To assess whether long ORFs are genes, a researcher would ask the question, “is this sequence length indicative of a coding region or would I expect to see sequences this long in garbage?” By “garbage”, of course, we mean just random sequences of nucleotides.
We’ll test this by generating a bunch of “garbage” sequences of the same length as our test DNA sequence and measuring the maximum ORF length in each. Then, we’ll ask the following question. Is the very longest ORF among these still shorter than some ORFs we observe in our real DNA? If the real DNA ORFs are significantly longer than what we see in the garbage sequence, that is a very strong indicator that we did in fact find genes in our original DNA!
Write a function longestORFNoncoding(DNA, numReps)
that makes a bunch of garbage sequences, finds the very longest ORF in all of these, and returns its length. Note: this function returns a number rather than a DNA string.
OK, so now it’s time to generate garbage sequences. We could generate totally random strings of the same length as our DNA string, but that might not be a very accurate test since our DNA string might have more nucleotides of one type and fewer of another. To be fair, our garbage strings should have the same nucleotides but just reordered or “shuffled” randomly.
To do that, you’ll first need to take your DNA string and turn it into an ArrayList or Array of its constituent symbols.
>>> String str = "hello"; >>> char[] charArray = str.toCharArray(); // Creating array and Storing the array returned by toCharArray() >>> ArrayList<Character> charList = new ArrayList<Character>(); >>> for (char ch : charArray) { // add the characters one at a time charList.add(ch); } >>> System.out.println(charList); [h, e, l, l, o]
Now, in the Collections
package (get it by including import java.util.Collections;
at the top of your file) there is a function called shuffle
that scrambles the ArrayList. Unlike other functions that we’ve used until now, this shuffle
function doesn’t return a new list, but rather it actually changes the list that you gave it. Here’s an example:
ArrayList<String> mylist = new ArrayList<String>(); mylist.add("A"); mylist.add("T"); mylist.add("G"); mylist.add("hi"); mylist.add("bye"); mylist.add("ok"); System.out.println("Original List : \n"+mylist); // original order Collections.shuffle(mylist); <-- this changes myList! System.out.println("\nShuffled List : \n"+ mylist); // shuffled order
In other words, do not do this in your code…
ArrayList someOtherList = Collections.shuffle(myList)
but instead do just this…
Collections.shuffle(myList) # this will actually change myList by scrambling its contents
Oh, wait! But now we want that shuffled list to be glued back together as a string, since after all we’re dealing with DNA strings. That’s easy, here’s a function that takes as input a list of symbols and returns back the string that we get by gluing those symbols together. Take a minute to see how it works. It’s really short and sweet.
Then, copy it into your geneFinder.py file and try it out!
public String collapse(ArrayList<Character> charList) { String backTogether = ""; // This is our initial output string for (Character ch : charList) // for each char in the list... { backTogether += ch; // ... construct a new output string } return backTogether; // and return the final output string }
Here’s the function in action:
>>> System.out.println( collapse(['o', 'l', 'h', 'e', 'l']) ) // not actual java syntax "olhel"
For each garbage sequence you make, calculate the longest ORF using longestORFBothStrands
. You should repeat this process numReps
times, and return a number indicating the length of the longest ORF you see in all those repetitions.
You can test your longestORFNoncoding
function on a real Salmonella sequence in FASTA format in the file included called X73535.fna. You can turn the file into one DNA String strand by doing this:
try { String X73525 = ""; Scanner sc = new Scanner(new File("X73525.fna")); while(sc.hasNext()) { X73525 += sc.next(); } longestORFNonCoding(X73525, 50); # not real java sytanx } catch (Exception err) { }
Now run longestORFNoncoding
a few times. Note that because it makes use of randomness, it will not give you exactly the same number each time. It will, however, be consistent enough for our purposes.
>>> longestORFNoncoding(X73525,50). //repeats 50 times
624
>>> longestORFNoncoding(X73525,50). //repeats 50 times
693
Our next step is to write a function findORFs( DNA )
that will identify all the ORFs in the real (unshuffled) DNA and return them as a list. If there are none, it should return an empty list.
Once again, this task is made easier by making use of functions that we have already written. findORFs
should call oneFrame
in each of the three possible reading frames of the sequence. It should then combine all of the ORFs found in each frame and return them. Note that this strategy makes it possible to find overlapping reading frames. findORFs
will likely be very similar to the longestORF
that you wrote above.
>>> findORFs("ATGGGATGAATTAACCATGCCCTAA")
['ATGGGA', 'ATGCCC', 'ATGAAT']
>>> findORFs("GGAGTAAGGGGG")
[]
Next write a function called findORFsBothStrands( DNA )
that searches both the forward and reverse complement strands for ORFs and returns a list with all the ORFs found. For example:
>>> findORFsBothStrands('ATGAAACAT') ['ATGAAACAT', 'ATGTTTCAT']
Use the examples above to text each one of the methods you have written.