Topic 1 - Variables and Data Types
Topic 2 - Conditionals and Strings
Topic 3 - Loops
Topic 4 - Arrays
Topic 5 - File Handling
Semester 1 Projects
Topic 6 - Classes/Objects and Methods
Topic 7 - ArrayLists
Semester Projects

Part 2e: Gene finding!

getCoordinates

Another thing a user might want to know is the coordinates of a gene in the original DNA sequence. For this purpose, we provide you the following function. Cut and paste it into your project file for GeneFinder:

public Coordinate getCoordinates(String gene, String DNA)
{
    String geneToUse = gene;
    int start = DNA.indexOf(geneToUse);

    if (start == -1)
    {
        geneToUse = reverseCompliment(gene);
        start = DNA.indexOf(geneToUse);
    }

    int end = start + gene.length();
    return new Coordinate(start, end);
}

getCoordinates follows the convention of reporting everything in forward strand coordinates (even if the ORF is actually on the reverse complemented strand).

Here are some examples of how getCoordinates is used, using some simple inputs:

>>> getCoordinates("GTT", "ACGTTCGA")
[2, 5] // Coordinate.getStart() and Coordinate.getEnd()
>>> getCoordinates("CGAA", "ACGTTCGA")
[3, 7] // Coordinate.getStart() and Coordinate.getEnd()

Gene Finding!

Finally, Write a function called geneFinder(DNA, minLen) that identifies ORFs longer than minLen, and returns a list with information about each.

geneFinder should first call findORFsBothStrands to obtain a list of ORFs in the input DNA. It should then run through this list, keeping only those ORFs which are longer than minLen.

For each ORF which is long enough, geneFinder should calculate

  1. The beginning and end positions of the ORF in DNA using getCoordinates
  2. The protein sequence of the ORF using Ribosome’s createProtein() . However, before you use createProtein(), you’ll need to modify it so that it can take a String as opposed to an ArrayList (or you can turn the ORF into an ArrayList of codons!)

These should then be placed in a Arraylist:

[beginningCoord, endCoord, proteinSequence] // int, int, string

There will be an ArrayList like this for every ORF that is long enough. You will collect these ArrayLists in another Arraylist (an ArrayList of ArrayLists).

Say our final ArrayList of Arraylists is called finalOutputList. Its elements are ArrayLists that hold: [beginningCoord, endCoord, proteinSequence].

Creating your own class that can hold all of this information will be helpful.