Topic 1 - Variables and Data Types
Topic 2 - Conditionals and Strings
Topic 3 - Loops
Topic 4 - Arrays
Topic 5 - File Handling
Semester 1 Projects
Topic 6 - Classes/Objects and Methods
Topic 7 - ArrayLists
Semester Projects

Part 2a: GeneFinder Overview

In this part, we will be looking at some investigations of Salmonella pathogenesis. Researchers identified genomic regions present in Salmonella but not in closely related species such as E. coli. Further studies revealed that some of these regions are essential to Salmonella-specific pathogenic processes. One particular region was found to be needed for the entry of the bacterium into the epithelial cells of the gut. The researchers studying it wished to identify the protein-coding genes in it, so they sequenced it and analyzed that sequence for genes. In this part of the project, you will develop a simple gene finder, and use it to examine this sequence. You will also examine a related “mystery” sequence, and try to determine its function.

You will be using a file that contains Salmonella DNA, accession number X73525, from NCBI as a fasta file (extension .fna). We’ve already done the downloading for you, so you can find the X73525.fna file in your starter code.

A simple gene-finding strategy is to look for large open reading frames. An open reading frame (ORF) is the stretch of sequence between a start codon (ATG) and the next in-frame stop codon (TGA, TAA, and TAG are stop codons). This strategy depends on the assumption that the protein-coding sequence has larger ORFs than noncoding sequences. In noncoding regions “Start” and “Stop” codons appear by random chance, and the ORFs between them are not long. In many circumstances, protein-coding genes have ORFs that are longer than this.

To look for genes in a particular sequence, you can identify its longest ORFs. But how do you know if these are genes? To help decide, you can look at many known noncoding regions to obtain a distribution of open reading frame lengths for noncoding DNA. You can then compare the longest ORFs in our test sequence with this distribution. If the test sequence ORFs are longer, they are likely to be genes.

In this problem the approach you will take is the following. You will compare open reading frames in our Salmonella sequence with open reading frame lengths from known noncoding sequences. How do you get known noncoding sequences? One way is to make them randomly. Here you will do this by randomly shuffling (reordering) our genomic sequences. More on that shortly!