Computer Lab Assignment 11

Your name:
Your email address:

(20 minutes) Gene Identification Exercise A (Prokaryotic genomic DNA)

For the following sequences from a prokaryote (the archaeon Thermoplasma acidophilum ), identify possible Open Reading Frames using ORF-finder at the NCBI at
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
- Cut and paste the Thermoplasma acidophilum genomic DNA sequence fragment (T_acido.fa) into the sequence window of the ORF finder and press <OrfFind>
  The web page that will be returned has a symbolic representation of the 6 possible translation frames. Potential ORF are indicated in green.
- select one ORF by clicking on it in the symbolic representation
- Use the link to BLAST on the top of the page, and do a blast search for the selected and one additional open reading frame - check the graphical overview button in the parameter box on the format page before requesting the results.
- Based on your and your neighbors analyses, A) which ORFs are likely to form an operon? B) Which is the coding strand? C) Is it the same for all ORFs?
- You can select an ORF to work on also by clicking onto the green boxes in the table on the right. The first number gives the reading frame, the entries are ranked according to length. Select longest ORF on frame +2 and do a blast search. Check the graphical overview box on the format page. D) Do you notice anything strange?
- E) What happens when you press the accept button below the graphic representation?
- F) Glimmer is a program that aims to find real ORFs based on compositional analyses. A web version is here. Copy / paste the Thermoplasma acidophilum genomic DNA sequence fragment (T_acido.fa) into the sequence window, select linear and bacterial/archaeal. Did Glimmer identify the ORFs as most probable that you identified as being part of the ATPase operon?
- G) As an aside, Glimmer and many similar programs do really well in identifying real ORF (98% of the real ORF are identified). What other parameter would you like to know in order to judge the overall success rate of the program?
(15 minutes) Genome Alignments and Synteny

So far, you have learned how to search, compare and manipulate single genes using bioinformatic tools. However, it is also possible to align complete prokaryotic and eukaryotic genomes. There is many tools freely available to perform different kind of genome comparison but for today's class, we will concentrate on a new service developed by the Joint Genome Institute (JGI). The Integrated Microbial Genomes server allow you to search, select and compare portions of fully sequenced or partially completed genome sequences.

Go to the IMG server (link). In this exercise, we will compare portions of genomes from different bacterial species to verify if they all have the same order for the genes encoding the ATP synthase subunits. First, click on the find genes tool bar. In the Keyword window, type in ATP synthase and select Thermotoga maritima in the organism list. The resulting search displays all subunits that are part of the complete ATP synthase (except the Flagellum-specific ATP synthase which is part of the bacterial flagellum assembly).
In the Gene object ID column, click on the number corresponding to the beta chain (Subunit B), this will lead you to a page containing information about that particular ORF. In the Evidence for function Prediction box (patience -- it can take some time for the page to load completely), you can see a graphical representation of the genome where the ATP synthase beta chain is located (red orf). You can put your mouse cursor over the ORF to display the identity of the gene. (You need to wait until the page is completely loaded!) Look at the ORFs located around the beta chain gene. Where do you think the operon containing the ATP synthase begins and ends?

Now, under conserved neighborhood, click on "Show ortholog neighborhood regions" (since we haven't selected any other organism in our genome cart, the next page will display a default list of different species). Is the subunit order of the ATP synthase conserved in the other bacterial species? List a few of the differences you can find? (You can click on next to move to the next page.)
(15 Minutes)The NCBI provides a facility for pairwise genome comparison. This is similar to dotlet, but the units of comparison are complete Open Reading Frames. A circle is placed in the plot, if a gene "x" from genome A has a top scoring blast hit in genome B (gene "y"), AND if gene "x" also is a hit of gene "y". (Note: it is not clear if in the current implementation, you also get a hit if the return hit is significant, but not top scoring). Go to the NCBI GenePlot page.
- First compare the two Leptospira interrogans serovars from the scroll down lists (serovar Copenhageni and serovar Lai) and do the Genome Plot by pressing the 'Compare' button. What can you conclude on the genome differences by looking on the left window?
- Select two closely related species and do the Genome Plot (If the two species you chose to compare do not contain any interesting results, retry with other species). Which species did you compare? What interesting feature did you find when comparing those two genomes? Does the plot change when you switch the axis? Try to use the zoom in feature in the right window to figure out what an interesting sequence encodes.

4) Background:

Problems in finding Open Reading Frames (ORFs) and Coding Sequences (cds) provide a nice example for the failing of first principle approaches:

In Eukaryotes the coding sequence is often interrupted by introns. Genes are transcribed into RNA. With the help of so-called spliceosomes introns are removed from the RNA and the exon portions are religated. In Arabidopsis the splice site consensus is as follows (from www.arabidopsis.org/info/splice_site.pdf):

(This table summarizes the sequences surrounding the intron splice sites in the plant Arabidopsis. E.g., in 52.9% of the intron exon boundaries (bottom part) the first base of the exon is a G, and in 40.5% the next nucleotide is a T.)

Given the many introns known in Arabidopsis, and the fact that many of the spliceosomal RNAs have been sequenced, one might expect that given a sequence it would be possible to recognize with high reliability which parts of a sequence are coding. The following exercises will demonstrate that this is not the case.

ORF predictions using GENESCAN (20 minutes)

This sequence is a fragment of genomic DNA from the genome of the plant Arabidopsis thaliana .

1) Use GENSCAN at :

http://genes.mit.edu/GENSCAN.html

or at

http://genome.dkfz-heidelberg.de/cgi-bin/GENSCAN/genscan.cgi

to predict exons and introns encoded on this piece of genomic DNA.

Paste sequence into the sequence window
Select Arabidopsis or plant as organism
Run GENSCAN
Inspect the graphic output (.pdf file)
Copy and use the predicted peptide to do a blastp search of swissprot.
Inspect the alignments between target and query sequence. Can you locate the exons that were missed by GENSCAN?

Alternatively, you can use either DOTLET or Blast2seq to better visualize where are the missed exons. Pick one of those two programs (take your favorite one :)) and compare the predicted peptide sequence with this one (protein sequence translated from cDNA).

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.