CLASS 17. Gene Prediction.

HOMEWORK PROBLEM SET #3 is posted to Moodle. Due in class on FRIDAY, FEBRUARY 19.

INSTRUCTIONS:

For each exercise, provide search query used and keep the answers brief. Email me the answers by Sunday 11:59PM AST at the latest.

Use "CLASS 17 EXERCISE" as a message subject, and type answers directly to email body (i.e., no document attachments please). Make sure that first line of your message is your NAME.

    PART I. In the sequence fragment from a prokaryote (the archaeon Thermoplasma acidophilum), identify possible protein-coding genes.

  1. Use ORF-finder at the NCBI for this exercise.

    * Cut and paste the Thermoplasma acidophilum genomic DNA sequence fragment into the sequence window of the ORF finder and press "OrfFind" button. The web page that will be returned has a symbolic representation of the six possible translation frames. Potential ORFs are indicated in green.

    * Select one ORF by clicking on its symbolic representation (coordinate with your neighbors to pick different ORFs).

    * Use the link to BLAST on the top of the page, and do a BLAST search for the selected ORF. Repeat BLAST search for one additional ORF.

    * Based on your and your neighbors' analyses, A) which ORFs are likely to form an operon? B) Which is the coding strand? C) Is it the same for all ORFs?

    * You can also select an ORF to work with by clicking on the green boxes in the table on the right. The first number gives the reading frame, the entries are ranked according to length. Select longest ORF on frame +2 and do a BLAST search. Check out the graphical overview box of the BLAST results. D) Do you notice anything strange?

    * E) What happens when you press the "accept" button below the graphic representation in ORF finder?

  2. GLIMMER is a program that aims to find protein-coding ORFs based on compositional analyses. Copy/Paste the Thermoplasma acidophilum genomic DNA sequence fragment into the sequence window, for additional parameters select "bacterial/archaeal" genetic code and "linear" topology.

    A) Did Glimmer identify the ORFs as most probable that you identified as being part of the ATPase operon?

    B) As an aside, Glimmer and many similar programs do really well in identifying real ORFs (98% of the real ORFs are identified). What other parameter would you like to know in order to judge the overall success rate of the program?

  3. PART II. In the genomic DNA fragment from the plant Arabidopsis thaliana predict exons and introns.

  4. * Paste sequence into the sequence window

    * Select Arabidopsis or plant as organism

    * Run GenScan

    * Inspect the graphic output (.pdf file)

    * Copy and use the predicted peptide to do a BLASTP search using swissprot as a database. Inspect the alignments between target and query sequence. Can you locate the exons that were missed by GENSCAN?

  5. Alternatively, you can use either DotLet or pairwise BLAST to better visualize where are the missed exons. Pick one of those two programs and compare the predicted peptide sequence with this one (protein sequence translated from cDNA). Can you locate the exons that were missed by GENSCAN?