CLASS 13. Prokaryotic Genome Annotation.
Ch. 8, 9 (textbook);
Section 4.1 (Koonin and Galperin)
Section 9.1-9.3, 10.1-10.3 (supp. textbook)
Here is Escherichia coli K-12 genome, which is one of the 6548 ongoing genome projects (according to GOLD on February 2nd 2010). There is a need to detect features of these genomes (i.e. to annotate genomes), such as finding location (and function) of protein-coding genes, tRNA and rRNA genes, intron and exons in case of eukaryotic genomes, operons in case of prokaryotic genomes, transcription factor binding sites, promoters and ribosome binding sites, etc.
If you really do have a prokaryotic genome in need of annotation, the way to go is to use RAST annotation service [part of the SEED].
The Basic Steps in Annotating a Genome Using RAST
(more information is here).
- Call the tRNA and rRNA genes and exclude these regions from further analyses
- Make an Initial Effort to Call Protein-Encoding Genes (i.e., determine putative genes ab initio, see below)
- Establishing Phylogenetic Context (using a small subset of universal house-keeping genes, find closest relatives to the genome under annotation)
- A Targeted Search Based on gene families that Occur in Closely Related Genomes (homology-based search)
- Processing the Remaining Genes Against the Entire Collection (i.e. not only closely related genomes)
- Clean Up Remaining Gene Calls (Remove Overlaps and Adjust Starting Positions)
- Process the Remaining, Unannotated Protein-encoding Genes
- Construct an Initial Metabolic Reconstruction
Gene Prediction in Prokaryotes
First, non coding RNAs (such as ribosomal RNAs and tRNAs) are identified and excluded from further analyses. These regions of a genome can be identified with great accuracy. Ribosomal RNAs are extremely conserved and can be identified through sequence similarity searches against known rRNAs (see RDP).
tRNAScan-SE (online server) is used by RAST to predict transfer RNAs. The program takes into account tRNAs' secondary structure:
tRNAscan predicts tRNAs with 97.5% accuracy and reports a false positive once per 3 million bases (which for a typical prokaryotic genome means just once per genome).
For most common purposes, a prokaryotic gene can be defined simply as the longest ORF for a given region of DNA (Koonin and Galperin). The gene starts with ATG codon (or, in few instances, with GTG, TTG, or CTG) and ends with a stop codon (TAA, TGA, or TAG).
Additional Considerations:
- Because a stop codon can occur by chance in about every 20 codons, having a long stretch of DNA without stop codons is an indication of coding potential
- ORF has a typical GC composition at 3rd codon (Gs and Cs are preferred in the 3rd codon position)
- Promoter and ribosome binding site precede the ORF
- ORFs for different regions do not usually overlap with each other (small overlaps are allowed)
- There are significant matches to the ORF in protein databases
Hidden Markov Models can be trained to detect genes (e.g. GLIMMER or GeneMark). The Markov model for gene prediction use the observation that oligonucleotide (k-mer) distribution in a coding region is different from non-coding regions. That is, a nucleotide state depends on k previous nucleotides (and the Markov model is therefore called k-order model). The models are then trained on specific sets of genes (genomes).
RAST uses GLIMMER [Gene Locator and Interpolated Markov ModelER] for the initial call of protein-coding genes. GLIMMER uses Markov models from 0th to 8th order. It correctly predicts >95% of genes, with relatively few false positives.
GLIMMER (ver. 3.02; iterated) predictions: orfID start end frame score -------- ----- ----- -- ----- orf00002 716 3 -3 6.46 orf00003 1146 742 -1 3.89 orf00005 1523 3439 +2 18.72 orf00006 3528 4658 +3 13.34
GLIMMER output can consequently be ran through RBSfinder, to search for ribosomal binding sites in the vicinity of predicted start codons.
Gene Families
Most commonly used sets of protein families:
- COGs (Clusters of Orthologous Groups)
- eggNOGs (evolutionary genealogy of genes: Non-supervised Orthologous Groups)
- TIGRFams (protein families from completed genomes based on Pfam Hidden Markov Models)
- FIGFams ("yet another set of protein families")
- NCBI's Protein Clusters
COG Functional Categories and functional categories used in RAST
An example: COGs in E.coli K-12 genome
AT THE END...