Here is Escherichia coli K-12 genome, which is one of the 6548 ongoing genome projects (according to GOLD on February 2nd 2010). There is a need to detect features of these genomes (i.e. to annotate genomes), such as finding location (and function) of protein-coding genes, tRNA and rRNA genes, intron and exons in case of eukaryotic genomes, operons in case of prokaryotic genomes, transcription factor binding sites, promoters and ribosome binding sites, etc.

If you really do have a prokaryotic genome in need of annotation, the way to go is to use RAST annotation service [part of the SEED].

The Basic Steps in Annotating a Genome Using RAST
(more information is here).

Call the tRNA and rRNA genes and exclude these regions from further analyses
Make an Initial Effort to Call Protein-Encoding Genes (i.e., determine putative genes ab initio, see below)
Establishing Phylogenetic Context (using a small subset of universal house-keeping genes, find closest relatives to the genome under annotation)
A Targeted Search Based on gene families that Occur in Closely Related Genomes (homology-based search)
Processing the Remaining Genes Against the Entire Collection (i.e. not only closely related genomes)
Clean Up Remaining Gene Calls (Remove Overlaps and Adjust Starting Positions)
Process the Remaining, Unannotated Protein-encoding Genes
Construct an Initial Metabolic Reconstruction

Gene Prediction in Prokaryotes

First, non coding RNAs (such as ribosomal RNAs and tRNAs) are identified and excluded from further analyses. These regions of a genome can be identified with great accuracy. Ribosomal RNAs are extremely conserved and can be identified through sequence similarity searches against known rRNAs (see RDP).

tRNAScan-SE (online server) is used by RAST to predict transfer RNAs. The program takes into account tRNAs' secondary structure:

Fig. 10.2 [Source]. tRNA secondary structure, showing features used by tRNAScan to predict tRNAs.

Fig. 10.3 [Source]. Decision tree algorithm employed by tRNAScan.

tRNAscan predicts tRNAs with 97.5% accuracy and reports a false positive once per 3 million bases (which for a typical prokaryotic genome means just once per genome).

An Open Reading Frame (ORF) is a stretch of DNA sequence between start and stop codon. Evere protein-coding gene is an ORF, not every ORF is a protein-coding gene (especially a very short ORF).

For most common purposes, a prokaryotic gene can be defined simply as the longest ORF for a given region of DNA (Koonin and Galperin). The gene starts with ATG codon (or, in few instances, with GTG, TTG, or CTG) and ends with a stop codon (TAA, TGA, or TAG).

Fig. Example of NCBI's ORF finder results.

Additional Considerations:

Because a stop codon can occur by chance in about every 20 codons, having a long stretch of DNA without stop codons is an indication of coding potential
ORF has a typical GC composition at 3rd codon (Gs and Cs are preferred in the 3rd codon position)
Promoter and ribosome binding site precede the ORF
ORFs for different regions do not usually overlap with each other (small overlaps are allowed)
There are significant matches to the ORF in protein databases

Hidden Markov Models can be trained to detect genes (e.g. GLIMMER or GeneMark). The Markov model for gene prediction use the observation that oligonucleotide (k-mer) distribution in a coding region is different from non-coding regions. That is, a nucleotide state depends on k previous nucleotides (and the Markov model is therefore called k-order model). The models are then trained on specific sets of genes (genomes).

RAST uses GLIMMER [Gene Locator and Interpolated Markov ModelER] for the initial call of protein-coding genes. GLIMMER uses Markov models from 0th to 8th order. It correctly predicts >95% of genes, with relatively few false positives.

GLIMMER (ver. 3.02; iterated)  predictions:
 orfID      start     end  frame  score
--------    -----    -----  --    -----
orf00002      716        3  -3     6.46
orf00003     1146      742  -1     3.89
orf00005     1523     3439  +2    18.72
orf00006     3528     4658  +3    13.34

Fig. Example of GLIMMER results for a piece of DNA sequence used in ORF finder example above. According to the GLIMMER manual, the shown score is a "100 times the log odds per base of the in-frame coding score versus the score of the independent, non-coding model. These scores provide a consistent scale to compare scores of different orfs."

GLIMMER output can consequently be ran through RBSfinder, to search for ribosomal binding sites in the vicinity of predicted start codons.

Fig. 10.4 [Source]. Sequence Logo of ribosome-binding sites in E.coli genes (based on 149 sequences).

Gene Families

Most commonly used sets of protein families:

COGs (Clusters of Orthologous Groups)
eggNOGs (evolutionary genealogy of genes: Non-supervised Orthologous Groups)
TIGRFams (protein families from completed genomes based on Pfam Hidden Markov Models)
FIGFams ("yet another set of protein families")
NCBI's Protein Clusters

COG Functional Categories and functional categories used in RAST

An example: COGs in E.coli K-12 genome

AT THE END...

annotated E.coli genome