Where to find genomes?
As of January 20, 2006 there are 2299 genome projects, of which 495 are completed and published
(Source: Genomes Online Database - GOLD).
Vastness of Protein Space
The Size of Protein Sequence Space (back of the envelope calculation):
Protein Fold Space: the known folds are not scattered across the fold space, but form populated regions (clumps, or attractors). For more on proteins universe click here and here.
Establishing homology through similarity
If two proteins (not necessarily true for nucleotide sequences) show significant similarity in their primary sequence, they have shared ancestry, and probably similar function.
The reverse is not true:
PROTEINS WITH THE SAME OR SIMILAR FUNCTION DO NOT ALWAYS SHOW SIGNIFICANT SEQUENCE SIMILARITY
for one of two reasons:
Establishing homology in practice: Basic Local Alignment Search Tool (BLAST)
Works with both DNA and protein sequences (as query sequences).
Databases (subject sequences): a complete genome, whole GenBank, a metagenome, etc.
There are five different BLAST programs, that perform the following searches:
The underlying operation: the program aligns a query sequence to each subject sequence in the database and reports results as a hit list, which is ranked by Score/E-value. E-value for a hit represents the number of such alignments that would be expected by chance alone. Sometimes P-values are used. P-values give the probability of to find a match of this quality or better. P values can take values in the range [0,1], E-values are in the range [0,∞).
Rule of thumb: the E-value below 10-4 usually indicates the evidence for homology.
More information: For very basic BLAST tutorial click here.
For not-so-gory details check out The Statistics of Sequence Similarity Scores.
PSI BLASTprovides an enormous advantage over basic BLAST in the detection of distantly related sequences. It only works if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences.
The NCBI page describes PSI blast as follows:
A diagram giving an overview over the PSI-blast procedure is here.
The results of a basic BLAST search are aligned and a pattern of conserved residues is
extracted from the alignment. This pattern is used for the next iteration. An important
parameter to adjust is the E-value threshold down to which matches are included in the
alignment and pattern extraction.
What is BLAST useful for?For example:
Genome Dot PlotsThe following plots were generated using CMR Protein Scatter Plot tool.
Synteny - gene order conservation along a chromosome.
1. Comparison of two strains of E.coli:
2. Comparison of two species of Pyrococcus (archaea):
3. Comparison of two strains of Prochlorococcus marinus (marine cyanobacteria):
ORFans, group-specific genes and Genomic IslandsORFan - a gene with no detectable homologs in database. Visit the ORFanage here or here. According to the latter source, there are 68770 ORFans in 330 genomes.
Identification of Genomic Islands: Synechococcus sp. WH8102 vs. Sargasso Sea MUMmerplot (alignment of Sargasso Sea reads against the genome of this marine cyanobacterium)
Clusters of Orthologous Groups (COGs)[Alternative - TIGR Families]
BLASTology blundersExample: bacterial genes in human genome: In the article on the "Initial sequencing and analysis of the human genome", 113 genes were claimed to be putative instances of horizontal gene transfer from Bacteria to vertebrates based on BLAST analyses.
"An interesting category is a set of 223 proteins that have significant similarity to proteins from bacteria, but no comparable similarity to proteins from yeast, worm, fly and mustard weed, or indeed from any other (nonvertebrate) eukaryote. These sequences should not represent bacterial contamination in the draft human sequence, because we filtered the sequence to eliminate sequences that were essentially identical to known bacterial plasmid, transposon or chromosomal DNA (such as the host strains for the large-insert clones). To investigate whether these were genuine human sequences, we designed PCR primers for 35 of these genes and confirmed that most could be readily detected directly in human genomic DNA (Table 24). Orthologues of many of these genes have also been detected in other vertebrates (Table 24). A more detailed computational analysis indicated that at least 113 of these genes are widespread among bacteria, but, among eukaryotes, appear to be present only in vertebrates. It is possible that the genes encoding these proteins were present in both early prokaryotes and eukaryotes, but were lost in each of the lineages of yeast, worm, fly, mustard weed and, possibly, from other nonvertebrate eukaryote lineages. A more parsimonious explanation is that these genes entered the vertebrate (or prevertebrate) lineage by horizontal transfer from bacteria."
Responses to the claim:
Steps of the phylogenetic analysis:
Bootstrapping is one of the most popular ways to assess the reliability
of branches. The term bootstrapping
goes back to the Baron Münchhausen (pulled himself out of a swamp
by his shoe laces). Briefly, positions of the aligned sequences are
randomly sampled from the multiple sequence alignment with replacements.
The sampled positions are assembled into new data sets, the
so-called bootstrapped samples. Each
position has an about 63% chance to make it into a particular bootstrapped
sample.If a grouping has a lot of support, it will
be supported by at least some positions in each of the bootstrapped
samples, and all the bootstrapped samples will yield this grouping.
Bootstrapping can be applied to all methods of phylogenetic reconstruction.
Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.