CLASS 16. Evaluation of performance. Phylogenetic Reconstruction (Part I)
Evaluation of gene predictions
pp.102-103 (textbook);
Sections 9.6 (supp. textbook);
Ch.10 (textbook);
Section 7.1 (supp.textbook);
The results of gene prediction program can fall into four categories:
- True Positive: correct prediction
- False Positive: prediction of region as a gene that is not a gene
- False Negative: missed prediction
- True Negative: correct prediction of gene absence
- nucleotide (base) level (coding level of each nucleotide)
- exon level (exact prediction of exon start and end points)
- protein level (maintenance of correct reading frames)
It is important to use both specificity and sensitivity to assess program accuracy.
Sensitivity of prediction is then defined as Sn=TP/(TP+FN), i.e. proportion of true predictions to total number of correct genes (i.e. including missed predictions)
Specificity is the proportion of true predictions among all predicted genes (including incorrectly predicted ones): Sp=TP/(TP+FP)
Both sensitivity and specificity can be summarized in a single parameter, ranging from -1 to +1, called correlation coefficient:
Phylogenetic Reconstruction Overview
Phylogenetic analysis is an inference of evolutionary relationships between organisms.
Those relationships are usually represented by tree-like diagrams. Note: the assumption of tree-likeliness of evolution is controversial.
Phylogeny is "the pattern of historical relationships between species or other groups resulting from divergence during evolution" (OED).
Compilation of sequence dataset |
Alignment |
Determination of substitution model |
Tree building |
Tree evaluation |
Why Phylogenetic Reconstruction?
- systematic classification of organisms
- Evolution of molecules e.g.:
- domain shuffling,
- reassignment of function,
- gene duplications,
- horizontal gene transfer,
- drug targets,
- detection of genes that drive evolution of a species/population (e.g. influenza virus)
- relating extinct organisms to contemporary ones (e.g., mammoth, cave bear, neanderthals)
from the Histoire naturelle des animaux sans vertebres of Jean-Baptiste Lamarck (1815).
See also a more modern version at NCBI's Taxonomy Database.
Quite entertaining historical overview of trees and networks has been recently written by Mark Ragan, "Trees and networks before and after Darwin" (highly recommended).
Fossil evidence is scarce and incomplete. Moreover, microorganisms not only have very limited fossil evidence, but also lack enough morphological features to distinguish them from each other.
Seminal paper by Emile Zuckerkandl and Linus Pauling in 1965: Evolutionary divergence and convergence in proteins
Tree Terminology
Trees have a branching pattern (also called the topology), and branch lengths (usually scaled with substitutions, not time). If branch lengths are ignored, the tree is called a cladogram (usually implies a rooted tree, see below).
Trees can be unrooted (showing relative relationship among the taxa) and rooted (indicating position of the ancestor). Usually phylogenetic trees reconstructed from extant molecular sequences are unrooted. Additional knowledge is required to root the tree (more on this later).
In contrast, tree in the previous figure is fully resolved.
A phylogenetic tree can be presented as a series of splits or bipartitions. Each branch (such as branch marked as red in trees above) divides a tree into two parts, therefore forming a bipartition. A bipartition is often denoted by series of "*" and ".". For the red branch above, the bipartition would be "****.." or "....**". Another example:
Different trees vs. Different Topologies:
You can think of phylogenetic trees as mobiles:
rotation around a node does not change the tree!
Number of Rooted and Unrooted Trees
Number of unrooted trees for n taxa Nu=(2n-5)*(2n-7)*...*3*1=(2n-5)!/[2n-3*(n-3)!]
Number of rooted trees for n taxa Nr=(2n-3)*(2n-5)*(2n-7)*...*3*1=(2n-3)!/[2n-2*(n-2)!]
Note that the number of unrooted trees for n sequences is equal for the number of rooted trees for (n-1) sequences.
Number of Taxa |
Number of unrooted trees |
Number of rooted trees |
3 |
1 |
3 |
4 |
3 |
15 |
5 |
15 |
105 |
6 |
105 |
945 |
7 |
945 |
10395 |
8 |
10395 |
135135 |
9 |
135135 |
2027025 |
10 |
2027025 |
34459425 |
20 |
2.22E+020 |
8.20E+021 |
30 |
8.69E+036 |
4.95E+038 |
40 |
1.31E+055 |
1.01E+057 |
50 |
2.84E+074 |
2.75E+076 |
60 |
5.01E+094 |
5.86E+096 |
70 |
5.00E+115 |
6.85E+117 |
80 |
2.18E+137 |
3.43E+139 |
For comparison the universe contains only about 1089 protons and has an age of about 5*1017 seconds or 5*1029 picoseconds.