CLASS 16. Evaluation of performance. Phylogenetic Reconstruction (Part I)

Evaluation of gene predictions

HOMEWORK READING:

pp.102-103 (textbook);

Sections 9.6 (supp. textbook);


Ch.10 (textbook);

Section 7.1 (supp.textbook);

The results of gene prediction program can fall into four categories:

  • True Positive: correct prediction
  • False Positive: prediction of region as a gene that is not a gene
  • False Negative: missed prediction
  • True Negative: correct prediction of gene absence
For gene prediction, these classifications can be done at different levels:
  • nucleotide (base) level (coding level of each nucleotide)
  • exon level (exact prediction of exon start and end points)
  • protein level (maintenance of correct reading frames)

Fig. 9.12 [Source]. Assessing prediction of genes/exons on nucleotide level.


It is important to use both specificity and sensitivity to assess program accuracy.

Sensitivity of prediction is then defined as Sn=TP/(TP+FN), i.e. proportion of true predictions to total number of correct genes (i.e. including missed predictions)

Specificity is the proportion of true predictions among all predicted genes (including incorrectly predicted ones): Sp=TP/(TP+FP)

Both sensitivity and specificity can be summarized in a single parameter, ranging from -1 to +1, called correlation coefficient:

CC=(TP•TN-FP•FN)/√(TP+FP)(TN+FN)(FP+TN)

Phylogenetic Reconstruction Overview

Phylogenetic analysis is an inference of evolutionary relationships between organisms.
Those relationships are usually represented by tree-like diagrams. Note: the assumption of tree-likeliness of evolution is controversial.

Phylogeny is "the pattern of historical relationships between species or other groups resulting from divergence during evolution" (OED).

Steps of the phylogenetic analysis:


Compilation of sequence dataset
Alignment
Determination of substitution model
Tree building
Tree evaluation


Why Phylogenetic Reconstruction?

  1. systematic classification of organisms
  2. Fig. Animal Phylogeny,
    from the Histoire naturelle des animaux sans vertebres of Jean-Baptiste Lamarck (1815).


    See also a more modern version at NCBI's Taxonomy Database.

    Quite entertaining historical overview of trees and networks has been recently written by Mark Ragan, "Trees and networks before and after Darwin" (highly recommended).

    Fossil evidence is scarce and incomplete. Moreover, microorganisms not only have very limited fossil evidence, but also lack enough morphological features to distinguish them from each other.

    Seminal paper by Emile Zuckerkandl and Linus Pauling in 1965: Evolutionary divergence and convergence in proteins

  3. Evolution of molecules
  4. e.g.:
    • domain shuffling,
    • reassignment of function,
    • gene duplications,
    • horizontal gene transfer,
    • drug targets,
    • detection of genes that drive evolution of a species/population (e.g. influenza virus)
    • relating extinct organisms to contemporary ones (e.g., mammoth, cave bear, neanderthals)

Tree Terminology

Fig. Anatomy of an unrooted phylogenetic tree


Trees have a branching pattern (also called the topology), and branch lengths (usually scaled with substitutions, not time). If branch lengths are ignored, the tree is called a cladogram (usually implies a rooted tree, see below).

Trees can be unrooted (showing relative relationship among the taxa) and rooted (indicating position of the ancestor). Usually phylogenetic trees reconstructed from extant molecular sequences are unrooted. Additional knowledge is required to root the tree (more on this later).

Fig. Same tree topology as above, but this tree is rooted.


Fig. Example of a partially resolved tree.
In contrast, tree in the previous figure is fully resolved.


Fig. Example of a subtree within a phylogenetic tree.


New Hampshire/Newick Tree Format is a computer-friendly way to represent trees using parentheses and commas.

A phylogenetic tree can be presented as a series of splits or bipartitions. Each branch (such as branch marked as red in trees above) divides a tree into two parts, therefore forming a bipartition. A bipartition is often denoted by series of "*" and ".". For the red branch above, the bipartition would be "****.." or "....**". Another example:

Fig. 7.4 [Source]. Bipartitions in a phylogenetic tree. Also note that there is a scale under the tree (unit: number of substitutions per site)


Fig. "Phylogenetic Tree", growing on Galapagos Islands (summer 2005)


Different trees vs. Different Topologies:

Fig. Five rooted tree topologies for four taxon-trees. They all have a different tree topology, because the root is placed in a different location, and in addition the trees are different, because the branch lenths are different. Note that these are not all possible rooted trees for five taxa.


Fig. If the position of the root is ignored, all five trees in the previous figure have the same topology as this unrooted tree.


You can think of phylogenetic trees as mobiles:

Fig. Mobile Art [Source]


Fig. Mobile Art (the art piece from previous figure, but rotated) [Source]


Fig. These trees are the same (same connectivity and same branch lengths):
rotation around a node does not change the tree!


Fig. Is this tree the same as the ones above? [Mouse over the image to see the answer]


Fig. Are these two trees the same? [Mouse over the image to see the answer]


External nodes of a tree represent extant taxa under study and are referred to as Operational Taxonomic Units (OTUs). Internal nodes represent ancestral taxa (Li, Molecular Evolution, Sinauer, 1997)
Fig. With four OTUs (taxa, sequences, species), there are only three possible tree topologies. But there is an infinite number of different trees, because the tree also can differ in branch lenghts, even if they have the same topology.


Fig. Are these trees the same? [Mouse over the image to see the answer]


Fig. Tree Shapes. Three possible shapes of a rooted tree for five OTUs. All 105 possible trees for five OTUs will have one of these tree shapes.


Number of Rooted and Unrooted Trees

Number of unrooted trees for n taxa Nu=(2n-5)*(2n-7)*...*3*1=(2n-5)!/[2n-3*(n-3)!]

Number of rooted trees for n taxa Nr=(2n-3)*(2n-5)*(2n-7)*...*3*1=(2n-3)!/[2n-2*(n-2)!]

Note that the number of unrooted trees for n sequences is equal for the number of rooted trees for (n-1) sequences.

Number of Taxa

Number of unrooted trees

Number of rooted trees

3

1

3

4

3

15

5

15

105

6

105

945

7

945

10395

8

10395

135135

9

135135

2027025

10

2027025

34459425

20

2.22E+020

8.20E+021

30

8.69E+036

4.95E+038

40

1.31E+055

1.01E+057

50

2.84E+074

2.75E+076

60

5.01E+094

5.86E+096

70

5.00E+115

6.85E+117

80

2.18E+137

3.43E+139

For comparison the universe contains only about 1089 protons and has an age of about 5*1017 seconds or 5*1029 picoseconds.