Student presentation:
     Dynamics of genes in populations: Genetic drift and Kimura's theory of neutral evolution.


Question to discuss:
        Why are some parts of proteins conserved and others are not? Does this rhyme with the theory of neutral evolution?

 

 

   I forgot to do this last week:  

Evolution of protein families

Homology (shared ancestry) versus  Analogy (convergent evolution)

Types of homology  (orthology, paralogy, xenology, synology)

Discussion and examples from Fitch's article (TIG 2000, see handout).
See globin trees as example. Question: Can one gene have more than one ortholog in another organism? (What is the human ortholog to the plant globin gene?)

Further notes on systematic terminology:

The term cladogram refers to a strictly bifurcating diagram, where each clade is defined by a common ancestor that only gives rise to members of this clade. I.e., a clade is monophyletic (derived from one ancestor) as opposed to polyphyletic (derived from many ancestors). A clade is recognized and defined by shared derived characters (= synapomorphies). Shared primitive characters (= sympleisiomorphies) do not define a clade.

To use these terms you need to have polarized characters; for most molecular characters you don't know which state is primitive and which is derived (exceptions:....).

Related terms:
autapomorphy = a derived character that is only present in one group; an autapomorphic character does not tell us anything about the relationship of the group that has this character ot other groups.

homoplasy = a derived character that was derived twice independently (convergent evolution). Note that the characters in question might still be homologous (e.g. a position in a sequence alignment, front limbs turned into wings in birds and bats).

paraphyletic = a taxonomic group that is defined by a common ancestor, however, the common ancestor of this group also has descendants that do not belong to this taxonomic group. Many systematists despise paraphyletic groups (and consider them to be polyphyletic). Examples for paraphyletic groups are reptiles and protists. Many consider the archaea to be paraphyletic as well.

holophyletic = same as above, but the common ancestor gave rise only to members of the group.

 

  Review on approaches to phylogenetic reconstruction
     (caution, there are many ways to cut a cake):

a.    DISTANCE ANALYSES

I. calculate pairwise distances
     (different distance measures, correction for multiple hits, correction for codon bias)

II. make distance matrix (table of pairwise corrected distances)

III. calculate tree from distance matrix

i) using optimality criterion
(e.g.: smallest error squared between distance matrix and distances in tree, or use
ii) algorithmic approaches (UPGMA or neighbor joining)

b.    PARSIMONY ANALYSES

find that tree that explains sequence data with minimum number of substitutions
(tree includes hypothesis of sequence at each of the nodes)

c.    MAXIMUM LIKELIHOOD ANALYSES

Given a model for sequence evolution, find the tree that has the highest probability under this model.
This approach can also be used to successively refine the model.

 

On Wednesday we will use programs from the PHYLIP package.

The programs run on different platforms, are available free of charge, and are accompanied by excellent manuals. Joe Felsenstein has an extraordinary talent in getting complicated things (like formulas) across to readers with a biological background.

This package has different programs for
Distance, Parsimony and Maximum Likelihood analyses using DNA sequences
   (DNADIST, DNAPARS, DNAML) or
protein sequences
   (PROTDIST, PROTPARS, PROML).
In addition, there are programs that calculate trees from distance matrices
   (FITCH for methods that use an optimality criterion
, and
     NEIGHBOR for algorthimic approaches).

All of the programs work the same way:

  • The data the program works on are in a file called infile.
    If this file is not in the same directory as the program, the user is prompted for a file name.
  • The user selects options from a menu, when done enters Y and the program starts working.
  • The results are written into a file called outfile, and if trees were calculated these are written into a file called treefile.
  • The latest version prompts the user if he wants to overwrite or apend an already existing file. But if the program crashes, the previous entries to the file are lost. Therefore it is highly recommended that you rename the outfile and the treefile everytime a program finishes.

To perform a bootstrap analysis using neighborjoining, you use the following programs in series, one acting on the outfile of the previous:

  • SEQBOOT: generates pseudosamples through bootstrap
  • DNADIST or PROTDIST: generates distance matrices for each of the pseudosamples
  • FITCH or NEIGHBOR: generate trees from the distance matrices
  • CONSENSE: Genrates a partition table and a consensus tree

While this is more cumbersome than using clustal, you also have much better control over the individual steps, and you can utilize many different alternatives.

 

Long Branch Attraction Artifacts

A problem shared by many approaches to building trees is LONG BRANCH ATTRACTION, i.e., lineages that have experienced a higher substitution rates are found to group together in the reconstructed history, even though each of the fast changing lineages might be closer related to a lineage of descent with slower substitution rates.  Many algorithms will group the long branches together.  Although this problem has been extensively discussed in the literature, there is no easy solution to this problem. The situation is compounded by the fact that the long branches often look shorter in the reconstructed phylogenies. 

A similar case, although most algorithms appear less prone to fall victim to this problem, is long branch attraction in cases where the substitution rate is the same in all lineages, but the long branches are due to the absence of side branches.

Solutions:

 Break up long branches by adding additional sequences.

 Use algorithms that are less sensitive to long-branch attraction.  Parsimony is pretty sensitive, usually maximum likelihood approaches that incorporate among site rate variation (ASRV) into the model are doing pretty good. Using simulated protein sequence evolution, i.e. the true tree is known for these sequences, we found that long-branch attraction in the presence of ASRV can become a problem even when sequences are more than 50% identical. However, ML approaches did reasonable well (depending on the model down to about 20% sequence identity (go here for an online poster). 

 It has been suggested that algorithms that are insensitive to long branch attraction might make the opposite error: long branch repulsion (Mark Siddall in Cladistics 14, 209-220) – however, we didn’t see any hint of this in our exploration of phylogenetic reconstruction from amino acid sequences (see above mentioned poster).

Some literature:

  • Felsenstein, J. (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27, 401-410
  • Hendy, M.D., Penny, D. (1989) A framework for the quantitative study of evolutionary trees. Syst. Zool. 38, 297-309
  • Huelsenbeck P.J. (1995) Performance of phylogenetic methods in simulation. Syst. Biol. 44: 17-48.
  • Kuhner M.K., and J.Felsenstein (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11: 459-468.
  • Tateno Y, N. Takezaki, and M. Nei. (1994) Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11: 261-277
  • Mark E. Siddal (1998) Success of Parsimony in the Four-Taxon Case: Long-Branch repulsion by Likelyhood in the Farris Zone, Cladistics 14, 209-220