MCB 372 - CLASS 6

   Overview on approaches to phylogenetic reconstruction
     (caution, there are many ways to cut a cake):

a.    DISTANCE ANALYSES

                                              I.          calculate pairwise distances
(different distance measures, correction for multiple hits, correction for codon bias)

                                            II.          make distance matrix (table of pairwise corrected distances)

                                          III.          calculate tree from distance matrix

i) using optimality criterion
(e.g.: smallest error between distance matrix
and distances in tree, or use
ii) algorithmic approaches (UPGMA or neighbor joining)

b.    PARSIMONY ANALYSES

find that tree that explains sequence data with minimum number of substitutions

(tree includes hypothesis of sequence at each of the nodes)

 

c.    MAXIMUM LIKELIHOOD ANALYSES

given a model for sequence evolution, find the tree that has the highest probability under this model.

This approach can also be used to successively refine the model.

 

d.    Else:
spectral analyses, evolutionary parsimony, (i.e., look only at patterns of substitutions)
Hadamard conjugation

 

A problem shared by many approaches to building trees is LONG BRANCH ATTRACTION, i.e., lines that have experienced a higher substitution rates are found to group together in the reconstructed history, even though each of the fast changing lineages might be closer related to a Iine of descent with slower substitution rates.  Many algorithms will group the long branches together.  Although this problem has been extensively discussed in the literature, there is no easy solution to this problem. The situation is compounded by the fact that the long branches often look shorter in the reconstructed phylogenies. 

A similar case, although most algorithms are much less prone fall victim to this problem, is long branch attraction in cases where the substitution rate is the same in all lineages, but the long branches are due to the absence of side branches.

 

Solutions:

 Break up long branches by adding additional sequences.

 Use algorithms that are less sensitive to long-branch attraction.  Parsimony is pretty sensitive, usually maximum likelihood approaches that incorporate among site rate variation (ASRV) into the model are doing pretty good. Using simulated protein sequence evolution, i.e. the true tree is known for these sequences, we found that long-branch attraction in the presence of ASRV can become a problem even when sequences are more than 50% identical. However, ML approaches did reasonable well (depending on the model down to about 20% sequence identity (go here for an online poster). 

*  A complicating problem is that some of the algorithms more insensitive to long branch attraction were found to make the opposite error: long branch repulsion (Mark Siddall in Cladistics 14, 209-220) – however, we didn’t see any hint of this in our exploration of phylogenetic reconstruction from amino acid sequences (see above mentioned poster).

Some literature:

Felsenstein, J. (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27, 401-410

Hendy, M.D., Penny, D. (1989) A framework for the quantitative study of evolutionary trees. Syst. Zool. 38, 297-309

Huelsenbeck P.J. (1995) Performance of phylogenetic methods in simulation. Syst. Biol. 44: 17-48.

Kuhner M.K., and J.Felsenstein (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11: 459-468.

Tateno Y, N. Takezaki, and M. Nei. (1994) Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11: 261-277

Mark E. Siddal (1998) Success of Parsimony in the Four-Taxon Case: Long-Branch repulsion by Likelyhood in the Farris Zone, Cladistics 14, 209-220

 Bootstrapping and non-informative sites

It is a good idea to only use those positions for phylogenetic analyses for which one is sure that the sites are indeed homologous.  However, in order to obtain a bootstrap value better than 90% for a branch one only needs three sites that change along this branch.  Countering intuition (at least mine), these bootstrap values are not lowered by adding non-informative sites to the alignment.  The following discusses an example of four sequences which have 5 positions supporting the central branch in one orientation and 2 positions each supporting the two alternatives:

spec1 ACGTG AC CG

spec2 ACGTG GG AA

spec3 TGCAC GG CG

spec4 TGCAC AC AA

 Regardless how many non informative sites are added, the grouping of spec1 with spec2 (and spec3 with spec4) is found in about 80% of the bootstrapped samples.

 

Example of sequence file with added non-informative sites:

spec1 AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT ACGTG ACCG

spec2 AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT ACGTG GGAA

spec3 AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT TGCAC GGCG

spec4 ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA TGCAC ACAA

 

Table of bootstrap support:

 

Spec1 with 2

Spec1 with 3 or 1 with 4

 

5 informative sites support this branch

2 informative sites each
support this branch

+0 non-informative sites

 

 

 

82.5

12.17

 

77.83

3.67

 

84.17

11

 

 

6.5

 

 

11.83

 

 

10.33

Mean of 3 replicates with 100 bootstrapped samples each

81.50

9.25

Standard deviation

3.29

3.41

 

 

 

+50 non-informative sites

 

 

 

77.00

13.00

 

77.00

10.00

 

78.83

14.00

 

 

9.00

 

 

11.33

 

 

9.83

Mean of 3 replicates with 100 bootstrapped samples each

77.61

11.19

Standard deviation

1.06

1.96

 

 

 

+200 non-informative sites

 

 

 

78.83

14.33

 

78.33

6.83

 

76.50

12.83

 

 

8.83

 

 

14.50

 

 

9.00

Mean of 3 replicates with 100 bootstrapped samples each

77.89

11.05

Standard deviation

1.23

3.25

 

 

 

+2000 non-informative sites

 

 

 

79.67

10.67

 

83.33

9.67

 

81.33

10.33

 

 

6.33

 

 

10.83

 

 

7.83

Mean of 3 replicates with 100 bootstrapped samples each

81.44

9.28

Standard deviation

1.83

1.81

 

         X-Windows allows you to use a graphics user interface on a remote computer. For everyday operations it is usually not worth the trouble, however, to run alignment programs, it is difficult to do without. See the introduction to X-Windows page.

         seaview is an X-Windows programs that allow the editing of sequence alignments (colorful and easy to use).  (there is a PC version available as well

The program reads in a variety of formats, I encountered the least amount of problems with the interleaved PHYLIP format generated by clustalw. Gaps are inserted with the spacebar, for other options see the help menus.

To start seaview,
Start the X-windows server,
telnet to the UNIX machine,
redirect the display to your PC or MAC
cd (change directory) to the directory where your sequences are
type seaview, hit return. 

         phylo_win is an interactive program for phylogenetic reconstruction. The programs allow many more different approaches than clustalw.

Galtier, N., Gouy, M., and Gautier, C. 1996. SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny. CABIOS 12: 543-548

The actual algorithms used for phylogenetic reconstruction are the
parsimony programs DNAPARS and PROTPARS (adapted from Joe Felsenstein's
PHYLIP),
Distance matrix analyses allowing the use of many different ways to calculate distances (including logDet distances, which are less sensitive for nucleotide composition), and
fastDNAml (code from Gary Olsen).

Positive features of phylo_win are that

species and positions can be interactively selected and evaluated.

 The program has help facilities. Use them if you are not sure as to what the different options mean. Figure 11 in chapter 11 in Hillis, Moritz, and Mable: Molecular Systematics, Second Edition gives an overview in different frequently substitution models.

 The tree display section of the program is positively stunning (at least to me). You root, rearrange, and display additional information at the click of the mouse. You can output your trees in treefile format, which is read by TREEVIEW, or as postscript file, which can be printed on a postscript printer or read into ADOBE's Arobat Distiller.

 The search algorithm used for maximum likelihood is quite good (see options), and the program is much faster than DNAml.

Drawbacks:
no maximum likelihood for protein sequences,
PROTPARS is _v_e_r_y_ slow,
the implemented heuristic searches for most parsimonious trees fail when large data sets (more than 10 sequences) are analyzed (use PAUP instead), and
No programs are included that would take among site rate variation into consideration.

PHYLO_WIN and SEAVIEW were developed by Nicolas Galtier, M. Gouy, and C. Gautier at the
Laboratoire de Biometrie, Genetique et Biologie des Populations
UMR 5558. Universite C. Bernard. Lyon1.
42, Bd du 11 novembre 1918
69622 VILLEURBANNE
France.

Both Programs can be obtained for many different platforms from via anonymous FTP from ftp://biom3.univ-lyon1.fr/pub/mol_phylogeny/

The display on PC's can be problematic. As I usually use the same terminal to log in, I added flags to my profile. For example I added the line

alias phywi='phylo_win -seqfontsize 12 -systemfont helvetica,12,b -bigfont times,12'

to the file .profile in my home directory. You also could add a statement

alias sv='seaview -reducefonts

Every time I type phywi phylo_win in invoked with certain font sizes.