MCB 372 - CLASS 6
Overview on approaches to
phylogenetic reconstruction
(caution, there are many ways to cut a cake):
a. DISTANCE
ANALYSES
I.
calculate pairwise distances
(different distance measures, correction for multiple hits, correction for
codon bias)
II.
make distance matrix (table of pairwise corrected distances)
III.
calculate tree from distance matrix
i) using optimality criterion
(e.g.: smallest error between distance matrix
and distances in tree, or use
ii) algorithmic approaches (UPGMA or neighbor joining)
b. PARSIMONY
ANALYSES
find that tree that explains sequence data with
minimum number of substitutions
(tree includes hypothesis of sequence at each of the
nodes)
c. MAXIMUM
LIKELIHOOD ANALYSES
given a model for sequence evolution, find the tree
that has the highest probability under this model.
This approach can also be used to successively refine
the model.
d. Else:
spectral analyses, evolutionary parsimony, (i.e., look only at patterns of
substitutions)
Hadamard conjugation
A problem shared by many approaches to building trees
is LONG BRANCH ATTRACTION, i.e., lines that have experienced a higher substitution rates are found to group together in the
reconstructed history, even though each of the fast changing lineages might be
closer related to a Iine of descent with slower substitution rates. Many algorithms will group the long branches
together. Although this problem has been
extensively discussed in the literature, there is no easy solution to this
problem. The situation is compounded by the fact that the long branches often
look shorter in the reconstructed phylogenies.
A
similar case, although most algorithms are much less prone fall victim to this
problem, is long branch attraction in cases where the substitution rate is the
same in all lineages, but the long branches are due to the absence of side branches.
Solutions:
Break up long branches by
adding additional sequences.
Use algorithms that are less
sensitive to long-branch attraction.
Parsimony is pretty sensitive, usually maximum likelihood approaches
that incorporate among site rate variation (ASRV) into the model are doing
pretty good. Using simulated protein sequence evolution, i.e. the true tree is
known for these sequences, we found that long-branch attraction in the presence
of ASRV can become a problem even when sequences are more than 50% identical.
However, ML approaches did reasonable well (depending on the model down to
about 20% sequence identity (go here for an
online poster).
A complicating problem is that some of the algorithms more insensitive
to long branch attraction were found to make the opposite error: long branch
repulsion (Mark Siddall in Cladistics 14, 209-220) – however, we didn’t see any
hint of this in our exploration of phylogenetic reconstruction from amino acid
sequences (see above mentioned poster).
Some
literature:
Felsenstein,
J. (1978) Cases in which parsimony and compatibility methods will be positively
misleading. Syst. Zool. 27, 401-410
Hendy,
M.D., Penny, D. (1989) A framework for the quantitative study of evolutionary
trees. Syst. Zool. 38, 297-309
Huelsenbeck
P.J. (1995) Performance of phylogenetic methods in simulation. Syst. Biol. 44:
17-48.
Kuhner
M.K., and J.Felsenstein (1994) A simulation comparison of phylogeny algorithms
under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:
459-468.
Tateno
Y, N. Takezaki, and M. Nei. (1994) Relative efficiencies of the
maximum-likelihood, neighbor-joining, and maximum-parsimony methods when
substitution rate varies with site. Mol. Biol. Evol. 11: 261-277
Mark
E. Siddal (1998) Success of Parsimony in the Four-Taxon Case: Long-Branch
repulsion by Likelyhood in the Farris Zone, Cladistics 14, 209-220
Bootstrapping and non-informative sites
It
is a good idea to only use those positions for phylogenetic analyses for which
one is sure that the sites are indeed homologous. However, in order to obtain a bootstrap value better than 90% for
a branch one only needs three sites that change along this branch. Countering intuition (at least mine), these
bootstrap values are not lowered by adding non-informative sites to the
alignment. The following discusses an
example of four sequences which have 5 positions supporting the central branch
in one orientation and 2 positions each supporting the two alternatives:
spec1
ACGTG AC CG
spec2
ACGTG GG AA
spec3
TGCAC GG CG
spec4
TGCAC AC AA
Regardless how many non
informative sites are added, the grouping of spec1 with spec2 (and spec3 with
spec4) is found in about 80% of the bootstrapped samples.
Example
of sequence file with added non-informative sites:
spec1
AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT ACGTG ACCG
spec2
AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT ACGTG GGAA
spec3
AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT TGCAC GGCG
spec4
ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA TGCAC ACAA
Table
of bootstrap support:
|
Spec1 with 2 |
Spec1 with 3 or 1 with 4 |
|
5 informative sites support this branch |
2
informative sites each |
+0 non-informative sites |
|
|
|
82.5 |
12.17 |
|
77.83 |
3.67 |
|
84.17 |
11 |
|
|
6.5 |
|
|
11.83 |
|
|
10.33 |
Mean of 3 replicates with 100 bootstrapped samples each |
81.50 |
9.25 |
Standard deviation |
3.29 |
3.41 |
|
|
|
+50 non-informative sites |
|
|
|
77.00 |
13.00 |
|
77.00 |
10.00 |
|
78.83 |
14.00 |
|
|
9.00 |
|
|
11.33 |
|
|
9.83 |
Mean of 3 replicates with 100 bootstrapped samples each |
77.61 |
11.19 |
Standard deviation |
1.06 |
1.96 |
|
|
|
+200 non-informative sites |
|
|
|
78.83 |
14.33 |
|
78.33 |
6.83 |
|
76.50 |
12.83 |
|
|
8.83 |
|
|
14.50 |
|
|
9.00 |
Mean of 3 replicates with 100 bootstrapped samples each |
77.89 |
11.05 |
Standard deviation |
1.23 |
3.25 |
|
|
|
+2000 non-informative sites |
|
|
|
79.67 |
10.67 |
|
83.33 |
9.67 |
|
81.33 |
10.33 |
|
|
6.33 |
|
|
10.83 |
|
|
7.83 |
Mean of 3 replicates with 100 bootstrapped samples each |
81.44 |
9.28 |
Standard deviation |
1.83 |
1.81 |
X-Windows allows you to use a graphics user interface on a remote computer. For
everyday operations it is usually not worth the trouble, however, to run
alignment programs, it is difficult to do without. See the introduction to X-Windows page.
seaview is an X-Windows
programs that allow the editing of sequence alignments (colorful and easy to use). (there is a PC version available as well
The program reads in
a variety of formats, I encountered the least amount of problems with the
interleaved PHYLIP format generated by clustalw. Gaps are inserted with the
spacebar, for other options see the help menus.
To
start seaview,
Start the X-windows server,
telnet to the UNIX machine,
redirect the display to your PC or MAC
cd (change directory) to the directory where your sequences are
type seaview, hit return.
phylo_win is an interactive program
for phylogenetic reconstruction. The programs allow many more different
approaches than clustalw.
Galtier, N., Gouy,
M., and Gautier, C. 1996. SEAVIEW and PHYLO_WIN: two graphic tools for sequence
alignment and molecular phylogeny. CABIOS 12: 543-548
The actual
algorithms used for phylogenetic reconstruction are the
parsimony
programs DNAPARS and PROTPARS (adapted from Joe Felsenstein's PHYLIP),
Distance
matrix analyses allowing the use of many different ways to calculate distances
(including logDet distances, which are less sensitive for nucleotide
composition), and
fastDNAml
(code from Gary Olsen).
Positive features of phylo_win are that
species and positions can be interactively selected
and evaluated.
The program has help facilities. Use them if you
are not sure as to what the different options mean. Figure 11 in chapter 11 in
Hillis, Moritz, and Mable:
Molecular
Systematics, Second
Edition gives an overview in
different frequently substitution models.
The tree display section of the program is
positively stunning (at least to me). You root, rearrange, and display
additional information at the click of the mouse. You can output your trees in
treefile format, which is read by TREEVIEW, or as postscript file, which can be
printed on a postscript printer or read into ADOBE's Arobat Distiller.
The search algorithm used for maximum likelihood
is quite good (see options), and the program is much faster than DNAml.
Drawbacks:
no
maximum likelihood for protein sequences,
PROTPARS
is _v_e_r_y_ slow,
the
implemented heuristic searches for most parsimonious trees fail when large data
sets (more than 10 sequences) are analyzed (use PAUP instead), and
No
programs are included that would take among site rate variation into
consideration.
PHYLO_WIN and SEAVIEW were developed by Nicolas
Galtier, M. Gouy, and C. Gautier at the
Laboratoire de Biometrie, Genetique et Biologie des Populations
UMR 5558. Universite C. Bernard. Lyon1.
42, Bd du 11 novembre 1918
69622 VILLEURBANNE
France.
Both Programs can be
obtained for many different platforms from via anonymous FTP from ftp://biom3.univ-lyon1.fr/pub/mol_phylogeny/
The display on PC's
can be problematic. As I usually use the same terminal to log in, I added flags
to my profile. For example I added the line
alias
phywi='phylo_win -seqfontsize 12 -systemfont helvetica,12,b -bigfont times,12'
to the file .profile
in my home directory. You also could add a statement
alias sv='seaview
-reducefonts
Every time I type
phywi phylo_win in invoked with certain font sizes.