MCB 372 - CLASS 7
PUZZLE
The reconstruction of phylogenetic
trees from molecular sequences necessitates that you assume a model that describes
the evolutionary process. Often these
assumptions are not clearly spelled out; and some make the claim that parsimony
analyses does not assume a model at all, it just searches for the tree that
explains the data with the least number of substitutions. However, an alternative view is that
parsimony corresponds to a model in which all substitutions are equally
likely. One of the major problems,
especially if one wants to calibrate the data with respect to time, is the
correction for multiple substitutions.
The situation is complicated by two factors:
1.
different sites along a
sequence experience substitutions with different frequency
2.
if a replacement occurs,
the different types of replacements occur with different probabilities
Both of these considerations
are valid for amino acid and nucleotide sequences. Taking both of these
processes into account greatly improves the validity of the obtained trees. Two
approaches have been used to address problem 1. Assign different weight to
different positions a priori (e.g. first, second, third codon position,
or stem versus loop regions in rRNA. Or have the program decide which
distribution of among site rate variation is present in the data and make the
appropriate corrections for multiple substitutions.
The so called gamma function
has become very popular for this purpose. A good and readable overview was
published by Z. Yang (1996): The
among-site rate variation and its impact on phylogenetic analyses. Trends Ecol.
Evol. 11, 367-372.
The G -distribution is useful because a single parameter (the shape
parameter a ) continuously alters the character of the
distribution. With a =¥ all sites change at the same
rate; an extreme ASRV where only a few sites vary and the majority of sites are
invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (» 2), and normal distributions (a >10). (see here
for graphs)
There are several program packages that utilize the Gamma distribution. (e.g. PAML from Z.Yang (Yang, Z 1996a. Phylogenetic analysis by maximum likelihood (PAML),Version 1.2. Department of Integrative Biology, UCBerkeley, CA), PAUP from D. Swofford (test version distributed through the author, Smithonian Institution, Laboratory of Molecular Systematics, MRC-534, Washington DC 20560), and Strimmer and von Haseler's PUZZLE (Strimmer, K., von Haeseler, A. (1996) Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964-969. Version 4.2 is available through the authors at http://www.tree-puzzle.de/ .
The programs we will
use in this course are PAUP and PUZZLE.
PAUP is a great
program, if you use nucleotide sequences. The second part of this course will
be devoted to PAUP. However, for
proteins it leaves you little choice but to use the encoding nucleotide
sequences. As we have shown (Lei, Holsinger,
Gogarten, in prep.) recoverable phylogenetic information decays slower in
protein sequences in the presence of ASRV.
At present in many instances nucleotide sequences are not a viable
alternative to studying protein sequences directly.
PUZZLE is a very versatile maximum likelihood program that
is particularly useful to analyze protein sequences. The program was developed
and is maintained by Korbian Strimmer and Arnd von Haseler (Univ. of Munich).
PUZZLE
allows
fast and
accurate estimation of ASRV (through estimating the shape parameter alpha) for
both nucleotide and amino acid sequences,
it has a
fast algorithm to determine trees through quartet puzzling (calculating ml
trees for quartets of species and building the multispecies tree from the
quartetts).
The
program provides confidence numbers which tend to be smaller than bootstrap
values (i.e. provide a more conservative estimate), and
the
program calculates branch lengths and likelihood for user defined trees, which
is great if you want to compare different tree topologies, or different models
using the maximum likelihood ratio test.
Branches
which are not significantly supported are collapsed.
PUZZLE
runs on "all" platforms
PUZZLE
reads PHYLIP format
Don't leave home without it
Drawbacks:
The more
species you add the lower the support for individual branches. While this is
true for all algorithms, in PUZZLE this can lead to completely unresolved
trees.
The
determined multi-species tree is not the tree with the highest likelihood,
rather it is the tree whose topology is supported through ml-quartets, and the
lengths of the resolved branches is determined through maximum likelihood.
Trees
calculated with PUZZLE
Finding
ML trees can be very time consuming, finding parsimony trees is blazingly fast
by comparison. PUZZLE does not include a search algorithm for the tree with the
highest likelihood. If you want to find
this tree you need to use either Adachi/Hasegawa's PROTML, or Yang's PAML. PROTML does not incorporate ASRV, but it is
quite user friendly. PAML incorporates
ASRV, and is run via a control file.
Both programs use star decomposition for heuristic searches. You also can provide a partially unresolved
user-tree and let the program determine the most likely resolution of the
unresolved nodes (again using star decomposition).
PUZZLE uses a different approach. All different quartets of species (four
sequences each) are calculated (the more the better) using maximum likelihood
according to the model you select.
Starting with a randomly chosen quartet, species are added one at a time,
selecting the position of the added species so that it provides least conflict
with all the quartets calculated.
E.g: (a,b),(c,d) is the starting quartet. Sequence e is to be added.
The quartet (e,b),(c.d) adds penalty to placing e with either c or d, whereas
the central branch, and placement with a or b is compatible.
The quartet (e,a),(b,c) adds penalty to placing e with either b or c or with
the central branch, whereas placement with a or d is compatible. The algorithm
also keeps track of partially resolved quartets, e.g. if according to the ml
analyses a can go with either c or d but not with B.
The
adding of quartets is repeated many times, and the majority consensus of the
different trees is displayed. In practice it turns out that the support values
given by PUZZLE are more conservative than bootstrap values.
Maximum
likelihood mapping
An
often-encountered problem in inspecting trees is the assessment of support for
different groupings. E.g. does Giardia form the deepest branch within the known
eukaryotes. Maximum likelihood mapping offers a graphic approach to this
problem.
You
can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible
quartet, the probabilities for the three tree types are plotted in a
simplex. (Pi= Li/(L1+L2+L3) Note that
P1+P2+P3=1; Pi is the (kind of) posterior probability and Li is the likelihood
of tree i. For more information check out
the literature posted here. If you generate a plot for the whole tree,
you learn about how many quartets are resolved with confidence, if you plot
only a single branch, you learn howmany quartetts support each of the possible
orientations of the branch.
E.g.
if one wants to know if Giardia lamblia is the deepest branch within the
eukaryotes, on can choose the "higher" eukaryotes as cluster a,
another deep branching eukaryote (one that competes against Giardia) as cluster
b, Giardia as cluster c, and the outgroup as cluster d. For an easy example output see this sample ml-map.
A more complicated result from the analysis of carbamoyl phosphate
synthetase domains is here.
Maximum
likelihood ratio test
If
you want to compare two models of evolution (this includes the tree) given a
certain data set, you can utilize the so-called maximum likelihood ratio
test. If L1 and L2
are the likelihoods of the two models, d =2(logL1-logL2)
approximately follows a Chi square distribution with n degrees of freedom.
Usually n is the difference in model parameters (i.e., how many parameters are
used to describe the substitution process).
In particular n can be the difference in branches between two trees (one
tree is more resolved than the other). In principle, this test can only be
applied if on model is a more refined version of the other. However, if you
compare two completely resolved trees with each other that differ only in a
single branch, you can, following a suggestion by Joe Felsenstein, use one
degree of freedom.
General Remarks:
If no file called
infile is present in the program directory, PUZZLE will prompt you for the
infile. If you work a lot on the same dataset, it might be worth to rename your
data set into infile. The output is always put into the files outfile, outtree
and outdist.
You have to rename these files as soon as you are done. The next time puzzle starts,
the files get erased (this is particularly sad if your run took a couple of
days).