Class 7: PUZZLE

MCB 372 - CLASS 7

PUZZLE

The reconstruction of phylogenetic trees from molecular sequences necessitates that you assume a model that describes the evolutionary process. Often these assumptions are not clearly spelled out; and some make the claim that parsimony analyses does not assume a model at all, it just searches for the tree that explains the data with the least number of substitutions. However, an alternative view is that parsimony corresponds to a model in which all substitutions are equally likely. One of the major problems, especially if one wants to calibrate the data with respect to time, is the correction for multiple substitutions. The situation is complicated by two factors:

1. different sites along a sequence experience substitutions with different frequency

2. if a replacement occurs, the different types of replacements occur with different probabilities

Both of these considerations are valid for amino acid and nucleotide sequences. Taking both of these processes into account greatly improves the validity of the obtained trees. Two approaches have been used to address problem 1. Assign different weight to different positions a priori (e.g. first, second, third codon position, or stem versus loop regions in rRNA. Or have the program decide which distribution of among site rate variation is present in the data and make the appropriate corrections for multiple substitutions.

The so called gamma function has become very popular for this purpose. A good and readable overview was published by Z. Yang (1996): The among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11, 367-372.

The G -distribution is useful because a single parameter (the shape parameter a ) continuously alters the character of the distribution. With a =¥ all sites change at the same rate; an extreme ASRV where only a few sites vary and the majority of sites are invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (» 2), and normal distributions (a >10). (see here for graphs)

There are several program packages that utilize the Gamma distribution. (e.g. PAML from Z.Yang (Yang, Z 1996a. Phylogenetic analysis by maximum likelihood (PAML),Version 1.2. Department of Integrative Biology, UCBerkeley, CA), PAUP from D. Swofford (test version distributed through the author, Smithonian Institution, Laboratory of Molecular Systematics, MRC-534, Washington DC 20560), and Strimmer and von Haseler's PUZZLE (Strimmer, K., von Haeseler, A. (1996) Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964-969. Version 4.2 is available through the authors at http://www.tree-puzzle.de/ .

The programs we will use in this course are PAUP and PUZZLE.

PAUP is a great program, if you use nucleotide sequences. The second part of this course will be devoted to PAUP. However, for proteins it leaves you little choice but to use the encoding nucleotide sequences. As we have shown (Lei, Holsinger, Gogarten, in prep.) recoverable phylogenetic information decays slower in protein sequences in the presence of ASRV. At present in many instances nucleotide sequences are not a viable alternative to studying protein sequences directly.

PUZZLE is a very versatile maximum likelihood program that is particularly useful to analyze protein sequences. The program was developed and is maintained by Korbian Strimmer and Arnd von Haseler (Univ. of Munich).

PUZZLE allows
fast and accurate estimation of ASRV (through estimating the shape parameter alpha) for both nucleotide and amino acid sequences,
it has a fast algorithm to determine trees through quartet puzzling (calculating ml trees for quartets of species and building the multispecies tree from the quartetts).
The program provides confidence numbers which tend to be smaller than bootstrap values (i.e. provide a more conservative estimate), and
the program calculates branch lengths and likelihood for user defined trees, which is great if you want to compare different tree topologies, or different models using the maximum likelihood ratio test.
Branches which are not significantly supported are collapsed.
PUZZLE runs on "all" platforms
PUZZLE reads PHYLIP format
Don't leave home without it

Drawbacks:
The more species you add the lower the support for individual branches. While this is true for all algorithms, in PUZZLE this can lead to completely unresolved trees.
The determined multi-species tree is not the tree with the highest likelihood, rather it is the tree whose topology is supported through ml-quartets, and the lengths of the resolved branches is determined through maximum likelihood.

Trees calculated with PUZZLE

Finding ML trees can be very time consuming, finding parsimony trees is blazingly fast by comparison. PUZZLE does not include a search algorithm for the tree with the highest likelihood. If you want to find this tree you need to use either Adachi/Hasegawa's PROTML, or Yang's PAML. PROTML does not incorporate ASRV, but it is quite user friendly. PAML incorporates ASRV, and is run via a control file. Both programs use star decomposition for heuristic searches. You also can provide a partially unresolved user-tree and let the program determine the most likely resolution of the unresolved nodes (again using star decomposition).
PUZZLE uses a different approach. All different quartets of species (four sequences each) are calculated (the more the better) using maximum likelihood according to the model you select. Starting with a randomly chosen quartet, species are added one at a time, selecting the position of the added species so that it provides least conflict with all the quartets calculated.
E.g: (a,b),(c,d) is the starting quartet. Sequence e is to be added.
The quartet (e,b),(c.d) adds penalty to placing e with either c or d, whereas the central branch, and placement with a or b is compatible.
The quartet (e,a),(b,c) adds penalty to placing e with either b or c or with the central branch, whereas placement with a or d is compatible. The algorithm also keeps track of partially resolved quartets, e.g. if according to the ml analyses a can go with either c or d but not with B.

The adding of quartets is repeated many times, and the majority consensus of the different trees is displayed. In practice it turns out that the support values given by PUZZLE are more conservative than bootstrap values.

Maximum likelihood mapping

An often-encountered problem in inspecting trees is the assessment of support for different groupings. E.g. does Giardia form the deepest branch within the known eukaryotes. Maximum likelihood mapping offers a graphic approach to this problem.

You can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible quartet, the probabilities for the three tree types are plotted in a simplex. (Pi= Li/(L1+L2+L3) Note that P1+P2+P3=1; Pi is the (kind of) posterior probability and Li is the likelihood of tree i. For more information check out the literature posted here. If you generate a plot for the whole tree, you learn about how many quartets are resolved with confidence, if you plot only a single branch, you learn howmany quartetts support each of the possible orientations of the branch.

E.g. if one wants to know if Giardia lamblia is the deepest branch within the eukaryotes, on can choose the "higher" eukaryotes as cluster a, another deep branching eukaryote (one that competes against Giardia) as cluster b, Giardia as cluster c, and the outgroup as cluster d. For an easy example output see this sample ml-map. A more complicated result from the analysis of carbamoyl phosphate synthetase domains is here.

Maximum likelihood ratio test

If you want to compare two models of evolution (this includes the tree) given a certain data set, you can utilize the so-called maximum likelihood ratio test. If L₁ and L₂ are the likelihoods of the two models, d =2(logL₁-logL₂) approximately follows a Chi square distribution with n degrees of freedom. Usually n is the difference in model parameters (i.e., how many parameters are used to describe the substitution process). In particular n can be the difference in branches between two trees (one tree is more resolved than the other). In principle, this test can only be applied if on model is a more refined version of the other. However, if you compare two completely resolved trees with each other that differ only in a single branch, you can, following a suggestion by Joe Felsenstein, use one degree of freedom.

General Remarks:

If no file called infile is present in the program directory, PUZZLE will prompt you for the infile. If you work a lot on the same dataset, it might be worth to rename your data set into infile. The output is always put into the files outfile, outtree and outdist.

You have to rename these files as soon as you are done. The next time puzzle starts, the files get erased (this is particularly sad if your run took a couple of days).