TREE RECONSTRUCTION ARTIFACTS and other BACKGROUND information Long Branch Attraction. In analyzing divergent sequences one frequently observed problem is that of long-branch attraction. Even though the long branches might be closer related to short branches, many algorithms will group the long branches together. Although this problem has been extensively discussed in the literature, there is no easy solution to this problem. The situation is compounded by the fact that the long branches often look shorter in the reconstructed phylogenies.A similar case, although most algorithms are much less prone fall victim to this problem, is long branch attraction in cases where the substitution rate is the same in all lineages, but the long branches are due to the absence of side branches. (Examples for long-branch attraction were included as testseq1b and c) Break up long branches by adding additional sequences.Use algorithms that are less sensitive to long-branch attraction. Parsimony is pretty sensitive, usually maximum likelihood approaches that incorporate among site rate variation (ASRV) into the model are doing pretty good. Distance matrix analyses are somewhat in between (depending on the distance measure). Using simulated protein sequence evolution, i.e. the true tree is known for these sequences, we found (see poster by Olga Zhaxybayeva) that long-branch attraction in the presence of ASRV can become a problem even when sequences are more than 50% identical. However, ML approaches did reasonable well (depending on the model down to about 20% sequence identity), distance matrix analyses were in between. Some literature: Felsenstein,
J. (1978) Cases in which parsimony and compatibility methods will be positively
misleading. Syst. Zool. 27, 401-410 Hendy,
M.D., Penny, D. (1989) A framework for the quantitative study of evolutionary
trees. Syst. Zool. 38, 297-309 Huelsenbeck
P.J. (1995) Performance of phylogenetic methods in simulation. Syst. Biol.
44: 17-48. Kuhner
M.K., and J.Felsenstein (1994) A simulation comparison of phylogeny algorithms
under equal and unequal evolutionary rates. Mol. Biol. Evol. 11: 459-468. Tateno
Y, N. Takezaki, and M. Nei. (1994) Relative efficiencies of the maximum-likelihood,
neighbor-joining, and maximum-parsimony methods when substitution rate varies
with site. Mol. Biol. Evol. 11: 261-277 TREE-PUZZLE The reconstruction of phylogenetic trees from molecular sequences necessitates that you assume a model that describes the evolutionary process. Often these assumptions are not clearly spelled out; and some make the claim that parsimony analyses does not assume a model at all, it just searches for the tree that explains the data with the least number of substitutions. However, an alternative view is that parsimony corresponds to a model in which all substitutions are equally likely. One of the major problems, especially if one wants to calibrate the data with respect to time, is the correction for multiple substitutions. The situation is complicated by two factors:
Both of these considerations are valid for amino acid and nucleotide sequences. Taking both of these processes into account greatly improves the validity of the obtained trees. Two approaches have been used to address problem: Either assign different weight to different positions a priori (e.g. first, second, third codon position, or stem versus loop regions in rRNA. Or have the program decide which distribution of among site rate variation is present in the data and make the appropriate corrections for multiple substitutions. The so called gamma distribution has become very popular for this purpose. A good and readable overview was published by Z. Yang (1996): The among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11, 367-372. The Gamma distribution is useful because a single parameter (the shape parameter alpha) continuously alters the character of the distribution. With a =infinity all sites change at the same rate; an extreme ASRV where only a few sites vary and the majority of sites are invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (» 2), and normal distributions (a >10). (see here for graphs) There are several program packages that utilize the Gamma distribution. (e.g.: PAML from Z.Yang (Yang, Z 2001. Phylogenetic analysis by maximum likelihood (PAML),Version 3.12. Department of Integrative Biology, UCBerkeley, CA), PAUP from D. Swofford (test version distributed through the author, Smithonian Institution, Laboratory of Molecular Systematics, MRC-534, Washington DC 20560), and Strimmer and von Haseler's TREE-PUZZLE (Strimmer, K., von Haeseler, A. (1996) Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964-969. Version 5.0 is available through the authors at http://www.tree-puzzle.de/ . TREE-PUZZLE is a very versatile maximum likelihood program that is particularly useful to analyze protein sequences. The program was developed by Korbian Strimmer and Arnd von Haseler (then at the Univ. of Munich) and is maintained by von Haseler, Heiko A. Schmidt, and Martin Vingron (contacts see http://www.tree-puzzle.de/). TREE-PUZZLE allows
Drawbacks:
Trees calculated with TREE-PUZZLEFinding ML trees can be very time consuming, finding parsimony trees is blazingly fast by comparison. TREE-PUZZLE does not include a search algorithm for the tree with the highest likelihood. If you want to find this tree you need to use either PROML from PHYLIP, Adachi/Hasegawa's PROTML, or Yang's PAML. PROTML does not incorporate ASRV, but it is quite user friendly. PAML incorporates ASRV, and is run via a control file. Both programs use star decomposition for heuristic searches. At present my first choice would be PROML, if one starts a heuristic search from scratch, or PAML, if you know how to constrain the tree. In PAML and PROTML you also can provide a partially unresolved user-tree and let the program determine the most likely resolution of the unresolved nodes (again using star decomposition or exhaustive search). TREE-PUZZLE uses a different approach. All different quartets of species (four sequences each) are calculated (the more the better) using maximum likelihood according to the model you select. Starting with a randomly chosen quartet, species are added one at a time, selecting the position of the added species so that it provides least conflict with all the quartets calculated. E.g: (a,b),(c,d) is the starting quartet. Sequence e is to be added. The quartet (e,b),(c.d) adds penalty to placing e with either c or d, whereas the central branch, and placement with a or b is compatible. The quartet (e,a),(b,c) adds penalty to placing e with either b or c or with the central branch, whereas placement with a or d is compatible. The algorithm also keeps track of partially resolved quartets, e.g. if according to the ml analyses a can go with either c or d but not with B. The adding of quartets is repeated many times, and the majority consensus of the different trees is displayed. In practice it turns out that the support values given by TREE-PUZZLE are more conservative than bootstrap values. Maximum likelihood ratio testIf you want to compare two models of evolution (this includes the tree) given a data set, you can utilize the so-called maximum likelihood ratio test. If L1 and L2 are the likelihoods of the two models, d =2(logL1-logL2) approximately follows a Chi square distribution with n degrees of freedom. Usually n is the difference in model parameters. I.e., how many parameters are used to describe the substitution process and the tree. In particular n can be the difference in branches between two trees (one tree is more resolved than the other). In principle, this test can only be applied if on model is a more refined version of the other. However, if you compare two completely resolved trees with each other that differ only in a single branch, you can, following a suggestion by Joe Felsenstein, use one degree of freedom. If you compare two trees, one calculated without assuming a clock, the other assuming a clock, the degrees of freedom are n-2. To calculate the probability you can use the CHISQUARE calculator for windows available from Paul Lewis. Maximum likelihood mappingAn often-encountered problem in inspecting trees is the assessment of support for different groupings. E.g. does Giardia form the deepest branch within the known eukaryotes. Maximum likelihood mapping offers a graphic approach to this problem. You can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible quartet, the probabilities for the three tree types are plotted in a simplex. (Pi= Li/(L1+L2+L3) Note that P1+P2+P3=1; Pi is the (kind of) posterior probability and Li is the likelihood of tree i. For more information see here. If you generate a plot for the whole tree, you learn about how many quartets are resolved with confidence, if you plot only a single branch, you learn how many quartets support each of the possible orientations of the branch. E.g., if one wants to know, if Giardia lamblia is the deepest branch within the eukaryotes, on can choose the "higher" eukaryotes as cluster a, another deep branching eukaryote (one that competes against Giardia) as cluster b, Giardia as cluster c, and the outgroup as cluster d. For an easy example output see this sample ml-map. Application of ML mapping to comparative Genome analysesA recent article on the use of ml mapping in comparative genome analyses is here. General Remarks:If no file called infile is present in the program directory, TREE-PUZZLE will prompt you for the infile. If you work a lot on the same dataset, it might be worth to rename your data set into infile. The output is always put into the files outfile, outtree and outdist. You have to rename these files as soon as you are done. The next time puzzle starts, the files get erased (this is particularly sad if your run took a couple of days). If you start TREE-PUZZLE from the command line (MS-DOS prompt, or c-shell) using the command "puzzle name_of_infile.phy" the program writes the results into a file called name_of_infile.puzzle, name_of_infile.dist, etc.. Assignment #7: |