TREE RECONSTRUCTION ARTIFACTS and other BACKGROUND information

Long Branch Attraction. In analyzing divergent sequences one frequently observed problem is that of long-branch attraction. Even though the long branches might be closer related to short branches, many algorithms will group the long branches together. Although this problem has been extensively discussed in the literature, there is no easy solution to this problem. The situation is compounded by the fact that the long branches often look shorter in the reconstructed phylogenies.

A similar case, although most algorithms are much less prone fall victim to this problem, is long branch attraction in cases where the substitution rate is the same in all lineages, but the long branches are due to the absence of side branches. (Examples for long-branch attraction were included as testseq1b and c)

Break up long branches by adding additional sequences.

Use algorithms that are less sensitive to long-branch attraction.

Parsimony is pretty sensitive, usually maximum likelihood approaches that incorporate among site rate variation (ASRV) into the model are doing pretty good. Distance matrix analyses are somewhat in between (depending on the distance measure). Using simulated protein sequence evolution, i.e. the true tree is known for these sequences, we found (see poster by Olga Zhaxybayeva) that long-branch attraction in the presence of ASRV can become a problem even when sequences are more than 50% identical. However, ML approaches did reasonable well (depending on the model down to about 20% sequence identity), distance matrix analyses were in between.

Some literature:

Felsenstein, J. (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27, 401-410

Hendy, M.D., Penny, D. (1989) A framework for the quantitative study of evolutionary trees. Syst. Zool. 38, 297-309

Huelsenbeck P.J. (1995) Performance of phylogenetic methods in simulation. Syst. Biol. 44: 17-48.

Kuhner M.K., and J.Felsenstein (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11: 459-468.

Tateno Y, N. Takezaki, and M. Nei. (1994) Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11: 261-277

TREE-PUZZLE

The reconstruction of phylogenetic trees from molecular sequences necessitates that you assume a model that describes the evolutionary process. Often these assumptions are not clearly spelled out; and some make the claim that parsimony analyses does not assume a model at all, it just searches for the tree that explains the data with the least number of substitutions. However, an alternative view is that parsimony corresponds to a model in which all substitutions are equally likely. One of the major problems, especially if one wants to calibrate the data with respect to time, is the correction for multiple substitutions. The situation is complicated by two factors:

different sites along a sequence experience substitutions with different frequency
if a replacement occurs, the different types of replacements occur with different probabilities

Both of these considerations are valid for amino acid and nucleotide sequences. Taking both of these processes into account greatly improves the validity of the obtained trees. Two approaches have been used to address problem: Either assign different weight to different positions a priori (e.g. first, second, third codon position, or stem versus loop regions in rRNA. Or have the program decide which distribution of among site rate variation is present in the data and make the appropriate corrections for multiple substitutions.

The so called gamma distribution has become very popular for this purpose. A good and readable overview was published by Z. Yang (1996): The among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11, 367-372.

The Gamma distribution is useful because a single parameter (the shape parameter alpha) continuously alters the character of the distribution. With a =infinity all sites change at the same rate; an extreme ASRV where only a few sites vary and the majority of sites are invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (ť 2), and normal distributions (a >10). (see here for graphs)

There are several program packages that utilize the Gamma distribution. (e.g.: PAML from Z.Yang (Yang, Z 2001. Phylogenetic analysis by maximum likelihood (PAML),Version 3.12. Department of Integrative Biology, UCBerkeley, CA), PAUP from D. Swofford (test version distributed through the author, Smithonian Institution, Laboratory of Molecular Systematics, MRC-534, Washington DC 20560), and Strimmer and von Haseler's TREE-PUZZLE (Strimmer, K., von Haeseler, A. (1996) Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964-969. Version 5.0 is available through the authors at http://www.tree-puzzle.de/ .

TREE-PUZZLE is a very versatile maximum likelihood program that is particularly useful to analyze protein sequences. The program was developed by Korbian Strimmer and Arnd von Haseler (then at the Univ. of Munich) and is maintained by von Haseler, Heiko A. Schmidt, and Martin Vingron (contacts see http://www.tree-puzzle.de/).

TREE-PUZZLE allows

fast and accurate estimation of ASRV (through estimating the shape parameter alpha) for both nucleotide and amino acid sequences,
it has a fast algorithm to determine trees through quartet puzzling (calculating ml trees for quartets of species and building the multispecies tree from the quartets).
The program provides confidence numbers which tend to be smaller than bootstrap values (i.e. provide an even more conservative estimate), and
the program calculates branch lengths and likelihood for user defined trees, which is great if you want to compare different tree topologies, or different models using the maximum likelihood ratio test.
Branches which are not significantly supported are collapsed.
TREE-PUZZLE runs on "all" platforms
TREE-PUZZLE reads PHYLIP format, and communicates with hte user in a way similar to the PHYLIP programs.
Don't leave home without it!

Drawbacks:

The more species you add the lower the support for individual branches. While this is true for all algorithms, in TREE-PUZZLE this can lead to completely unresolved trees.
Trees calculated via quartet puzzling are usually not completely resolved, and they do not correspond to the ML-tree.
The determined multi-species tree is not the tree with the highest likelihood, rather it is the tree whose topology is supported through ml-quartets, and the lengths of the resolved branches is determined through maximum likelihood.

Trees calculated with TREE-PUZZLE

Finding ML trees can be very time consuming, finding parsimony trees is blazingly fast by comparison. TREE-PUZZLE does not include a search algorithm for the tree with the highest likelihood. If you want to find this tree you need to use either PROML from PHYLIP, Adachi/Hasegawa's PROTML, or Yang's PAML. PROTML does not incorporate ASRV, but it is quite user friendly. PAML incorporates ASRV, and is run via a control file. Both programs use star decomposition for heuristic searches. At present my first choice would be PROML, if one starts a heuristic search from scratch, or PAML, if you know how to constrain the tree. In PAML and PROTML you also can provide a partially unresolved user-tree and let the program determine the most likely resolution of the unresolved nodes (again using star decomposition or exhaustive search).

TREE-PUZZLE uses a different approach. All different quartets of species (four sequences each) are calculated (the more the better) using maximum likelihood according to the model you select. Starting with a randomly chosen quartet, species are added one at a time, selecting the position of the added species so that it provides least conflict with all the quartets calculated. E.g: (a,b),(c,d) is the starting quartet. Sequence e is to be added. The quartet (e,b),(c.d) adds penalty to placing e with either c or d, whereas the central branch, and placement with a or b is compatible. The quartet (e,a),(b,c) adds penalty to placing e with either b or c or with the central branch, whereas placement with a or d is compatible. The algorithm also keeps track of partially resolved quartets, e.g. if according to the ml analyses a can go with either c or d but not with B.

The adding of quartets is repeated many times, and the majority consensus of the different trees is displayed. In practice it turns out that the support values given by TREE-PUZZLE are more conservative than bootstrap values.

Maximum likelihood ratio test

If you want to compare two models of evolution (this includes the tree) given a data set, you can utilize the so-called maximum likelihood ratio test. If L₁ and L₂ are the likelihoods of the two models, d =2(logL₁-logL₂) approximately follows a Chi square distribution with n degrees of freedom. Usually n is the difference in model parameters. I.e., how many parameters are used to describe the substitution process and the tree. In particular n can be the difference in branches between two trees (one tree is more resolved than the other). In principle, this test can only be applied if on model is a more refined version of the other. However, if you compare two completely resolved trees with each other that differ only in a single branch, you can, following a suggestion by Joe Felsenstein, use one degree of freedom. If you compare two trees, one calculated without assuming a clock, the other assuming a clock, the degrees of freedom are n-2.

To calculate the probability you can use the CHISQUARE calculator for windows available from Paul Lewis.

Maximum likelihood mapping

An often-encountered problem in inspecting trees is the assessment of support for different groupings. E.g. does Giardia form the deepest branch within the known eukaryotes. Maximum likelihood mapping offers a graphic approach to this problem.

Barycentric Coordinates

You can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible quartet, the probabilities for the three tree types are plotted in a simplex. (P_i= L_i/(L₁+L₂+L₃) Note that P₁+P₂+P₃=1; P_i is the (kind of) posterior probability and L_i is the likelihood of tree i. For more information see here. If you generate a plot for the whole tree, you learn about how many quartets are resolved with confidence, if you plot only a single branch, you learn how many quartets support each of the possible orientations of the branch.

E.g., if one wants to know, if Giardia lamblia is the deepest branch within the eukaryotes, on can choose the "higher" eukaryotes as cluster a, another deep branching eukaryote (one that competes against Giardia) as cluster b, Giardia as cluster c, and the outgroup as cluster d. For an easy example output see this sample ml-map.

Application of ML mapping to comparative Genome analyses

A recent article on the use of ml mapping in comparative genome analyses is here.

General Remarks:

If no file called infile is present in the program directory, TREE-PUZZLE will prompt you for the infile. If you work a lot on the same dataset, it might be worth to rename your data set into infile. The output is always put into the files outfile, outtree and outdist. You have to rename these files as soon as you are done. The next time puzzle starts, the files get erased (this is particularly sad if your run took a couple of days).

If you start TREE-PUZZLE from the command line (MS-DOS prompt, or c-shell) using the command "puzzle name_of_infile.phy" the program writes the results into a file called name_of_infile.puzzle, name_of_infile.dist, etc..

Assignment #7:

Your name:
Your email address:

GOALS:

Learn how to get your data into TREE-PUZZLE and the results out.
Attain some understanding on what to do with the ml ratio test.
Learn about the different options in TREE-PUZZLE

Task 1 Transfer the file testseq5.phy and testseq5.aln into the directory that contains the executable program for TREE-PUZZLE5.0. This file contains vacuolar/archaeal ATPases from pro- and eukaryotes:

Daucus carota, Arabidopsis thaliana, Gossypium hirsutum are plants;

Acetabularia acetabulum is a green and Cyanidium caldarium is a red algae

Mus, Homo, Bos (mammals) Gallus (bird), Drosophila, Aedes (insects) are animals

Saccharomyces, Candida, Schizosaccharomyces, and Neurospora are fungi

Dictyostelium discoideum, Entamoeba, Plasmodium falciparum, Trypanosoma, Giardia are protists or protozoa

Sulfolobus acidocaldarius (70^oC),Archaeoglobus fulgidus (83^oC), Methanosarcina barkeri (30-37 ^oC), Methanosarcina mazeii (37^oC), Methanococcus jannaschii (80^oC), Haloferax volcanii (37^oC), Halobacterium salinarium (ca37^oC), Methanobacterium thermoautotrophicum (60-65^oC), Desulfurococcus sp. (85-90 ^oC), Thermococcus sp (75+ ^oC) (archaea),

Enterococcus hirae (37^oC), Borrelia burgdorferi (33-37^oC), Thermus thermophilus (70-80^oC), Deinococcus radiodurans (30^oC) (Bacteria) are prokaryotes. (Usually Bacteria have an F- and not an A-ATPase. The bacteria probably obtained the archaeal/vacuolar type ATPase through horizontal gene transfer).

The prokaryotes can be considered as outgroup for the eukaryotes.

Load testeseq5.aln into ClustalX. Create a bootstrapped tree (exclude positions with gaps, correct for multiple substitutions, do not forget to change output options to report bootstrap values at a node, not at a branch).

Design a strategy to use the maximum likelihood mapping to determine, if there is significant support in the data set to conclude that Giardia forms the deepest branch as compared to the other eukaryotes. Think first, then act. What other information do you need to form 4 clusters?

Start TREE-PUZZLE by double clicking on the TREE-PUZZLE icon
Load testseq5.phy (or what ever name you gave the file)
Select type of analyses: maximum likelihood mapping (b)
Select sequences in 4 clusters (g)
Assign sequences to the 4 groups (i.e. the 4 groups attached to the branch you want to study). Ask if you cannot figure it out. What are your groups?

Select number of quartets (n) enter 0 to select all (as two of your clusters contain a single sequence only, this number is quite small)
Select the model of heterogeneity. (Select Gamma distribution with 8 classes, enter the shape parameter as 0.6. DO NOT have the program find the shape parameter (takes about 30 minutes).
Interpret the outfiles.
What is your conclusion? (Which sequence is the deepest branch? Is the support for your finding strong, week, or nonexistent?)

Task 2. Use the maximum likelihood ratio test determine, if the trees stored in intree5.txt are significant improvements over the lesser-resolved one. (Use the VT model and no ASRV).

Start TREE-PUZZLE, read the aligned sequences from testseq5 into the program, toggle option "k" to usertree, enter "Y", enter name for treefile "intree5.txt". (If this takes too long, the outfile is here.)
Inspect the trees in this file using treeview (control copy from the text, control paste into treeview).
Use Paul's Chi square calculator/Chi Square table to determine, if the increase in likelihood (2(logL1-logL2)) is significant.
How many degrees of freedom do you assume for your tests?
What does the result mean?

Task 3 (IF YOU HAVE TIME LEFT):

When you incorporate ASRV into your substitution model, do you get similar results (outfile here)?

Compare the trees obtained with and without ASRV (using the ML ratio test). Does the more sophisticated model yield significantly higher likelihood when compared to the uniform rate? (outfile here)

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone