Class 10: Answers, PUZZLE, ml-mapping

MCB 372 - CLASS 10

Questions and comments regarding last Wednesday's class

The random seed allows you to start the random generator at a defined place. This way you can force the program to do exactly the same thing twice, and using a different seed, you can make sure that the randomizations are different, when you run the program a second time (jumbling or bootstrapping). The n in the 4n+1 formula does not refer to the number of species, but to any integer.

Why use jumble? Using the jumble option you build the tree in your heuristic search in a different way (you add the sequences in different order). Often the heuristic searches fail to actually find the most parsimonious tree. The jumble option runs the program several times and returns the best tree found. The only problem with the jumble option is that it does not report the tree found after each jumbling step. (If you do four jumbles, and everytime the tree is different, you were correct in assuming that if you repeat this a couple of times you might find shorter trees, whereas, if you always find the same tree you would feel a little more confident, as the parsimony landscape is more smooth and apparently easier to explore.) What could you do to make the program list the best tree for each starting tree generated with random input order?

If you do a heuristic search (as implemented in PHYLIP), you can never be sure that you really found the most parsimonious tree. If you have a large dataset, you should run the searches repeatedly, and if possible use PAUP, which offers more sophisticated and exhaustive branch swapping algorithms. However, even then you never are sure (e.g. here).

To use user-trees, you still need a dataset. I.e. you first put in your alignment (via infile or prompt), the tell the program via the menu that you want to use usertrees, and then you will be prompted to provide a filename (or use intree.).

Regarding question 4, dataset #2: The idea was to replace gaps with missing data. Use either "?" or "X". "N" would be interpreted as Asparagin, not as missing data. In this case "X" or "?" would both be fine. Why? Discuss the difference.

Regarding question 4, dataset #2: Are the trees different? (How did you treat gaps?) What could be used as an outgroup?

The outfile generated by PHYLIP programs is a text file that can be read in any text editor. To read the trees without too much problems you should use a non-proportional font (like Courier), decrease the fontsize and/or increase the size of the paper. You cannot open the file directly in treeview, BUT if the outfile contains a tree given in parenthesis notation, you can copy-paste this tree into TREEVIEW. You can directly load the treefile. If the file contains more than one tree, treeview obens one window, and you can click on the forward arrow to scroll throught the trees.

Using a single species as an outgroup does not change the tree found by a parsimony program from molecular sequence data (usually), it only influences how the tree is depicted. Remember: molecular data usually have to be considered as non-polarized, and the calculated trees are UNROOTED. If you have ADDITIONAL information, you can root the tree using an outgroup, but this will not change the number of steps required.

Student presentation #2: Models of Substitution: Nucleotides

TREE-PUZZLE

The reconstruction of phylogenetic trees from molecular sequences necessitates that you assume a model that describes the evolutionary process. Often these assumptions are not clearly spelled out; and some make the claim that parsimony analyses does not assume a model at all, it just searches for the tree that explains the data with the least number of substitutions. However, an alternative view is that parsimony corresponds to a model in which all substitutions are equally likely. One of the major problems, especially if one wants to calibrate the data with respect to time, is the correction for multiple substitutions. The situation is complicated by two factors:

different sites along a sequence experience substitutions with different frequency
if a replacement occurs, the different types of replacements occur with different probabilities

Both of these considerations are valid for amino acid and nucleotide sequences. Taking both of these processes into account greatly improves the validity of the obtained trees. Two approaches have been used to address problem 1. Assign different weight to different positions a priori (e.g. first, second, third codon position, or stem versus loop regions in rRNA. Or have the program decide which distribution of among site rate variation is present in the data and make the appropriate corrections for multiple substitutions.

The so called gamma function has become very popular for this purpose. A good and readable overview was published by Z. Yang (1996): The among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11, 367-372.

The G -distribution is useful because a single parameter (the shape parameter a ) continuously alters the character of the distribution. With a =¥ all sites change at the same rate; an extreme ASRV where only a few sites vary and the majority of sites are invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (» 2), and normal distributions (a >10). (see here for graphs)

There are several program packages that utilize the Gamma distribution. (e.g.: PAML from Z.Yang (Yang, Z 2001. Phylogenetic analysis by maximum likelihood (PAML),Version 3.1. Department of Integrative Biology, UCBerkeley, CA), PAUP from D. Swofford (test version distributed through the author, Smithonian Institution, Laboratory of Molecular Systematics, MRC-534, Washington DC 20560), and Strimmer and von Haseler's TREE-PUZZLE (Strimmer, K., von Haeseler, A. (1996) Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964-969. Version 5.0 is available through the authors at http://www.tree-puzzle.de/ .

The programs we will use in this course are PAUP and TREE-PUZZLE.

PAUP is a great program, if you use nucleotide sequences. The second part of this course will be devoted, in part, to PAUP. However, for proteins it leaves you little choice but to use the encoding nucleotide sequences. Recoverable phylogenetic information decays slower in protein sequences in the presence of ASRV (Lei, Holsinger, Gogarten, unpublished). In many instances nucleotide sequences are not (yet?) a viable alternative to studying protein sequences directly.

TREE-PUZZLE is a very versatile maximum likelihood program that is particularly useful to analyze protein sequences. The program was developed by Korbian Strimmer and Arnd von Haseler (then at the Univ. of Munich) and is maintained by von Haseler, Heiko A. Schmidt, and Martin Vingron (contacts see http://www.tree-puzzle.de/).

TREE-PUZZLE allows

fast and accurate estimation of ASRV (through estimating the shape parameter alpha) for both nucleotide and amino acid sequences,
it has a fast algorithm to determine trees through quartet puzzling (calculating ml trees for quartets of species and building the multispecies tree from the quartetts).
The program provides confidence numbers which tend to be smaller than bootstrap values (i.e. provide an even more conservative estimate), and
the program calculates branch lengths and likelihood for user defined trees, which is great if you want to compare different tree topologies, or different models using the maximum likelihood ratio test.
Branches which are not significantly supported are collapsed.
TREE-PUZZLE runs on "all" platforms
TREE-PUZZLE reads PHYLIP format, and communicates with hte user in a way similar to the PHYLIP programs.
Don't leave home without it

Drawbacks:

The more species you add the lower the support for individual branches. While this is true for all algorithms, in TREE-PUZZLE this can lead to completely unresolved trees.
Trees calculated via quartet puzzling are usually not completely resolved, and they do not correspond to the ML-tree.
The determined multi-species tree is not the tree with the highest likelihood, rather it is the tree whose topology is supported through ml-quartets, and the lengths of the resolved branches is determined through maximum likelihood.

Trees calculated with TREE-PUZZLE

Finding ML trees can be very time consuming, finding parsimony trees is blazingly fast by comparison. TREE-PUZZLE does not include a search algorithm for the tree with the highest likelihood. If you want to find this tree you need to use either PROML from PHYLIP, Adachi/Hasegawa's PROTML, or Yang's PAML. PROTML does not incorporate ASRV, but it is quite user friendly. PAML incorporates ASRV, and is run via a control file. Both programs use star decomposition for heuristic searches. At present my first choice would be PROML, if one starts a heuristic search from scratch, or PAML, if you know how to constrain the tree. In PAML and PROTML you also can provide a partially unresolved user-tree and let the program determine the most likely resolution of the unresolved nodes (again using star decomposition or exhaustive search).

TREE-PUZZLE uses a different approach. All different quartets of species (four sequences each) are calculated (the more the better) using maximum likelihood according to the model you select. Starting with a randomly chosen quartet, species are added one at a time, selecting the position of the added species so that it provides least conflict with all the quartets calculated.
E.g: (a,b),(c,d) is the starting quartet. Sequence e is to be added.
The quartet (e,b),(c.d) adds penalty to placing e with either c or d, whereas the central branch, and placement with a or b is compatible.
The quartet (e,a),(b,c) adds penalty to placing e with either b or c or with the central branch, whereas placement with a or d is compatible. The algorithm also keeps track of partially resolved quartets, e.g. if according to the ml analyses a can go with either c or d but not with B.

The adding of quartets is repeated many times, and the majority consensus of the different trees is displayed. In practice it turns out that the support values given by TREE-PUZZLE are more conservative than bootstrap values.

Maximum likelihood ratio test

If you want to compare two models of evolution (this includes the tree) given a data set, you can utilize the so-called maximum likelihood ratio test. If L₁ and L₂ are the likelihoods of the two models, d =2(logL₁-logL₂) approximately follows a Chi square distribution with n degrees of freedom. Usually n is the difference in model parameters. I.e., how many parameters are used to describe the substitution process and the tree. In particular n can be the difference in branches between two trees (one tree is more resolved than the other). In principle, this test can only be applied if on model is a more refined version of the other. However, if you compare two completely resolved trees with each other that differ only in a single branch, you can, following a suggestion by Joe Felsenstein, use one degree of freedom. If you compare two trees, one calculated without assuming a clock, the other assuming a clock, the degrees of freedom are n-2.

Maximum likelihood mapping

An often-encountered problem in inspecting trees is the assessment of support for different groupings. E.g. does Giardia form the deepest branch within the known eukaryotes. Maximum likelihood mapping offers a graphic approach to this problem.

You can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible quartet, the probabilities for the three tree types are plotted in a simplex. (Pi= Li/(L1+L2+L3) Note that P1+P2+P3=1; Pi is the (kind of) posterior probability and Li is the likelihood of tree i. For more information check out the literature posted here and here. If you generate a plot for the whole tree, you learn about how many quartets are resolved with confidence, if you plot only a single branch, you learn how many quartetts support each of the possible orientations of the branch.

E.g., if one wants to know, if Giardia lamblia is the deepest branch within the eukaryotes, on can choose the "higher" eukaryotes as cluster a, another deep branching eukaryote (one that competes against Giardia) as cluster b, Giardia as cluster c, and the outgroup as cluster d. For an easy example output see this sample ml-map.

A more complicated result from the analysis of carbamoyl phosphate synthetase domains is here.

Application of ML mapping to comparative Genome analyses

A recent article on the use of ml mapping in comparative genome analyses is here. (Go through Fig1, 2, 3, 4, 7, and Tab. 4)

General Remarks:

If no file called infile is present in the program directory, TREE-PUZZLE will prompt you for the infile. If you work a lot on the same dataset, it might be worth to rename your data set into infile. The output is always put into the files outfile, outtree and outdist. You have to rename these files as soon as you are done. The next time puzzle starts, the files get erased (this is particularly sad if your run took a couple of days).

If you start TREE-PUZZLE from the command line (MS-DOS prompt, or c-shell) using the command "puzzle name_of_infile.phy" the program writes the results into a file called name_of_infile.puzzle, name_of_infile.dist, etc..