The Effects of Among Site Rate Variation on Phylogenetic Reconstruction from Amino Acid Sequences


J. Peter Gogarten
Dept. of Molecular and Cell Biology,
University of Connecticut
Storrs, CT 06269-3044,USA
http://www.sp.uconn.edu/~gogarten
Olga A. Zhaxybayeva
Dept. of Molecular and Cell Biology,
University of Connecticut
Storrs, CT 06269-3044,USA
http://carrot.mcb.uconn.edu/~olgazh/

[Summary]
[Introduction]
[Materials and Methods]
[The Simulation Algorithm]
[The Tree Types]
[Results and Discussion]
[Conclusion]
[Acknowledgments]
[References]



SUMMARY

Using computer simulation to generate sequence data for protein families, we studied the influence of among site rate variation on the performance of phylogenetic reconstruction. The following methods for phylogenetic reconstruction were compared: maximum likelihood, parsimony, and distance-matrix analyses. Three different tree topologies were used to explore the effect of long branch attraction with data that incorporated substitution bias and different degrees of among site rate variation.

While all methods showed errors due to long branch attraction under extreme conditions, maximum likelihood was by far the most successful in accurately resolving deep phylogenies.

Using protein parsimony long branch attraction already became a problem with modest differences in rate (ratio 1:3) and sequences that were about 50% identical, where as maximum likelihood methods still performed well with realistic values of among site rate variation and branch length ratios of 1:12 (sequences less than 30% identical).

In the absence of fast evolving lineages (i.e., long branches were due to the absence of side branches only) all tested algorithms performed well even when sequences were below 20% identity.


INTRODUCTION

Systematic artifacts have long been recognized as important complications in phylogenetic reconstruction. The most serious of this is long branch attraction (e.g.: Felsenstein, 1978, Handy and Penny, 1989). Long branch attraction occurs when a phylogeny is being reconstructed that contains long branches that group intermingled with shorter branches. Two scenarios can yield long branches:

  • some sequences evolve faster than others, i.e., they experience faster substitution rates. (Tree Type #1)
  • the long branches can be due to the absence of side branches in some lineages (Tree Type #3)

In both instances most algorithms tend to group the long branches together regardless whether this grouping reflects the actual history. Once the length of the long branches is above a threshold, which is different for different algorithms, the wrong result, i.e. grouping the long branches together, is produced with high reproducibility (compare Fig. 2). In case of parsimony analysis this finding can be rephrased in the statement that the actual evolution is not necessarily most parsimonious.

Protein sequences are useful for comparisons between distantly related sequences because their 20 letter alphabet in conjunction with substitution bias facilitates the alignment of divergent sequences (Pearson, 1996). Sequence alignment is further aided by highly conserved residues that remain unchanged even between distantly related proteins.

In case of nucleotide sequences simulations have been performed to estimate the effect of among site rate variation (ASRV) on phylogenetic inference (e.g.: Kuhner and Felsenstein, 1994, Huelsenbeck, 1995, Tateno et al, 1994); however at present the effect of ASRV on long branch attraction on protein sequences based phylogenies has not been explored. In this study we simulate protein data for different tree topologies using four different distributions for ASRV.


MATERIALS AND METHODS

Three different tree topologies were considered. For each topology independent data sets were generated by simulating substitution events. Among site rate variation and substitution bias were taken into account. Different degrees of among site rate variation can be described using different shape parameters for the Gamma distribution. Four values of shape parameters were used: infinity (no ASRV), 0.98 (exponential distribution), 0.66 and 0.5 were used. Values of 0.98 and 0.66 for a were chosen because they represent values estimated for a variety of different protein families (Gogarten et al., 1996). The value of 0.5 for a was chosen because of indications that substitution patterns in extreme thermophiles follow a more extreme site to site variation (Liu et al. submitted).

For each value of shape parameter data sets with different branch lengths were generated. The distribution of the different rate classes along the sequence was chosen to mimic the spatial distribution observed in vacuolar and archaeal ATPase A-subunits (Gogarten et al., 1992).

The observed frequency of residues falling into each rate category was used to discretize a continuous Gamma distribution. Lower and upper bounds for each rate category were calculated so that the portion of the gamma distribution falling within these bounds corresponded to the observed frequency of that rate category. The rate assigned to each category was determined from of the median of the gamma distribution within the corresponding lower and upper bounds.

We used several programs for analyzing the simulated data: PUZZLE 3.0 (Strimmer and van Haeseler, 1996) for maximum likelihood, PHYLIP 3.5 package (Felsenstein, 1993) and PHYLOWIN (Galtier et al 1996) for parsimony, neighbor joining and Fitch-Margoliash (Fitch and Margoliash, 1967) methods, CLUSTALW 1.6 (Thompson et al., 1994) for neighbor joining method with prior alignment of the data.


THE SIMULATION ALGORITHM

Fig.1: Schematic diagram of the algorithm used to simulate protein sequence evolution.

The algorithm was developed by Lei Liu et al. to study the relationship between genetic distance and % sequence divergence in the presence of different degrees of Among Site Rate Variation. The algorithm uses a discrete approximation to a continuous Gamma distribution to describe ASRV (Yang, 1996b).

The distribution of rate classes along the sequence (i.e. the correlation between sites) was chosen to mimic the distribution found in ATPase catalytic subunits. This is necessary because slower evolving sites are not randomly distributed along the sequence, but clustered together. The frequency of the rate classes was chosen to approximate a continuous distribution. The resulting algorithm thus provides sequences with realistic ASRV, substitution bias and correlation between sites. The algorithm is consistent with estimation of ASRV through the maximum likelihood programs PUZZLE(Strimmer and van Haeseler, 1996) and PAML (Yang, 1996).


THE TREE TYPES

Tree type#1 tests for long branch attraction due to different substitution rates in different lines of descent.
Tree type#2 tests for "long branch repulsion". An algorithm could do well on type#1 trees not because it is better in extracting phylogenetic information, but because it has the tendency to group long branches apart from each other. In the latter case, this algorithm should do poorly on type#2 trees.
Tree type#3 tests for long branch attraction in the absence of fast evolving lineages. The tree conforms to a molecular clock, the long branches are due only to the absence of side branches.

RESULTS AND DISCUSSION

All of the tested algorithms are susceptible to long branch attraction. In the absence of ASRV errors due to long branch attraction only occur with sequences that show less than 20% identity (Tab. 1); however, in the presence of ASRV long branch attraction becomes a serious problem even when the sequences at the "long" branches are about 50% identical. Use of maximum likelihood methods that incorporate both substitution bias and ASRV is superior over parsimony analysis. However, even this approach falls victim to long branch attraction with more extreme ASRV (a = 0.5 and 0.66).

Test of algorithms for errors due to long branch attraction.

Sequence simulations used the indicated values for shape parameters and branch lengths. '+' denotes that the tree topology was correctly recovered, '-' indicates that the wrong tree was recovered (i.e., the long branches A and B grouped together). 'n/a' denotes that data could not be analyzed with the method. Signs n the brackets indicate that the result was found in two out of three replicates.


Fig. 2: Correlation between branch length and average of bootstrap probability calculated with clustalw (with alignment, gaps excluded, no correction for multiple hits). the white column denotes correct, the black columns the incorrect tree. Bars indicate standard deviations calculated from independent simulations.

Table 2 shows that none of the tested algorithm makes the opposite error (i.e., long branch repulsion). The better performance of some algorithms indeed appears to be due to the better extraction of phylogenetic information by these algorithms.

Test of algorithms for long branch repulsion.

For further explanations see text and table 1.

The better performance of the used maximum likelihood approaches is also indicated in the results of Table 3. However, all of the algorithms performed surprisingly well, when the long branches were not due to unequal substitution rates but due to the absence of side branches. Even when the sequences were aligned prior to phylogenetic reconstruction, phylogenies were accurately recovered when sequences were less than 20% identical. (No one in his right mind would sequences which are less than 10% identical for phylogenetic information).

Test for long branch attraction due to missing side branches only.

'+' denotes that the tree topology was correctly recovered, '-' indicates that the wrong tree was recovered (i.e., the long branches A and B grouped together). 'n/a' denotes that data could not be analyzed with the method and 'u' indicates that the tree was unresolved. Signs n the brackets indicate that the result was found in two out of three replicates.

If genetic distances used as the basis for comparison, ASRV facilitates phylogenetic reconstruction (Tab.3), especially if the alignment step is included.


CONCLUSION

Avoid data sets that contain fast and slow evolving sequences.

If data sets contain long branches that are due to higher substitution rates, maximum likelihood analyses that incorporate ASRV and substitution bias perform better than protein parsimony of distance matrix analyses.

Long branch attraction that is due to the absence of side branches does not appear to be a serious problem for phylogenetic reconstruction from amino acid sequences.


ACKNOWLEDGMENTS

Work in this project was supported through the NASA Exobiology Program. While working on this project Olga Zhaxybayeva was supported through the NASA Planetary Biology Internship Program. We thank Lei Liu and Kent Holsinger for stimulating discussions and suggestions, and Lei Liu for providing the algorithm used to simulate sequence evolution.


REFERENCES

  • Felsenstein, J. (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27, 401-410
  • Felsenstein, J. (1993) Phylogeny Inference Package, version 3.5c. Distributed by the author. Dept. of Genetics, Univ. of Washington, Seattle.
  • Fitch, W. M., and E. Margoliash (1967). Construction of phylogenetic trees. Science 155: 279-284
  • Galtier, N., Gouy, M., and Gautier, C. (1996). SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny. CABIOS 12: 543-548
  • Gogarten, J.P., L. Olendzenski, E. Hilario, C. Simon, and K.E. Holzinger (1996). Dating the cenancestor of organisms, Science 274: 1750-1751.
  • Gogarten, J.P., T. Starke, H. Kibak, J. Fichmann, and L. Taiz (1992). Evolution and Isoforms of V-ATPase Subunits. The Journal of Experimental Biology 172: 137-147
  • Hendy, M.D., Penny, D. (1989) A framework for the quantitative study of evolutionary trees. Syst. Zool. 38, 297-309
  • Huelsenbeck, P.J. (1995) Performance of phylogenetic methods in simulation. Syst. Biol. 44, 17-48.
  • Pearson, W.R. (1996) Effective protein sequence comparison. Meth. Enzymol. 266, 227-258.
  • Strimmer K., and A, von Haeseler 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964-969.
  • Tateno, Y., Takezaki, N., Nei, M. (1994) Relative efficiencies of the maximum likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11, 261-277.
  • Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.
  • Yang, Z. (1996) Phylogenetic analysis by maximum likelihood (PAML), Version 1.2. Department of Integrative Biology, UCBerkeley, CA
  • Yang, Z. (1996b) The among-site rate variation and its impact on phylogenetic analyses, Trends Ecol.Evol. 11, 367-372