Trees with CLUSTALX

Besides aligning sequences, Clustal also includes programs to calculate distance trees. The trees generated by clustalw certainly have their limitations, however, if one is aware of these limitations, the program is extremely useful for initial exploration.

Trees are calculated from a corrected or uncorrected distance matrix using the neighbor joining method. This method does not use an optimization procedure but a much faster algorithmic approach.

Several parameters that you can choose in clustalw influence tree building.

The choice of substitution matrix, and of other alignment parameters

You can ignore all positions that in any of the sequences contains a gap

You can correct for multiple substitutions (In a perfect world you want to use the actual number of substitutions that occurred in evolution, and not the number of sites that differ between two sequences). Later in the course we will discuss other methods for distance correction, however, everything considered clustalw is doing quite well.

Clustalw also provides possibilities for bootstrapping:

Bootstrapping - how to assess reliability of partitions given in a tree.

5 6 Alpha ACAAAC Delta CACCCA Gamma ACAAAC Beta ACCCCC Epsilon CAAAAC 5 6 Alpha AACAAC Beta AACCCC Epsilon CCAAAC Delta CCACCA Gamma CCCAAC 5 6 Delta CAACCC Beta ACCCCC Gamma ACCAAA Alpha ACCAAA Epsilon CAAAAA 5 6 Alpha AAAACA Beta AAAACC Gamma AAACCA Delta CCCCAC Epsilon CCCCAA 5 6 Beta ACCCCC Epsilon CAAACC Delta CCCCAA Gamma AAAACC Alpha AAAACC 5 6 Gamma CCAACC Alpha ACAACC Epsilon CAAACC Delta CACCAA Beta ACCCCC 5 6 Alpha AAACAA Delta CCCACC Epsilon CCCAAA Gamma AACCAA Beta AAACCC 5 6 Alpha AAAACC Delta CCCCAA Beta CCCCCC Epsilon AAAACC Gamma AAAACC 5 6 Beta AAAAAC Alpha AAAAAC Gamma AACCCC Delta CCCCCA Epsilon CCCCCC 5 6 Delta CCCCAA Epsilon CCAACC Gamma AAAACC Alpha AAAACC Beta AACCCC

Problems with clustalw:

The input order in analyzing the bootstrapped samples is not randomized; therefore, if you have no phylogenetic information at all, you get 100% bootstrap values.
LOOK AT YOUR ALIGNMENTS CAREFULLY! -
or "From junk comes junk!"

If you have very different branch lengths, even if you have a "molecular clock" running, long branches have the tendency to attract each other.

TREEVIEW

To view trees generated by clustalw, you can use treeview from Rod Page.

The program should be already installed on your computers. The program is extremely user-friendly. Trees generated can be copied and pasted into Microsoft Word, and the labels can be rearranged/modified after double clicking on the imported image.

There are several programs available that among other things calculate distance matrices (some with more sophisticated corrections than available in clustal). You can use the Joe Felsensteins program Neighbor.exe to calculate neighbor joining trees from the distance matrices. A source code and executables are available through the Phylip homepage.

Assignment #5

Align the sequences contained in testseq1.txt using ClustalX. Perform a bootstrap analysis (Trees menu-> bootstrap N-J tree). Load the tree into treeview. In treeview toggle between the different display options (buttons on top of the tree window). Go to Tree menu and define the outgroup as Sulfolobus and Thermococcus. Then use the outgroup to root the tree (same menu). Does the tree correspond to your expectations?

Note: you also can use TreeView to edit a tree. This comes in handy, when you need to generate usertrees for other programs. Try it out!! In treeview copy a tree onto the computer's clipboard, and paste it into a MSWord document.
The sequences in testseq1.txt (V/A-ATPase catalytic subunits) are quite similar to one another. To test the effect of long branches, I added a homologous, but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella). The resulting file is testseq1b.txt.

Align the sequences and calculate bootstrapped trees for this file using ALL possible permutations of gaps/ no gaps, and with and without correction for multiple substitutions.

Which of the resulting trees appears to best reflect the actual evolution?

Give a justification for your choice?

What might be the reason that the others options worked less well?

What do you expect to happen, when you replace the Salmonella sequence with a completely (?) unrelated sequence?
(testseq1c.txt)

Is your expectation confirmed?

Discussion of Results

Analyses of trees obtained with testseq1b: In my opinion, the best trees were obtained with correction for multiple substitutions turned on. Without correction for multiple substitutions the two longest branches (flSalmonella and Borrelia) group together and the group of the two yeasts is broken up by the Neurospora sequence. Excluding the positions with gaps resulted in a slight improvement (the yeasts go together), and the bootstrap values for the branches that are supported by other evidence were higher, whereas questionable groupings were appropriately little supported.
Analysis of trees obtained with testseq1c: The Synechococcus sequence is not homologous to any of the other sequences. Accordingly the distance correction does not work for all instances. Without considering gaps the sequence groups with the longest branch (long branch attraction), with position that contain gaps included it goes with the Drosophila sequence, probably, because the amino terminal ends of the two sequences match up.
Exclusion of positions with gaps gets rid of a lot of noise (these regions are usually least conserved), and of instances of convergent gap formation (some other programs handle this problem with more alternatives).
Multiple substitutions occur, thus it is a good thing to take this into consideration when calculating distances.

Trees with CLUSTALX

TEST DATA SET

CONTENTS OF OUTPUT FILE

Problems with clustalw:

Assignment #5

Discussion of Results