Your name: Your email address:
Download Seaview (see computer lab #8). The latest versions of the seaview program available for different platforms are here.
If you're doing this exercise from home, you may need to have WINZIP or STUFF-IT expander installed (read the help manuals to learn how to use these programs) - UConn's ftp server has copies of these programs in the restricted folder (only accessible from within UConn).
1 (15 minutes):
Load the sequences contained in testseq1.txt into clustalx. Align the sequences (note the inteins in two of the sequences) and calculate neighbor joining trees with bootstrap support values using the possible permutations of gaps/ no gaps with correction for multiple substitutions. Load the trees into njplot. Root the tree using Sulfolobus and/or Thermococcus as outgroup. Which of the trees correspond to your expectations? (Sulfolobus and Thermococcus are Archaea, Borrelia is a Spirochete (bacterium), Acetabularia is a green algae, Daucus is a flowering plant (carrot), Candida, and Saccharomyces are yeasts, Neurospora is another fungus (not a yeast though), Drosophila is an animal (fruit fly) and Trypanosomes are protists.)
The sequences in testseq1.txt (V/A-ATPase catalytic subunits) are quite similar to one another. To test the effect of long branches, I added a homologous, but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella). The resulting file is testseq1b.txt . Align the sequences and calculate neighbor joining trees for this file using the possible permutations of gaps/ no gaps, and with and without correction for multiple substitutions.
Which of the resulting trees appears to best reflect the actual evolution? (note that the Borrelia sequence actually is an archaeal type ATPase acquired through gene transfer)
Give a justification for your choice? What might be the reason that the others options worked less well?
What do you expect to happen, when you replace the Salmonella sequence with a completely (?) unrelated sequence ( testseq1c.txt )?
Is your expectation confirmed?
Analyses of trees obtained with testseq1b: In my opinion, the best trees were obtained with correction for multiple substitutions turned on. Without correction for multiple substitutions the two longest branches (flSalmonella and Borrelia) group together and the group of the two yeasts is broken up by the Neurospora sequence. Excluding the positions with gaps resulted in a slight improvement (the yeasts go together). Analysis of trees obtained with testseq1c: The Synechococcus sequence is not homologous to any of the other sequences. Accordingly the distance correction does not work for all instances. Without considering gaps the sequence groups with the longest branch (long branch attraction), with positions that contain gaps included it goes with the Drosophila sequence, probably, because the amino terminal ends of the two sequences match up.
Exclusion of positions with gaps gets rid of a lot of noise (these regions are usually least conserved), and of instances of convergent gap formation. In case of distance based analyses, it is dangerous to not exclude columns that contain gaps or missing data. The reason is that these regions often evolve faster, thus sequences that contain them (in our case the two yeast sequences) are artificially pushed further apart from each other.
Multiple substitutions occur, thus it is a good thing to take this into consideration when calculating distances.
2. (15 minutes) PHYLIP is a collection of programs for phylogenetic analyses written by Joe Felsenstein. The programs are freely available (including source code), and can be used on a variety of different operating system. The programs are modular. Different modules exist to create bootstrap samples, calculate distance matrices and calculate trees from the distance matrices (Fitch and Neighbor), calculate consensus trees, etc.. All programs either use files called infile or intree, or alternatively the user needs to provide the file name. We will use the sequences from exercise 2c above (i.e., testseq1b.txt).
We will use the program protpars as implemented in seaview.
(IGNORE THIS PARAGRAPH, if you are in class) This paragraph has comments on phylip that you can safely ignore for now: If you use protpars directly (without the seqview interface), note that phylip by default treats gaps as a 21st character. If you want to treat the gap as missing data, you need to replace the gap symbol with "?"'s. In case you want to use one of the programs on your own, you need to read the excellent manuals that come with the software. Download PHYLIP from HERE. Drag the "phylip-3.68" folder to your Desktop. (The original download location is here for PCs and Macs.)
To calculate a phylogenetic tree from the aligned sequences using protein parsimony, open seaview by double clicking, and drag the file into alignment window.
Align the sequences using muscle. (click on align, then align all).
In the trees menu select parsimony. Uncheck "ignore all gap sites" and check "gaps as unknown state. To do a more thorough search for the tree that explains the data with the least number of substitutions, select "randomize seq order" 5 times provides a reasonable number of starting points for the heuristic search.
How are the fungal sequences resolved? (What does this tell us about parsimony and missing data?) Where does the Salmonella sequence go? (This is as expected, parsimony analyses are very sensitive to the Long Branch Attraction (LBA)).
Seaview also allows calculating trees using the maximum likelihood (ml) principle. The ml tree is the tree under which the sequence alignment has the highest probability. To calculate the ml tree for a sequence alignment, select phyml under trees, and run the program with the default settings. How does the ml tree compare to the parsimony and neighbor joining trees?
Long Branch Attraction (LBA) is a serious problem in phylogenetic reconstruction. LBA denotes the fact that long branches tend to be grouped together with significant support, even though the organisms representing the long branches did not share more recent common ancestry. The support usually is measured through bootstrap support values for the different trees. We have simulated the evolution of 4 sequences (named A,B,C,D) according to the following tree:
Files containing these sequences in multiple sequence fasta format were generated and named according to the length chosen for the two long branches (all scaled in substitutions per site). For the simulation we assumed that the Among Site Rate Variation could be described with a gamma distribution that has a shape factor of 1 (equal to an exponential distribution).
These files are HERE (open the folder, then ctrl click on the individual files to save them into a folder on your computer).
Your task is to explore the sensitivity of different phylogenetic reconstruction algorithms towards LBA. At the minimum you should use protein parsimony and one protein distance matrix analysis approach. In this case we know that the sequences are aligned as given; however, you to explore the effect that the alignment algorithm has on LBA, we can align them before phylogenetic reconstruction. To keep track of things, name the files accordingly.
NOTE I: If you want to explore the effect of alignment, it might be a good idea to use seaview and muscle as alignment program - especially for the more divergent sequences, clustalx takes a very long time. We will use the GUI provided in seaview.
Note II: You can divide the labor with your neighbor, distributing different sequences to different students.
We will use programs as implemented in SEAVIEW
3A: To test parsimony, choose the files with x = 0.1, 0.3, 1, 3, 10.
How long are the sequences before and after alignment with muscle - the default in the align menu of seaview)?
For each of the datasets, use the tree menu in seaview, select parsimony, uncheck "ignore all gap sites", check "gaps as unknown states", check "bootstrap with 100 replicates". (Note: If you are interested in the best parsimony tree, then you want to use the original dataset (not bootstrapped) and randomize the input order for several independent heuristic searches, if you do a bootstrap analysis, repeated heuristic searches for each dataset are not worth the time.)
In the following box list the files that you chose, aligned or as provided, the bootstrap support for the correct tree, and the support for the LBA tree:
3B) Explore a distance matrix based approach with respect to LBA (Neighbor joining using Poisson corrected or observed distances work well). Depending on the settings, these might be less sensitive to LBA. x = 0.3, 3, 30, 300 are good choices to explore.
In the following box list the parameters you selected in seaview, the files that you chose (aligned or as provided), and for each file indicate the bootstrap support for the correct tree, and the support for the LBA tree:
3C (optional) Explore the sensitivity of phyml towards LBA. This only works on a fast computer - either transfer the sequences to the cluster, login, qlogin and start the program by typing phyml at the command line, or use seaview on your own computer.
In the following box list give the parameters you chose for phyml, the files that you choose, indicate if you aligned them or used them as provided, and for each file give the support value for the correct tree, and the support for the LBA tree:
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.