Assignment for Class 29

Your name:
Your email address:

Note: To do these exercises at home you need to install clustalx, treeview and njplot on your computer. For PCs the software is available here, here and here. For MAC OSX go here, here and here. The PHYLIP package is available at http://evolution.genetics.washington.edu/phylip.html . Follow the instructions given at that site.
You need to have WINZIP or STUFF-IT expander installed (read the help manuals to learn how to use these programs) - UConn's ftp server has copies of these programs in the restricted folder (only accessible from within UConn).

Assignments:

1. 25 minutes

Download this file onto your computer. Find the program clustalx, start it, and load the sequences into the clustalx program (Menu file open).
The sequences are named according to the following schema:

A denotes the catalytic ATP binding subunits of the vacuolar proton pumping ATPase that is found on membranes of the eukaryotic endomembrane system and of the archaeal type A-ATPsynthase
- these were called A-subunits, because they are the largest subunits of the water soluble head group of these ATPase/ATPsynthases

bet denotes the catalytic ATP binding subunits of the bacterial type F-ATPsynthase (also present in mitochondria and plastids)
- these were called beta subunits because they are the second largest subunit in the head group

B denotes the non-catalytic ADP binding subunits of the vacuolar proton pumping V-ATPase that is found on membranes of the eukaryotic endomembrane system and of the archaeal type A-ATPsynthase
- these were called B-subunits, because they are the second largest subunits in the head group of these ATPase/ATPsynthases

alp denotes the non-catalytic ADP binding subunits of the bacterial type F-ATPsynthase
- these were called alpha subunits because they are the largest subunit in the head group

fl denotes proteins that if mutated prevent the assembly of the bacterial flagella

The subunit deseignation is followed by the genus name:
Mus musculus - house mouse, an animal (mammal)
Arabidopsis thaliana - thale cress, a flowering plant (model organism for botanists)
Neurospora crassa - bakers mold, an ascus forming fungus (model organism for geneticists)
Ecoli - Escherichia coli, a Gram negative bacterium, (model organism for biology)
Aquifex aeolicus- an extremely thermophilic bacterium (model organism for astrobiologists)
Methanosarcina barkeri, a euyarchaeote
Sulfolobus acidocaldarius, a crenarchaeote
Plasmodium, Trpanosoma - flagelated protozoa

Align the sequences using the default options (alignment menu)

Calculate a neighbor joining trees (Tree menu) using the different options: (dont forget to give different names to your trees!)
Exclude positions with gaps (unchecked) - correct for multiple substitutions (unchecked)
Exclude positions with gaps (unchecked) - correct for multiple substitutions (checked)
Exclude positions with gaps (checked) - correct for multiple substitutions (unchecked)
Exclude positions with gaps (checked) - correct for multiple substitutions (checked)
Save the trees into different files using names that allow you to remember which options you used - the absence of a logfile is one of the drawbacks of clustalx

Using Njplot, load the file that was calculated including positions with gaps and without correction for multiple substitutitions (clustalx has generated two types of tree files: *.dnd is a tree that is used to guide the alignment; *.ph is the neighbor joining tree in PHYLIP format. The *.ph trees are the one you want to explore.

In njplot explore the new outgroup option.
Where do you think the root of the tree might be located?
Which subunits are paralogs, which are probably orthologous? In particular, are the beta and A subunits orthogs or paralogs?
Which of the bifurcations correspond to gene duplications?

In njplot explore the swap nodes option. Does this change the tree?

In njplot explore the subtree option. Can you manage to draw a tree from which only the flagellar assembly ATPase are excluded?

Compare the neighbor joining trees that clustalx calculated using the different options.
What are the differences?

2. (20 minutes)

Load the sequences contained in testseq1.txt into clustalx. Align the sequences (note the inteins in two of the sequences) and calculate a neighbor joining tree. Load the tree into treeview. In treeview toggle between the different display options (buttons on top of the tree window). Go to Tree and define the outgroup as Sulfolobus and Thermococcus. Then use the outgroup to root the tree (same menu).
Does the tree correspond to your expectations?

Sulfolobus and Thermococcus are Archaea, Borrelia is a Spirochete (bacterium), Acetabularia is a green algae, Daucus is a flowering plant (carrot), Candida, and Saccharomyces are yeasts, Neurospora is another fungus (not a yeast though), Drosophila is an animal (fruit fly) and Trypanosomes are protists.

The sequences in testseq1.txt (V/A-ATPase catalytic subunits) are quite similar to one another. To test the effect of long branches, I added a homologous, but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella). The resulting file is testseq1b.txt .
Align the sequences and calculate neighbor joining trees for this file using the possible permutations of gaps/ no gaps, and with and without correction for multiple substitutions.

Which of the resulting trees appears to best reflect the actual evolution?

Give a justification for your choice? What might be the reason that the others options worked less well?

What do you expect to happen, when you replace the Salmonella sequence with a completely (?) unrelated ( testseq1c.txt )?

Is your expectation confirmed?

Discussion of Results :

Analyses of trees obtained with testseq1b: In my opinion, the best trees were obtained with correction for multiple substitutions turned on. Without correction for multiple substitutions the two longest branches (flSalmonella and Borrelia) group together and the group of the two yeasts is broken up by the Neurospora sequence. Excluding the positions with gaps resulted in a slight improvement (the yeasts go together). Analysis of trees obtained with testseq1c: The Synechococcus sequence is not homologous to any of the other sequences. Accordingly the distance correction does not work for all instances. Without considering gaps the sequence groups with the longest branch (long branch attraction), with positions that contain gaps included it goes with the Drosophila sequence, probably, because the amino terminal ends of the two sequences match up.

Exclusion of positions with gaps gets rid of a lot of noise (these regions are usually least conserved), and of instances of convergent gap formation. In case of distance based analyses, it is dangerous to not exclude columns that contain gaps or missing data. The reason is that these regions often evolve faster, thus sequences that contain them (in our case the two yeast sequences) are artificially pushed further appart from each other.

Multiple substitutions occur, thus it is a good thing to take this into consideration when calculating distances.

3. (Only if you have time, BUT it provides a valuable learing experience!) PHYLIP is a collection of programs for phylogenetic analyses written by Joe Felsenstein. The programs are freely available (including source code), and can be used on a variety of different operating system. The programs are modular. Different modules exist to create bootstrap samples, calculate distance matrices and calculate trees from the distance matrices (Fitch and Neighbor), calculate consensus trees, etc.. All programs either use files called infile or intree, or alternatively the user needs to provide the file name. We will use the sequences from exercise 2c above. The file is here. (Note that phylip by default treats gaps as a 21 character. If you want to treat the gap as missing data, you need to replace the gap symbol with "?"'s). In case you want to use one of the programs on your own, you need to read the excellent manuals that come with the software.

Save the file with the aligned sequences in the directory where the PHYLIP executables are stored (within the mcb221 folder).

To calculate a phylogenetic tree from the aligned sequences using protein parsimony, double click the icon protpars.app. This should open a terminal window and launch the protpars program. When prompted, enter the name of the sequence file (testseq1b.phy). Read through the menu options, but do not change them. Enter Y to start the program. The results will be written into two files outtree and outfile. Outtree can be opened with njplot, outfile is a text file. Inspect both files. (Note: Phylip by default uses the two file named outfile and outtree. Rename your files so that you remember what is in there.)

Where does the Salmonella sequence go? (This is as expected, parsimony analyses are very sensitive to the Long Branch Attraction (LBA)).
How are the fungal sequences resolved? (What does this tell us about parsimony and missing data?)

To calculate a distance matrix from the aligned sequences, we can use the program PROTDIST. You start it by double clicking on PROTDIST.app. First, we run the program with the default options. When done, rename the outfile dist1.txt. Run the program again, this time selecting a model that considers that different sites change with different probability. We select option G. After we start the program by entering Y, we are prompted to enter 1/shape parameter, we choose one, which corresponds to an exponential distribution.
We rename the outfile into dist2.txt.

Phylip has several programs to calculate trees from distance matrices. The fastest is Neighbor (same algorithm as used in clustalw). Neighbor joining is an algorithmic tree building program (not much liked by many evolutionary biologist, it does not find a tree that fulfills an optimality criterion). Using Neighbor.app, calculate neighbor joining trees from dist1.txt and dist2.txt. Rename the outtrees into Neighbor1.ph and Neighbor2.ph. Inspect the trees using nj plot.
What is the difference, how do you explain it?

FITCH.app is a program that tries to find the tree that fits the distance matrix with the least amount of error (using the defaults it uses a weighted least square error, placing a higher weight on shorter distances). Especially in case of trees with many branches this gives superior results compared to NEIGHBOR, but it can take a long time. Start FITCH.app, load dist2.txt, select option G (global rearrangement) and enter Y to start the program. Rename the outtree FITCH2.ph. Inspect the tree with njplot. What is different compared to Neighbor2.ph? Which one seems more realistic?

In the analyses using PROTDIST, we usually found that the two yeasts did not group together. What could one do to improve the analyses?

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.