Assignment 10

Your name:
Your email address:

Note: To do these exercises you need to install clustalx, treeview and njplot on your computer.

Download ClustalX from HERE. Drag the ClustalX icon to your Desktop. (Original source is here, for PCs and Macs.)

Download NJplot from HERE (original link is here for PCs and Macs).

Download Seaview (see last computer lab). The latest versions of the seaview program available for different platforms are here.

Download Figtree from here. (The latest versions for different operating systems are here)

If you're doing this exercise from home, you may need to have WINZIP or STUFF-IT expander installed (read the help manuals to learn how to use these programs) - UConn's ftp server has copies of these programs in the restricted folder (only accessible from within UConn).

Assignments:

1. 20 minutes

Download this file onto your computer. Find the program clustalx, start it, and load the sequences into the clustalx program (Menu file ... load sequences).
The sequences are named according to the following schema:

A denotes the catalytic ATP binding subunits of the vacuolar proton pumping ATPase that is found on membranes of the eukaryotic endomembrane system and of the archaeal type A-ATPsynthase
- these were called A-subunits, because they are the largest subunits of the water soluble head group of these ATPase/ATPsynthases

bet denotes the catalytic ATP binding subunits of the bacterial type F-ATPsynthase (also present in mitochondria and plastids)
- these were called beta subunits because they are the second largest subunit in the head group

B denotes the non-catalytic ADP binding subunits of the vacuolar proton pumping V-ATPase that is found on membranes of the eukaryotic endomembrane system and of the archaeal type A-ATPsynthase
- these were called B-subunits, because they are the second largest subunits in the head group of these ATPase/ATPsynthases

alp denotes the non-catalytic ADP binding subunits of the bacterial type F-ATPsynthase
- these were called alpha subunits because they are the largest subunit in the head group

fl denotes proteins that if mutated prevent the assembly of the bacterial flagella

The subunit designation is followed by the genus name:
Mus musculus - house mouse, an animal (mammal)
Arabidopsis thaliana - thale cress, a flowering plant (model organism for botanists)
Neurospora crassa - bakers mold, an ascus forming fungus (model organism for geneticists)
Ecoli - Escherichia coli, a Gram negative bacterium, (model organism for biology)
Aquifex aeolicus- an extremely thermophilic bacterium (model organism for astrobiologists)
Methanosarcina barkeri, a euyarchaeote
Sulfolobus acidocaldarius, a crenarchaeote
Plasmodium, Trpanosoma - flagelated protozoa

Align the sequences using the default options (alignment menu)

Calculate neighbor joining trees (Tree menu ... Draw N-J tree) using the different options: (don't forget to give different names to your trees!)
Exclude positions with gaps (unchecked) - correct for multiple substitutions (unchecked)
Exclude positions with gaps (unchecked) - correct for multiple substitutions (checked)
Exclude positions with gaps (checked) - correct for multiple substitutions (unchecked)
Exclude positions with gaps (checked) - correct for multiple substitutions (checked)
Save the trees into different files using names that allow you to remember which options you used - the absence of a logfile is one of the drawbacks of clustalx

Using Njplot, load the file that was calculated including positions with gaps and without correction for multiple substitutions (clustalx has generated two types of tree files: *.dnd is a tree that is used to guide the alignment; *.ph is the neighbor joining tree in PHYLIP format. The *.ph trees are the one you want to explore.

In njplot explore the new outgroup option.
Where do you think the root of the tree might be located?
Which subunits are paralogs (i.e. evolved from a gene duplication), which are probably orthologous (i.e., the homologs are related by a speciation, not a gene duplication vent)?
In particular, are the beta and A subunits orthologs or paralogs?
Which of the bifurcations correspond to gene duplications?

In njplot explore the swap nodes option. Does this change the tree?

In njplot explore the subtree option. Can you manage to draw a tree from which only the flagellar assembly ATPase are excluded?

Compare the neighbor joining trees that clustalx calculated using the different options.
What are the differences? (The scale bare indicates the average number of substitutions per site)

Using a setting for the neighbor joining tree calcualtion that worked well, perform a bootstrap analysis calculating neighbor joining trees. Inspect the resuling tree in njplot. Given the support values, can you be sure that the flagellar assembly ATPase subunit diverged before the duplication that gave rise to the catalytic and non-catalytic subunits?

2. (20 minutes)

Load the sequences contained in testseq1.txt into clustalx. Align the sequences (note the inteins in two of the sequences) and calculate neighbor joining trees with bootstrap support values using the possible permutations of gaps/ no gaps with correction for multiple substitutions. Load the trees into njplot. Root the tree using Sulfolobus and/or Thermococcus as outgroup.
Which of the trees correspond to your expectations? (Sulfolobus and Thermococcus are Archaea, Borrelia is a Spirochete (bacterium), Acetabularia is a green algae, Daucus is a flowering plant (carrot), Candida, and Saccharomyces are yeasts, Neurospora is another fungus (not a yeast though), Drosophila is an animal (fruit fly) and Trypanosomes are protists.)

The sequences in testseq1.txt (V/A-ATPase catalytic subunits) are quite similar to one another. To test the effect of long branches, I added a homologous, but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella). The resulting file is testseq1b.txt .
Align the sequences and calculate neighbor joining trees for this file using the possible permutations of gaps/ no gaps, and with and without correction for multiple substitutions.

Which of the resulting trees appears to best reflect the actual evolution? (note that the Borrelia sequence actually is an archaeal type ATPase acquired through gene transfer)

Give a justification for your choice? What might be the reason that the others options worked less well?

What do you expect to happen, when you replace the Salmonella sequence with a completely (?) unrelated sequence ( testseq1c.txt )?

Is your expectation confirmed?

Discussion of Results :

Analyses of trees obtained with testseq1b: In my opinion, the best trees were obtained with correction for multiple substitutions turned on. Without correction for multiple substitutions the two longest branches (flSalmonella and Borrelia) group together and the group of the two yeasts is broken up by the Neurospora sequence. Excluding the positions with gaps resulted in a slight improvement (the yeasts go together). Analysis of trees obtained with testseq1c: The Synechococcus sequence is not homologous to any of the other sequences. Accordingly the distance correction does not work for all instances. Without considering gaps the sequence groups with the longest branch (long branch attraction), with positions that contain gaps included it goes with the Drosophila sequence, probably, because the amino terminal ends of the two sequences match up.

Exclusion of positions with gaps gets rid of a lot of noise (these regions are usually least conserved), and of instances of convergent gap formation. In case of distance based analyses, it is dangerous to not exclude columns that contain gaps or missing data. The reason is that these regions often evolve faster, thus sequences that contain them (in our case the two yeast sequences) are artificially pushed further apart from each other.

Multiple substitutions occur, thus it is a good thing to take this into consideration when calculating distances.

3. (20 minutes) PHYLIP is a collection of programs for phylogenetic analyses written by Joe Felsenstein. The programs are freely available (including source code), and can be used on a variety of different operating system. The programs are modular. Different modules exist to create bootstrap samples, calculate distance matrices and calculate trees from the distance matrices (Fitch and Neighbor), calculate consensus trees, etc.. All programs either use files called infile or intree, or alternatively the user needs to provide the file name. We will use the sequences from exercise 2c above (i.e., testseq1b.txt).

We will use the program protpars as implemented in seaview.

This paragraph has comments on phylip that you can safely ignore for now: If you use protpars directly (without the seqview interface), note that phylip by default treats gaps as a 21st character. If you want to treat the gap as missing data, you need to replace the gap symbol with "?"'s. In case you want to use one of the programs on your own, you need to read the excellent manuals that come with the software. Download PHYLIP from HERE. Drag the "phylip-3.68" folder to your Desktop. (The original download location is here for PCs and Macs.)

To calculate a phylogenetic tree from the aligned sequences using protein parsimony, open seaview by double clicking, and drag the file into alignment window.

Align the sequences using muscle. (click on align, then align all).

In the trees menu select parsimony. Uncheck "ignore all gap sites" and check "gaps as unknown state. To do a more thorough search for the tree that explains the data with the least number of substitutions, select "randomize seq order" 5 times provides a reasonable number of starting points for the heuristic search.

How are the fungal sequences resolved? (What does this tell us about parsimony and missing data?)
Where does the Salmonella sequence go? (This is as expected, parsimony analyses are very sensitive to the Long Branch Attraction (LBA)).

Seaview also allows calculating trees using the maximum likelihood (ml) principle. The ml tree is the tree under which the sequence alignment has the highest probability. To calculate the ml tree for a sequence alignment, select phyml under trees, and run the program with the default settings. (This might take some time, when I calculated my first ml tree for 12 sequences, it took over a week on a UNIX cluster, and the model for sequence evolution was much simpler.).
How does the ml tree compare to the parsimony and neighbor joining trees?

If this takes too long on your iMacs, and you have still time and energy, load one of the trees from assignment 1 (the one with correction for multiple substitutions and without excluding gap positions works well) into Figtree, and explore the different display and annotation options.
Select NODE under selection mode (in the header), click on the branch where you want to place the root, then click on the re-root tool in the header.

Click on clade in the selection mode, and select the different paralogs and color the branches in different colors.

Select the eukaryotic clades and display it in cartoon version (one after the other).

Rotate works as in njplot. (Careful with the highlight feature, it sometimes gives unexpected results.)

Finished?

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.