MCB 3421 — Assignment 10

Your name:
Your email address:

Assignments:

1. 30 minutes

Download ATPaseSU.fasta.txt onto the Desktop. Start seaview, and File—Open the sequence file.

The sequences are named according to the following schema:

A denotes the catalytic ATP binding subunits of the vacuolar proton pumping ATPase that is found on membranes of the eukaryotic endomembrane system and of the archaeal type A-ATPsynthase
- these were called A-subunits, because they are the largest subunits of the water soluble head group of these ATPase/ATPsynthases

bet denotes the catalytic ATP binding subunits of the bacterial type F-ATPsynthase (also present in mitochondria and plastids)
- these were called beta subunits because they are the second largest subunit in the head group

B denotes the non-catalytic ADP binding subunits of the vacuolar proton pumping V-ATPase that is found on membranes of the eukaryotic endomembrane system and of the archaeal type A-ATPsynthase
- these were called B-subunits, because they are the second largest subunits in the head group of these ATPase/ATPsynthases

alp denotes the non-catalytic ADP binding subunits of the bacterial type F-ATPsynthase
- these were called alpha subunits because they are the largest subunit in the head group

fl denotes proteins that if mutated prevent the assembly of the bacterial flagella

The subunit designation is followed by the genus name:
Mus musculus - house mouse, an animal (mammal)
Arabidopsis thaliana - thale cress, a flowering plant (model organism for botanists)
Neurospora crassa - bakers mold, an ascus forming fungus (model organism for geneticists)
Ecoli - Escherichia coli, a Gram negative bacterium, (model organism for biology)
Aquifex aeolicus- an extremely thermophilic bacterium (model organism for astrobiologists)
Methanosarcina barkeri, a euyarchaeote
Sulfolobus acidocaldarius, a crenarchaeote
Plasmodium, Trpanosoma - flagelated protozoa

Align the sequences (Align all) using muscle (in Alignment options).
Using Trees—Distance methods, calculate four neighbor joining trees as follows:

"BioNJ", Distance "Observed" (observed means no correction for multiple substitutions), ignore all gap sites UNCHECKED
"BioNJ", Distance "Observed", ignore all gap sites CHECKED
"BioNJ", Distance Poisson or Kimura (choose one, both these models have correction for multiple substitutions), ignore all gap sites UNCHECKED
"BioNJ", Distance Poisson/Kimura (choose same as above), ignore all gap sites CHECKED

Within each tree window, explore the "Re-root" option. (After choosing a new root, switch back to the "Full" option to make the display neater.)
Where do you think the root of the tree might be located?
Which subunits are paralogs (i.e. evolved from a gene duplication), which are probably orthologous (i.e., the homologs are related by a speciation, not a gene duplication vent)?
In particular, are the beta and A subunits orthologs or paralogs?
Which of the bifurcations correspond to gene duplications?

Explore the "Swap" option. Does this change the tree?

Explore the "Subtree" option. Can you manage to draw a tree from which only the flagellar assembly ATPase are excluded?

Compare the neighbor joining trees calculated using the different options.
What are the differences? (Note: The scale bar indicates the average number of substitutions per site)

Using a setting for the neighbor joining tree calculation that worked well, perform a bootstrap analysis calculating neighbor joining trees. Check the "Bootstrap" box (the default 100 replicates is fine). After re-rooting the tree on the flagellar subunits (and then going back to "Full" view), check the "Br support" box. Inspect the resulting tree. Given the support values, can you be sure that the flagellar assembly ATPase subunit diverged before the duplication that gave rise to the catalytic and non-catalytic subunits?

Save the tree from the tree window using File—"Save rooted tree". Open the tree from the bootstrap analysis in FigTree (should be already installed on the lab computers). Explore different options to display editorialize the tree (re-root, collapse and/or color different clades). The header allows you to select node/clades or taxa. Once you did this, the available option are turned on in the header (e.g., reroot, after yo selected a "node" (realy a branch)). (Note the automated coloring according to support values only works, if the support is given as a fraction of 1). The menus in the bar on the left allow to set font sizes, line widths, etc.) If you arrive at a nice depiction, export the tree as pdf (in the file menu), and send a copy to gogarten@uconn.edu .

2. (15 minutes)

Load the sequences contained in testseq1.txt into seaview. Align the sequences with muscle (note the inteins in two of the sequences) and calculate neighbor joining trees with bootstrap support values using the possible permutations of gaps/ no gaps with correction for multiple substitutions (again, choose either Poisson or Kimura as your model). Root the tree using Sulfolobus and/or Thermococcus as outgroup.
Which of the trees correspond to your expectations? (Sulfolobus and Thermococcus are Archaea, Borrelia is a Spirochete (bacterium), Acetabularia is a green algae, Daucus is a flowering plant (carrot), Candida, and Saccharomyces are yeasts, Neurospora is another fungus (not a yeast though), Drosophila is an animal (fruit fly) and Trypanosomes are protists.)

The sequences in testseq1.txt (V/A-ATPase catalytic subunits) are quite similar to one another. To test the effect of long branches, I added a homologous, but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella). The resulting file is testseq1b.txt.
Align the sequences and calculate neighbor joining trees for this file using the possible permutations of gaps/ no gaps, and with and without correction for multiple substitutions.

Which of the resulting trees appears to best reflect the actual evolution? (note that the Borrelia sequence actually is an archaeal type ATPase acquired through gene transfer)

Give a justification for your choice? What might be the reason that the other options worked less well?

What do you expect to happen, when you replace the Salmonella sequence with a completely (?) unrelated sequence ( testseq1c.txt)?

Is your expectation confirmed?

Discussion of Results :

Analyses of trees obtained with testseq1b: In my opinion, the best trees were obtained with correction for multiple substitutions turned on. Without correction for multiple substitutions the two longest branches (flSalmonella and Borrelia) group together and the group of the two yeasts is broken up by the Neurospora sequence. Excluding the positions with gaps resulted in a slight improvement (the yeasts go together). Analysis of trees obtained with testseq1c: The Synechococcus sequence is not homologous to any of the other sequences. Accordingly the distance correction does not work for all instances. Without considering gaps the sequence groups with the longest branch (long branch attraction), with positions that contain gaps included it goes with the Drosophila sequence, probably, because the amino terminal ends of the two sequences match up.

Exclusion of positions with gaps gets rid of a lot of noise (these regions are usually least conserved), and of instances of convergent gap formation. In case of distance based analyses, it is dangerous to not exclude columns that contain gaps or missing data. The reason is that these regions often evolve faster, thus sequences that contain them (in our case the two yeast sequences) are artificially pushed further apart from each other.

Multiple substitutions occur, thus it is a good thing to take this into consideration when calculating distances.

3. (15 minutes) PHYLIP is a collection of programs for phylogenetic analyses written by Joe Felsenstein. The programs are freely available (including source code), and can be used on a variety of different operating system. The programs are modular. Different modules exist to create bootstrap samples, calculate distance matrices and calculate trees from the distance matrices (Fitch and Neighbor), calculate consensus trees, etc.. All programs either use files called infile or intree, or alternatively the user needs to provide the file name. We will use the sequences from exercise 2c above (i.e., testseq1b.txt).

We will use the program protpars as implemented in seaview.

(IGNORE THIS PARAGRAPH, if you are in class) This paragraph has comments on phylip that you can safely ignore for now: If you use protpars directly (without the seaview interface), note that phylip by default treats gaps as a 21st character. If you want to treat the gap as missing data, you need to replace the gap symbol with "?"'s. In case you want to use one of the programs on your own, you need to read the excellent manuals that come with the software. Download PHYLIP from here for PCs and Macs.)

To calculate a phylogenetic tree from the aligned sequences using protein parsimony, open seaview by double clicking, and drag the file into alignment window.

Align the sequences using muscle. (click on align, then align all).

In the trees menu select parsimony. Uncheck "ignore all gap sites" and check "gaps as unknown state. To do a more thorough search for the tree that explains the data with the least number of substitutions, select "randomize seq order" 5 times provides a reasonable number of starting points for the heuristic search.

How are the fungal sequences resolved? (What does this tell us about parsimony and missing data?)
Where does the Salmonella sequence go? (This is as expected, parsimony analyses are very sensitive to the Long Branch Attraction (LBA)).

Seaview also allows calculating trees using the maximum likelihood (ml) principle. The ml tree is the tree under which the sequence alignment has the highest probability. To calculate the ml tree for a sequence alignment, select phyml under trees, and run the program with the default settings.
How does the ml tree compare to the parsimony and neighbor joining trees?

Finished?

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.