GOALS:

·      Learn how to get your data into PUZZLE and the results out.

·      Attain some understanding on what to do with the ml ratio test.

·      Think about models for the substitution process.

·      Learn about the different options in PUZZLE

General Remarks:

If no file called infile is present in the program directory, PUZZLE will prompt you for the infile.  If you work a lot on the same dataset, it might be worth to rename your data set into infile.  The output is always put into the files outfile, outtree and outdist.

You have to rename these files as soon as you are done.  The next time puzzle starts, the files get erased (this is particularly sad if your run took a couple of days).

 

1.  Transfer the files testseq5.txt, testseq5.aln and testseq5.phb onto your computer.  These files contains vacuolar/archaeal ATPases from 49 pro- and eukaryotes: 

Daucus carota, Arabidopsis thaliana, Gossypium hirsutum are plants;

Acetabularia acetabulum is a green and Cyanidium caldarium is a red algae

Mus, Homo, Bos (mammals) Gallus (bird), Drosophila, Aedes (insects) are animals

Saccharomyces, Candida, Schizosaccharomyces, Eremothecium and Neurospora are fungi

Dictyostelium discoideum, Entamoeba, Plasmodium falsiparum, Trypanosoma, Nosema, and Giardia are protists or protozoa

Sulfolobus acidocaldarius (70 °C), Sulfolobus solfataricus (70-85 °C), Archaeoglobus fulgidus (83 °C), Methanosarcina barkeri (30-37 °C), Methanosarcina mazeii (37°C), Methanococcus jannaschii (80°C), Haloferax volcanii (37°C), Halobacterium salinarium (ca37°C), Methanobacterium thermoautotrophicum (60-65°C), Desulfurococcus sp. (85-90 °C), Thermococcus sp (75+ °C), Aeropyrum pernix (90°C), Thermoplasma acidophilum (55-60°C), Pyrococcus abysii and Pyrococcus horikoshii 95°C (archaea),

Enterococcus hirae (37oC), Borrelia burgdorferi (33-37oC), Thermus thermophilus (70-80oC), Deinococcus radiodurans (30oC), Chlamydia trachomatis, Chlamydophila pneumoniae (Bacteria) are prokaryotes. (Usually Bacteria have an F- and not an A-ATPase. The bacteria probably obtained the archaeal/vacuolar type ATPase through horizontal gene transfer.) 

Note the German Collection of Microorganisms and Cell Cultures is a good place to find growth temperatures for microorganisms. 

The prokaryotes can be considered as outgroup for the eukaryotes.

2.    Use maximum likelihood mapping to address the question if the Giardia or the Trichomonas sequence represents the deeper branching lineage among the eukaryotic homologues.  As we are not sure about the relative branching order of Trypanosoma and Plasmodium, do not assign them to any group. 

For this you need to:

*         load testseq5.aln into clustalx,

*         save the sequences (they should be already aligned) in phylip format.  Check that truncating the sequence names does not result in duplicates. 

*         Start PUZZLE by double clicking on the PUZZLE icon

*         Load testseq5.phy (or what ever name you gave the file)

*         Select type of analyses: maximum likelihood mapping (“b”)

*         Select sequences in 4 clusters (“g”)

*         Assign sequences to the 4 groups (i.e. the 4 groups attached to the branch you want to study).  Ask if you cannot figure it out.

*         Select number of quartets (“n”) enter “0” to select all (as two of your clusters contain a single sequence only, this number is quite small)

*         Select the model of heterogeneity.  (select Gamma distribution with 8 classes, enter the shape parameter as 0.6.  DO NOT have the program find the shape parameter (takes about 30 minutes).

*         Interpret the outfiles. 

3.  Chose one of the following problems to address using the maximum likelihood ratio test: 

A)    Do the V-ATPase A subunits from the red and green algae form a monophyletic group with the higher plants?

  For this you

*   Delete all prokaryotic sequences from testseq5.aln using clustalx

*   Save as phylip file (testseq5b.phy)

*   Use the treeview editor to build the appropriate trees to test.  A good starting tree is this one [it is a neighbor joining tree calculated from distances that were estimated using among site rate variation (gamma distribution with alpha=.6) and the JTT substitution model.  I then collapsed all branches that in a maximum likelihood reconstruction were not at least 2.5 standard deviations larger than zero.]  You can load several trees into puzzle simultaneously.  If you do so, puzzle also performs a Kishino-Hasegawa test to determine which of the trees is significantly better than the other ones.  An example for a file with multiple user trees is here. You can use treeview to edit your tree to generate the different starting trees.  You can copy directly from treeview and paste into a text editor.  The only problem is that treeview adds ‘ ‘ to sequences that start with a number e.g. ‘1Acetabularia’.  You need to remove the ‘ before you can read the trees into PUZZLE! 

*   Save the trees (multiple trees should be separated by ; )you want to test in a textfile in the same directory as the PUZZLE program

*   Start PUZZLE

*   Load testseq5b.phy 

*   Select tree search procedure and toggle to “User defined trees”

*   Select model of heterogeneity (in the interest of time you might leave it at the default setting)

*   Calculate branch lengths and likelihood values for the different trees. 

*   Use the CHI square distribution (see handout) to determine, if the increase in likelihood (2(logL1-logL2)) is significant. 

B)    Does the incorporation of among site rate variation lead to a significant increase in likelihood?

For this you

*   Delete all prokaryotic sequences from testseq5.aln using clustalx

*   Save as phylip file (testseq5b.phy)

*   Save this treefile (see above) in your PUZZLE directory

*   Start PUZZLE

*   Load testseq5b.phy 

*   Select tree search procedure and toggle to “User defined trees”

*   Select the default model of heterogeneity

*   start the program

*   when prompted enter name of treefile

*   When done, save outfile under a different name!

*   run PUZZLE again, but select a different model for substitution (Gamma with 8 categories and alpha=.6)

*   Use the CHI square distribution to determine if the increase in likelihood between the two runs (2(logL1-logL2)) is significant. 

4.   The data set in testseq5.aln contains organisms growing at very different temperatures.  DOES the environmental temperature influence the amount of among site rate variation?  To address this question we can estimate the shape parameter (=the amount of ASRV) separately for subsets of sequences from thermo and mesophilic organisms. 

*        Using clustalx generate three data sets:

      4-7 prokaryotes with a growth temperature above 50 oC
      a comparable** group of 4-7 prokaryotes with a growth temperature below 50 oC
      a data set containing both of the above groups. 

** Use this tree to choose the appropriate (same number of sequences, similar relationships) subsets.

*        Analyse all three of these data sets using PUZZLE (use the default options, but select a rate heterogeneity model described by a Gamma distribution with 8 rate categories – have puzzle estimate the shape parameter)

Is the among site rate variation different for the two sets of species? 

Discuss your findings.