GOALS:
·
Learn how to get your
data into PUZZLE and the results out.
·
Attain some
understanding on what to do with the ml ratio test.
·
Think about models for
the substitution process.
·
Learn about the
different options in PUZZLE
General Remarks:
If no file called
infile is present in the program directory, PUZZLE will prompt you for the
infile. If you work a lot on the same
dataset, it might be worth to rename your data set into infile. The output is always put into the files
outfile, outtree and outdist.
You have to rename these files as soon as you are done. The next time puzzle starts, the files get erased (this is
particularly sad if your run took a couple of days).
1. Transfer the files testseq5.txt, testseq5.aln and testseq5.phb
onto your computer. These files
contains vacuolar/archaeal ATPases from 49 pro- and eukaryotes:
Daucus
carota, Arabidopsis
thaliana, Gossypium hirsutum are plants;
Acetabularia acetabulum is a
green and Cyanidium caldarium is a red algae
Mus, Homo, Bos
(mammals) Gallus (bird), Drosophila, Aedes (insects) are
animals
Saccharomyces, Candida, Schizosaccharomyces,
Eremothecium and Neurospora are
fungi
Dictyostelium discoideum, Entamoeba,
Plasmodium falsiparum, Trypanosoma, Nosema, and Giardia
are protists or protozoa
Sulfolobus acidocaldarius (70 °C), Sulfolobus solfataricus (70-85 °C), Archaeoglobus
fulgidus (83 °C), Methanosarcina barkeri (30-37 °C), Methanosarcina mazeii
(37°C),
Methanococcus jannaschii (80°C), Haloferax volcanii
(37°C),
Halobacterium salinarium (ca37°C), Methanobacterium thermoautotrophicum
(60-65°C), Desulfurococcus sp. (85-90 °C), Thermococcus sp
(75+ °C),
Aeropyrum pernix (90°C), Thermoplasma acidophilum (55-60°C), Pyrococcus
abysii and Pyrococcus horikoshii 95°C (archaea),
Enterococcus hirae (37oC), Borrelia
burgdorferi (33-37oC), Thermus thermophilus (70-80oC), Deinococcus
radiodurans (30oC), Chlamydia trachomatis,
Chlamydophila pneumoniae (Bacteria) are prokaryotes.
(Usually Bacteria have an F- and not an A-ATPase. The bacteria probably
obtained the archaeal/vacuolar type ATPase through horizontal gene transfer.)
Note
the German Collection of Microorganisms and Cell
Cultures is a good place to find growth temperatures for
microorganisms.
The
prokaryotes can be considered as outgroup for the eukaryotes.
2. Use maximum likelihood mapping to address the
question if the Giardia or the Trichomonas sequence represents the deeper branching lineage among the eukaryotic homologues. As we are not sure about the relative
branching order of Trypanosoma and Plasmodium, do not assign them to any
group.
For this you need to:
load testseq5.aln into clustalx,
save the sequences (they should be already aligned) in phylip format. Check that truncating the sequence names does not result in duplicates.
Start PUZZLE by double clicking on the PUZZLE icon
Load testseq5.phy (or what ever name you gave the file)
Select type of analyses: maximum likelihood mapping (“b”)
Select sequences in 4 clusters (“g”)
Assign sequences to the 4 groups (i.e. the 4 groups attached to the branch you want to study). Ask if you cannot figure it out.
Select number of quartets (“n”) enter “0” to select all (as two of your clusters contain a single sequence only, this number is quite small)
Select the model of heterogeneity. (select Gamma distribution with 8 classes, enter the shape parameter as 0.6. DO NOT have the program find the shape parameter (takes about 30 minutes).
Interpret the outfiles.
3. Chose one of the following problems to address
using the maximum likelihood ratio test:
A) Do the V-ATPase A subunits from the red
and green algae form a monophyletic group with the higher plants?
For
this you
Delete
all prokaryotic sequences from testseq5.aln using clustalx
Save
as phylip file (testseq5b.phy)
Use
the treeview editor to build the appropriate trees to test. A good starting tree is this
one [it is a neighbor joining tree calculated from distances that were
estimated using among site rate variation (gamma distribution with alpha=.6)
and the JTT substitution model. I then
collapsed all branches that in a maximum likelihood reconstruction were not at
least 2.5 standard deviations larger than zero.] You can load several trees into puzzle simultaneously. If you do so, puzzle also performs a
Kishino-Hasegawa test to determine which of the trees is significantly better
than the other ones. An example for a
file with multiple user trees is here. You can use
treeview to edit your tree to generate the different starting trees. You can copy directly from treeview and
paste into a text editor. The only problem is that treeview adds ‘ ‘
to sequences that start with a number e.g. ‘1Acetabularia’. You need to remove the ‘ before you can read
the trees into PUZZLE!
Save
the trees (multiple trees should be separated by ; )you want to test in a
textfile in the same directory as the PUZZLE program
Start
PUZZLE
Load
testseq5b.phy
Select
tree search procedure and toggle to “User defined trees”
Select
model of heterogeneity (in the interest of time you might leave it at the
default setting)
Calculate
branch lengths and likelihood values for the different trees.
Use
the CHI square distribution (see handout) to determine, if the increase in likelihood (2(logL1-logL2)) is significant.
B) Does the incorporation of among site rate
variation lead to a significant increase in likelihood?
For this you
Delete
all prokaryotic sequences from testseq5.aln using clustalx
Save
as phylip file (testseq5b.phy)
Save this treefile (see above) in your PUZZLE directory
Start
PUZZLE
Load
testseq5b.phy
Select
tree search procedure and toggle to “User defined trees”
Select
the default model of heterogeneity
start the program
when prompted enter name of treefile
When done, save outfile under a different name!
run PUZZLE again, but select a different model for
substitution (Gamma with 8 categories and alpha=.6)
Use
the CHI square distribution to determine if the increase in likelihood between the two runs (2(logL1-logL2)) is significant.
4.
The data set in testseq5.aln contains organisms growing at very different
temperatures. DOES the environmental
temperature influence the amount of among site rate variation? To address this question we can estimate the
shape parameter (=the amount of ASRV) separately for subsets of sequences from
thermo and mesophilic organisms.
Using clustalx generate three data sets:
4-7
prokaryotes with a growth temperature above 50 oC
a comparable** group of 4-7 prokaryotes
with a growth temperature below 50 oC
a data set containing both of the above
groups.
** Use this tree to choose the appropriate (same number of
sequences, similar relationships) subsets.
Analyse all three of these data sets using
PUZZLE (use the default options, but select a rate heterogeneity model
described by a Gamma distribution with 8 rate categories – have puzzle estimate
the shape parameter)
Is
the among site rate variation different for the two sets of species?
Discuss
your findings.