Assignment #5: Exploring PHYLIP package

PHYLIP HTML DOCUMENTATION

(The latest versions [3.6a2.1] of the programs from the PHYLIP package that we are going to use today can be copied to your computer from here)

Dataset #1: - testseq4.phy

The sequences in testseq4 are small ribosomal RNAs from bacteria and mitochondria, with two archaeal sequences one might want to use as outgroup:

MC THLITH: Methanococcus sp., an archaeon; included to be used as outgroup
HR SACCHA: Halorubrum sp, another archaeon
E COLI: Escherichia coli, a gamma proteobacterium
TT MARITI: Thermotoga maritima, an extremely thermophilic bacterium
AQU PYROP: Aquifex pyrophilum, another extremely thermophilic bacterium
D RADIODU: Deinococcus radiodurans, the "radiation resistant berry from hell", a relative of Thermus
FRA SCN10: Frankia sp., an actinomycete (despite the name these are bacteria NOT fungi!)
HER AURAN: Herpetosiphon aurantiacus, a green no sulfur bacterium, often assumed to be related to the Deinococcales.
RIC PROWA Rickettsia prowazekii, intracellular alpha proteobacterium, often assumed to be closely related to mitochondria
RIC BELLI Rickettsia bellii, another Rickettsia species

Mitochonrial sequences from

ZEA MAYS: corn (plant)
ASPE THO: Aspergillus (fungus)
CLMY REI: Chlamydomonas reinhardtii (single celled green algae)
PRTH WIC: Prototheca wickerhamii (colorless unicellular "green" alga)
DROS YA2: Drosophila (fruit fly, animal)
HOMO_SA7: Human (vertebrate, animal)

The rRNA in animal mitochondria evolves much faster than the rRNA in bacteria and in plant mitochondria.
This dataset is designed to test and explore the long branch attraction artifact.

Run DNAPARS on the original dataset (use default options). The results will be written to the file named "outfile". !!! Rename it !!! Otherwise the next PHYLIP program you run will overwrite it.

Explore the different options. (If in doubt, check the manual).

Why might it be useful to jumble the input order and run repeated analyses?

What is the number of steps in the most parsimonious tree?

Do your analyses always get the same tree with the same number of steps?

How are gaps treated by this program? Does this make sense?

How could you change this? Do it. (If you need a hint, take peak here.)

Run DNAPARS on the modified alignment. Do you get a different tree?

Does it more correspond to your expectation?

Open one of the trees you calculated in a text editor, and copy the tree into treeview (via [ctrl]-C [ctrl]-V). Edit the tree in treeview, so that it corresponds to expectation (i.e., the mitochondrial sequences should all group together). Paste this tree, and the trees that resulted from #3 and #1 into a single textfile. The first line should be a "3" (i.e. the number of trees), the trees should be seperated from one another by ";". If this is too cryptic, look at this file for an example. If you generate your user trees with treeview, one unfortunate difference in handling names is the treatment of names that contain numbers. Treeview decides to add '' (single quotes) to these names. For example a name like Dros_4 turns into .'Dros_4' The problem is that DNAPARS will not find the sequence 'Dros_4', only Dros_4. You need to remove the '(quotes) from the edited trees in your text editor.
Save your trees (feel free to use more than one aditional tree) as a file called intree, or give it any name you like.
Run DNAPARS on testseq4c.phy (see question #3). Select the usertree option, and when prompted, enter the name of your usertree file.

Are the trees in the usertree file considered to be significantly different? (The answer should be in outfile.)

Run DNADIST on the original dataset. Use the default values and repeat the analysis with at least two other distance measure (option D -- don't worry, we will talk about these in more detail, for now all you need to know is that logdet should be insensitive to compositional bias, and that Jukes-Cantor is the more simple model of substitution as compared to the Kimura two parameter model, which is more simple than F84) . If you are sure that you don't crash the program you can append the distance matrices to a single file.

Use FITCH to calculate tree from the distance matrices you calculated in previous step. Turn on global rearrangements [option G] and randomization of input order [option J]. (If you append hte matrices to a single file, you need to use the M-option.) Use the default options for everything else.

Use NEIGHBOR to calculate trees from the distance matrices.

Are the trees from Fitch and Neighbor different?

Which of the trees comes closest to you expectation? Why? (i.e., descibe the part of your expectation that is met by the tree.)

Run SEQBOOT to generate a file with 100 boostrap replicates (use default options). (Remember: Rename the outfile!)

Run DNADIST on the boostrapped dataset one time with Jukes Cantor (JC) distances [option D]. In order for the program to read all the bootstrapped samples, you have to tell program to analyze multiple datasets. [option M]. Output of the program are the distance matrices in the file named "outfile". Again, do not forget to rename it! (OK, this was the last warning you get on the renaming business.)

Use NEIGHBOR to calculate tree from the distance matrices you calculated in previous step. Turn on randomization of input order [option J with one replicate]. And do not forget to switch [option M] for multiple dataset analysis.

Use CONSENSE to calculate a consensus tree from the trees generated in the last step.

Explore the tree in TreeView.

Which branches are strongly supported? Are there any surprises?

Dataset #2: - testseq2mod.phy

The prefix A indicates vacuolar or archaeal ATPase catalytic subunits, the prefix B indicates the paralogous non-catalytic B subunit of the vacuolar or archaeal ATPases. The prefixes alp and bet denote the alpha (non-catalytic) and beta (catalytic) subunits of the bacterial F-ATPase subunits (also found as ATPsynthases in mitchondria and plastids).

Arabidopsis thaliana: green flowering plant of the mustard familiy

Mus musculus: the mouse

Homo sapiens: humans

Neurospora crassa: baker's mold

Plasmodium falciparum: malaria causing protist

Trypanosoma congolense: another desease causing protist

Sulfolobus acidocaldarius: an archaeon (a crearchaeote, an acido thermophile)

Methanosarcina barkeri: another archaeon (a euryarchaeote, methanogen)

E.coli: a gram negative bacterium

Aquifex pyrophilus: a deep branching bacterium (according to 16S rRNA)

Thermotoga maritima: another deep branching bacterium (second deepest branch in 16S rRNA phylogenies)

Salmonella typhimurium and Bacillus subtilis: representatives of Gram negative and Gram positive bacteria, respectively.

Run PROTDIST on the boostrapped dataset with PAM matrix [option P]. Output of the program is a distance matrices in the file named "outfile".
Use FITCH to calculate tree from the distance matrices you calculated in previous step. Turn on global rearrangements [option G] and randomization of input order [option J].
Explore tree in TreeView

Run PROTPARS on the original dataset. (JUMBLE 5 replicates)

Run PROML on the original dataset. Use the defaults for everything.
Explore tree in TreeView. (Note: when we did the same analysis 14 years ago on a mainframe, it took weeks to run a ml analysis on 12 protein sequences).

What, if any are the differences between the trees calculated in #1-3?

Is there a reason you would prefer one tree over the others?

What might have gone wrong in PROTPARS? What could you do to fix this? Does the result improve? (See here for a hint).