Assignment #4:

[Links from this page open in a separate window]

The programs we will use are in the "class folder" on your desktop, in the "MCB Software" subfolder. Some programs need to have the program and the file in the same folder. As you are not allowed to copy anything into the class folder, you should copy the programs into a folder where you have read write priviledges.

Align the sequences contained in the file testseq1.txt using ClustalX and the default parameters.
(ClustalX is located in "Class Software" Folder, "MCB Software" subfolder on your Desktop).
Save the alignment on your computer - you'll need it again for #3.)
In ClustalX what happens when you change the gap penalty parameters away from their default values? (Parameters are under "Alignment->Alignment Parameters->Multiple Alignment Parameters" menu; Play with "Gap Opening" and "Gap Extention" values).

Can you find a parameter combination, so that the inserted sequence present in the two yeast sequences is no longer recognizable in the alignment? What is that parameter combination?

LISTEN TO LECTURE ON BOOTSTRAP (Notes below)
Calculate a phylogenetic bootstrap consensus tree from the alignment calculated under #1 and look at it in TreeView (TreeView is located in "Class Software" Folder, "MCB Software" subfolder on your Desktop).

Does this tree conform to your expectation based on what you know about organismal evolution? (Saccharomyces, Candida and Neurospora are fungi, Drosophila is a fly, Acetabularia a green algae, Daucus is another name for carrots, Sulfolobus is an archaeon, and Borrelia is a bacterium).

Does the tree change, when you exclude positions with gaps?
Does the tree change, when you add correction for multiple substitutions?

Repeat your exploration of the different gap penalty parameters using the more divergent sequences in testseq2.txt.
The names indicate the organisms from which the genes that encode the given protein sequences were isolated. The complete species names are given in the file; however, clustal only reads the name of the sequence up to the first space. The prefix A indicates vacuolar or archaeal ATPase catalytic subunits, the prefix B indicated the paralogous non-catalytic B subunit of the vacuolar or archaeal ATPases. The prefixes alp and bet denote the alpha (non-catalytic) and beta (catalytic) subunits of the bacterial F-ATPase subunits (also found as ATPsynthases in mitchondria and plastids). fl denotes the ATPase subunit that was found to play a role in the assembley of the bacterial flagella, and ttf denotes a bacterial transcription termination factor.
Arabidopsis thaliana: green flowering plant of the mustard familiy
Mus musculus: the mouse
Homo sapiens: humans
Neurospora crassa: baker's mold
Plasmodium falciparum: malaria causing protist
Trypanosoma congolense: another desease causing protist
Sulfolobus acidocaldarius: an archaeon (a crearchaeote, an acido thermophile)
Methanosarcina barkeri: another archaeon (a euryarchaeote, methanogen)
E.coli: a gram negative bacterium
Aquifex pyrophilus: a deep branching bacterium (according to 16S rRNA)
Thermotoga maritima: another deep branching bacterium (second deepest branch in 16S rRNA phylogenies)
Salmonella typhimurium and Bacillus subtilis: representatives of Gram negative and Gram positive bacteria, respectively.

Can you find parameter choices that reveal the so-called non-homologous region in the catalytic V-ATPase subunits (i.e. a region of 100 aa, about 150 aa from the amino terminal end that is absent in the other ATPase subunits)? Report your findings:
(Can you get rid of the amino acids sprinkled into the corresponding gap? What is a parameter combination that seemed to work?)

Are the nucleotide binding site motifs GGxxxxxGxxGxGKTV (in the V-ATPase A-subunits) properly aligned between the different types?

What sequence in the V-ATPase B subunits does align to this motif?

Do the ATPase subunits and the rho termination factors (ttf in testseq2.txt) share other motifs besides the GKT region

Calculate a bootstrap consensus tree for this dataset (remember to tell the program to write the bootstrap values at the node, not the branch) and inspect it with TreeView.
Try to root the tree with the ttf and fl subunits.
Which of the bifurcations in this tree are certain to represent gene duplications, which are likely to represent speciation events?
The sequences in testseq1.txt (V/A-ATPase catalytic subunits) were quite similar to one another. To test the effect of long branches, I added a homologous, but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella). The resulting file is testseq1b.txt.

Align the sequences and calculate bootstrapped trees for this file using the possible permutations of gaps/ no gaps, and with and without correction for multiple substitutions.

Which of the resulting trees appears to best reflect the actual evolution?

Give a justification for your choice?

What might be the reason that the others options worked less well?

What do you expect to happen, when you replace the Salmonella sequence with a completely (?) unrelated sequence? (testseq1c.txt)

Is your expectation confirmed?

If you have time, copy the program clustalw from the class folder into your working directory, and align the data in testseq1.txt and testseq2.txt using the command line interface. Safe the alignment in the different formats and inspect the formats using a texteditor.

Bootstrapping - how to assess reliability of partitions given in a tree.

Baron Karl Friedrich Hieronymus von Münchhausen

Bootstrapping is one of the most popular ways to assess the reliability of branches. The term bootstrapping goes back to the Baron Münchhausen (pulled himself out of a swamp by his shoe laces). Briefly, positions of the aligned sequences are randomly sampled from the multiple sequence alignment with replacements. The sampled positions are assembled into new data sets, the so-called bootstrapped samples. Each position has an about 63% chance to make it into a particular bootstrapped sample. If a grouping has a lot of support, it will be supported by at least some positions in each of the bootstrapped samples, and all the bootstrapped samples will yield this grouping. Bootstrapping can be applied to all methods of phylogenetic reconstruction.
Bootstrapping thus realizes the impossible: the evolution of sequences in real life happened only once, and it is impossible to run the evolution of, let's say, small subunit ribosomal RNAs again. Nevertheless, using the resampling approch, pseudosamples are generated that have a variation that resembles the variation one would have obtained, if it were possible to sample 100 or 1000 parallele worlds in which the evolution of 16S rRNAs occurred over and over again. You end up with a stastical analyses using a single original sample only.

Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.

Creating a bootstrapped sample

Joe Felsenstein describes the bootstrap procedure in his manual to the seqboot program (part of the PHYLIP package, the manual is here, the citations here) as follows:

The bootstrap. Bootstrapping was invented by Bradley Efron in 1979, and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b; see also Penny and Hendy, 1985). It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets. The method assumes that the characters evolve independently, an assumption that may not be realistic for many kinds of data.

The sample input and output of the seqboot program illustrates the generation of the bootstrapped samples:

TEST DATA SET

    5    6
Alpha     AACAAC
Beta      AACCCC
Gamma     ACCAAC
Delta     CCACCA
Epsilon   CCAAAC

CONTENTS OF OUTPUT FILE

(If Replicates are set to 10 and seed to 4333)

    5     6
Alpha        ACAAAC
Delta        CACCCA
Gamma        ACAAAC
Beta         ACCCCC
Epsilon      CAAAAC
    5     6
Alpha        AACAAC
Beta         AACCCC
Epsilon      CCAAAC
Delta        CCACCA
Gamma        CCCAAC
    5     6
Delta        CAACCC
Beta         ACCCCC
Gamma        ACCAAA
Alpha        ACCAAA
Epsilon      CAAAAA
    5     6
Alpha        AAAACA
Beta         AAAACC
Gamma        AAACCA
Delta        CCCCAC
Epsilon      CCCCAA
    5     6
Beta         ACCCCC
Epsilon      CAAACC
Delta        CCCCAA
Gamma        AAAACC
Alpha        AAAACC
    5     6
Gamma        CCAACC
Alpha        ACAACC
Epsilon      CAAACC
Delta        CACCAA
Beta         ACCCCC
    5     6
Alpha        AAACAA
Delta        CCCACC
Epsilon      CCCAAA
Gamma        AACCAA
Beta         AAACCC
    5     6
Alpha        AAAACC
Delta        CCCCAA
Beta         CCCCCC
Epsilon      AAAACC
Gamma        AAAACC
    5     6
Beta         AAAAAC
Alpha        AAAAAC
Gamma        AACCCC
Delta        CCCCCA
Epsilon      CCCCCC
    5     6
Delta        CCCCAA
Epsilon      CCAACC
Gamma        AAAACC
Alpha        AAAACC
Beta         AACCCC

Bootstrapping and non-informative sites

It is a good idea to only use those positions for phylogenetic analyses for which one is sure that the sites are indeed homologous. However, in order to obtain a bootstrap value better than 90% for a branch one only needs three sites that change along this branch. Counter to intuition (at least mine), these bootstrap values are not lowered by adding non-informative sites to the alignment. The following discusses an example of four sequences that have 5 positions supporting the central branch in one orientation and 2 positions each supporting the two alternatives:

spec1 ACGTG AC CG

spec2 ACGTG GG AA

spec3 TGCAC GG CG

spec4 TGCAC AC AA

Regardless how many non informative sites are added, the grouping of spec1 with spec2 (and spec3 with spec4) is found in about 80% of the bootstrapped samples.

Example of sequence file with added non-informative sites:

spec1 AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT ACGTG ACCG

spec2 AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT ACGTG GGAA

spec3 AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT TGCAC GGCG

spec4 ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA TGCAC ACAA

Table of bootstrap support:

	Species 1 with 2	Spec.1 with 3 or 1 with 4
	5 informative sites support this branch	2 informative sites each support this branch
+0 non-informative sites
	82.5	12.17
	77.83	3.67
	84.17	11
		6.5
		11.83
		10.33
Mean of 3 replicates with 100 bootstrapped samples each	81.50	9.25
Standard deviation	3.29	3.41

+50 non-informative sites
	77.00	13.00
	77.00	10.00
	78.83	14.00
		9.00
		11.33
		9.83
Mean of 3 replicates with 100 bootstrapped samples each	77.61	11.19
Standard deviation	1.06	1.96

+200 non-informative sites
	78.83	14.33
	78.33	6.83
	76.50	12.83
		8.83
		14.50
		9.00
Mean of 3 replicates with 100 bootstrapped samples each	77.89	11.05
Standard deviation	1.23	3.25

+2000 non-informative sites
	79.67	10.67
	83.33	9.67
	81.33	10.33
		6.33
		10.83
		7.83
Mean of 3 replicates with 100 bootstrapped samples each	81.44	9.28
Standard deviation	1.83	1.81

Discussion of some Results

Analyses of trees obtained with testseq1b: In my opinion, the best trees were obtained with correction for multiple substitutions turned on. Without correction for multiple substitutions the two longest branches (flSalmonella and Borrelia) group together and the group of the two yeasts is broken up by the Neurospora sequence. Excluding the positions with gaps resulted in a slight improvement (the yeasts go together), and the bootstrap values for the branches that are supported by other evidence were higher, whereas questionable groupings were appropriately little supported.

Analysis of trees obtained with testseq1c: The Synechococcus sequence is not homologous to any of the other sequences. Accordingly the distance correction does not work for all instances. Without considering gaps the sequence groups with the longest branch (long branch attraction), with position that contain gaps included it goes with the Drosophila sequence, probably, because the amino terminal ends of the two sequences match up.

Exclusion of positions with gaps gets rid of a lot of noise (these regions are usually least conserved), and instances of convergent gap formation are ignored (some other programs handle this problem with more alternatives).

Multiple substitutions occur, thus it is a good thing to take these into consideration when calculating distances.