Assignment #4:
[Links from this page open in a separate window] The programs we will use are in the "class folder"
on your desktop, in the "MCB Software" subfolder. Some programs need
to have the program and the file in the same folder. As you are not
allowed to copy anything into the class folder, you should copy the
programs into a folder where you have read write priviledges.
Align the sequences and calculate bootstrapped trees for this file
using the possible permutations of gaps/ no gaps, and with and without
correction for multiple substitutions. Which of the resulting trees appears to best reflect the actual
evolution?
If you have time, copy the program clustalw from the class
folder into your working directory, and align the data in testseq1.txt
and testseq2.txt using the command line interface. Safe the alignment
in the different formats and inspect the formats using a texteditor. |
Bootstrapping - how to assess reliability of partitions given in a tree.
Baron Karl Friedrich Hieronymus von Münchhausen |
Bootstrapping is one of the most popular ways to assess the reliability
of branches. The term bootstrapping
goes back to the Baron Münchhausen (pulled himself out of a swamp
by his shoe laces). Briefly, positions of the aligned sequences are
randomly sampled from the multiple sequence alignment with replacements.
The sampled positions are assembled into new data sets, the so-called
bootstrapped samples. Each position
has an about 63% chance to make it into a particular bootstrapped sample. If a grouping has a lot of support, it will
be supported by at least some positions in each of the bootstrapped
samples, and all the bootstrapped samples will yield this grouping.
Bootstrapping can be applied to all methods of phylogenetic reconstruction.
Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values. |
Creating a bootstrapped sample Joe Felsenstein describes the bootstrap procedure in his manual to the seqboot program (part of the PHYLIP package, the manual is here, the citations here) as follows:
The sample input and output of the seqboot program illustrates the generation of the bootstrapped samples:
TEST DATA SET
CONTENTS OF OUTPUT FILE(If Replicates are set to 10 and seed to 4333)
Bootstrapping
and non-informative sites It
is a good idea to only use those positions for phylogenetic analyses for
which one is sure that the sites are indeed homologous. However, in order to obtain a bootstrap value better than 90% for
a branch one only needs three sites that change along this branch. Counter to intuition (at least mine), these
bootstrap values are not lowered by adding non-informative sites to the
alignment. The following discusses
an example of four sequences that have 5 positions supporting the central
branch in one orientation and 2 positions each supporting the two alternatives: spec1
ACGTG AC CG spec2
ACGTG GG AA spec3
TGCAC GG CG spec4
TGCAC AC AA Regardless how many non informative
sites are added, the grouping of spec1 with spec2 (and spec3 with spec4)
is found in about 80% of the bootstrapped samples. Example
of sequence file with added non-informative sites: spec1
AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT ACGTG ACCG spec2
AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT ACGTG GGAA spec3
AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT TGCAC GGCG spec4
ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA TGCAC ACAA Table
of bootstrap support:
|
Discussion of some
Results
Analyses of trees obtained
with testseq1b: In my opinion, the best trees were obtained with
correction for multiple substitutions turned on. Without correction for multiple
substitutions the two longest branches (flSalmonella and Borrelia) group together
and the group of the two yeasts is broken up by the Neurospora sequence. Excluding
the positions with gaps resulted in a slight improvement (the yeasts go together),
and the bootstrap values for the branches that are supported by other evidence
were higher, whereas questionable groupings were appropriately little supported.
Analysis of trees obtained
with testseq1c: The Synechococcus sequence is not homologous to any
of the other sequences. Accordingly the distance correction does not work for
all instances. Without considering gaps the sequence groups with the longest
branch (long branch attraction), with position that contain gaps included it
goes with the Drosophila sequence, probably, because the amino terminal ends
of the two sequences match up.
Exclusion of positions with gaps gets rid of a lot of noise
(these regions are usually least conserved), and instances of convergent gap
formation are ignored (some other programs handle this problem with more alternatives).
Multiple substitutions occur, thus it is a good thing to
take these into consideration when calculating distances.