| Assignment #4:
 [Links from this page open in a separate window] The programs we will use are in the "class folder" 
          on your desktop, in the "MCB Software" subfolder. Some programs need 
          to have the program and the file in the same folder. As you are not 
          allowed to copy anything into the class folder, you should copy the 
          programs into a folder where you have read write priviledges.  
 
   
 
 
 
 
 If you have time, copy the program clustalw from the class 
        folder into your working directory, and align the data in testseq1.txt 
        and testseq2.txt using the command line interface. Safe the alignment 
        in the different formats and inspect the formats using a texteditor.  | 
Bootstrapping - how to assess reliability of partitions given in a tree.
| 
 Baron Karl Friedrich Hieronymus von Münchhausen | Bootstrapping is one of the most popular ways to assess the reliability 
          of branches.  The term bootstrapping 
          goes back to the Baron Münchhausen (pulled himself out of a swamp 
          by his shoe laces). Briefly, positions of the aligned sequences are 
          randomly sampled from the multiple sequence alignment with replacements.  
          The sampled positions are assembled into new data sets, the so-called 
          bootstrapped samples.  Each position 
          has an about 63% chance to make it into a particular bootstrapped sample.  If a grouping has a lot of support, it will 
          be supported by at least some positions in each of the bootstrapped 
          samples, and all the bootstrapped samples will yield this grouping. 
          Bootstrapping can be applied to all methods of phylogenetic reconstruction. 
           Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values. | 
| Creating a bootstrapped sample Joe Felsenstein describes the bootstrap procedure in his manual to the seqboot program (part of the PHYLIP package, the manual is here, the citations here) as follows: 
 The sample input and output of the seqboot program illustrates the generation of the bootstrapped samples: 
 TEST DATA SET
 
 
 
 CONTENTS OF OUTPUT FILE(If Replicates are set to 10 and seed to 4333) 
 
 
 
 Bootstrapping 
        and non-informative sites It 
        is a good idea to only use those positions for phylogenetic analyses for 
        which one is sure that the sites are indeed homologous.  However, in order to obtain a bootstrap value better than 90% for 
        a branch one only needs three sites that change along this branch.  Counter to intuition (at least mine), these 
        bootstrap values are not lowered by adding non-informative sites to the 
        alignment.  The following discusses 
        an example of four sequences that have 5 positions supporting the central 
        branch in one orientation and 2 positions each supporting the two alternatives: spec1 
        ACGTG AC CG spec2 
        ACGTG GG AA spec3 
        TGCAC GG CG spec4 
        TGCAC AC AA Regardless how many non informative 
        sites are added, the grouping of spec1 with spec2 (and spec3 with spec4) 
        is found in about 80% of the bootstrapped samples.   Example 
        of sequence file with added non-informative sites:  spec1 
        AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT AAGCAGCTGT ACGTG ACCG spec2 
        AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT AACCGCCTGT ACGTG GGAA spec3 
        AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT AACGACCTCT TGCAC GGCG spec4 
        ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA ATCCACCAGA TGCAC ACAA   Table 
        of bootstrap support: 
 
 | 
Discussion of some 
  Results
Analyses of trees obtained 
  with testseq1b: In my opinion, the best trees were obtained with 
  correction for multiple substitutions turned on. Without correction for multiple 
  substitutions the two longest branches (flSalmonella and Borrelia) group together 
  and the group of the two yeasts is broken up by the Neurospora sequence. Excluding 
  the positions with gaps resulted in a slight improvement (the yeasts go together), 
  and the bootstrap values for the branches that are supported by other evidence 
  were higher, whereas questionable groupings were appropriately little supported. 
  
Analysis of trees obtained 
  with testseq1c: The Synechococcus sequence is not homologous to any 
  of the other sequences. Accordingly the distance correction does not work for 
  all instances. Without considering gaps the sequence groups with the longest 
  branch (long branch attraction), with position that contain gaps included it 
  goes with the Drosophila sequence, probably, because the amino terminal ends 
  of the two sequences match up. 
 Exclusion of positions with gaps gets rid of a lot of noise 
  (these regions are usually least conserved), and instances of convergent gap 
  formation are ignored (some other programs handle this problem with more alternatives).
  Exclusion of positions with gaps gets rid of a lot of noise 
  (these regions are usually least conserved), and instances of convergent gap 
  formation are ignored (some other programs handle this problem with more alternatives). 
  
 Multiple substitutions occur, thus it is a good thing to 
  take these into consideration when calculating distances.
  Multiple substitutions occur, thus it is a good thing to 
  take these into consideration when calculating distances.