Assignment for Class 39

Your name:
Your email address:

Introduction to Bayesian Analyses

MrBayes 3.1.2 should be already installed on the iMACs in the computer lab (in the MCB221 folder).

In case you work at home on a PC, go here and follow instructions to download and install MrBayes.

Exercise 1:
The goal of this exercise is to learn how to use MrBayes to reconstruct phylogenies.

Save testseq1c.nex into the MrBayes folder that also contains the MrBayes Program (Folder MrBayes3.1, probably your desktop). This is the dataset of vacuolar ATPase A subunit you used earlier, but in the NEXUS format that MrBayes reads (the nexus format is used by PAUP and many other programs - it allows to pass trees and commands to the different programs -- clustal and many other programs write nexus formated files, but usually, you need to go into the file with a texteditor and change things like the way gaps are treated, or the data type). You might need to rename the file after downloading. Start MrBayes through double clicking on the MrBayes icon. (do NOT use the one ending in .p).
At the MrBayes command prompt type "execute testseq1c.nex". This will load the data into the program.
Type the following commands one by one:

prset aamodelpr=fixed(jones) [this sets the substititution matrix to JTT, a mdern version of PAM]
mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random [this sets the frequency with which the "robot" reports results to the screen and to the files (different files for parameters (.p) and trees (.t))]
mcmcp savebrlens=yes [mcmcp sets parameters for the chain. savebrlens tells it to save the trees with branchlengths]
[mcmcp filename=testseq1c] [if you use this command, it tells the program to save the files under a certaian name. This is handy, if you want to read in data from a previous run, but it usually is easier to go with the default, which in this case is testseq1c.nex]
mcmc ngen=20000 [this starts the chain, at the end you'll be asked if you want to continue the run, read the stuff below while you wait]
sump [or sump filename=testseq1c] [this summarizes the data stored in the parameter file]

Rather than typing the commands one by one you could use a MrBayes block at the end of the input file, e.g.:
begin mrbayes;
    prset aamodelpr=fixed(jones);
    mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random;
    mcmcp savebrlens=yes filename=testseq1c;
    mcmc ngen=20000; [you might need to add set autoclose=yes to get to the next step]
    sump filename=testseq1c;
end;

" prset" sets the priors for the model used to calculate likelihoods. In this case we choose the substitution parameters from the JTT amino acid substitution model (Jones et al., 1992).
" mcmcp " sets parameters for Markov Chain Monte Carlo: we set to sample every 50 generation, to print results to a screen every 50th generation, run 2 chains simultaneously, start with random tree, and save branch lengths.
" mcmc " actually runs the chain, and we set it to run for 20000 generations.
The program runs to analyses in parallel (by default each with 4 chains, and three of these chains are heated; we use only two to make things a little faster--- it definitely is a good idea to run mb on a fast and remote computer). The smaller the reported average standard deviation of split frequencies is, the more reliable the result (i.e., your run is close enough to infinitely long). When the value is below .015, or when your patience is exhausted, terminate the run, by typing no at the prompt.

After the run is finished, the " sump " command will plot the logL vs. generation number, that allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).

[Rather than using the sump command, you also can import the parameter file into EXEL and plot the logL values as a chart in EXEL. See below.]

During the start of the run, the likelihood rapidly increases by
orders of magnitude. If the first samples are included in the plot, one
really cannot see if the likelihood values fluctuate around a constant
value. You can exclude the first couple of samples by specifying a burnin.
sump burnin=20 excludes the first 20 samples.
Note, that the y-axis is rescaled automatically, which provides you a better spread to judge it the chain is "mixing well", or still in the initial orientation.

Type
" sumt burnin=20 " or " sumt filename=testseq1c burnin=20 "
, where you need to substitute '20' with the number you obtained in the previous step of the exercise (Note: that burnin values is the # of generations divided by 50, since we sample every 50th generation). This command creates a consensus tree, and shows posterior probabilities of the clades. You can take a look at the tree in Treeview or njplot by loading the testseq1.con file.

Which branch in the tree is the longest?
How long is it?
What is the measure?
Can you explain in a few words, why is it important to exclude a 'burnin' from our analyses?

MrBayes by example: Identification of sites under positive selection in a protein

Background:

Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.

Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on-new antibodies, new mutants," Fitch said.

The research into the virus' genetic data focused on the evolution of the hemagglutinin gene-the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.

By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.

The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.

Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.

A talk by Walter Fitch (slides and sound) is here

Exercise2:

The goal of this exercise is to detect sites in hemmagglutinin that are under positive selection.

Since the analysis takes a very long time to run (several days), here are the saved results of the MrBayes run: Fitch_HA.nex.p , Fitch_HA.nex .t .

The original data file is flu_data.paup . The dataset is obtained from an article by Yang et al, 2000 . The File used for MrBayes is here

The MrBayes block used to obtain results above is:

begin mrbayes;
set autoclose=yes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
mcmcp samplefreq=500 printfreq=500;
mcmc ngen=500000;
sump burnin=50;
sumt burnin=50; end;

Selecting a nucmodel=codon with Omegavar=Ny98 specifies a model in which for every codon the ratio of the rate of non-synonymous to synonymous substitutions is considered. This ratio is called OMEGA. The Ny98 model considers three different omegas, one equal to 1 (no selection, this site is neutral); the second with omega < 1, these sites are under purifying selection; and the third with Omega >1, i.e. these sites are under positive or diversifying selection. (The problem of this model is that the there are only three distinct omegas estimated, and for each site the probability to fall into one of these three classes. If the omega>1 is estimated to be very large, because one site has a large omega, the other sites might not have a high probability to have the same omega, even though they might also be under positive selection. This leads to the site with largest omega to be identified with confidence, the others have more moderate probabilities to be under positive selection).

Note : Version 2.0 of Mr Bayes has a model that estimates omega for each site individually, the new version only allows the Ny98 model as described above..

First, you need to detect how many generations to burn in (meaning the number of samples you will have to discard). Import the file Fitch_HA.nex.p.txt into Excel (link is above, open a new spreadsheet, select DATA, get external data, import text file (if your file does not have a txt extension, select enable all document types) -- select windows format, else defaults; to import the remaining columns of the spreadsheet see 2. below) and plot # of generations versus -LnL values. Determine after how many generations the graph becomes "stationary". The burnin value is that number of generations divided by 50 (since only every 50th generation was sampled; i.e. the burnin value roughly is equal to the number of rows - not quite because there is a header). To more accurately determine the burnin, you need to rescale the Y-axis (click at the Y-axis -- if you aim accurately, you'll get a box that allows rescaling). Selecting the values to plot becomes easier, if you delete the first row from the spreadsheet (then the first row acts as headers).
The result (scatterplot of LogL versus generation) might look like this:
Load Fitch_HA.nex.p into Excel. (You need to load the data into 2 sheets, since Excel does not allow to load more than 256 columns per sheet. To load the data, create a new Excel spreadsheet. Go to Data->Get External Data->Import Text File , and load first 256 columns.

Go to a separate sheet, and repeat "Get External Data" command to load the rest of the data, you need to block out (=select =make black) and exclude the first 256 columns (the last imported codon ended on nuc 555) -- you need to click the radio button AFTER you selected the columns to skip!). This file contains information for posterior probabilities for each codon (columns) at each sampled generation (rows).
Calculate average posterior probability for each site of being under positive selection ( see example from Monday ). Do not forget to exclude first N rows as a burnin; you should have detected value of N in the first question of this exercise - to be clear on where the burnin ends, you might want to highlight the rows representing the burnin and select a different font color. (Use AVERAGE() function of Excel, enter the formula in a cell below the values for the individual trees, copy the formula to all columns)
Plot average posterior probability vs. site # . (select the row in which you calculated the averages, then click Graph, and select a bar graph). Write down the codon positions for a few sites with the highest posterior probability of being positively selected (the columns name pr+(1,2,3), pr+(4,5,6)....and so on. pr+(1,2,3) mean probability of codon #1 (nucleotide #1, #2 and #3) to be under positive selection)) (Examples are here and here , check all sheets - note: these were calculated from only 30 000 generations)
Create a histogram for the omega value <1 and the omega>1. To do this you need to use Analysis ToolPak of Excel ( Tools->Data Analyses ; if you do not see this item in the Tools menu, you need to activate the ToolPak. To do that go to Tools->Add-Ins... and check "Analysis ToolPak").
The histogram data will be placed into a separate sheet. Again, do not forget to discard the burnin from consideration. Plot the calculated histogram data as a histogram. What is the shape of the distribution?
The histogram fro omega <1 should look something like this:
Determine the 95% credibility interval for the omega<1 value. To do this you sort posterior probability column in ascending order (Select data you want to sort, and go to Data->Sort... ). Again, do not forget to discard the burnin ; the easiest might be to actually delete it.. After sorting, exclude 2.5% of the data on the top and on the bottom. The range of the remaining data gives you the 95% confidence interval. (The values for the 95% credibility interval should be approximately .31- .52 ; using a burnin of 55500 generations, 2.5% of the remainder is about 7.)
The structure of hemagglutinin has been crystallized and is publicly available through PDB. Download the PDB file here and examine it with SPDBV. Chain A of the PDB file corresponds to the sequences we did our analysis with (color the molecule according to chain). Below is a comparison of the one of the sequences we used for analyses with the sequence for which the structure was determined:

Using this alignment as a help, map the sites which have the highest probability to belong to the class with omega>1. Where are these sites located in the protein? (Reminders: The position number in the nexus file corresponds to nucleotide sequence, the structure is based on the amino acid sequence - take the third codon position and divide by 3 to find the amino acid. You only want to be concerned with Chain A!)

In case you have energy left:

Exploring intein structures in SPDBV

Background information:

Most inteins are composed of two domains: one is responsible for protein splicing, and the other has endonuclease activity. A few inteins have lost the endonuclease domain completely and retain only the self-splicing domain and activity. The latter inteins are called mini-inteins .

The structures of several inteins are crystallized.

Saccharomyces cerevisiae intein (PMID: 9160747, 1VDE ),
the same Saccharomyces cerevisiae intein bound to its recognition sequence (PMID: 12219083, 1LWS ),
the Mycobacterium xenopi mini intein (PMID: 9437427, 1AM2 )

Open 1VDE in SPDBV. This structure has two chains. Save chain A to a separate file.
Reopen chain A in SPDBV. Open Mycobacterium mini intein 1AM2. Align two structures using Magic Fit. How good is the alignment? (your answer should be "Bad").

Depict the structures as ribbons and color them according to the secondary structure. Rotate two structures until you can see the similarities between mini intein and one of the domains of large intein. Move them on top of each other. Do "Fit -> Improved Fit...". Does it work now?

Can you find which part of Saccharomyces cerevisiae intein corresponds to the endonuclease domain by comparison of the two structures? [Manually inspect the two structures to find the similarities]. Color the putative self-splicing and endonuclease domains of 1VDE in two different colors. Select N and C terminals (First a.a. and the last a.a.) in both structures. How close are they (in angstroms)? [Click on the button with "1.5A" label, and select first and second a.a. to obtain the distance]. Save your project.
Open Saccharomyces cerevisiae intein that is bound to its recognition DNA sequence. Display intein (chain A) as ribbons with secondary structure color scheme (select color target with little black triangle in Control Panel). Color DNA as CPK, and compute hydrogen bonds (Tools - > compute H-bonds). Does the finding of DNA interaction domain agree with your previous assignment of the self-splicing domain? (see the saved structure from the previous exercise)
Try to find a way to display the interactions between the aminoacid side chains and the DNA helix. One way to do it is to select two DNA chains and select Neighbors of selected amino acids. Choose "select groups that are within" option. Click on the "Cloud" icon in the header of the control panel to display the aminoacids as balls. Also click in the header for showing sidechain. Manually turn off the cloud checkmarks for the DNA. One way to look at individual interactions is to turn the molecule so that one looks down the DNA helix, and select the "Display -> Slab" option. If you press shift while mouse cursor up and down the visible slice moves through the molecule along the axes perpendicular to the screen. If you press the shift key and move the mouse left and right you increase or decrease the size of the slab.
The Lys 340 and Glu 366 are residues that are important for interaction with DNA. Select those residues (in Control panel choose label column to depict the label). Do they interact with major or minor groove of DNA? Which base pairs interact with these amino acids?

Optional exercises #2 :

If you have more time to spare and you are up for a challenge, take a look at the nucleosome. Right click here and save as pdb file. Open it from within spdbv. You might want to do some of the future exercises with the nucleosome in addition to the ATPases - thus save the pdb file, where you can find it again.

Align all the histones form the nucleosome to one reference histone and color in rmv:

The result might look something like this:

The picture shows a structure alignment of the 8 histones (2 each) that are part of the nucleosome. All the histones were colored regarding the match to H2A, except H2A, which was colored according to its match to H3. Coloring option RMS - shorter wavelengths - better match

Below same as last figure, but histones are depicted side by side :

Below are two views of the complete nucleosome. Histones H2A are depicted as spacefilling balls and RMS colored regarding their match to H3. The rest of the molecule is colored according to chain.

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.