MrBayes by example: Identification of sites under positive selection in a protein

Background:

News piece from UC Irvine

A talk by Walter Fitch (slides and sound) is here

Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.

Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on-new antibodies, new mutants," Fitch said.

The research into the virus' genetic data focused on the evolution of the hemagglutinin gene-the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.

By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.

The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.

Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.

Introduction to Bayesian Analyses

Paul Lewis, from EEB department, is one of the pioneers applying a Bayesian framework to the analysis of molecular data. His lecture notes for his Introduction to Bayesian Phylogenetics are here (about four hours of lecture). Essentially, the Bayesian approach tries to assess the probability of a model, bipartition or range of a parameter value - in contrast to ML, which assesses the probability of the data given a model.

It has been shown that under some conditions the biased sampling of tree and parameter space converges on the posterior probability. The approach most often used in recent months is Markov Chain Monte Carlo sampling. The principle is illustrated by a little program that Paul Lewis wrote called MCRobot. This little robot runs around in two dimensional space over which different distribution can be defined. The walk of the robot is biased in a way so that probability to find the robot in a place is proportional to the defined distribution.

MCRobot demo - you are welcome to play yourself.

MrBayes is doing the same as the MCRobot, but it walks around in tree and parameter space. For each place it visits, the program calculates the likelihood. The decision to take or reject a step is based on the likelihood. From the evaluation of all the trees and parameters visited (minus the burn-in phase), one can calculate the posterior probabilities of trees and parameters.

Exercise1:

The goal of this exercise is to learn how to use MrBayes to reconstruct phylogenies.

Save testseq1c.nex. This is the dataset of vacuolar ATPase A subunit you used earlier, but in the NEXUS format that MrBayes reads. Start MrBayes.
At the MrBayes command prompt type "execute testseq1c.nex". This will load the data into the program.
Type the following commands one by one:
```
lset aamodel=jones
mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random
mcmcp savebrlens=yes filename=testseq1c
mcmc ngen=20000
sump filename=testseq1c.bp

    
```
"lset" sets amino acid substitution model to be JTT (Jones, 1992). "mcmcp" sets parameters for Markov Chain Monte Carlo: we set to sample every 50 generation, to print results to a screen every 50th generation, run 2 chains simultaneously, start with random tree, save branch lengths and outfile name will be testseq1c. "mcmc" actually runs the chain, and we set it to run for 20000 generations. After the run is finished, "sump" command will plot the logL vs. generation number, that allows to detect the burnin.
Type "sumt filename=testseq1c.t burnin=20", where you need to substitute burnin value with the number you obtained in the previous step of the exercise (Note: that burnin values is the # of generations divided by 50, since we sample every 50th generation). This command creates a consensus tree, and shows posterior probabilities of the clades.

Exercise2:

Your name:
Your email address:

The goal of this exercise is to detect sites in hemmagglutinin that are under positive selection.

Since the analysis takes a very long time to run (several days), here are the saved results of the MrBayes run: possel2.p, possel2.t, possel2.bp, possel2.posexhaust.

The original data file is flu_data.paup. The dataset is obtained from an article by Yang et al, 2000

The MrBayes block used to obtain results above is:

  
  begin mrbayes;
   set autoclose=yes;
   lset nst=2 omega=estimate enforcecodon=yes omegavar=ny98 inferpossel=yes;
   mcmcp samplefreq=100 printfreq=100 filename=possel2;
   mcmc ngen=60000;
   quit;
  end;

We used a codon model of substitutions, and enabled inference of positively selected sites; We ran the MCMC for 60,000 generations with sampling of every 100th generation.

First, you need to detect how many generations to burn in. Open possel2.p in Excel and plot # of generations versus -LnL values. Determine after how many generations the graph becomes stationary. The burnin value is that number of generations divided by 100 (since only every 100th generation was sampled). To more accurately determine the burnin, you want to rescale the Y-axis.
The burnin can also be detected using sump command:
Start MrBayes. Load the data. To do that, type the following command into MrBayes command prompt:
execute flu_data.paup;
Now type:
sump filename=possel2.bp;
The result of the latter command is an ASCII graph of the number of generations vs. log-likelihood.
Load possel2.posexhaust into Excel. (You need to load the data into 2 sheets, since Excel does not allow to load more than 256 columns per sheet. To load the data, create a new Excel spreadsheet. Go to Data->Get External Data->Import Text File, and load first 256 columns. Go to a separate sheet, and repeat "Get External Data" command to load the rest of the data, you need to block out (=select =make black) and exclude the first 256 columns). This file contains information for posterior probabilities for each site (columns) at each sampled generation (rows).
Calculate average posterior probability for each site of being under positive selection (Do not forget to exclude first N rows as a burnin; you should have detected value of N in the first question of this exercise) [Hint: use AVERAGE() function of Excel]
Plot average posterior probability vs. site # (do separate graphs: one for first 256 sites, and the second for the rest of the sites). Select a few sites with the highest posterior probability of being positively selected.
Paste here site numbers that you selected:
For these sites create a histogram of posterior probabilities. To do this you need to use Analysis ToolPak of Excel (Tools->Data Analyses; If you do not see this item in the Tools menu, you need to activate the ToolPak. To do that go to Tools->Add-Ins... and check "Analysis ToolPak"). The histogram data will be placed into a separate sheet. Again, do not forget to discard the burnin from consideration. Plot the calculated histogram data as a histogram. What is the shape of the distribution?
Determine the 95% confidence interval for the posterior probability for the particular site. To do this you sort posterior probability column in ascending order (Select data you want to sort, and go to Data->Sort...). After sorting, exclude 2.5% of the data on the top and on the bottom. The range of the remaining data gives you the 95% confidence interval.
The structure of hemmagglutinin has been crystallized and is publicly available through PDB. Download the PDB file here and examine it with SPDBV. The Chain A of the PDB file corresponds to the sequences we did analysis with. Below is a comparison of one of the sequences we used for analyses with the sequence for which the structure was determined:
.

Using this alignment as a help, map the site which you have chosen for the analysis onto the protein structure. Where is it located in the protein?
Plot "Nonsynonimous/Synonimous ratio vs. # of generations" (nonsyn_syn column in the possel2.p file); and construct a histogram for this column (see above instructions for histogram creation). Is this protein overall under positive selection? Explain your answer.

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone