Your name: Your email address:
Exercise 1:
Likelihood ratio tests
Open seaview, and load Yeast_vma1_all_seaview.mase. Note that these nucleotide sequences are already aligned based on their encoded amino acids. (To see the aa alignment click on view and check protein - return to the nucleotide sequence view). The dataset has three set of sites: all, intein, extein. Collaborate with two of your neighbors, each of you will use a different set of sites.
We want to calculate thee likelihood values using Trees — PhyML:
All other parameters are the same for the three runs: Support values: aLRTvalues, nucleotide equilibrium frequencies optimized, tree searching NNI, starting tree BioNJ, optimize tree topology.
Once the tree reconstruction is finished, DO NOT CLICK OK, rather click on copy at the bottom of the window, open a text editor and paste the content into the test window. We are interested in the log likelihood, and the last values estimated for the shape parameter, and the % invariable sites (if the latter two were estimated as part of the model).
One important condition that has to be fulfilled before one can use a Likelihood Ratio Test (LRT) to compare two models, is that the models should be "nested". This means that the simpler model must be a constrained version of the parameter-rich model. The likelihood ratio test is performed by doubling the difference in log-likelihood scores and comparing this test statistic with the critical value from a chi-squared distribution having degrees of freedom equal to the difference in the number of estimated parameters in the two models. The parameter-rich model will always have a better fit, due to the extra parameters and will therefore have the highest log-likelihood, so the difference should be a positive number. In this case there is 1 degree of freedom between each of the models — the gamma shape parameter is one parameter and the % invariant sites is the second parameter. Use this online chi-square calculator to determine the significance of the test.
Does the GTR+Gamma model explain the data significantly better than the more simple GTR model? Does the GTR+Gamma+I model explain the data significantly better than GTR+Gamma model? Is this true for all of the the three sets of sites?
When you compare the trees you obtains for the intein, extein and the combined datasets, do you observe any differences?
Introduction to Bayesian Analyses Intro Slides are here
In class we will use MrBayes3.2 as installed on the bioinformatics cluster. Before you set up an ssh terminal connection to cluster, increase the width of the terminal to >150 characters. The goal of this exercise is to learn how to use MrBayes to reconstruct phylogenies and estimate parameters.
MrBayes by example: Identification of sites under positive selection in a protein
Exercise 3:
Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.
Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on-new antibodies, new mutants," Fitch said.
The research into the virus' genetic data focused on the evolution of the hemagglutinin gene-the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.
By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.
The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.
Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.
The goal of this exercise is to detect sites in hemmagglutinin that are under positive selection.
Since the analysis takes a very long time to run (several days), here are the saved results of the MrBayes run: Fitch_HA.nex.p.txt, Fitch_HA.nex.t.txt . The original data file is flu_data.paup . The dataset is obtained from an article by Yang et al, 2000 . The File used for MrBayes is here The MrBayes block used to obtain results above is:
Selecting a nucmodel=codon with Omegavar=Ny98 specifies a model in which for every codon the ratio of the rate of non-synonymous to synonymous substitutions is considered. This ratio is called OMEGA. The Ny98 model considers three different omegas, one equal to 1 (no selection, this site is neutral); the second with omega < 1, these sites are under purifying selection; and the third with Omega >1, i.e. these sites are under positive or diversifying selection. (The problem of this model is that the there are only three distinct omegas estimated, and for each site the probability to fall into one of these three classes. If the omega>1 is estimated to be very large, because one site has a large omega, the other sites might not have a high probability to have the same omega, even though they might also be under positive selection. This leads to the site with largest omega to be identified with confidence, the others have more moderate probabilities to be under positive selection).
Note : Version 2.0 of Mr Bayes has a model that estimates omega for each site individually, the new version only allows the Ny98 model as described above..
Exercise 4:
dN/dS ratios along a sequence. I ran the dataset from exercise #1 and 2 using the NY98 model (no partitions). The file is here. The model has a parameter for transition/tranversion ratio and for the dN/dS ratio (called omega). The model uses and explores only two of these - one for sites under purifying selection, one for sites under diversifying selection. In addition for each codon the probability that the codon is in the omega+ group is estimated, and the omega value for each codon is estimated. This results in a lot of parameters, which makes for a slow moving robot. The parameter files resulting form the run are here. This is the MrBayes block in the file:
begin mrbayes; lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98; report possel = yes siteomega = yes; mcmcp filename=analysisS; mcmcp samplefreq=50 printfreq=50 diagnfreq=500; mcmcp savebrlens=yes; end;
This results in a lot of parameters, which makes for a slow moving robot. The parameter files resulting form a 24h run is here. (already imported into excel). Two sheets give the parameters for each run, the third contains the values after the burnin.
Scroll to the right to see these columns, starting with pr+(1,2,3), pr+(4,5,6), etc. Calculate average posterior probability for each site of being under positive selection . Use the AVERAGE() function of Excel, enter the formula in a cell below the values for the individual generations -- starting in column pr+(1,2,3) -- copy the formula to all columns). You can do the same with the esimated omega values (to the right of the pr(xyz columns).
Type logout to release the compute node from the queue. If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.