Assignment 14

Your name:
Your email address:

Some of these slides are from last year's class, so not everything is relevant to this year's lab — slides with examples are here !

1) Handling genome (or larger) amounts of data -- Extracting text from other applications

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

First we will download the encoded proteins from the genome of a group of organisms you are interested in (Halobacteria, Methanocella, Halomonas).

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Click on Filters in the upper right corner, Check Assembley level "Complete" and "chromosome". This should result in a listing of completely sequenced genomes only.

Use the Group and SubGroup boxes to narrow your selection, or better yet, use the "Search by organism" box to narrow to a taxonomic group. (Note that after a "Search by organism", one might need to repeat the process of clicking on the "Prokaryotes" tab, and re-tick the "Complete" genomes box.)

Then look for the R and G links in the far right-hand column (you will probably have to scroll to the right). The R takes you to a listing of all the refseq files for a genome project.

You want to download the file ending in ".faa.gz". This "faa" file contains all of the proteins coded by a genome.

Right-click on the faa file, and select "Copy link location". Then go to the cluster and paste the link, like this:

Start a program called Bitvise SSH Client. In the "Host" field, type bbcsrv3.biotech.uconn.edu
In the "Username" field, type your username. Login.

Cluster commands: (bracketed parts are comments only!)

	mkdir lab14a								(make a new folder called "lab14a")
	cd lab14a								(equivalent of "double-clicking" a folder to descend "into" it)
	curl -O paste the link you copied from NCBI				(curl stands for "see URL", this downloads the NCBI link to the cluster)
	                                                        (...you could also accomplish this with the Bitvise SFTP client)
                                                          (NOTE: The curl command wnats a URL, i.e something that starts with http:// 
                                                                ; if what you copy starts with ftp:// replase the first ftp with http.   

	[don't copy this mindlessly! :-)] e.g. curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000940915.1_ASM94091v1/GCF_000940915.1_ASM94091v1_protein.faa.gz

	ls -l									(make sure the file is there)
	gunzip *								(the .gz suffix means this text file is compressed, so uncompress it)
	[...or...]
	gunzip hit your tab key to complete the name				
	ls -l									(the .gz suffix should be gone, and the file got larger)
	[comment] let's make the filename simpler and more descriptive for humans
	[comment] A good convention would be the first letter of the genus, and then the species name, e.g., D_radiodurans.faa.gz
	[comment] REMEMBER TO USE THE TAB KEY TO COMPLETE LONG FILENAMES — this makes life much easier
	mv oldname.faa newname.faa						(changes the name of a file from oldname to newname
	[comment] e.g., mv GCF_000940915.1_ASM94091v1_protein.faa G_species.faa
	more G_species.faa							(inspect the first few lines of the faa file, type "q" to exit)
										(...and space to go forward)
										(...and "b" to go back)
	qlogin									(we're about to do some serious computation, so onto a compute node we go...)
	cd lab14a								(logging into the compute node "resets" our location to our home directory)
	ls									(once more to make sure the faa file is present in this folder)

One more note about qlogin, the command that takes you to a "compute" node. We could "qlogin" before typing the above commands, and perhaps that is better, but the general rule is that as long as you are executing commands that only take a few seconds, you can stay on the master node. DEFINITELY qlogin before running programs like pepstats or BLAST+ utilities. If the scheduler can't schedule your login, use qacct -q to get a listing of available queues and how busy they are. Try the one least busy using qlogin -q name_of_queue, e.g. qlogin -q course.q

And finally, here is the exciting pepstats command:

	pepstats G_species.faa -outfile G_species.pepstats
	more G_species.pepstats							(inspect the pepstats file, type space to go forward)
										(...and "b" to go back)
										(...and "q" to exit)

Now we need a program to extract the isoelectric points (amongst other stuff). It's called parse_pepstats.pl.txt. It will work provided the output of pepstats is in a file ending in ".pepstats" (remember that is what you named it above, type "ls" to confirm. Read through the parse_pepstats.pl.txt script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.

	curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/parse_pepstats.pl.txt		(download the perl script via link)
	perl parse_pepstats.pl.txt							(run the script, and extract the isoelectric points)
	ls -l										(you made a bunch of additional files)
	head G_species.pepstats.pI							(the first 10 isoelectic points!)

As an alternative, especially if you have many genomes that you want to analyze, you could use a modified script that calls a Rscript to calcualte and plot the histogram. The Perl script is parse_pepstats_mod2.pl.
The R script is histogramscript_pdf.R

	curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/parse_pepstats_mod2.pl		(download the perl script via link)
	curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/histogramScript_pdf.R      (download the R script via link)
  perl parse_pepstats_mod2.pl.txt							(run the script, and extract the isoelectric points)
	ls -l										        (you made a bunch of additional files)
	head G_species.pepstats.pI							(the first 10 isoelectic points!)
                                                      (for each pepstats file a pdf file containing the histogram should be created in the folder)
                                                      (move these files to your PC and inspect them using acrobat or similar)

Using the Bitvise SFTP Window, drag the file containing the isoelectric points (.pepstats.pI ending) to your computer, and load it into Excel. (Remember that it's in the lab14a folder on the cluster.) Make a histogram! In Excel, after loading the .pI file (remember to select "All Files" to see it), use Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram.

(More about histograms in Excel, including the formula used for default binning.)

Which genomes did you analyze?
Describe the distribution of isoelectric points in your selected genome. How many peaks?
Why might there be a minimum at around pH7?
Compare your finding with others in the class. Do thermo- and halophiles have the same distributions of isoelectric points?

Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column; you could use
fastacmd -s accession_number to retrieve the sequence from nr).
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzymes?

Exercise 2:

dN/dS ratios along a sequence. I ran the dataset from lab11 exercise #1 and 2 using the NY98 model (no partitions). The file is here. The model has a parameter for transition/tranversion ratio and for the dN/dS ratio (called omega). The model uses and explores only two of these - one for sites under purifying selection, one for sites under diversifying selection. In addition for each codon the probability that the codon is in the omega+ group is estimated. This results in a lot of parameters, which makes for a slow moving robot. This is the MrBayes block in the file:

begin mrbayes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
report possel = yes siteomega = yes;
mcmcp filename=analysisS;
mcmcp samplefreq=50 printfreq=50 diagnfreq=500;
mcmcp savebrlens=yes;
end;

The parameter files resulting form a 24h run is here. (already imported into excel). Two sheets give the parameters for each run, the third contains the values from both runs after the burnin.

Scroll to the right to see these columns, starting with pr+(1,2,3), pr+(4,5,6), etc. Calculate average posterior probability for each site of being under positive selection . Use the AVERAGE() function of Excel, enter the formula in a cell below the values for the individual generations -- starting in column pr+(1,2,3) -- copy the formula to all columns). You can do the same with the esimated omega values (to the right of the pr(xyz columns). A spreadsheet with the graph already calculated is here.

Can you recognize from the plots where the intein is located?
Why do many sites within the intein appear to have little evidence for purifying (negative) selection?
Is there clear and convincing evidence for sites to have been under diversifying (positive) selection?

Background:

Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.

Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on-new antibodies, new mutants," Fitch said.

The research into the virus' genetic data focused on the evolution of the hemagglutinin gene-the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.

By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.

The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.

Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.

A talk by Walter Fitch (slides and sound) is here

The goal of this exercise is to detect sites in hemmagglutinin that are under positive selection.

Since the analysis takes a very long time to run (several days), here are the saved results of the MrBayes run: Fitch_HA.nex.p.txt, Fitch_HA.nex.t.txt .

The original data file is flu_data.paup . The dataset is obtained from an article by Yang et al, 2000 . The File used for MrBayes is here

The MrBayes block used to obtain results above is:

begin mrbayes;
set autoclose=yes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
mcmcp samplefreq=500 printfreq=500;
mcmc ngen=500000;
sump burnin=50;
sumt burnin=50; end;

Selecting a nucmodel=codon with Omegavar=Ny98 specifies a model in which for every codon the ratio of the rate of non-synonymous to synonymous substitutions is considered. This ratio is called OMEGA. The Ny98 model considers three different omegas, one equal to 1 (no selection, this site is neutral); the second with omega < 1, these sites are under purifying selection; and the third with Omega >1, i.e. these sites are under positive or diversifying selection. (The problem of this model is that the there are only three distinct omegas estimated, and for each site the probability to fall into one of these three classes. If the omega>1 is estimated to be very large, because one site has a large omega, the other sites might not have a high probability to have the same omega, even though they might also be under positive selection. This leads to the site with largest omega to be identified with confidence, the others have more moderate probabilities to be under positive selection).

Note : Version 2.0 of Mr Bayes has a model that estimates omega for each site individually, the new version only allows the Ny98 model as described above..

First, you need to detect how many generations to burn in (meaning the number of samples you will have to discard). Open the file Fitch_HA.nex.p.txt with Excel and plot # of generations versus -LnL values. Determine after how many generations the graph becomes "stationary" (hint: change the Y-axis bounds to "zoom in", e.g., -3300 min to -3200 max). The burnin value is that number of generations divided by 50 (since only every 50th generation was sampled; i.e. the burnin value roughly is equal to the number of rows - not quite because there is a header). To more accurately determine the burnin, you need to rescale the Y-axis (click at the Y-axis -- if you aim accurately, you'll get a box that allows rescaling).
The result (scatterplot of LogL versus generation) might look like this:
This file contains information for posterior probabilities for each codon (columns) at each sampled generation (rows). Scroll to the right to see these columns, starting with pr+(1,2,3), pr+(4,5,6), etc. Calculate average posterior probability for each site of being under positive selection (Do not forget to exclude first N rows as a burnin; you should have detected value of N in the first question of this exercise - to be clear on where the burnin ends, you might want to highlight the rows representing the burnin and select a different font color. (Use AVERAGE() function of Excel, enter the formula in a cell below the values for the individual generations -- starting in column pr+(1,2,3) -- copy the formula to all columns) (see slides)
Plot average posterior probability vs. site #. (select the row in which you calculated the averages, then click Graph, and select a bar graph). Write down the codon positions for a few sites with the highest posterior probability of being positively selected (the columns name pr+(1,2,3), pr+(4,5,6)....and so on. pr+(1,2,3) mean probability of codon #1 (nucleotide #1, #2 and #3) to be under positive selection))

Determine the 95% credibility interval for the omega<1 value. To do this you sort posterior probability column in ascending order (Select data you want to sort, and go to Data->Sort... ). Again, do not forget to discard the burnin ; the easiest might be to actually delete it.. After sorting, exclude 5% of the data on the top and on the bottom. The range of the remaining data gives you the 90% confidence interval. (Enter answer in box below!)
The structure of hemagglutinin has been crystallized and is publicly available through PDB. Download the PDB file here and examine it with SPDBV. Chain A of the PDB file corresponds to the sequences we did our analysis with (color the molecule according to chain). Below is a comparison of one of the sequences we used for analyses with the sequence for which the structure was determined:

Using this alignment as a guide, map the site(s) which have the highest probability to belong to the class with omega>1. Where are these sites located in the protein? (Reminders: The position number in the nexus file corresponds to nucleotide sequence, the structure is based on the amino acid sequence - take the third codon position and divide by 3 to find the amino acid. You only want to be concerned with Chain A!)

What is the 95% credibility interval for the omega < 1?
Does this value indicate strong purifying selection?
Which codon (s) showed signs of positive selection?
Is this signal stronger than the one detected in exercise 2?
Which position and which amino acid does this correspond to in the above alignment?
Where is this aa located in the structure?

Finished?

Type logout to release the compute node from the queue.
If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.