Assignment 13

Your name:
Your email address:

Assignments:

1. Using PRSS determine if there is significant similarity between proteins with the following gi numbers 145722 (D-Ala D-Ala ligase) and 121663 (Glutathione synthetase). Select "Accession/GI number" from the drop-down box that will by default have "FASTA format" selected. Click on "Shuffle Sequence" to start the comparison.

What is the E-Value (10000) of the comparison?

2. (10 minutes) Collaborate with your neighbor. One of you should do exercise #2, the other exercise #3. Share the results!

Do a PSI-BLAST search with the Glutathione synthetase (121663) as a query (use uniProtKB/swiss-prot as database). Under "Filters and Masking", turn on the filter for low complexity (tick/check the box), set the E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.00001 (or 1e-5) and change the maximum target sequences to 20000.
Note : Depending on the settings, PSI-Blast switches back and forth between the format and the result window. Place a checkmark in the box Show results in a new window (at the bottom of the page).

Note that the iteration number is listed at the top of the result page. One "launches" the next PSI-Blast iteration by looking for the "Run PSI-Blast iteration ? with max" at the top (or the bottom) of the list of significant matches, and clicking "Go".

After how many iterations (do not more than 5 iterations!) do you start to pick up D-Ala D-Ala ligase and Carbamoyl-phosphate synthase ?

Which other types of enzyme are included among the hits?

Notice! : If this takes a long time, collaborate with your neighbor. One of you could do task #2, the other #3.

3. (10 minutes) Do a PSI-BLAST search with the D-Ala D-Ala ligase (145722) as a query (use uniProtKB/swiss-prot as database). Under "Filters and Masking", turn on the filter for low complexity (tick/check the box), set the E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.00001 (or 1e-5) and change the maximum target sequences to 10000.
Note : Depending on the settings, PSI-Blast switches back and forth between the format and the result window. Place a checkmark in the box Show results in a new window (at the bottom of the page).

After how many iterations (do not more than 5 iterations!) do you start to pick up carbamoyl phosphate synthetases, Glutathione synthetase, and Biotin carboxylase (aka Acetyl-CoA carboxylase A1)?

Which other types of enzyme are included among the hits?

What might be the reason for the different results obtained in tasks 2 and 3?

4. (30 Minutes) Do a PSI-BLAST (use uniProtKB/swiss-prot as database) search for 4 iterations, an E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.0001 with the following sequence:

DO NOT use the gi number as query. The gi number refers to the whole protein, we want to use the intein sequence only!
>Pab_VMA intein from gi|7436316|pir||D75028
CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG
NMPTLLHNT

DO NOT use the gi number as query. The gi number refers to the whole protein, we want to use the intein sequence only!

What types of enzymes do you get as hits?

Can you verify that the target proteins contain inteins? (How?)

Which E-value cut-off for inclusion in the next round did you choose?

What is the percent identity of the least significant hit added in the last iteration (clicking on the score in the table will jump to the alignment)?

Save the PSSM (Position Specific Scoring Matrix, or profile) from this search. To do that choose PSSM from menu inside the download link on top of the result page. Save the PSSM as an ASN file on your computer.
(If the iterations take too long, the PSSM after 4 iterations on the swissprot database is here, the PSSM after 5 iterations on the nr database is here*)

* I used the following command (commandline using the new blast+ package): psiblast -db nr -out out.inteinq -query inteinquery.txt -out_pssm inteinquery.pssm -out_ascii_pssm inteinquery.asci.pssm -inclusion_ethresh 0.0001 -num_iterations 5

Go to Microbial Genomes Genomic BLAST page. Select BlastP (on top of the page). Paste intein sequence into query sequence box. Select the non redundant database.
Select the organisms to which the search should be restricted. You can select individual organisms or whole groups. (if you start typing, options will appear form which to select taxa)

Possible are

Pyrobaculum aerophilum
Aeropyrum pernix
Sulfolobus tokodaii
Archaeoglobus fulgidus
Methanothermobacter thermautotrophicus
Thermoplasma volcanium
Methanocaldococcus
Methanococcales
Halobacteria

After selecting the genomes to search, go to Algorithm parameters and under PSI-blast otions select and upload your PSSM from Question #7 .

As you start your search with a PSSM matrix already, you do not need to do iterations.

What are the results of your search? Did you get any significant matches? What are they?
If you have significant matches, does the match occur over the full lengths of both query and subject sequences? What is the percent identity?
Use Blink to investigate if the hits are indeed inteins.
What is your conclusion?
In your answer indicate - genomes searched, - number of significant matches found, - the E-values of these matches, and - the identity of these matches (i.e., are these probable inteins, or are they likely to be something else?).

5. Using PSSMs

SKIP THE COMMAND IN GREEN!!!!!

IS605.faa.txt is a FASTA formatted file containing an annotated IS605 transposase protein sequence from a Frankia genome.

We will use it to build a PSSM for this protein family, and then compare (mainly quantitatively) three searches:

a blastp search of Frankia genomes for ORFs with significant matches to the sequence in the FASTA file,
a PSI-blastp search of the collection of ORFs from the five genomes, and
a PSI-tblastn search of Frankia genomes for significant matches to this PSSM

To do this, we will use the cluster. Establish a terminal connection to the cluster bbcsrv3.biotech.uconn.edu with the Bitvise SSH program installed on your computer. (Also recall that this program includes a SFTP Window to transfer files between the cluster and your computer, although since the files are going TO the cluster in this case, we will use "curl" for convenience.)

A FASTA file with all the proteins from 5 Frankia genomes is here - fiveFrankia.faa.txt

A FASTA file of the nucleotide sequences of the 5 Frankia genomes is here - fiveFrankia.fna.txt

Note that these (currently 5) Frankia genomes were retreived from the Entrez ftp site. In this case, I retreived the *.faa (protein) files, and the *.fna (genome nucleotide) files.

Preparation: (bracketed parts are comments only!)

	mkdir lab13								(make a new folder called "lab13")
	cd lab13								(equivalent of "double-clicking" a folder to descend "into" it)
	curl -O http://carrot.mcb.uconn.edu/mcb3421_2016/IS605.faa.txt		(curl stands for "see URL", this downloads the IS605 file to the cluster)
	                                                              	 	(...you could also accomplish this with the Bitvise SFTP client)
	ls									(make sure the file is there)
	cat IS605.faa.txt							(...and that the contents look OK)
	qlogin									(we're about to do some serious computation, so onto a compute node we go...)
	cd lab13								(logging into the compute node "resets" our location to our home directory)
	ls									(once more to make sure the IS605 file is present in this folder)

To create the PSSM for IS605: (the psiblast command is long!...remember to scroll to the right to get all of it!)
SKIP THE COMMAND IN GREEN!!!!!

	export BLASTDB=/common/blast/data					(just to make sure BLAST knows where to find the "nr" database)
	export BLASTMAT=/opt/bio/ncbi/data					(...and the scoring matrices)
										(...ideally "BLASTDB" would be set in your .bash_profile, to avoid extra typing)
	psiblast -db nr -query IS605.faa.txt -out_pssm IS605checkpointPSSM -out_ascii_pssm IS605pssm.txt -inclusion_ethresh 1e-5 -out blast.out -num_iterations 2 -num_threads 2

("-num_threads 2" says to use two threads -- similar to two processors/CPUs -- which should make it faster, at the expense of using twice the resources)

This takes quite a while, 40 MINUTES FOR JUST 2 ITERATIONS!!!!! 90 minutes for 5 iterations with 2 threads, 60 minutes for 5 iterations with 4 threads, if you are in a hurry, grab the checkpoint file (with -num_iterations 5) from here - IS605checkpointPSSM

Options for psiblast can be seen using

	psiblast -help

or look at this psiblast-help.txt

In preparation for our blast searches on the Frankia genomes, we need to first create a searchable blast database:

	curl -O http://carrot.mcb.uconn.edu/mcb3421_2016/fiveFrankia.faa.txt
	makeblastdb -in fiveFrankia.faa.txt -dbtype prot -parse_seqids
	curl -O http://carrot.mcb.uconn.edu/mcb3421_2016/fiveFrankia.fna.txt
	makeblastdb -in fiveFrankia.fna.txt -dbtype nucl -parse_seqids

To do a normal blastp search:

	blastp -db fiveFrankia.faa.txt -query IS605.faa.txt -out blastp.out -evalue 1e-5 -outfmt 6 -num_threads 2

To do a PSI-blast search of the encoded proteins:

	curl -O http://carrot.mcb.uconn.edu/mcb3421_2016/IS605checkpointPSSM		(get the PSSM first, because we skipped the command in green)
	psiblast -db fiveFrankia.faa.txt -in_pssm IS605checkpointPSSM -out PSIblastP.out -inclusion_ethresh 1e-5 -outfmt 6 -num_threads 2

You will receive a fancy warning message, but I believe it works just fine. More on the warning message is here.

To do a PSI-blast search of the 6 reading frames of the genomes:

	tblastn -db fiveFrankia.fna.txt -in_pssm IS605checkpointPSSM -out psitblastn.out -evalue 1e-5 -outfmt 6 -num_threads 2

Counting the number of lines (corresponding to the number of significant matches; note, we had set the evalue to 10^-5, and selected tabular output) in a file:

wc -l blastp.out
wc -l PSIblastP.out
wc -l psitblastn.out

or wc -l psitblastn.out PSIblastP.out blastp.out

Also note (for fun purposes only) that the tblastn matches are not distributed evenly by genome:

	cut -f2 psitblastn.out | uniq -c				(cuts out 2nd field, which is the target genome ID, and counts the matches for each of the 5 genomes)

How does the number of blastp matches compare to the number of PSI-blast and PSI-tblastn matches?

If there is a significant difference in the number of matches, can you think of a reason why this could happen?

BE SURE TO TYPE

	logout								(log out from compute node and return to master node)

TO RELEASE THE QLOGIN JOB FROM THE QUEUE

6. MrBayes by example: Identification of sites under positive selection in a protein

Background:

Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.

Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on-new antibodies, new mutants," Fitch said.

The research into the virus' genetic data focused on the evolution of the hemagglutinin gene-the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.

By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.

The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.

Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.

A talk by Walter Fitch (slides and sound) is here

Exercise6:

The goal of this exercise is to detect sites in hemmagglutinin that are under positive selection.

Since the analysis takes a very long time to run (several days), here are the saved results of the MrBayes run: Fitch_HA.nex.p.txt, Fitch_HA.nex.t.txt .

The original data file is flu_data.paup . The dataset is obtained from an article by Yang et al, 2000 . The File used for MrBayes is here

The MrBayes block used to obtain results above is:

begin mrbayes;
set autoclose=yes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
mcmcp samplefreq=500 printfreq=500;
mcmc ngen=500000;
sump burnin=50;
sumt burnin=50; end;

Selecting a nucmodel=codon with Omegavar=Ny98 specifies a model in which for every codon the ratio of the rate of non-synonymous to synonymous substitutions is considered. This ratio is called OMEGA. The Ny98 model considers three different omegas, one equal to 1 (no selection, this site is neutral); the second with omega < 1, these sites are under purifying selection; and the third with Omega >1, i.e. these sites are under positive or diversifying selection. (The problem of this model is that the there are only three distinct omegas estimated, and for each site the probability to fall into one of these three classes. If the omega>1 is estimated to be very large, because one site has a large omega, the other sites might not have a high probability to have the same omega, even though they might also be under positive selection. This leads to the site with largest omega to be identified with confidence, the others have more moderate probabilities to be under positive selection).

Note : Version 2.0 of Mr Bayes has a model that estimates omega for each site individually, the new version only allows the Ny98 model as described above..

First, you need to detect how many generations to burn in (meaning the number of samples you will have to discard). Open the file Fitch_HA.nex.p.txt with Excel and plot # of generations versus -LnL values. Determine after how many generations the graph becomes "stationary" (hint: change the Y-axis bounds to "zoom in", e.g., -3300 min to -3200 max). The burnin value is that number of generations divided by 50 (since only every 50th generation was sampled; i.e. the burnin value roughly is equal to the number of rows - not quite because there is a header). To more accurately determine the burnin, you need to rescale the Y-axis (click at the Y-axis -- if you aim accurately, you'll get a box that allows rescaling).
The result (scatterplot of LogL versus generation) might look like this:
This file contains information for posterior probabilities for each codon (columns) at each sampled generation (rows). Scroll to the right to see these columns, starting with pr+(1,2,3), pr+(4,5,6), etc. Calculate average posterior probability for each site of being under positive selection (Do not forget to exclude first N rows as a burnin; you should have detected value of N in the first question of this exercise - to be clear on where the burnin ends, you might want to highlight the rows representing the burnin and select a different font color. (Use AVERAGE() function of Excel, enter the formula in a cell below the values for the individual trees -- starting in column pr+(1,2,3) -- copy the formula to all columns)
Plot average posterior probability vs. site #. (select the row in which you calculated the averages, then click Graph, and select a bar graph). Write down the codon positions for a few sites with the highest posterior probability of being positively selected (the columns name pr+(1,2,3), pr+(4,5,6)....and so on. pr+(1,2,3) mean probability of codon #1 (nucleotide #1, #2 and #3) to be under positive selection))

Determine the 95% credibility interval for the omega<1 value. To do this you sort posterior probability column in ascending order (Select data you want to sort, and go to Data->Sort... ). Again, do not forget to discard the burnin ; the easiest might be to actually delete it.. After sorting, exclude 5% of the data on the top and on the bottom. The range of the remaining data gives you the 90% confidence interval. (Enter answer in box below!)
The structure of hemagglutinin has been crystallized and is publicly available through PDB. Download the PDB file here and examine it with SPDBV. Chain A of the PDB file corresponds to the sequences we did our analysis with (color the molecule according to chain). Below is a comparison of one of the sequences we used for analyses with the sequence for which the structure was determined:

Using this alignment as a guide, map the site(s) which have the highest probability to belong to the class with omega>1. Where are these sites located in the protein? (Reminders: The position number in the nexus file corresponds to nucleotide sequence, the structure is based on the amino acid sequence - take the third codon position and divide by 3 to find the amino acid. You only want to be concerned with Chain A!)

What is the 95% credibility interval for the omega < 1?
Does this value indicate strong purifying selection?
Which codon(s) showed signs of positive selection?
Which position and which amino acid does this correspond to in the above alignment?
Where is this aa located in the structure?

Finished?

Type logout to release the compute node from the queue.
If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.