Assignment 13

Your name:
Your email address:

Take home exam #9 has been posted (Due Monday after Thanksgiving)!

 

Assignments:

1. (2 Minutes) Using PRSS determine if there is significant similarity between proteins with the following gi numbers 145722 (D-Ala D-Ala ligase) and 121663 (Glutathione synthetase). Select "Accession/GI number" from the drop-down box that will by default have "FASTA format" selected.

What is the E-Value (10000) of the comparison?

2. (10 minutes) Collaborate with your neighbor. One of you should do exercise #2, the other exercise #3. Share the results!

Do a PSI-BLAST search with the Glutathione synthetase (gi|121663) as a query (use swissprot as database). On the Format page, turn on the filter for low complexity, set the E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.00001 and change the maximum target sequences to 20000.
Note :
Depending on the settings, PSI-Blast switches back and forth between the format and the result window.

After how many iterations (do not more than 5 iterations!) do you start to pick up D-Ala D-Ala ligase and Carbamoyl-phosphate synthase ?

Which other types of enzyme are included among the hits?


Notice! : If this takes a long time, collaborate with your neighbor. One of you could do task #2, the other #3.


3. (10 minutes) Do a PSI-BLAST search with the D-Ala D-Ala ligase (gi|145722) as a query (use swissprot as database).
On the Format page, turn on the filter for low complexity, set the E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.00001 and change the maximum target sequences to 10000.
Note :
Depending on the settings, PSI-Blast switches back and forth between the format and the result window. DO NOT CLICK the "Run PSI_Blast iteration X" button repeatedly. Click it once and open the Format window!

After how many iterations (do not more than 5 iterations!) do you start to pick up carbamoyl phosphate synthetases, Glutathione synthetase, and Biotin carboxylase (aka Acetyl-CoA carboxylase A1)?

Which other types of enzyme are included among the hits?


What might be the reason for the different results obtained in tasks 2 and 3?


4. (30 Minutes) Do a PSI-BLAST (use SwissProt as the database) search for 4 iterations with the following sequence:

DO NOT use the gi number as query. The gi number refers to the whole protein, we want to use the intein sequence only!
>Pab_VMA intein from gi|7436316|pir||D75028
CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG
NMPTLLHNT

DO NOT use the gi number as query. The gi number refers to the whole protein, we want to use the intein sequence only!


What types of enzymes do you get as hits?

Which E-value cut-off for inclusion in the next round did you choose?

What is the percent identity of the least significant hit added in the last iteration (clicking on the score in the table will jump to the alignment)?

Save the PSSM (Position Specific Scoring Matrix, or profile) from this search. To do that choose PSSM from menu inside the download link on top of the result page. Save the PSSM as an ASN file on your computer.
(If the iterations take too long, the PSSM after 4 iterations on the swissprot database is here, the PSSM after 5 iterations on the nr database is here*)

* I used the following command (commandline using the new blast+ package): psiblast -db nr -out out.inteinq -query inteinquery.txt -out_pssm inteinquery.pssm -out_ascii_pssm inteinquery.asci.pssm -inclusion_ethresh 0.0001 -num_iterations 5

Go to Microbial Genomes Genomic BLAST page. Paste intein sequence into query sequence box. Select the non redundant database.
Select the organisms to which the search should be restricted. You can select individual organisms or whole groups. (if you start typing, options will appear form which to select taxa)

Possible are

After selecting the genomes to search, go to alignment options and under PSI-blast otions select and upload your PSSM from Question #7 .

What are the results of your search? Did you get any significant matches? What are they?
If you have significant matches, does the match occur over the full lengths of both query and subject sequences? What is the percent identity?
Use Blink to investigate if the hits are indeed inteins.
What is your conclusion? In your answer indicate - genomes searched, - number of significant matches found, - the E-values of these matches, and - the identity of these matches (i.e., are these probable inteins, or are they likely to be something else?).

5. Using PSSMs

Here is a FASTA formatted file containing an annotated IS605 transposase protein sequences from a Frankia genomes.

We will use it to build a PSSM for this protein family, and then compare (mainly quantitatively) three searches:

  1. a blastp search of Frankia genomes for ORFs with significant matches to the sequences in the FASTA file,
  2. a PSI-blastp search of the collection of ORFs from the five genomes, and
  3. a PSI-tblastn search of Frankia genomes for significant matches to this PSSM

To do this, we will use the cluster. Establish a terminal and a file transfer connection to the cluster (see assignment 6 for information)

A FASTA file with all the proteins from 5 Frankia genomes is here - fiveFrankia.faa

A FASTA file of the nucleotide sequences of the 5 Frankia genomes is here - fiveFrankia.fna

Note that these (currently 5) Frankia genomes were retreived from the Entrez ftp site. In this case, I retreived the *.faa (protein) files, and the *.fna (genome nucleotide) files.

blastpgp -d nr -i IS605.faa -j 2 -C IS605Check.chk -h 1e-5 -o blast.out -a 2

This takes quite a while, if you are in a hurry, grab the checkpoint file (with -j 4) from here - IS605Check.chk

Options for blastpgp are here

formatdb -p T -i fiveFrankia.faa -o T

formatdb -p F -i fiveFrankia.fna -o T

blastall -p blastp -d fiveFrankia.faa -i IS605.faa -o blastp.out -a 2 -e 1e-5 -m 8

blastpgp -d fiveFrankia.faa -i IS605.faa -R IS605Check.chk -o PSIblastP.out -a 2 -e 1e-5 -m 8

blastall -p psitblastn -i IS605.faa -d fiveFrankia.fna -R IS605Check.chk -o psitblastn.out -a 2 -e 1e-5 -m 8

Counting the number of lines (corresponding to the number of significant matches) in a file:

wc -l blastp.out

wc -l PSIblastP.out

wc -l psitblastn.out

How does the number of blastp matches compare to the number of PSI-blast and PSI-tblastn matches?

If there is a significant difference in the number of matches, can you think of a reason why this could happen?

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.