Assignment 5: Statistics of Sequence Comparison

Your name:
Your email address:

1. PRSS (15 minutes)

Using PRSS*, determine if there is significant similarity between the proteins with gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679 . Choose one of these sequences and compare it to all the other ones. (The 1st three students (closest to the door) will use the 1st sequence, the 2nd row the 2nd sequence, etc..) We will combine the findings from the different rows into a single table. Please enter the pseudo E-value for 10000 comparisons into the table on the whiteboard and below. To start the program click on shuffle sequence

Your sequence 1:

E-values for comparison to

* The PRSS server at embNET provides the traditional PRSS output, but their link to sequence retrieval is broken ... but you can download the fasta files from NCBI and copy paste them into the form.

2. BLAST (15 minutes)

Repeat a few of the pairwise comparisons using Pairwise BLAST (go here, then select the protein BLAST. Click the box to select align two or more sequences, which is at the bottom of the "Enter Query Sequence" box. Make sure the blastp tab in the header is selected). You can paste the GI numbers directly into the box labeled "Enter accession number(s), gi(s), or FASTA sequence(s)" Hint- copy only the number, not the letters G or I. You can force the program to report insignificant alignments by increasing the expect value. To do this, click on the plus sign labeled "Algorithm parameters". A number of additional options will drop down, including the "Expect Threshold," which is set to 10 by default, but you can enter a larger number to obtain less significant matches (do this only if the default parameter does not return any result). When all of your parameters are set, click the BLAST button

What do the E-values mean in case of pairwise blast? (Hint: check the output after clicking align.)

How do these E-values compare to the ones obtained using PRSS?

3a. Transitive homology? Part one (5 minutes)

You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) over the entire length and sequence B has significant similarity to C (gi 2983405) over the entire length, but C and A are not significantly similar.

Can you nevertheless conclude that A is homologous to C? (Two characters -- sequences, or morphological characters -- are homologous if they are derived from the same character existing in some ancient organism.)

3b. Transitive homology? Part two (10 minutes)

Does the same reasoning hold for gi 6320016, gi 1303679, gi 2507047?

Why might this case be different from the previous one?
How does the output from the pairwise blast comparison help you to draw a conclusion (compare 6320016 with 1303679, and 6320016 with 2507047)?


4. FASTA (10 minutes)

Do a databank search of the Swissprot database with (gi 2493127). Fasta is accessible through the web at

(Enter the sequence in Fasta format (either search entrez for for 2493127 or go here and cut and paste (step 2)). In the display pulldown-menu select FASTA (step 3). Select the UniProtKB/swissprot database as target (step 1). In step 3, click the in "More Options." Then change the HISTOGRAM pulldown menu to yes, under the "Matrix" pulldown select the BLOSUM62 matrix, under the "alignments" pulldown select 1000 alignments, leave everything else in the default options.)

Once the program is done, click on the "Tool Output" button.

How many proteins show sequence similarity to the query sequence? (Give number of hits and the score cut-off or the E-value cut-off you chose.)

What type of sequences are among the matches?
In comparing the number of actual matches (==) to the distribution fitted to these data (*) for each alignment score (given in the left column), for which value of score(s) does the number of actual hits deviate most from the expectation?
In the pairwise alignments, what is signified by the ":" and the "." ?

5. More BLAST (10 minutes)

Using a protein coding nucleotide sequence of your choice as query (or use

    ^ DANGER: cut and paste the nucleotide sequence -- the gi number refers to the whole genome, which is what you get if you display the FASTA sequence, numbers pasted into the sequence box are ignored. At the beginning of the sequence add a ">" followed by a name and an <enter> symbol)),

1. search the non-redundant databank using blastn [link] (set maximum target sequences (under alignment parameters) to 20000, else defaults)

2. repeat the search with the translated query vs. protein database (nr) search tool (Blastx) using the protein non-redundant databank as target (set maximum target sequences to 20000, else defaults), and

3. with the translated query vs. translated database search tool (Tblastx) again using the non-redundant databank as target.

Do you notice any differences in results? (How many significant matches did you get? The taxonomy report provides an easy way to check things.)


Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.