Assignment 5: Statistics of Sequence Comparison

Your name:
Your email address:

1. PRSS (20 minutes)

Using PRSS*, determine if there is significant similarity between the proteins with gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679 . Choose one of these sequences and compare it to all the other ones. (The 1st three students (closest to the door) will use the 1st sequence, the 2nd row the 2nd sequence, etc..) We will combine the findings from the different rows into a single table. Please enter the pseudo E-value for 10000 comparisons into the table on the white-board and below. To start the program click on shuffle sequence

Your sequence 1:

E-values for comparison to

For a comparison that resulted in a significant E value (strong - smaller than 1e-8), but not overwhelming (i.e. > 1e-100) repeat the analysis selecting a different substitution matrix (Blosum50, Blosum80, PAM250). How does the E-value change?

For a comparison that resulted in a significant E value (strong - smaller than 1e-8), but not overwhelming (i.e. > 1e-100) repeat the analysis increasing and decreasing the gap opening penalty (e.g., -1, -10, -13, -20, -100). Which gap opening penalty gives the smallest E-value? What might be the reason for your finding?

* The PRSS server at embNET provides the traditional PRSS output, but their link to sequence retrieval (among others) is broken ... .

2. BLAST (10 minutes)

Repeat a few of the pairwise comparisons using Pairwise BLAST (go here, then select the protein BLAST. Click the box to select align two or more sequences, which is at the bottom of the "Enter Query Sequence" box. Make sure the blastp tab in the header is selected). You can paste the GI numbers directly into the box labeled "Enter accession number(s), gi(s), or FASTA sequence(s)" Hint- copy only the number, not the letters G or I. You can force the program to report insignificant alignments by increasing the expect value. To do this, click on the plus sign labeled "Algorithm parameters". A number of additional options will drop down, including the "Expect Threshold," which is set to 10 by default, but you can enter a larger number to obtain less significant matches (do this only if the default parameter does not return any result). When all of your parameters are set, click the BLAST button.

How do these E-values compare to the ones obtained using PRSS?

Note: The NCBI BLAST interface adjusts the gap penalties to a new default value for each substitution matrix. How do these values correspond the impact of the gap penalty on the E-value you performed in exercise 1?

3a. Transitive homology? Part one (5 minutes)

You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) over the entire length and sequence B has significant similarity to C (gi 2983405) over the entire length, but C and A are not significantly similar.

Can you nevertheless conclude that A is homologous to C? (Two characters -- sequences, or morphological characters -- are homologous if they are derived from the same character existing in some ancient organism.)

3b. Transitive homology? Part two (10 minutes)

Does the same reasoning hold for gi 6320016, gi 1303679, gi 2507047?

Why might this case be different from the previous one?
How does the output from the pairwise blast comparison help you to draw a conclusion (compare 6320016 with 1303679, and 6320016 with 2507047)?


4. FASTA (10 minutes)

Do a databank search of the Swissprot database with (gi 2493127). Fasta is accessible through the web at

(Enter the sequence in Fasta format (either search entrez for for 2493127 or go here and cut and paste (step 2)). In the display pulldown-menu select FASTA (step 3). Select the UniProtKB/swissprot database as target (step 1). In step 3, click the in "More Options." Then change the HISTOGRAM pulldown menu to yes, under the "Matrix" pulldown select the BLOSUM62 matrix, under the "alignments" pulldown select 1000 alignments and 1000 scores, leave everything else in the default options.)

How many proteins show sequence similarity to the query sequence with an E-value smaller than 1E-20?
What type of sequences are among the matches?

Click on the "Tool Output" button.
In comparing the number of actual matches (==) to the distribution fitted to these data (*) for each alignment score (given in the left column), for which value of score(s) does the number of actual hits deviate most from the expectation?
In the pairwise alignments, what is signified by the ":" and the "." ?

5. More BLAST (10 minutes)

Using a protein coding nucleotide sequence of your choice as query (or use

    ^ DANGER: cut and paste the nucleotide sequence -- the gi number refers to the whole genome, which is what you get if you display the FASTA sequence, numbers pasted into the sequence box are ignored. At the beginning of the sequence add a ">" followed by a name and an <enter> symbol)),

1. search the non-redundant databank using blastn [link] (set maximum target sequences (under alignment parameters) to 20000, else defaults - this should be fast, if it isn't, the a temporary link to the results is here)

2. repeat the search with the translated query vs. protein database (nr) search tool (Blastx) using the protein non-redundant databank as target (set maximum target sequences to 20000, blosum45, else defaults) (for the above query, results may still be here)

Do you notice any differences in results? (How many hits did you get in how many different organisms? The taxonomy report provides an easy way to check things.)


6) Preparation for next week:

Connect to the server

You will need two connections to the server,
(A) a terminal window that allows you execute commands from the command line, and
(B) a program that allows you to move files back and forth.

For A:

On a windows machine download and install the ssh client from . Install the program, it should include both a terminal program and a file transfer program. Open the terminal program and connect to the server. To do so, first create and then edit a profile with your information, then use the profile to connect.

On a Mac, open the terminal program ( -- the program usually resides in the application folder inside the utilities subfolder).
In the terminal window type:

enter your password when prompted

If you want to change your password type

Then follow the instructions to change your password.

For B:

On a windows machine, if you have the ssh client installed, log in and open the file transfer window.

On a Mac:
The easiest is to use a program like filezilla. Go to their web page, download and install FileZilla client (use the correct version according to your operating system and processor - not you can use this also on windows, but make sure you download the client, not the server). Unpack the program. Double click on the program. For Mac with Intel processor, the application is also available here, and for old Macs with RISC processors, the application is also available here. Connect FileZilla to the server (top left icon, then click on the new site button, then select protocol sftp, and logon type normal. The two panes in the FileZilla window represent your computer (left) and the server (right).

Find a files you want to transfer on the left window pane and drag them into the directory on the server where you want them to go. Right click (or control click) on the right pane opens a menu that allows you create new subdirectories and change file permissions.

Using the terminal,



Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.