Assignment 5: Statistics of Sequence Comparison

Your name:
Your email address:

1. PRSS (15 minutes)

Using PRSS*, determine if there is significant similarity between the proteins with gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679 . Choose one of these sequences and compare it to all the other ones. (The 1st row (closest to the door) will use the 1st sequence, the 2nd row the 2nd sequence, etc..) We will combine the findings from the different rows into a single table. Please enter the pseudo E-value for 10000 comparisons into the table on the whiteboard and below..

Your sequence 1:

E-values for comparison to
2506213:
2493127:
4323566:
2983405:
1303679:

* The PRSS server at embNET provides the traditional PRSS output, but their link to sequence retrival is broken ...

2. BLAST (15 minutes)

Repeat a few of the pairwise comparisons using Pairwise BLAST (go here, select align two or more sequences, then choose blastp for protein blast in the tabs in the third part of the header). You can paste the GI numbers directly. (You can force the program to report insignificant alignments by increasing the expect value under Algorithm options on the page you use to submit your sequences).

What do the E-values mean in case of pairwise blast? (Hint: check the output after clicking align.)


How do these E-values compare to the ones obtained using PRSS?

3a. Transitive homology? Part one (5 minutes)

You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) and sequence B has significant similarity to C (gi 2983405), but C and A are not significantly similar.

Can you nevertheless conclude that A is homologous to C? (Two characters -- sequences, or morphological characters -- are homologous if they are derived from the same character existing in some ancient organism.)


3b. Transitive homology? Part two (10 minutes)

Does the same reasoning hold for gi 6320016, gi 1303679, gi 2507047?

Why might this case be different from the previous one?

How does the output from the pairwise blast comparison help you to draw a conclusion (compare 6320016 with 1303679, and 6320016 with 2507047)?

 

4. FASTA (10 minutes)

Do a databank search of the Swissprot database with (gi 2493127). Fasta is accessible through the web at

   http://www.ebi.ac.uk/Tools/fasta33/index.html (Bill Pearsons site currently can not be convinced to draw the histogram)

(Enter the sequence in Fasta format on the form (go to Entrey, select databank protein, search for 2493127, in the display pulldown-menu select FASTA, or go here), select the UniProtKB/swissprot database as target, past the fasta formated sequence into the box, in the HISTOGRAM pulldown menu select yes, select the BLOSUM62 matrix, select 1000 alignments, leave everything in the default options.)

Once the program is done, click on the FASTA Results button.

How many proteins show sequence similarity to the query sequence? (Give number of hits and the score cut-off or the E-value cut-off you chose.)

What type of sequences are among the matches?

In comparing the number of actual matches (==) to the distribution fitted to these data (*) for each alignment score (given in the left column), for which value of score(s) does the number of actual hits deviate most from the expectation?

In the pairwise alignments, what is signified by the ":" and the "." ?

5. More BLAST (10 minutes)

Using a protein coding nucleotide sequence of your choice as query (or use

    http://www.ncbi.nlm.nih.gov/nuccore/21226102?from=2146398&to=2148008&report=gbwithparts

    ^ DANGER: cut and paste the nucleotide sequence -- the gi number refers to the whole genome, which is what you get if you display the FASTA sequence, numbers pasted into the sequence box are ignored. At the beginning of the seqence add a ">" followed by a name and an <enter> symbol)),

1. search the non-redundant databank using blastn [link] (set maximum target sequences to 20000, else defaults)

2. repeat the search with the translated query vs. protein database ( nr) search tool (Blastx) using the protein nonredundant databank as target (set maximum target sequences to 20000, else defaults), and

3. with the translated query vs. translated database search tool (Tblastx) again using the non-redundant databank as target.

Do you notice any differences in results? (How many significant matches did you get? The taxonomy report provides an easy way to check things.)



Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.