CLASS 5. Statistics of Sequence Comparison

HOMEWORK PROBLEM SET #1 is posted to Moodle. Due in class on FRIDAY, JANUARY 22.

INSTRUCTIONS:

For each exercise, provide search query used and keep the answers brief. Email me the answers by Sunday 11:59PM AST at the latest.

Use "CLASS 5 EXERCISE" as a message subject, and type answers directly to email body (i.e., no document attachments please). Make sure that first line of your message is your NAME.

  1. Using PRSS, determine if there is significant similarity between the proteins with the following GI numbers:
    1303679 (A),
    2506213 (B),
    2983405 (C).
    Do all possible comparisons. Report the pseudo E-value for 10000 shuffles. Choose one comparison and repeat it 2-3 times. Do you always get the same E-value? Why or why not? Choose one GI number and compare it against itself. What is E-value? Did it conform to your expectations?

  2. Repeat a few of the pairwise comparisons using protein BLAST [To get to pairwise BLAST, check "Align two or more sequences" box]. You can force the program to report insignificant alignments by increasing the "Expect threshold" parameter in "Algorithm Parameters" section. What do the E-values mean in case of pairwise blast? (Hint: on the results report, examine "Search Summary", under "Other Reports".) How do these E-values compare to the ones obtained using PRSS?

  3. Transitive homology? Part I. You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) and sequence B has significant similarity to C (gi 2983405), but C and A are not significantly similar. Can you nevertheless conclude that A is homologous to C?

  4. Transitive homology? Part II. Does the same reasoning hold for GI 6320016, 1303679 and 2507047? Why might this case be different from the previous one? How does the output from the pairwise blast comparison help you to draw a conclusion ( to answer the question compare 6320016 with 1303679, and 6320016 with 2507047 in pairwise BLAST)?

  5. If you still have time left, play around with BLAST and explore how its output varies under different search parameters. Take a query of your choice (e.g., take a protein from your research project, or concatenate your first, middle and last name into a string of letters and pretend that it is a protein sequence [remove letters not used in a.a. alphabet]). Do a BLASTP search against non-redundant (nr) database. Repeat the search using RefSeq as a database. Did your results change? Try different parameters: use word size of 2, turn low complexity ON/OFF, choose a different substitution matrix.