Assignment 5: Statistics of Sequence Comparison

Your name:
Your email address:

Some cluster computer basics

Spend 5 minutes reading this introduction to the command-line interface.

Start a program called Bitvise SSH Client. In the "Host" field, type bbcsrv3.biotech.uconn.edu
In the "Username" field, type your username. Login. It will ask you to accept a new host key. Now enter your password.

Two windows will probably pop up. The one with a black background is the command-line window. The other window is a file transfer window (drag and drop).

In the window with a black background, you will see a flashing cursor. Stuff you type appears at the cursor position.

You can read more about the cluster here.

Type qlogin
and then hit the return or enter key. This takes you to a "compute" node. Now type qstat
This command shows the job-ID of your session.

Type ls
This gives you a directory (or folder) listing. There may be nothing to see (yet). Type mkdir lab5
This command stands for "make directory". Now type ls
again. You should see your newly-created directory. Type pwd
This stands for "print working directory". This forward-slash (/) separated "path" is a shorthand for the directory you are currently in.
Type cd lab5
Now type pwd again. You have moved into the lab5 directory.

You're stranded on Mars, with no access to the Internet, and for some reason you need an ASCII table (ASCII what?!). Suddenly you type man ascii
and all is not lost. "man" stands for manual. Hit the space bar to scroll through the pages, and press the "q" key to exit back to the prompt.

There are many ways to get files into your account, e.g., the file transfer application that is part of the SSH Client. Another possibility is the wget commant. to move this file into the directory on your cluster, you could type
wget http://carrot.mcb.uconn.edu/mcb3421_2017/HaloATPases+Intein.txt
You could copy paste the above line to the terminal window, or right click "this file" and copy the link address.

Do the same for this file:
wget http://carrot.mcb.uconn.edu/mcb3421_2017/HaloATPases-Intein.txt

Once you downloaded the files into the lab5 directory, you can display their content by typing
cat HaloATPases-Intein.txt
Alternatively you could use the more commands that allows you to go through a file page by page (Type <space> to get to the next page, type "q" to quit).

To display both files you could use the wildcard "*", which stands for any character.
cat H*
Dispalys both files to the screen. By default the output of cat goes to the standard output, in our case the screen. But you can redirect the output to a file using the > symbol.
cat H* > HaloATPases.txt
copies both files into a new file.

Use more to scrol through the file
more HaloATPases.txt

Type logout
You will return to the master computer. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.

 

1. PRSS (20 minutes)

Using PRSS*, determine if there is significant similarity between the proteins with gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679. To start the program click on "Shuffle Sequence". Sample values are in the table below. Note that 2e-3 means 2 × 10-3.

Table 1: PRSS pseudo E-value for 10000 comparisons. Fill out one column and enter numbers on whiteboard.

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
         
2493127
vacuolar ATPase subunit B
       
4323566
ATPase subunit alpha
     
2983405
transcription termination factor Rho
   
1303679
vacuolar ATPase subunit A
 

For a comparison that resulted in a significant E value (strong - smaller than 1e-8), but not overwhelming (i.e. > 1e-100) repeat the analysis selecting a different substitution matrix (Blosum50, Blosum80, PAM250). How does the E-value change?

For a comparison that resulted in a significant E value (strong - smaller than 1e-8), but not overwhelming (i.e. > 1e-100) repeat the analysis increasing and decreasing the gap opening penalty (e.g., -1, -10, -13, -20, -100). Which gap opening penalty gives the smallest E-value? What might be the reason for your finding?

* The PRSS server at embNET provides the traditional PRSS output, but their link to sequence retrieval (among others) is broken ... .

2. BLAST (10 minutes)

Repeat a few of the pairwise comparisons using Pairwise BLAST (go here, then select the protein BLAST. Click the box to select align two or more sequences, which is at the bottom of the "Enter Query Sequence" box. Make sure the blastp tab in the header is selected). You can paste the GI numbers directly into the box labeled "Enter accession number(s), gi(s), or FASTA sequence(s)" You can force the program to report insignificant alignments by increasing the expect value. To do this, click on the plus sign labeled "Algorithm parameters". A number of additional options will drop down, including the "Expect Threshold," which is set to 10 by default, but you can enter a larger number to obtain less significant matches (do this only if the default parameter does not return any result). When all of your parameters are set, click the BLAST button.

Table 2: BLASTp E-value.Enter your values on the whiteboard

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
         
2493127
vacuolar ATPase subunit B
       
4323566
ATPase subunit alpha
     
2983405
transcription termination factor Rho
   
1303679
vacuolar ATPase subunit A
 

How do these E-values compare to the ones obtained using PRSS?

Note: The NCBI BLAST interface adjusts the gap penalties to a new default value for each substitution matrix. How do these values correspond the impact of the gap penalty on the E-value you performed in exercise 1?


3a. Transitive homology? Part one (5 minutes)

You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) over the entire length and sequence B has significant similarity to C (gi 2983405) over the entire length, but C and A are not significantly similar.

Can you nevertheless conclude that A is homologous to C? (Two characters -- sequences, or morphological characters -- are homologous if they are derived from the same character existing in some ancient organism.)


3b. Transitive homology? Part two (10 minutes)

Does the same reasoning hold for gi 6320016, gi 1303679, gi 2507047?

Why might this case be different from the previous one?
How does the output from the pairwise blast comparison help you to draw a conclusion (compare 6320016 with 1303679, and 6320016 with 2507047)?

 

4. FASTA (10 minutes)

Do a databank search of the Swissprot database with (gi 2493127). Fasta is accessible through the web at

   http://www.ebi.ac.uk/Tools/sss/fasta/

(Enter the sequence in Fasta format (either search entrez for for 2493127 or go here and cut and paste (step 2)). In the display pulldown-menu select FASTA (step 3). Select the UniProtKB/swissprot database as target (step 1). In step 3, change the HISTOGRAM pulldown menu to yes, under the "Matrix" pulldown select the BLASTP62 matrix, under the "alignments" pulldown select 1000 alignments and 1000 scores, leave everything else in the default options.) <If the server is overloaded, you can get the results here.>

1) How many proteins show sequence similarity to the query sequence with an E-value smaller than 1E-20?
2) What type of sequences are among the matches?

Click on the "Tool Output" tab.
3) In comparing the number of actual matches (==) to the distribution fitted to these data (*) for each alignment score (given in the left column), for which value of score(s) does the number of actual hits deviate most from the expectation?
4) In the pairwise alignments, what is signified by the ":" and the "." ?

Repeat the search (same parameters) with the uniprot databank that clusters sequences together that have more than 50% identity. <If the server is overloaded, you can get the results here.>
5) How many matches better than 1E-20 did you obtain?

5. More BLAST (10 minutes)

Using a protein coding nucleotide sequence of your choice as query (or use

    https://www.ncbi.nlm.nih.gov/nuccore/21226102?report=fasta&from=2146398&to=2148008

    ^ DANGER: cut and paste the nucleotide sequence -- the gi number refers to the whole genome, which is what you get if you display the FASTA sequence, numbers pasted into the sequence box are ignored. At the beginning of the sequence add a ">" followed by a name and an <enter> symbol)),

1. search the non-redundant databank using blastn [link] (set maximum target sequences (under alignment parameters) to 20000, else defaults - this should be fast)
If you are in a hurry, or this takes too long, the result of the search is here (till 3.55 am on Saturday 9/30/2017).

2. repeat the search with the translated query vs. protein database (nr) search tool (Blastx) using the protein non-redundant databank as target (set maximum target sequences to 20000, blosum45, else defaults)
This might take awhile - the results are here (till 3.55 am on Saturday 9/30/2017)

Do you notice any differences in results? (How many hits did you get in how many different organisms? The taxonomy report provides an easy way to check things.)

 

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.