!!! NEXT CLASS (MONDAY FEB 4th)
WILL MEET IN THE COMPUTER LAB !!!

 

Blast, PSIblast, and Homology (continued from class 2)

PRSS Example. 
We will use the web version (available here).

Go through example. Sequences are here (fl), here (B), here (A) and here (A2)

There are many other alignment programs.  BLAST is a program that is widely used and offered through the NCBI (go here for more info).  It also offers to do pairwise comparisons  (go here do example).

To force the program to report an alignment increase the E-value.

Rules of thumb:

If you can demonstrate significant similarity using either randomization or an unweighted blast search, your sequences are homologous (i.e. related by common ancestry).  Convergent evolution has not been shown to lead to sequence similarities detectable by these means (see above - this might not be true for scores in PSI-blast)

If the actual alignment score is more than three standard deviations (of the randomized sequences) better than the mean for the randomized sequences, the two sequences are homologous (i.e. related by common ancestry).  PRSS and many other program use more accurate distributions to describe the distribution of random hits.  The expectation value for the alignment-score of the actual sequences is based on these statistics.

Usually E values (in a blast search or through randomization) smaller than 10-4 are convincing.

 

Terminology

E-values give the expected number of matches with an alignment score this good or better,

P-values give the probability of to find a match of this quality or better. 

P values are [0,1], E-values are [0,infinity).  For small values E=P

z-values, give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.  For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences.  Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration.  (see the but below).

A somewhat readable description of E, P HSP and other values is here.

BUT:
Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

Examples:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (For example, enzymes with GRASP nucleotide binding sites are depicted here)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (picture)

BLAST and PSI BLAST

Run a blast trial with  this sequence

A tutorial for standard blast search is here

An easy way to force the program to report less significant matches is to increase the expect value in the advanced blast page. 

PSI-blast provides an enormous advantage over normal blast in the detection of distantly related sequences.  It only works if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences. 

The NCBI page describes PSI blast as follows:
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases. 

A diagram giving an overview over the PSI-blast procedure is here.

The results of a normal blast search are aligned and a pattern of conserved residues is extracted from the alignment.  This pattern is used for the next iteration.  An important parameter to adjust is the E-value threshold down to which matches are included in the alignment and pattern extraction. 
At higher iterations a PSI blast profile can be corrupted and false positives are identified with significant E-values.   I.e., in a traditional blast search one can be quite certain that a match with an E-value of 10^-13 represents a homologue; this is not clear with PSI blast.  Test studies indicate that profile corruptions are likely after more than 5 iterations. On the positive side: there are many fewer false negatives with PSI blast than with normal blast.

 

 

!!! NEXT CLASS (MONDAY, FEB 4th) WILL MEET IN THE COMPUTER LAB !!!

Assignment #2:

[Links from this page open in a separate window]

  1. Using Entrez download the protein sequence with PID# 2506213. What type of proteins do you find among the related proteins?

  2. Copy the FASTA format of gi2506213 onto your computer's clipboard (not the clipboard of the NCBI page), go to the NCBI BLAST page and do a gapped BLAST search (otherwise default options) . Do you obtain any different results?

  3. Download proteins with the following gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679 [Hint: you can paste multiple space/comma separated gi numbers to retrieve all 5 proteins simultaneously]. Using PRSS determine if there is significant similarity between these proteins.

  4. Do the same for the D-Alanin-D-Alanin ligase (gi|145722) and a glutathione synthetase (gi|121663).

  5. Do a BLAST search with the D-Ala D-Ala ligase as a query. Increase the "Expect" parameter to 1000 and the "descriptions" to 250

  6. Do a PSI-BLAST search with the D-Ala D-Ala ligase as a query. After how many iterations do you start to pick up the carbamoyl phosphate synthetases? Which other types of enzyme are included among the hits? (If it is too slow check here for 7 rounds of PSI-BLAST).


  7. START WORKING ON YOUR STUDENT PROJECT.

    If you still have time left after that, you may continue with the questions below.
  8. Go to the Taxonomy browser in Entrez. Can you use this to find the taxonomic position of Pyrococcus and Aeropyrum? To which kingdoms do they belong?

  9. Go to the Entrez Microbial genomes section. Explore the different genome options for Aeropyrum pernix ([G] [F] [T] [P] [C] [D] [L] [S]). Select Taxtable ([T]). How many ORFs do you find whose most similar sequence is a eukaryotic one? What could be the reason for this? Select one of these ORFs and try to find out what it might encode. (Don't forget to check the pairwise alignment, this will tell you which part of the archaeal protein picked up the eukaryotic protein! Also, you don't need to run a normal blast search, the first number links to the BLink of the sequence.)

  10. Use Internet Explorer for this exercise. Do a PSI-BLAST search for 3 iterations with the following sequence:
     >gi|7436316|pir||D75028 Pab VMA intein
     CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
     AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
     GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
     HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
     FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
     LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFI
     GGNMPTLLHNT  
    Do you notice anything strange about the search results? Save the PSSM (Position Specific Scoring Matrix, or profile) from your search on the 4th iteration. To do that choose PSSM from pull-down menu under Format options and click "Format!" button. We are going to use this profile next class in different BLAST searches.

 

!! NEXT CLASS (MONDAY, FEB 4th) WILL MEET IN THE COMPUTER LAB !!!