Blast, PSI–blast, and Homology

Are two similar sequences homologous
(i.e., is their similarity due to shared ancestry)?

One way to quantify the similarity between two sequences is to

1.    compare the actual sequences and calculate alignment score

2.    randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.

3.    repeat step 2 at least 100 times

4.    describe distribution of randomized alignment scores

5.    do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences

To illustrate the assessment of similarity/homology we will use a program from Pearson's FASTA package called PRSS. 
This and many other programs by Bill Pearson are available from his web page at ftp://ftp.virginia.edu/pub/fasta/.  A version for PCs is here.

Rules of thumb:

If you can demonstrate significant similarity using either randomization or an unweighted blast search, your sequences are homologous (i.e. related by common ancestry).  Convergent evolution has not been shown to lead to sequence similarities detectable by these means (this might not be true for scores in PSI-blast)

If the actual alignment score is more than three standard deviations (of the randomized sequences) better than the mean for the randomized sequences, the two sequences are homologous (i.e. related by common ancestry).  PRSS and many other program use more accurate distributions to describe the distribution of random hits.  The expectation value for the alignment-score of the actual sequences is based on these statistics.

Usually E values (in a blast search or through randomization) smaller than 10-4 are convincing.

 

Terminology

E-values give the expected number of matches with an alignment score this good or better,

P-values give the probability of to find a match of this quality or better. 

P values are [0,1], E-values are [0,infinity).  For small values E=P

z-values, give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.  For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences.  Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration.  (see the but below).

A somewhat readable description of E, P HSP and other values is here.

Examples of PRSS output are here

 

BUT: failure to detect significant similarity does only show our inability to detect the homology, it does not prove that the sequences are not homologous.

 

Examples:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binging sites are depicted here)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

 

Output of a “normal” gapped blast search is here.  Notice that you can modify the type of alignment.  A tutorial for standard blast search is here. 

An easy way to force the program to report less significant matches is to increase the expect value in the advanced blast page. 

The results from a normal blast search of the nr databank using an intein from a V-ATPase of Pyrococcus (an archaeon or archaebacterium) are depicted here.  The bottom line is that there are a few other inteins that show significant similarity, but not many. 

PSI-blast provides an enormous advantage over normal blast in the detection of distantly related sequences.  It only works if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences.  The NCBI page describes PSI blast as follows:
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases. 

The results of a normal blast search are aligned and a pattern of conserved residues is extracted from the alignment.  This pattern is used for the next iteration.  An important parameter to adjust is the E-value threshold down to which matches are included in the alignment and pattern extraction. 
PSI blast is comparatively new, and it is not clear to me to what extend
false positives are identified with “”significant” E-values.   I.e., in a traditional blast search one can be quite certain that a match with an E-value of 10^-13 represents a homologue; this is not clear with PSI blast.  On the other hand, it is certain that there are many fewer false negatives with PSI blast than with normal blast.   The results from a PSI blast search using the same intein as a query are here: 1st iteration all additional matches are inteins.  The 2nd iteration also finds other endonucleases (homing and mating type switching endonucleases) all of which appear to be homologous to the at least part of the intein.  There are also a few matches to HSP70s.  In later iterations (3rd iteration) one obtains many of these HSPs, and because the pattern is built including these HSPs (!) the E-values for these matches are reported as highly significant.  The same is true for myosin and other eukaryotic cytoskeletal proteins.  Could all of these have evolved from domesticated selfish genes, or is these an example of false positives?