Blast, PSI–blast,
and Homology
Are two
similar sequences homologous
(i.e., is their similarity due to shared ancestry)?
One
way to quantify the similarity between two sequences is to
1. compare
the actual sequences and calculate alignment score
2. randomize (scramble)
one (or both) of the sequences and calculate the alignment score for the
randomized sequences.
3. repeat
step 2 at least 100 times
4. describe
distribution of randomized alignment scores
5. do a
statistical test to determine if the score obtained for the real sequences is
significantly better than the score for the randomized sequences
To
illustrate the assessment of similarity/homology we will use a program from
Pearson's FASTA package called PRSS.
This and many other programs by Bill
Pearson are available from his web page at ftp://ftp.virginia.edu/pub/fasta/. A version
for PCs is here.
Rules of thumb:
If
you can demonstrate significant similarity using either randomization or an
unweighted blast search, your sequences are homologous (i.e. related by common
ancestry). Convergent evolution has not
been shown to lead to sequence similarities detectable by these means (this
might not be true for scores in PSI-blast)
If
the actual alignment score is more than three standard deviations (of the
randomized sequences) better than the mean for the randomized sequences, the
two sequences are homologous (i.e. related by common ancestry). PRSS and many other program use more
accurate distributions to describe the distribution of random hits. The expectation value for the
alignment-score of the actual sequences is based on these statistics.
Usually
E values (in a blast search or through randomization) smaller than 10-4 are
convincing.
Terminology
E-values give the expected number of matches
with an alignment score this good or better,
P-values give the probability of to find a
match of this quality or better.
P values are [0,1], E-values are [0,infinity). For small values E=P
z-values, give the distance between the
actual alignment score and the mean of the scores for the randomized sequences
expressed as multiples of the standard deviation calculated for the randomized
scores. For example: a z-value of 3
means that the actual alignment score is 3 standard deviations better than the
average for the randomized sequences.
Z-values > 3 are usually considered as suggestive of homology,
z-values > 5 are considered as sufficient demonstration. (see the but below).
A somewhat readable description of E, P HSP
and other values is here.
Examples of PRSS output are
here
BUT: failure to detect significant
similarity does only show our inability to detect the homology, it does not
prove that the sequences are not homologous.
Examples:
Jim Knox (MCB-UConn) has studied many proteins
involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis
or destruction. Many of these proteins have identical 3-D structure, and
therefore can be assumed to be homologous, however, the above tests fail to
detect this homologies. (for example, enzymes with GRASP nucleotide binging
sites are depicted here)
DNA replication involves many different enzymes. Some
of the proteins do the same thing in bacteria, archaea and eukaryotes; they
have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and
eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the
above tests fail to detect homology.
Output
of a “normal” gapped blast search is here. Notice that you can modify the type of
alignment. A tutorial for standard
blast search is here.
An
easy way to force the program to report less significant matches is to increase
the expect value in the advanced blast
page.
The
results from a normal blast search of the nr databank using an intein from a
V-ATPase of Pyrococcus (an archaeon or archaebacterium) are depicted here.
The bottom line is that there are a few other inteins that show
significant similarity, but not many.
PSI-blast
provides an enormous advantage over normal blast in the detection of distantly
related sequences. It only works if
some closely related sequences are already available, but if this is the case
it finds a lot of other distantly related sequences. The NCBI page describes PSI blast as follows:
Position-Specific
Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a
"profile" search, which is a sensitive way to look for sequence
homologues. The program first performs a gapped BLAST database search. The
PSI-BLAST program uses the information from any significant alignments returned
to construct a position-specific score matrix, which replaces the query
sequence for the next round of database searching. PSI-BLAST may be iterated
until no new significant alignments are found. At this time PSI-BLAST may be
used only for comparing protein queries with protein databases.
The results of a normal
blast search are aligned and a pattern of conserved residues is extracted from
the alignment. This pattern is used for
the next iteration. An important
parameter to adjust is the E-value threshold down to which matches are included
in the alignment and pattern extraction.
PSI blast is comparatively new, and it is not clear to me to what extend false positives are identified with “”significant”
E-values. I.e., in a traditional blast
search one can be quite certain that a match with an E-value of 10^-13
represents a homologue; this is not clear with PSI blast. On the other hand, it is certain that there
are many fewer false negatives with PSI blast than with normal blast. The results from a PSI blast search using the same intein as a
query are here: 1st
iteration all additional matches are inteins. The 2nd
iteration also finds other endonucleases (homing and mating type switching
endonucleases) all of which appear to be homologous to the at least part of the
intein. There are also a few matches to
HSP70s. In later iterations (3rd
iteration) one obtains many of these HSPs, and because the pattern is
built including these HSPs (!) the E-values for these matches are
reported as highly significant. The
same is true for myosin and other eukaryotic cytoskeletal proteins. Could all of these have evolved from
domesticated selfish genes, or is these an example of false positives?