BLAST, PSI-BLAST, and Homology

Are two similar sequences homologous (i.e., is their similarity due to shared ancestry) ?

One way to quantify the similarity between two sequences is to:

compare the actual sequences and calculate alignment score
randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.
repeat step 2 at least 100 times
describe distribution of randomized alignment scores
do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences

To illustrate the assessment of similarity/homology we will use a program from Pearson's FASTA package called PRSS.
This and many other programs by Bill Pearson are available from his web page at ftp://ftp.virginia.edu/pub/fasta/.

A web version is available here.

Example: Sequences are here (fl), here (B), here (A) and here (A2)

P-values give the probability of to find a match of this quality or better. P values are [0,1], E-values are [0,infinity). For small values E=P
Z-values give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores. For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences. Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration.

If you can demonstrate significant similarity using either randomization or an unweighted blast search, your sequences are homologous (i.e. related by common ancestry). Convergent evolution has not been shown to lead to sequence similarities detectable by these means (see above - this might not be true for scores in PSI-blast)

If the actual alignment score is more than three standard deviations (of the randomized sequences) better than the mean for the randomized sequences, the two sequences are homologous (i.e. related by common ancestry).

BUT: Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

Example:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binging sites are depicted here).

Basic Local Alignment Search Tool (BLAST)

BLAST is a tool to search for similar sequences. Works with both DNA and protein sequences (as query sequences).

There are five different blast programs, that perform the following searches:

BLASTP compares an amino acid query sequence against a protein sequence database;

BLASTN compares a nucleotide query sequence against a nucleotide sequence database;

BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database;

TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Databases (subject sequences): nr (non-redundant) is most commonly used.

Filtering for low-complexity.

Scoring Matrices for the nucleotide sequences and for the amino acid sequences.

The underlying operation: the program aligns a query sequence to each subject sequence in the database and reports results as a hit list, which is ranked by Score/E-value. E-value for a hit represents the number of such alignments that would be expected by chance alone.

Rule of thumb: the E-value below 10^-4 usually indicates the evidence for homology.

Another variation of BLAST: pairwise BLAST - compares two sequences with each other.

For very basic BLAST tutorial click here
And for explanation on how the BLAST algorithm actually works click here.

PSI BLAST

provides an enormous advantage over basic BLAST in the detection of distantly related sequences. It only works if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences.

The NCBI page describes PSI blast as follows:
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.

A diagram giving an overviw over the PSI-blast procedure is here.

The results of a basic BLAST search are aligned and a pattern of conserved residues is extracted from the alignment. This pattern is used for the next iteration. An important parameter to adjust is the E-value threshold down to which matches are included in the alignment and pattern extraction.

At higher iterations a PSI blast profile can be corrupted and false positives are identified with "significant" E-values. I.e., in a traditional blast search one can be quite certain that a match with an E-value of 10^-13 represents a homologue; this is not clear with PSI blast. Test studies indicate that profile corruptions are likely after more than 5 iterations. On the positive side: there are many fewer false negatives with PSI blast than with basic blast.

Assignment #3:

[Links open in a separate window]

Your name:
Your email:

Using Entrez download the protein sequence with PID# 2506213. What type of proteins do you find among the related proteins?

Copy the FASTA format of gi2506213 onto your computer's clipboard (not the clipboard of the NCBI page), go to the NCBI BLAST page and do a gapped BLAST search (otherwise default options) . Do you obtain any different results?

Download proteins with the following gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679 [Hint: you can paste multiple space/comma separated gi numbers to retrieve all 5 proteins simultaneously]. Using PRSS determine if there is significant similarity between these proteins.

Do the same for the D-Alanin-D-Alanin ligase (gi|145722) and a glutathione synthetase (gi|121663).

Do a BLAST search with the D-Ala D-Ala ligase as a query. Increase the "Expect" parameter to 1000 and the "descriptions" to 250

Do a PSI-BLAST search with the D-Ala D-Ala ligase as a query. After how many iterations do you start to pick up the carbamoyl phosphate synthetases? Which other types of enzyme are included among the hits?
(If it is too slow check here for 7 rounds of PSI-BLAST).

Use Internet Explorer for this exercise. Do a PSI-BLAST search for 3 iterations with the following sequence:
```
 >gi|7436316|pir||D75028 Pab VMA intein
 CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
 AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
 GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
 HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
 FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
 LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFI
 GGNMPTLLHNT  
```
What types of enzymes do you get as hits? Do you notice anything strange about the search results?

Save the PSSM (Position Specific Scoring Matrix, or profile) from your search on the 4th iteration. To do that choose PSSM from pull-down menu under Format options and click "Format!" button. After the search is done, you should get strangely looking alphanumerical symbol mixture in your browser window. This is a PSSM. Save PSSM matrix to the disk as text file, or keep this browser window opened. We are going to use this profile in the question #8.
Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page (Let it load completely before choosing any options). Paste intein sequence into query sequence box, change Query and Database entries to "Protein". Choose one of the following genomes as Database:
- Pyrobaculum aerophilum
- Aeropyrum pernix
- Sulfolobus tokodaii
- Archaeoglobus fulgidus
- Methanothermobacter thermautotrophicus
- Thermoplasma volcanium
- Methanococcus jannaschii
- Saccharomyces cerevisiae (This genome is on Other eukaryotes Genomic BLAST page, but user interface is the same)
After that click "Adv. BLAST" button. This will redirect you to the BLAST search window. Paste your PSSM from Question #7 into PSSM box (under Options). What are the results of your search? Did you get any significant matches? What are they? If you have significant matches, does the match occur over the full lengths of both query and subject sequences? Use Blink to investigate if the hits are indeed inteins.
What is your conclusion? In your answer indicate
- genomes searched,
- number of significant matches found,
- the E-values of these matches, and
- the identity of these matches
(i.e., are these probable inteins, or are they likely to be something else?).

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone