CLASS 4. Database Similarity Searching. BLAST.

Requirements of Database Similarity Searching

HOMEWORK READING:
Ch.4 (textbook);

Section 4.6; (supp. textbook);

Sections 5.3 (supp. textbook); [optional, detailed theory]
Database searching is making pairwise alignments en masse, aligning a query sequence (chosen by a user) and sequences in the database. The alignments then are ranked by quality (best ones on the top).

The desirable features of database similarity searching is:
  • Sensitivity - To find as many correct (i.e. significantly similar) hits as possible (true positives)
  • Specificity - To exclude incorrect hits (false positives)
  • Speed - To obtain results in a reasonable amount of time

Dynamic Programming algorithms are too slow. The solution is to use heuristic (as opposed to exhaustive) searches, i.e. to examine only a fraction of possible alignments.

Basic Local Alignment Search Tool (BLAST)

There are five different BLAST "flavors":
  • BLASTP compares an amino acid query sequence against a protein sequence database;

  • BLASTN compares a nucleotide query sequence against a nucleotide sequence database;

  • BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database;

  • TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

  • TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Initially developed by Stephen Altschul at NCBI in 1990 (As of January 8, 2010, the article describing the currently used gapped BLAST version is cited 23,569 times [according to ISI Web of Science]).

OVERVIEW OF STEPS IN BLAST ALGORITHM:

  1. Divide query sequence into words (short stretches of sequences, by default set to three a.a. or 11 nucleotides).
  2. Search database for occurence of this words (scored using a substitution matrix).
  3. Make local pairwise alignments by extending matching words in both directions until the alignment gets bad due to mismatches and gaps.
  4. Generate a report, where matches are ranked by alignment scores and their significance is assessed.

Rule of thumb: the E-value below 10-4 usually indicates the evidence for homology.

Assessing Alignment Quality. Every alignment receives a raw score, but as we've seen earlier, the score will depends on scoring system used. A more useful score is a bit score, a normalized raw score that is no longer dependent on scoring scheme. P-values give the probability of finding a match of this quality or better. P values can take values in range of [0,1]. E-value (expectation value) describes how many matches with this alignment score or better are expected to be found in the database by chance. E-values depend on database size (n) and range from 0 to n. For small values (when E < 0.01), E-value is approximately equal to P-value.

Low Complexity Regions of protein sequences complicate similarity searches. These are the regions of biased amino acid composition (such as runs of the same amino acids) that can achive high alignment scores in unrelated proteins. BLAST copes with this proteins by masking them with Xs. This filtering option in ON in BLAST by default.

Anatomy of BLAST Output

Fig. BLASTP search output fragment. Color overview of pairwise alignments of top matches. CAA49873.1 was used as a query in BLASTP search with default parameters.


Fig. BLASTP search output fragment. Ranked list of sequences in the database that produced significant hits. CAA49873.1 was used as a query in BLASTP search with default parameters.


Fig. BLASTP search output fragment. An example of pairwise alignment. CAA49873.1 was used as a query in BLASTP search with default parameters.