CLASS 4. Database Similarity Searching. BLAST.
Requirements of Database Similarity Searching
Ch.4 (textbook);
Section 4.6; (supp. textbook);
Sections 5.3 (supp. textbook); [optional, detailed theory]
The desirable features of database similarity searching is:
- Sensitivity - To find as many correct (i.e. significantly similar) hits as possible (true positives)
- Specificity - To exclude incorrect hits (false positives)
- Speed - To obtain results in a reasonable amount of time
Dynamic Programming algorithms are too slow. The solution is to use heuristic (as opposed to exhaustive) searches, i.e. to examine only a fraction of possible alignments.
Basic Local Alignment Search Tool (BLAST)
- BLASTP compares an amino acid query sequence against a protein sequence database;
- BLASTN compares a nucleotide query sequence against a nucleotide sequence database;
- BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database;
- TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
- TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
OVERVIEW OF STEPS IN BLAST ALGORITHM:
- Divide query sequence into words (short stretches of sequences, by default set to three a.a. or 11 nucleotides).
- Search database for occurence of this words (scored using a substitution matrix).
- Make local pairwise alignments by extending matching words in both directions until the alignment gets bad due to mismatches and gaps.
- Generate a report, where matches are ranked by alignment scores and their significance is assessed.
Assessing Alignment Quality. Every alignment receives a raw score, but as we've seen earlier, the score will depends on scoring system used. A more useful score is a bit score, a normalized raw score that is no longer dependent on scoring scheme. P-values give the probability of finding a match of this quality or better. P values can take values in range of [0,1]. E-value (expectation value) describes how many matches with this alignment score or better are expected to be found in the database by chance. E-values depend on database size (n) and range from 0 to n. For small values (when E < 0.01), E-value is approximately equal to P-value.
Low Complexity Regions of protein sequences complicate similarity searches. These are the regions of biased amino acid composition (such as runs of the same amino acids) that can achive high alignment scores in unrelated proteins. BLAST copes with this proteins by masking them with Xs. This filtering option in ON in BLAST by default.
Anatomy of BLAST Output