CLASS 7. Multiple Sequence Alignment.
Why do Multiple Sequence Alignments?
Ch.5 (textbook);
Sections 4.5, 6.4, 6.5 (supp. textbook) [OPTIONAL]
FOR FRIDAY LAB:
Familiarize yourself with DotLet:
- Phylogenetic Reconstruction (inferring evolutionary relationships among sequences)
- Identification of Conserved Domains/Motifs
- Design PCR primers
- Prediction of secondary and tertiary structures
Scoring and Types of Algorithms
MSA's scoring function is the sum of the sum of pairs. Most algorithms' objective is to maximize the alignment score.
[Figure is based on Figure 5.1 in your textbook].
Approaches to find MSA:
- Exhaustive Searches (e.g., dynamic programming; very impractical due to amount of computation and memory required)
- Semi-exhaustive Searches (e.g., Divide-and-Conquer Alignment)
- Heuristic Searches
Progressive Alignment
[Source]
Most widely used progressive alignment program is ClustalW (PMID: 7984417 and 9396791). A version with graphical user interface is called ClustalX.
PROs of ClustalW:
- Flexible Choice of Substitution Matrices: the matrix for each step is chosen based on evolutionary distances inferred from the guide tree
- Adjustable Gap Penalties: e.g., fewer gaps allowed in conserved domains and more allowed in loop regions of the protein
- Down-weighting the impact of redundant and very closely related sequences
- If data set consists of many partial sequences, the overall alignment is usually very poor (because it is global alignment)
- Final alignment depends on initial pairwise alignment quality. Early "errors" are fixed. The resulting alignment therefore can be far from the optimal.
T-Coffee [Tree-based Consistency Objective Function For alignmEnt Evaluation] (PMID: 10964570) was designed to improve initial pairwise sequence alignments (and hence the guide tree) by performing both global and local alignments and choosing the optimal initial alignments from multiple alternatives. While it sometimes outperforms Clustal in finding optimal alignments, it is much slower.
[Source: Fig. 1 in PMID: 10964570]
[Source: Fig. 2 in PMID: 10964570].
For simplified version of this figure, check out Fig. 6.17 in the supp. textbook
When the number of sequences is large, it may take a very long time to calculate pairwise alignments. MUSCLE (PMID: 15034147) offers a faster alternative. As a measure of pairwise sequence similarity, a fraction of shared k-mers (stretches of k residues) is used.
[Source: Fig. 2 in PMID: 15034147].
ProbCons (PMID: 15687296) uses Hidden Markov Models (we will learn about this technique next week) for pairwise alignments.
Alignment Programs Benchmarking
BAliBASE "provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences." [PMID: 16044462]
Other benchmark databases exist, such as OXBENCH, PREFAB and SABMARK.
Note that benchmark data sets themselves can contain errors, as has been recently pointed out.
Protein-Coding DNA Sequences Alignment
Improving Alignments with Additional Knowledge
[Source: Berg, Tymoczko and Stryer, Biochemistry, 5th ed., 2001: Fig. 29.17, obtained from here].
Cleaning and Editing Alignments
There are always poorly aligned or mis-aligned regions that either need to be removed or edited manually.[Source: Fig. 1 in PMID: 10742046].
ClustalX
ClustalW/X (or just Clustal) reads and writes several formats used by different programs. It also reads already pre-aligned sequences. One of the commonly used formats for input is FASTA format. The aligned sequences are saved by default in Clustal own format (with .aln extension) - see sample output file.