CLASS 7. Multiple Sequence Alignment.

Multiple Sequence Alignment (MSA) is an arrangement of multiple (>2) related sequences in an optimal way. Optimality is assessed using some scoring function (details below).

Why do Multiple Sequence Alignments?

HOMEWORK READING:

Ch.5 (textbook);

Sections 4.5, 6.4, 6.5 (supp. textbook) [OPTIONAL]

FOR FRIDAY LAB:
Familiarize yourself with DotLet:
  • Phylogenetic Reconstruction (inferring evolutionary relationships among sequences)
  • Identification of Conserved Domains/Motifs
  • Design PCR primers
  • Prediction of secondary and tertiary structures
Multiple alignments are usually more accurate than pairwise sequence alignments!

Scoring and Types of Algorithms

MSA's scoring function is the sum of the sum of pairs. Most algorithms' objective is to maximize the alignment score.

Fig. Concept of the sum of pairs. Scores are calculated using BLOSUM62 matrix.
[Figure is based on Figure 5.1 in your textbook].

Approaches to find MSA:

  • Exhaustive Searches (e.g., dynamic programming; very impractical due to amount of computation and memory required)
  • Semi-exhaustive Searches (e.g., Divide-and-Conquer Alignment)
  • Heuristic Searches

Progressive Alignment

Fig. An overview of progressive alignment algorithm, which is a heuristic search algorithm.
[Source]


The tree reconstructed from pairwise alignment scores is called guide tree. Consensus sequence contains most abundant nucleotide/a.a. residue in each alignment site.

Most widely used progressive alignment program is ClustalW (PMID: 7984417 and 9396791). A version with graphical user interface is called ClustalX.

PROs of ClustalW:

  • Flexible Choice of Substitution Matrices: the matrix for each step is chosen based on evolutionary distances inferred from the guide tree
  • Adjustable Gap Penalties: e.g., fewer gaps allowed in conserved domains and more allowed in loop regions of the protein
  • Down-weighting the impact of redundant and very closely related sequences
CONs:
  • If data set consists of many partial sequences, the overall alignment is usually very poor (because it is global alignment)
  • Final alignment depends on initial pairwise alignment quality. Early "errors" are fixed. The resulting alignment therefore can be far from the optimal.

T-Coffee [Tree-based Consistency Objective Function For alignmEnt Evaluation] (PMID: 10964570) was designed to improve initial pairwise sequence alignments (and hence the guide tree) by performing both global and local alignments and choosing the optimal initial alignments from multiple alternatives. While it sometimes outperforms Clustal in finding optimal alignments, it is much slower.

Fig. T-Coffee strategy
[Source: Fig. 1 in PMID: 10964570]


Fig. The Library Extension Step of T-Coffee strategy.
[Source: Fig. 2 in PMID: 10964570].
For simplified version of this figure, check out Fig. 6.17 in the supp. textbook

When the number of sequences is large, it may take a very long time to calculate pairwise alignments. MUSCLE (PMID: 15034147) offers a faster alternative. As a measure of pairwise sequence similarity, a fraction of shared k-mers (stretches of k residues) is used.

Fig. MUSCLE algorithm overview.
[Source: Fig. 2 in PMID: 15034147].

ProbCons (PMID: 15687296) uses Hidden Markov Models (we will learn about this technique next week) for pairwise alignments.

Alignment Programs Benchmarking

BAliBASE "provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences." [PMID: 16044462]

Other benchmark databases exist, such as OXBENCH, PREFAB and SABMARK.

Note that benchmark data sets themselves can contain errors, as has been recently pointed out.

Protein-Coding DNA Sequences Alignment

Since amino acids are coded by nucleotide triplets, direct alignment of DNA sequences can introduce gaps that cause frameshift mutations. Solution: translate DNA sequence into protein, align protein sequences and then convert the alignment back to nucleotide sequences.

Improving Alignments with Additional Knowledge

Alignment can be improved through incorporation of expert knowledge, such as secondary and/or tertuary structure information. An Example: Ribosomal Database (RDP) uses secondary structure information of ribosomal RNA.
Fig. Secondary and tertiary structures of 16S rRNA from E. coli
[Source: Berg, Tymoczko and Stryer, Biochemistry, 5th ed., 2001: Fig. 29.17, obtained from here].

Cleaning and Editing Alignments

There are always poorly aligned or mis-aligned regions that either need to be removed or edited manually.
Fig. Example of alignment cleaning with GBLOCKS program
[Source: Fig. 1 in PMID: 10742046].

ClustalX

Fig. Screenshot of ClustalX program.


ClustalW/X (or just Clustal) reads and writes several formats used by different programs. It also reads already pre-aligned sequences. One of the commonly used formats for input is FASTA format. The aligned sequences are saved by default in Clustal own format (with .aln extension) - see sample output file.