topic#4 multiple sequence alignment

There are many different algorithms to calculate sequence alignments. For two sequences it is (comparatively) easy to calculate an "optimal alignment". The so called Needleman-Wunsch algorithm is widely used to find the best possible global alignments between two sequences. The mathematics of the Needleman-Wunsch algorithm are explained on the following websites:

http://www.ibc.wustl.edu/CMB/bio5495/dynamic/dynamic.html

For more mathematical background information check M. Zuker’s other lecture at http://www.ibc.wustl.edu/~zuker/Bio-5495/.

Take your time to go through the Zuker’s example.

Many program packages allow calculating pair wise alignments. A web resource to do this (and many other things is the BCM launcher at Baylor College).

While it is possible to find a best global alignment, take note that there is usually more than one best alignment, and that the program will often return an alignment even though you input two random sequences.

An alternative to getting a global alignment is to find local alignments, i.e., stretches of sequences that match nicely between two sequences.

Multiple Sequence Alignments

One of the easiest most sophisticated, and most versatile alignment programs is clustalw (Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244; Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680).

The latest versions for clustalw and clustalx are available at the ebi server.

Clustalw1.74 is installed on the sp unix machine, it is also available for PCs and MACs. To start the program on the UNIX machine telnet to sp.uconn.edu, login, go to the directory where your sequences are and type clustalw17 (see below).

Clustalx uses the same algorithms as clustalw. However, it has a much nicer interface, it displays information on the level of similarity, and it uses color in the alignment. Especially for amino acids the use of color greatly enhances the ability to recognize conservative replacements.

Clustal reads and writes most of the usual formats. The easiest format is the FASTA format:

> name of sequence or any other information goes in the first line. This line starts with ">". The line can be longer than 80 characters. The first line ends with the first paragraph sign “p”
The second line contains the sequence itself , numbers and other non standard characters are ignored. Be careful if you download sequences. Often the transfer programs introduce paragraph signs every 100 characters. See the examples below.

Clustal also reads aligned sequences. If you input aligned sequences you can go directly to the tree section. !! Be careful if you make a mistake, and the sequences are not aligned, your tree will look pretty strange!!

Using globin.pep as an example, familiarize yourself with the different output/input options. *.MSF is read by GCG; PIR output is like FASTA - format, but the seq. all have equal lengths, you have to merge the first and second line in order to be FASTA compatible (you also could use the GDE format and using a text editor replace % with >). Phylip (*.phy) is the "new" phylip - interlaced format.

You can use the input/output options to reformat an alignment. Some programs require specialized formats. You can use a text editor like MSWord 6.0 to get your alignment into the desired format, but things are certainly much easier, if you start out with a format close to the desired one.

Hint: Often you do not want to use the complete alignment, but only those portions which are sufficiently conserved. You can take a file in clustal format (*.aln) and delete columns with a good text editor (in MSWord pressing down the alt key before clicking the mouse switches to column mode). Although the different lines in the resulting alignment have different lengths, clustalw reads in the aligned sequences correctly, and you can output the shortened sequences in any desired format you want.

Clustalw is menu driven, each menu comes with a help item, use it if you want more information.

CLUSTALW ON UNIX:

Log into your account using any of the many available telnet program. It is a good idea to make subdirectories. ("mkdir new_directory_Name") for the different projects. Some other useful UNIX commands are here.

To run clustal type clustalw. The easiest is to have the program and the input file in the same directory. To get your sequences onto the unix machine, and the output files to your local machine, use FTP (PC, UNIX, VAX) or fetch (Mac).

To start the program type clustalw17 (clustalw starts an older version).

Other Alignment programs:

Other excellent alignment programs are treealign from Jotun Hein, and the progresive alignment program from Feng and Doolittle. However, the interfaces are not that easy, and the results are usually not very different from clustal.

A program available via the www is SAM (sequence alignment and modeling system) by Richard Hughey, Anders Krogh, Christian Barrett, & Leslie Grate at UCSC. The input consists of a multiple sequence file (aligned or not aligned) in FASTA format. The program uses secondary structure predictions, neighboring sites, etc. to place gaps. The program can be accessed using netscape at " http://www.cse.ucsc.edu/research/compbio/sam.html ".

If your sequences are not very similar, and if you are not able to generate a trustworthy multiple sequence alignment, you can calculate distance trees based on pairwise alignments only. The best program for this purpose is statalign from Jeff Thorne (Thorne JL, Kishino H (1992) Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162). It runs under standard UNIX. It's only worth your effort if you are getting gray hairs because of a data set you cannot reliably align. We will not use this program in this course.

PILEUP in the GCG package generates alignments that are very similar to clustalw. The TREE programs in GCG are currently considered by many to be worthless (UPGMA). It is planned -since four years- to incorporate PAUP into GCG in the "near future".

Trees with CLUSTALW

Besides aligning sequences, Clustalw also includes programs to calculate distance trees. The trees generated by clustalw certainly have their limitations, however, if one is aware of these limitations, the program is extremely useful for initial exploration.

Trees are calculated from a corrected or uncorrected distance matrix using the neighbor joining method. This method does not use an optimization procedure but a much faster algorithmic approach (pages 486ff in Hillis, Moritz and Mable: Molecular Systematics).

Several parameters that you can choose in clustalw influence tree building.

The choice of substitution matrix, and of other alignment parameters

You can ignore all positions that in any of the sequences contains a gap

You can correct for multiple substitutions
(In a perfect world you want to use the actual number of substitutions that occurred in the evolution, and not the number of sites that differ between two sequences).
Later in the course we will discuss other methods for distance correction, however, everything considered clustalw is doing quite well.

Clustalw also provides possibilities for bootstrapping (Hillis, Moritz, Mable, Molecular Systematics, p. 507 ff, and p. 522. ff).

Bootstrapping is one of the most popular ways to assess the reliability of branches. Briefly, positions of the aligned sequences are randomly sampled from the multiple sequence alignment with replacements. The sampled positions are assembled into new data sets, the so-called bootstrapped samples. Each position has an about 63% chance to make it into a particular bootstrapped sample. If a grouping has a lot of support, it will be supported by at least some positions in each of the bootstrapped samples.

Problems with clustalw:

The input order in analyzing the bootstrapped samples is not randomized; therefore, if you have no phylogenetic information at all, you get 100% bootstrap values.
LOOK AT YOUR ALIGNMENTS CAREFULLY! -
or "From junk comes junk!"

If you have very different branch lengths, even if you have a "molecular clock" running, long branches have the tendency to attract each other.

TREEVIEW

To view trees generated by clustalw, you can use treeview from Rod Page.
Note: since clustalw1.8 the default tree format has changed. If you want to read the bootstrap values into treeview, you need to change the tree options in clustalx to put the bs values on the nodes.

The program should be already installed on your MACs (PC versions are available as well). The program is extremely user friendly. Trees generated can be copied and pasted into Microsoft Word, and the labels can be rearranged after double clicking on the imported image. The tree edit function lets you alter the tree, which is handy to generate user defined trees that are used by many other programs.