There
are many different algorithms to calculate sequence alignments. For two sequences it is (comparatively) easy
to calculate an "optimal alignment". The so called Needleman-Wunsch algorithm
is widely used to find the best possible global alignments between two
sequences. The mathematics of the
Needleman-Wunsch algorithm are explained on the following websites:
http://www.ibc.wustl.edu/CMB/bio5495/dynamic/dynamic.html
For more
mathematical background information check M. Zuker’s other lecture at http://www.ibc.wustl.edu/~zuker/Bio-5495/.
Take your time to
go through the Zuker’s
example.
Many program
packages allow calculating pair wise alignments. A web resource to do this (and many other things is the BCM launcher at Baylor
College).
While it is
possible to find a best global alignment, take note that there is usually more
than one best alignment, and that the program will often return an alignment
even though you input two random sequences.
An alternative to
getting a global alignment is to find local alignments, i.e., stretches of
sequences that match nicely between two sequences.
Multiple Sequence
Alignments
One
of the easiest most sophisticated, and most versatile alignment programs is clustalw
(Higgins
DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence
alignment on a microcomputer. Gene 73:237-244; Thompson, J.D., Higgins,
D.G. and Gibson, T.J. (1994). CLUSTAL
W: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, positions-specific gap penalties and weight matrix choice.
Nucleic Acids Research, 22, 4673-4680).
The
latest versions for clustalw and clustalx are available at the ebi server.
Clustalw1.74 is installed on the sp unix machine, it is also
available for PCs
and MACs. To start the program
on the UNIX machine telnet to sp.uconn.edu, login, go to the directory where
your sequences are and type clustalw17 (see below).
Clustalx uses the same algorithms as clustalw. However, it
has a much nicer interface, it displays information on the level of similarity,
and it uses color in the alignment. Especially for amino acids the use of color
greatly enhances the ability to recognize conservative replacements.
Clustal
reads and writes most of the usual formats. The easiest format is the FASTA format:
>
name of sequence or any other information goes in the first line. This line
starts with ">". The line can be longer than 80 characters. The
first line ends with the first paragraph sign “p”
The second line contains the sequence itself , numbers and other non standard
characters are ignored. Be careful if you download sequences. Often the
transfer programs introduce paragraph signs every 100 characters. See the
examples below.
Clustal
also reads aligned sequences. If you input aligned sequences you can go
directly to the tree section. !! Be careful if you make a mistake, and the
sequences are not aligned, your tree will look pretty strange!!
Using
globin.pep as an example, familiarize yourself with the different output/input
options. *.MSF is read by GCG; PIR output is like FASTA - format,
but the seq. all have equal lengths, you have to merge the first and second
line in order to be FASTA compatible (you also could use the GDE format and
using a text editor replace % with >). Phylip (*.phy) is the "new"
phylip - interlaced format.
You
can use the input/output options to reformat an alignment. Some programs
require specialized formats. You can use a text editor like MSWord 6.0 to get
your alignment into the desired format, but things are certainly much easier,
if you start out with a format close to the desired one.
Hint:
Often you do not want to use the complete alignment, but only those portions
which are sufficiently conserved. You can take a file in clustal format (*.aln)
and delete columns with a good text editor (in MSWord pressing down the alt key
before clicking the mouse switches to column mode). Although the different
lines in the resulting alignment have different lengths, clustalw reads in the
aligned sequences correctly, and you can output the shortened sequences in any
desired format you want.
Clustalw
is menu driven, each menu comes with a help item, use it if you want more
information.
CLUSTALW
ON UNIX:
Log
into your account using any of the many available telnet program. It is a good
idea to make subdirectories. ("mkdir new_directory_Name") for the
different projects. Some other useful UNIX commands are here.
To
run clustal type clustalw. The easiest is to have the program and the input
file in the same directory. To get your sequences onto the unix machine, and
the output files to your local machine, use FTP (PC, UNIX, VAX) or fetch (Mac).
To
start the program type clustalw17 (clustalw starts an older version).
Other
Alignment programs:
Other
excellent alignment programs are treealign from Jotun Hein, and the progresive
alignment program from Feng and Doolittle.
However, the interfaces are not that easy, and the results are usually
not very different from clustal.
A
program available via the www is SAM (sequence alignment and
modeling system) by Richard Hughey, Anders Krogh, Christian Barrett, &
Leslie Grate at UCSC. The input consists of a multiple sequence file (aligned
or not aligned) in FASTA format. The program uses secondary structure predictions,
neighboring sites, etc. to place gaps. The program can be accessed using
netscape at " http://www.cse.ucsc.edu/research/compbio/sam.html ".
If
your sequences are not very similar, and if you are not able to generate a
trustworthy multiple sequence alignment, you can calculate distance trees based
on pairwise alignments only. The best program for this purpose is statalign
from Jeff Thorne (Thorne JL, Kishino H (1992) Freeing phylogenies from
artifacts of alignment. Mol Bio Evol 9:1148-1162). It runs under standard
UNIX. It's only worth your effort if you are getting gray hairs because of a
data set you cannot reliably align. We will not use this
program in this course.
PILEUP in the GCG
package generates alignments that are very similar to clustalw. The TREE
programs in GCG are currently considered by many to be worthless (UPGMA). It is
planned -since four years- to incorporate PAUP into GCG in the "near
future".
Trees with CLUSTALW
Besides aligning sequences,
Clustalw also includes programs to calculate distance trees. The trees
generated by clustalw certainly have their limitations, however, if one is
aware of these limitations, the program is extremely useful for initial exploration.
Trees are calculated from a
corrected or uncorrected distance matrix using the neighbor joining
method. This method does not use an optimization procedure but a much faster
algorithmic approach (pages 486ff in Hillis, Moritz and Mable: Molecular Systematics).
Several parameters that you
can choose in clustalw influence tree building.
The choice of substitution
matrix, and of other alignment parameters
You can ignore all positions
that in any of the sequences contains a gap
You can correct for multiple
substitutions
(In a perfect world you want to use the actual number of substitutions that
occurred in the evolution, and not the number of sites that differ between two
sequences).
Later in the course we will discuss other methods for distance correction,
however, everything considered clustalw is doing quite well.
Clustalw also provides
possibilities for bootstrapping (Hillis, Moritz, Mable, Molecular Systematics,
p. 507 ff, and p. 522. ff).
Bootstrapping is one of the most popular ways to assess the
reliability of branches. Briefly,
positions of the aligned sequences are randomly sampled from the multiple
sequence alignment with replacements.
The sampled positions are assembled into new data sets, the so-called bootstrapped
samples. Each position has an about 63%
chance to make it into a particular bootstrapped sample. If a grouping has a lot of support, it will
be supported by at least some positions in each of the bootstrapped samples.
Problems with clustalw:
The input order in analyzing
the bootstrapped samples is not randomized; therefore, if you have no
phylogenetic information at all, you get 100% bootstrap values.
LOOK AT YOUR ALIGNMENTS CAREFULLY! -
or "From junk comes junk!"
If you have very different
branch lengths, even if you have a "molecular clock" running, long
branches have the tendency to attract each other.
TREEVIEW
To view trees generated by
clustalw, you can use treeview from Rod Page.
Note: since clustalw1.8 the default tree format has
changed. If you want to read the
bootstrap values into treeview, you need to change the tree options in clustalx
to put the bs values on the nodes.
The program should be
already installed on your MACs (PC versions are available as well). The program
is extremely user friendly. Trees generated can be copied and pasted into
Microsoft Word, and the labels can be rearranged after double clicking on the
imported image. The tree edit function
lets you alter the tree, which is handy to generate user defined trees that are
used by many other programs.