Class 5. Alignment

Questions and comments on Monday's class?

How do you know that you have a candidate for and intein? (You could look at the host protein in blink!)

If you want our comments on your assignments, we need to be able to decipher your answers!

Sequence alignment

Pairwise alignment

A) DOT PLOT

The easiest way to align two sequences is to use a dotplot. In its most straight forward implementation the two sequences to be aligned are written along the coordinate axis,... See lecture 6 of the bioinf. course at UChicago (same diagram as in Li's textbook).

In more realistic implementations a window of 5 to 20 nucleotides or amino acids is slid along one of the axes (i.e., sequences) and compared to every possible window on the other axis (sequence). The dot intensity is adjusted to reflect the percent identity (or similarity) in the two windows.

The Swiss Institute for Bioinformatics provides a JAVA applet that perform interactive dot plots. The applet to calculate a dot blot from two amino acid sequences is here (Many other tools are at the ExPASy tools page).

Demonstration: comparing Yeast ATPase catalytic subunit with Yeast HO endonuclease. The sequences are here, here and here (for the intein only).

Go to the applet and input the two sequences (careful the program expects the naked sequence NOT the FASTA format). Once you leave the web page, the back arrow will only return you to the applet, but you have to input the sequences again.
Select the two sequences, a window size between 9 and 15 and click calculate. The program will compare every window of the chosen size in one sequence to all the possible windows in the other sequence. On the right you see a histogram that describes how often window pairs with the indicated score occurred. The sliding bars below and above the histogram let you select the colors with which matches are depicted. (I like black for matches, white for mismatches better than the default).
If you click on the dot blot panel, the alignment window at the bottom aligns the two sequences accordingly. You can fine-tune the alignment using the arrows.
In case of a nucleotide sequence comparison, each window comparison also uses the inverse complement of the sequence. The intensity plotted is the maximum score. This way on can also detect inversions with this program.

Only very few people use dot plots to calculate alignments or to find conserved motifs. The main use of dot plots is to detect domains, duplications, insertions, deletions, and, if you work at the DNA level, inversions (check the examples on the help pages of the dotlet application).

Optimal global and local alignments.

There are many different algorithms to calculate pairwise sequence alignments. For two sequences it is "easy" to calculate an optimal global alignment. (According to the motto: "It can be easily shown" -- if we are lucky we'll have a student walking us step by step through an example on Monday" (see here, here or here, links refer to a bioinformatics course given at the Univ. of Munich). The so called Needleman-Wunsch algorithm is widely used, it optimizes a positive alignment score (see eq. 5.1 here), a related (and under some conditions equivalent approach is to minimize the distance between to sequences (again see Li and Necrutenko' class notes here eq 5.3 ff.)

Li and Necrutenko's class has some good graphics on how the algorithm works (Fig 3.5 and 3.6.). The same for local alignment is in their class 7 Fig. 4.2.

Multiple Sequence Alignments

CLUSTAL, CLUSTALW and CLUSTALX

Usually global alignments are the easiest to calculate (local see below)

One of the easiest to use, most sophisticated, and most versatile alignment programs is clustalw

(Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244;
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680).

Clustalw runs on all possible platforms (unix, mac, pc), and it is part of most multiprogram packages, and it is also available via different web interfaces (for example here).

It uses a very simple menu driven command-line interface, and you also can run it from the command line only (i.e. it is easy to incorporate into scripts.)

Clustalx uses the same algorithms as clustalw. However, it has a much nicer interface, it displays information on the level of similarity, and it uses color in the alignment. Especially for amino acids the use of color greatly enhances the ability to recognize conservative replacements. Clustalx is available for different platforms at the ebi's ftp site (follow your platform, clustalx is stored in the clustalw folders)

Clustal reads and writes most formats used by different programs. The easiest format is the FASTA format:

> name of sequence or any other information goes in the first line. This line starts with ">". The line can be longer than 80 characters. The first line ends with the first paragraph sign.p
The second line contains the sequence itself; numbers and other non standard characters are ignored. Be careful if you download sequences. Often the transfer programs introduce paragraph signs every 100 characters, and the end of a command line frequently ends up as the beginning of the sequence.
All sequences to be read should be in a single file.

(sample clustalw input file)

(sample clustalw output file)

Clustal also reads aligned sequences. If you input aligned sequences you can go directly to the tree section.
!! Be careful if you make a mistake, and the sequences are not aligned, your tree will look strange!! ALWAYS CHECK YOUR ALIGNMENT!

Clustal also is useful to reformat and edit alignments, it is very forgiving in reading formats, e.g., you can open the clustal format (*.aln) in a text editor and delete columns and reload the file into clustalw, and output it in the other formats available.

For an alignment, can select different substitution matrices, and gap penalties (end-gaps can be considered differently!) .

Clustal is better than its reputation. It is doing a great job in handling gaps, especially terminal gaps, and it makes good use of different substitution matrices.

To align sequences clustal performs the following steps:

1) Pairwise distance calculation
2) Clustering analysis of the sequences
3) Iterated alignment of two most similar sequences or groups of sequences.

Today's handout (2nd part) has a nice illustration on how the different steps work. It is important to realize that the second step is the most important. The relationships found here will create a serious bias in the final alignment. The better your guide tree, the better your final alignment. You can load a guide tree into clustal. This tree will then be used instead of the neighbor joining tree calculated by clustalw as a default. (The guide tree needs to be in normal parenthesis notation WITH branch lengths).

Other programs often used for multiple sequence alignment:

A program available via the www is SAM (sequence alignment and modeling system) by Richard Hughey, Anders Krogh, Christian Barrett, & Leslie Grate at UCSC. The input consists of a multiple sequence file (aligned or not aligned) in FASTA format. The program uses secondary structure predictions, neighboring sites, etc. to place gaps. The program can be accessed using netscape at " http://www.cse.ucsc.edu/research/compbio/sam.html ".

If your sequences are not very similar, and if you are not able to generate a trustworthy multiple sequence alignment, you can calculate distance trees based on pairwise alignments only. The best program for this purpose is statalign from Jeff Thorne (Thorne JL, Kishino H (1992) Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162). It runs under standard UNIX. It's only worth your effort if you are getting gray hairs because of a data set you cannot reliably align. We will not use this program in this course.

PILEUP in the GCG package generates alignments that are very similar to clustalw. The TREE programs in GCG are currently considered by many to be worthless (UPGMA). It is planned -since over three years- to incorporate PAUP into GCG in the "near future".

Local Alignments (e.g. MACAW) search the sequences for motives that occur in different sequences. In macaw the user has the option to select different tools to search matching motives, the user can select subsets of sequences or positions to search for similar motives, and the user has to accept/reject each of the motives found.

Advantage: you get to know your sequences and the motives that are conserved are reliably aligned
Problems: time consuming, Macaw has the tendency to crash.

Alignments by Eye:
On PCs there is a DOS program called the Eyeball Sequence Editor (ESEE) that allows to simulataneously align nucleotid and encoded protein sequences. Needs some getting used to.

One useful sequence editor is seaview, the companion sequence editor to phylo_win. It runs on PC and most unix flavors, and is the easiest way to get alignments into phylo_win.

Another often used editor is JALVIEW. It is a JAVA applet, and you can download and install it on your own machine, or you can use the web version here. The latter includes alignment via clustalw, in case your sequences are not aligned already, you can select different coloring schemes, change the alignment by dragging things along with a mouse, and you can do rudimentary tree calculations and principle component analyses. In the tree you can draw lines to define groups in the alignment.

We'll use testseq1.txt as an example in class.

Trees with CLUSTALW

Besides aligning sequences, Clustal also includes programs to calculate distance trees. The trees generated by clustal certainly have their limitations; however, if one is aware of these limitations, the program is extremely useful for exploration.

You can choose several parameters in clustalw that influence tree building.

The substitution matrix, and of other alignment parameters
You can ignore all positions that in any of the sequences contains a gap
You can assume a molecular clock (UPGMA - this should be a no-no)
You can correct for multiple substitutions
(In a perfect world you want to use the actual number of substitutions that occurred in evolution, and not the number of sites that differ between two sequences).
Later in the course we will discuss other methods for distance correction, however, everything considered clustalw is doing quite well.

Clustalw also provides possibilities for bootstrapping. (a way to assess reliability of individual branches.)

Problems with clustalw:

The input order in analyzing the bootstrapped samples is not randomized; therefore, if you have no phylogenetic information at all, you get 100% bootstrap values.
LOOK AT YOUR ALIGNMENTS CAREFULLY! -
or "From junk comes junk!" (this is a frequent problem, if you chose to ignore positions with gaps.

If you have very different branch lengths, even if a next to perfect "molecular clock" is running, long branches have the tendency to attract each other.

If you use treeview to inspect your trees, you need to go to the tree menu in clustalx and select placing the bootstrap values at the nodes (not the branches).

TREEVIEW

To view trees generated by clustalw, you can use treeview from Rod Page, or the program njplot that is part of the clustal package.

Treeview should be already installed on your PCs. The program is extremely user friendly. Trees generated can be copied and pasted into Microsoft Word, and the labels can be rearranged after double clicking on the imported image.

(Try clustalx and treeview on testseq1, copy tree to MSWord.)