MCB 3421 Assignment 9

Your name:
Your email address:

A note about the "all on one line" problem you may come across when working with files

If you work under OsX, your text editor (MSWord) will be able to read all possible files and translate end of line characters correctly. A frequent problem is that the end of line character in MAC and in UNIX (including Darwin, the system that OsX is running on) is different. If you open a UNIX application like clustalx, it expects the UNIX end of line character, and in case the file uses MAC end of line characters, everything will be in a single line. If this happens you can use the following terminal commands (the terminal is a window that opens a command line interface to your operating system) (see here for a summary of frequently used UNIX commands):

tr '\r' '\n' < macfile.txt > unixfile.txt

(if this does not work on your system try /usr/bin/tr '\015' '\n' < macfile.txt > unixfile.txt)

Note: If you frequently use UNIX applications on a Mac under OSX, get the ConvertNewlines application. After unzipping the application, place it on the desktop. If you drag the file you want to change over it (drop it into the icon), it recognizes the current end of line format and asks you to which convention (Unix, Windows, or Mac) you want to convert it to.

Also, most versions of microsoft word, when you save a document as text file, allow you to select different end of line coding - the default ususally is the setting fro windows.

Part1: ClustalX (30 minutes)

You can download ClustalX and ClustalW for Mac and PC from here. If you are on a Mac, then the file you need to download IS THIS ONE. Download it to your Desktop, and double-click it to extract the ClustalX app from the zip file.

We will be using this sequence file. The sequences in this file are annotated as follows:

Start ClustalX by double clicking on the clustalx icon. Using the FILE menu, load the sequence file.

Once you loaded the sequences, calculate an alignment (Alignment menu -> "Do complete Alignment" now). It will take 2-3 minutes.

Maximize the window and scroll to position 300. Most of the ATPase subunits have a "canonical" motif (G.....GKT) characteristic for many nucleotide binding sites. With which sequence has this motif been replaced in the B subunits of the vacuolar type ATPases?

Save the alignment in different formats (PHYLIP, FASTA, NEXUS, and MSF) (File Menu -> Save sequences as).  Using a text editor (MS Word) and a non-proportional font (e.g. COURIER), inspect the different formats.

*.MSF is read by GCG; Phylip (*.phy) is the "new" phylip - interlaced format; NEXUS is used by PAUP and MrBayes 

Under alignment in the main menu-bar, select alignment options. Select reset all gaps before alignment. - The next time to go to this submenu, there should be a check mark infront of this option. - Under alignment options, change the parameters for both the pairwise and the multiple alignment. Make some drastic changes, and align the sequences again.
Which settings did you change, what was the impact on the alignment? Were the intein / extein junctions retained?

If you have time: You can use the input/output options to reformat an alignment. Some programs require specialized formats. You can use a text editor like MSWord to get your alignment into the desired format, but things are certainly much easier, if you start out with a format close to the desired one. Hint: Often you do not want to use the complete alignment, but only those portions which are sufficiently conserved. You can take a file in clustal format (*.aln) and delete columns with a text editor (in MSWord pressing down the alt key (on a PC) before clicking the mouse switches to column mode -- to see the alignment, you may need to decrease the font size and select a non-proportional font in your editor!). Although the different lines in the resulting alignment have different lengths, clustal reads in the aligned sequences correctly, and you can output the shortened sequences in any desired format you want. One problem is that on a MAC you have to convert back and forth from MAC to UNIX format. The seaview program makes it easier to select positions fro further analysis.

Part2: Jalview

Jalview is a JAVA application to inspect and edit multiple sequence alignments. It also allows inspection of protein space for the aligned sequences. This works surprisingly well. The Jalview Homepage contains a lot of additional information.

Go to the jalview applet page and either start the Jalview desktop (link on top), or a jalview applet (links in the middle of page).
If this does not work, download this file, unarchive it, and double-click on
Preferably, you want to load the JaLVIEWdesctop, but the Jalview lite version is just as fine, except the sequence input is more difficult (delete the sequences from the example, add the sequences form the file).

Close the windows that may have opened as a demonstration, except for the multiple sequence alignment window.

Load the sequences from the ATP-ase Subunit alignment into Jalview.

Explore the different coloring options (COLOUR menu). Which one seems to work best (most meaningful - scroll through the alignment to a more conserved region).

Note: You can change/edit the alignment by clicking on an amino acid residue and dragging it to the right or left using the arrow keys. Try it, but leave the sequences in an aligned state before you move on.

Select all sequences. CALCULATE an AVERAGE DISTANCE TREE USING % identity
Click somewhere in the resulting tree to color groups of related sequences in the same color. You can right-click (or command click) on a node to change color for a group of sequences.
Chose a color scheme that colors all subunits of the same type in the same color

In a principal component analyses, the new dimensions are calculated as a linear combination of the original dimensions, so that greatest variance by any projection of the data set comes to lie on the first axis, etc. for the following dimensions. Can you find a higher dimension that breaks up the vacuolar ATPase A subunits? (Their names start with A.).
Which of the A subunit sequences cluster together, if you use this dimension (1, 2 and 4 worked for me)?

Part 3: More alignments -- and combining alignments

We will use the seaview alignment Graphics User Interface (GUI) for this exercise. Seaview, like clustalx, uses an x-windows interface. The latest version of the seaview program is available for differnet platforms from here. The Mac version is also here. Unzip the archive, start the program by double clicking in the icon.

The following is a list of intein containing Yeast V-ATPase catalytic subunits -- CLICK HERE --. Download these sequences (send to: pulldown menu, select file and then fasta format), and then drag the file into the seaview alignment window. Align them into a multiple formated sequence file using the clustalw2 (set the option in the align menu under options, then start the alignment by selecting align all).

Scanning through the alignment, can you predict which part of the sequences corresponds to the ATPase subunit, and which to the intein? (If you click on a residue, seaview tells the position in the alignment and the position in the individual sequence). Redo the alignment in muscle. Do you see any differences between the muscle and the clustalw alignment? (Yes/No; Yes/No, if yes from about where to where in which sequence?)

Open a second seaview window and align the intein containing sequences (the sequence names start with gi....) with the ones from the S. cerevisiae ATPaseSU file. --HERE are all the ATPase SU + the intein contining A SU-- --Here are only the yeast A-SU --.

Explore different options to align the sequences with and without inteins. Define two groups of sequences (those with and without inteins) and explore a profile alignment between the two (select each group first and align within the group, then between the two groups). Is the intein clearly recognized as an inserted region? In an alignment contianing only the A-subunits with inteins, and one A-SU from yeast without intein as reference, use gblocks to define consrved sites: In the sites menu, select create set, select GBLOCKS, select "allow for gap positions within the final blocks". Which parts of the intein are flagged as relaibly aligned (you could compare this to to the conserved blocks listed in inbase here)?

Different alignment programs create differnet alignments - clustal and muscle are rather similar, PRANK produces longer alignments of a different flavor. Some programs allow to estimate the robustness of aligned regions. Guidance from Tal Pupk's group at TAU produces alignments with good esimation of reliability of individual alignment columns (the output from guidance for the ATPases alignment is here and here). For your future work, you might want to keep this service in mind. You also can use GBLOCKS in seaview (see above) to select sites which are relaibly aligned, but this may be too conservative.

If you like a challenge: move the sequences onto the biotech cluster, and align the sequences in muscle (command: muscle -in sequences.fasta -out muscle_alignment.aln ) Are there noticeable differences in the resulting alignment?


Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.