MCB 3421 Assignment 9

Your name:
Your email address:

A note about the "all on one line" problem you may come across when working with files

If you work under OsX, your text editor (MSWord) will be able to read all possible files and translate end of line characters correctly. A frequent problem is that the end of line character in MAC and in UNIX (including Darwin, the system that OsX is running on) is different. If you open a UNIX application like clustalx, it expects the UNIX end of line character, and in case the file uses MAC end of line characters, everything will be in a single line. If this happens you can use the following terminal commands (the terminal is a window that opens a command line interface to your operating system) (see here for a summary of frequently used UNIX commands):

tr '\r' '\n' < macfile.txt > unixfile.txt

(if this does not work on your system try /usr/bin/tr '\015' '\n' < macfile.txt > unixfile.txt)

Note: If you frequently use UNIX applications on a Mac under OSX, get the ConvertNewlines application. After unzipping the application, place it on the desktop. If you drag the file you want to change over it (drop it into the icon), it recognizes the current end of line format and asks you to which convention (Unix, Windows, or Mac) you want to convert it to.

Also, most versions of microsoft word, when you save a document as text file, allow you to select different end of line coding - the default usually is the setting fro windows.

Part1: ClustalX (30 minutes)

You can download ClustalX and ClustalW for Mac and PC from here. If you are on a Mac, then the file you need to download IS THIS ONE. Download it to your Desktop, and double-click it to extract the ClustalX app from the zip file.

We will be using this sequence file. The sequences in this file are annotated as follows:

Start ClustalX by double clicking on the clustalx icon. Using the FILE menu, load the sequence file.

Once you loaded the sequences, calculate an alignment (Alignment menu -> "Do complete Alignment" now). It will take 2-3 minutes.

Maximize the window and scroll to position 300. Most of the ATPase subunits have a "canonical" motif (G.....GKT) characteristic for many nucleotide binding sites. With which sequence has this motif been replaced in the B subunits of the vacuolar type ATPases?

Save the alignment in different formats (PHYLIP, FASTA, NEXUS, and MSF) (File Menu -> Save sequences as).  Using a text editor (MS Word) and a non-proportional font (e.g. COURIER), inspect the different formats.

*.MSF is read by GCG; Phylip (*.phy) is the "new" phylip - interlaced format; NEXUS is used by PAUP and MrBayes 

Under alignment in the main menu-bar, select alignment options. Select reset all gaps before alignment. - The next time to go to this submenu, there should be a check mark in front of this option. - Under alignment options, change the parameters for both the pairwise and the multiple alignment. Make some drastic changes, and align the sequences again.
Which settings did you change, what was the impact on the alignment? Were the intein / extein junctions retained?

If you have time: You can use the input/output options to reformat an alignment. Some programs require specialized formats. You can use a text editor like MSWord to get your alignment into the desired format, but things are certainly much easier, if you start out with a format close to the desired one. Hint: Often you do not want to use the complete alignment, but only those portions which are sufficiently conserved. You can take a file in clustal format (*.aln) and delete columns with a text editor (in MSWord pressing down the alt key (on a PC) before clicking the mouse switches to column mode -- to see the alignment, you may need to decrease the font size and select a non-proportional font in your editor!). Although the different lines in the resulting alignment have different lengths, clustal reads in the aligned sequences correctly, and you can output the shortened sequences in any desired format you want. One problem is that on a MAC you have to convert back and forth from MAC to UNIX format. The seaview program makes it easier to select positions fro further analysis.

Part 2: More alignments -- and combining alignments

We will use the seaview alignment Graphics User Interface (GUI) for this exercise. Seaview, like clustalx, uses an x-windows interface. The latest version of the seaview program is available for different platforms from here. The Mac version is also here. Unzip the archive, start the program by double clicking in the icon.

The following is a list of intein containing Yeast V-ATPase catalytic subunits -- CLICK HERE --. Download these sequences (send to: pulldown menu, select file and then fasta format), and then drag the file into the seaview alignment window. Align them into a multiple formated sequence file using the clustalo (set the option in the align menu under options, then start the alignment by selecting align all).

Scanning through the alignment, can you predict which part of the sequences corresponds to the ATPase subunit, and which to the intein? (If you click on a residue, seaview tells the position in the alignment and the position in the individual sequence). Redo the alignment in muscle. Do you see any differences between the muscle and the clustalw alignment? (Yes/No; Yes/No, if yes from about where to where in which sequence?)

Open a second seaview window and align the intein containing sequences (the sequence names start with gi....) with the ones from the S. cerevisiae ATPaseSU file. --HERE are all the ATPase SU + the intein contining A SU-- --Here are only the yeast A-SU --.

Explore different options to align the sequences with and without inteins. Define two groups of sequences (those with and without inteins) and explore a profile alignment between the two (select each group first and align within the group, then between the two groups). Try both muscle and clustalo as alignment algorithm for the profile alignment. Is the intein clearly recognized as an inserted region?

In an alignment containing only the A-subunits with inteins, and one A-SU from yeast without intein as reference (do it yourself, or here), use gblocks to define conserved sites: In the sites menu, select create set, select GBLOCKS, select "allow for gap positions within the final blocks". Which parts of the intein are flagged as reliably aligned (you could compare this to to the conserved blocks listed in inbase here)? What do you expect to happen, if you use GBLOCKS on the larger dataset? In particular, what would happen to the intein sequences? (Try it out , if the answer is not obvious!)

Different alignment programs create different alignments - clustal and muscle are rather similar, PRANK produces alignments of a different flavor. Try an alignment of the yeast V-ATPase subunits with intein. (webPrank does not like the multiple fasta file created above, this one works). If this takes too long, the resulting file is here. (Load the aligned multiple sequence fasta file into seaview, clustalx or jalview.) How do the muscle and the prank alignments differ? What are the overall lengths of the alignments? Why might the PRANK alignment be advantageous for some downstream applications?

If you have time, repeat the alignment with MAFFT (online version here). This is fast. How does the resulting alignment compare to the muscle, and PRANK alignments?

Some programs allow to estimate the robustness of aligned regions. Guidance from Tal Pupko's group at TAU produces alignments with good estimation of reliability of individual alignment columns (the output from guidance for the ATPases alignment is here and here). For your future work, you might want to keep this service in mind. You also can use GBLOCKS in seaview (see above) to select sites which are reliably aligned, but this may be too conservative.


Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.