MCB 3421 Assignment 8

Your name:
Your email address:

A note about the "all on one line" problem you may come across when working with files

If you work under OsX, your text editor (MSWord) will be able to read all possible files and translate end of line characters correctly. A frequent problem is that the end of line character in MAC and in UNIX (including Darwin, the system that OsX is running on) is different. If you open a UNIX application like clustalx, it expects the UNIX end of line character, and in case the file uses MAC end of line characters, everything will be in a single line. If this happens you can use the following terminal commands (the terminal is a window that opens a command line interface to your operating system) (see here for a summary of frequently used UNIX commands):

tr '\r' '\n' < macfile.txt > unixfile.txt

(if this does not work on your system try /usr/bin/tr '\015' '\n' < macfile.txt > unixfile.txt)

Note: If you frequently use UNIX applications on a Mac under OSX, get the ConvertNewlines application. After unzipping the application, place it on the desktop. If you drag the file you want to change over it (drop it into the icon), it recognizes the current end of line format and asks you to which convention (Unix, Windows, or Mac) you want to convert it to.

Part1: ClustalX (20 minutes)

You can download ClustalX and ClustalW for Mac and PC from here. If you are on a Mac, then the file you need to download IS THIS ONE. Download it to your Desktop, and double-click it to extract the ClustalX app from the zip file.

We will be using this sequence file. The sequences in this file are annotated as follows:

Start ClustalX by double clicking on the clustalx icon. Using the FILE menu, load the sequence file.

Once you loaded the sequences, calculate an alignment (Alignment menu -> "Do complete Alignment" now). It will take 2-3 minutes.

Maximize the window and scroll to position 300. Most of the ATPase subunits have a "canonical" motif (G.....GKT) characteristic for many nucleotide binding sites. With which sequence has this motif been replaced in the B subunits of the vacuolar type ATPases?

Save the alignment in different formats (PHYLIP, FASTA, NEXUS, and MSF) (File Menu -> Save sequences as).  Using a text editor (MS Word) and a non-proportional font (e.g. COURIER), inspect the different formats.

NOTE :
*.MSF is read by GCG; Phylip (*.phy) is the "new" phylip - interlaced format; NEXUS is used by PAUP and MrBayes 

If you have time: You can use the input/output options to reformat an alignment. Some programs require specialized formats. You can use a text editor like MSWord to get your alignment into the desired format, but things are certainly much easier, if you start out with a format close to the desired one. Hint: Often you do not want to use the complete alignment, but only those portions which are sufficiently conserved. You can take a file in clustal format (*.aln) and delete columns with a text editor (in MSWord pressing down the alt key (on a PC) before clicking the mouse switches to column mode -- to see the alignment, you may need to decrease the font size and select a non-proportional font in your editor!). Although the different lines in the resulting alignment have different lengths, clustal reads in the aligned sequences correctly, and you can output the shortened sequences in any desired format you want. The problem is that on a MAC you have to convert back and forth from MAC to UNIX format.

Part2: Jalview (15 minutes)

Jalview is a JAVA application to inspect and edit multiple sequence alignments. It also allows inspection of protein space for the aligned sequences. This works surprisingly well. The Jalview Homepage contains a lot of additional information.

Start Jalview as Java Web Start Application (if the window does not appear after a few seconds, check the dock for the JAVA icon and click on it).
If this does not work, download this file, unarchive it, and double-click on Jalview.app.
Close the windows that might have opened as a demonstration.
(In case neither of these options works, go this webpage and start Jalview as an applet. This makes it mor difficult to interact with the program, but ...)

Load your ATPaseSUall.aln file (File -- Input alignment -- from file) from the previous exercise (sequences need to be aligned!) into Jalview (select file format Clustal .aln).

Explore the different coloring options (COLOUR menu). Which one seems to work best (most meaningful - scroll through the alignment to a more conserved region).


Note: You can change/edit the alignment by shift-clicking on an amino acid residue and dragging it to the right or left. Try it, but leave the sequences in an aligned state before you move on.

Select all sequences. CALCULATE an AVERAGE DISTANCE TREE USING % identity
Click somewhere in the resulting tree to color groups of related sequences in the same color. You can right-click (or command click) on a node to change color for a group of sequences.

CALCULATE the PRINCIPAL COMPONENT ANALYSIS.
In a principal component analyses, the new dimensions are calculated as a linear combination of the original dimensions, so that greatest variance by any projection of the data set comes to lie on the first axis, etc. for the following dimensions. Can you find a higher dimension that breaks up the vacuolar ATPase A subunits? (Their names start with A.).
Which of the A subunit sequences cluster together, if you use this dimension (1, 2 and 5 worked for me)?

Part 3: More alignments -- and combining alignments

We will use the seaview alignment Graphics User Interface (GUI) for this exercise. Seaview, like clustalx, uses an x-windows interface. The latest version of the seaview program is available for differnet platforms from here. The Mac version is also here. Unzip the archive, start the program by double clicking in the icon.

The following is a list of intein containing Yeast V-ATPase catalytic subunits -- CLICK HERE --. Download these sequences (send to: pulldown menu, select file and then fasta format), and then drag the file into the seaview alignment window. Align them into a multiple formated sequence file using the clustalw2 (set the option in the align menu under options, then start the alignment by selecting align all).

Scanning through the alignment, can you predict which part of the sequences corresponds to the ATPase subunit, and which to the intein? (If you click on a residue, seaview tells the position in the alignment and the position in the individual sequence). Redo the alignment in muscle. Do you see any differences between the muscle and the clustalw alignment? (Yes/No; Yes/No, if yes from about where to where in which sequence?)

Align the intein containing sequences (the sequence names start with gi....) with the ones from the S. cerevisiae ATPaseSU file. --HERE are all the ATPase SU + the intein contining A SU-- --Here are only the yeast A-SU --.

Explore different options to align the sequences with and without inteins. Define two groups of sequences (those with and without inteins) and explore a profile alignment between the two (select each group first and align within the group, then between the two groups). Is the intein clearly recognized as an inserted region? In an alignment contianing only the A-subunits with inteins, and one A-SU from yeast without intein as reference, use gblocks to define consrved sites: In the sites menu, select create set, select GBLOCKS, select "allow for gap positions within the final blocks". Which parts of the intein are flagged as relaibly aligned (you could compare this to to the conserved blocks listed in inbase here)?

(For the following exercise, in case each alignment takes too long, delete some of the sequences from the alignment, but leave at least two intein containing and some other A-subunit sequences in the dataset.)

Start clustalX, load the combined sequences --HERE--. In the alignment menu, in the alignment parameter submenu, place a check mark in "reset all gaps before alignment".

Change the alignment parameters (e.g., reduce the opening gap penalty in both pairwise and multiple alignment parameter settings, or change the scoring matrix for the multiple sequence alignments).

Is the intein still aligned to gaps only in the other ATPases subunits? If no, which parameters did you change?

Different alignment programs create differnet alignments - clustal and muscle are rather similar, PRANK produces longer alignments of a different flavor. Some programs allow to estimate the robustness of aligned regions. Guidance from Tal Pupk's group at TAU produces alignments with good esimation of reliability of individual alignment columns (the output from guidance for the ATPases alignment is here). For your future work, you might want to keep this service in mind. You also can use GBLOCKS in seaview (under the sites menu) to select sites which are relaibly aligned, but this may be too conservative..

If you like a challenge: move the sequences onto the biotech cluster, and align the sequences in muscle (command: muscle -in sequences.fasta -out muscle_alignment.aln ) Are there noticeable differences in the resulting alignment?

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.