Assignment 7: Database searches using Blastall continued

Your name:
Your email address:


Use of <tab> and command history on the unix command-line.

Questions you should answer today are given in red.

Comments: in unix files and directory names should NOT include any space!

<control> c stops whatever process you are currently running.

<control> z halts the process you are currently running, entering b continues the halted process in background (you can do something else on the command-line).

Use the <tab key> to complete the command line. Use the upwards arrow to recall commands that you already executed.

If you use a windows computer, crimson is a recommended text editor.

In the context of connecting to servers using ssh, sftp, of ftp, your local computer/laptop is considered the client, the computer you connect to is known as the server.
(to install filezilla on your computer, you need the client version)

Instructions on how to connect to a computer using sftp and ssh are given below

At the end of the exercise type logout to release the compute node form the queue.
Check the queue for abandoned sessions using qstat.

As many of you did not do that last week, you should check that your previous qlogin session has been terminated. You do this after logging into the head node (using ssh).

1) Using Blastall to do genome plots for microbial genomes (different Frankia, Aeromonas, or different Thermotoga species work nicely).

Download the *.faa, and*.ptt from two genomes (main chromosome only) from the ncbi's ftp site As we want to plot syntenic relationships between the two genomes, we want to use the main chromosomes from two close relatives. (Or, if there is a second chromosome, you could also compare the two secondary chromosomes.)

We want to plot a genes genome position in one genome against the genome position of a homolog in another genome. One way to do this is to replace the GI numbers in the genome with the location in the genome. A program that does this is here. We also will use the script. Save the two genomes (as faa files, the two ptt files, the two files, and the two perl scripts into a new directory on the cluster.

Open a terminal connection to the cluster, login, then
qlogin to a compute node !!!!
mkdir blastdir2 (make a directory to work in)
cd blastdir2

Open FileZilla or the ssh client and open a transfer window to the cluster (go here if you need more instructions)

In your sftp program move to the blastdir2 directory and transfer the two .faa, .ptt files and the and scripts to the blastdir2 directory on the cluster.

run the addnucnum script. For example, if the genome and ptt files are ACN.faa, ACN.ptt, CCI.faa, CCI.ppt then the commands would be
perl ACN
perl CCI
Check the program, the ptt files, and the output in a a texteditor. (Note, at present the script uses the middle of the ORF to identify the location, if you want the plot the beginning of each ORF, modify the script).

Pick one of the numbered genomes and turn it into a "blastable" databank. e.g.
formatdb -i ACN.num.faa -p T -o T
Check the result of the formatdb command by typing
more *.log

then run a blastall search of the other genome (or rather the proteins encoded in the other genome) against the databank
blastall -i CCI.num.faa -d ACN.num.faa -o blast.out -p blastp -e 10e-4 -m 8 -a2
NOTE: You want to select a reasonable significance cut-off, because we will plot all significant matches.
NOTE: You need to use the numbered genomes in both cases.

Run blast.out through
perl blast.out

(Note: If you do more than one comparison, rename blast.out into something meaningful!)

Download the files blast.out and into excel and create scatterplots for the first two columns (alternatively, you can use the gnuplot script described linked in class11, ).

Which genomes did you use?

What if any is the difference between the two plots?

Why do some proteins have more than one match?

Repeat the scatterplot exercise after you removed all blast hits that had an E-value worse than 10^-20.

In the plot of, do you see matches along the main diagonal? What could explain the deviations from the diagonal?

In case your analysis resulted in a gene plot that had many matches close to both of the the two diagonals (see here and here for an example),
what process could have given rise to this pattern?

Optional: Sequence conservation along a genome

Plot the level of sequence conservation along a genome. An easy way to do this is to sort the spreadsheet of on the ORF position, and then plot the bitscores (or or -log E-values, or % identity) as bar graph, or using a scatter plot (bitscore (or ...) versus position). For this last exercise, if you want to identify the genes (see the use of fastacmd in assignment 6), it might help to use one genome with numbers giving the position in the genome, and the other one using the GI numbers. (Even better would be do the blast search with the normal genomes and write a Perl script to add another column to the spreadsheet that contains the genome location of the query and the target genome. If you do, send me a copy of the script.)

Which region(s) of the genome contains the least conserved proteins?

Histograms of ...

Go to assignment 6 if you want to plot any of the columns of the blast.out or files as a histogram.


You could do scatterplots of %identity against bitscore, or -log(E-value) against bitscore ....



Footnotes and Commentaries


SSH and SFTP connections to the server

You will need two connections to the server,
(A) a terminal window that allows you execute commands from the command line, and
(B) a program that allows you to move files back and forth.

For A:

On a windows machine download and install the ssh client from . Install the program, it should include both a terminal program and a file transfer program. Open the terminal program and connect to the server. First edit a profile with your information, then use the profile to connect

On a Mac, open the terminal program ( -- the program usually resides in the application folder inside the utilities subfolder).
In the terminal window type:

If you want to change your password type

Then follow the instructions to change your password.

For B:

Load the two .faa files from the distantly related organisms into a folder in your home directory on
As usual there are many ways to do this.

If you have the ssh client installed under windows, log in and open the file transfer window.

Alternative #1 (skip this if you use windows - the ssh client does the same as filezilla):
The easiest is to use a program like filezilla. Go to their web page, download and install FileZilla client (use the correct version according to your operating system and processor). Unpack the program. Double click on the program. For Mac with Intel processor, the application is also available here, and for old Macs with RISC processors, the application is also available here. Connect FileZilla to the server (top left icon, then click on the new site button, then select protocol sftp, and logon type normal. The two panes in the FileZilla window represent your computer (left) and the server (right).

Find the files you want to transfer on the left window pane and drag them into the directory on the server where you want them to go. Right click (or control click) on the right pane opens a menu that allows you create new subdirectories and change file permissions.

Alternative #2: Using sftp on your ssh connection "always" works, but it is rather pedestrian: Save the you want to transfer files onto your computer. Then open a terminal window on your mac (in the utilities folder inside the application folder) and type the commands in green

ssh    (enter your password when prompted)

Open another terminal window on your mac and type
sftp   (enter your password when prompted. This establishes a secure ftp connection to the cluster)
cd blasttest changes to the directory blasttest. lcd changes the directory on your local machine, lpwd reports your local directory name
mput *.faa transfers all files that end with .faa in your local directory into the directory on the server.

Whatever works for you!

More preparation: Often it is nice to have a text editor that uses context dependent coloring, especially, for creating scripts. For Macs a very nice program is textwrangler. The latest version is here, older versions for older operating systems are here.

* You can use a text editor for this (textwrangler), but the files might be pretty large. An alternative is to use unix:

open a terminal window (the application is in the utility folder inside the application folder).
cd to the directory where you downloaded the two files (e.g., cd Desktop)
cat name1.faa name2.faa > new_name_for_both_faa_files.faa
where name1.faa name2.faa are the names of the files you want to copy. cat list the content of one ore more files and > directs the output from the default (the screen) to a file

**** If you end up doing research in Science, Medicine, or Engineering you'll need to do statistical analyses. R is free, open source, and probably the most powerful software package available. You can also do publication quality plotting. There is a steep learning curve at the beginning, but it is a worthwhile investment of time! See here and here for more info.


Type logout to release the compute node form the queue.
Check the queue for abandoned sessions using qstat.
If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.