Assignment 7: Continuation of Blastall on Server

Your name:
Your email address:

Questions you should answer are given in red.

Comments: in unix files and directory names should NOT include any space!

<control> c stops whatever process you are currently running.

<control> z halts the process you are currently running, entering b continues the halted process in background (you can do something else on the commandline.

Use the <tab key> to complete the command line. Use the upwards arrow to recall commands that you already executed.

If you use a windows computer, crimson is a recommended text editor.

In the context of connecting to servers using ssh, sftp, of ftp, your local computer/laptop is considered the client, the computer you connect to is known as the server.
(to install filezilla on your computer, you need the client version)

Continue the exercises from last week. Leave the answers blank that you already submitted last week.

 

1) Using Blast on a "local" machine:

Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov) in the folder named genomes. The eukaryotic genomes are accessible from the genomes directory while all prokaryotic (bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear case of archaeal discrimination).

Download four genomes of your choice. The first two should be from distantly related organisms (e.g. one Archaeon, one Bacterium, if in doubt of what to choose, go for one of the Thermotoga and one of the Pyrococcus genomes.
For the second exercise two or more genomes from closely related bacteria or archaea are needed. Two or more of the Aeromonas genomes would be interesting):

Method: Go into the bacteria sub-folder using the ftp://ftp.ncbi.nih.gov link. Click on the appropriately named folder (e.g., Pyrococcus abyssi, Pyrococcus is a member of the archaeal domain).

Among the files in these folders we are interested in is the .ffn file which contains all nucleotide sequences for individual predicted ORF's, the .faa file which contains the predicted ORF's amino acids sequences, and the .fna file, which contains the whole genome. These files are in a fasta format. Another file of interest are the ptt files that contain information on the individual annotated open reading frames. This file is useful, if you want to map ORFs by position in the genome. For this exercise we will only use the .faa files and the .ptt. If a folder contains more than one .faa file, either take only the file containing the data from the chromosome (for the first exercise you could take all of the faa files and copy them into a single file*. The same approach is recommended, if you want to use more than one genome as target.) On your computer rename the files into something you recognize. (e.g., p_abyssi.faa and t_maritima.faa).

More preparation: Often it is nice to have a text editor that uses context dependent coloring, especially, for creating scripts. For Macs a very nice program is textwrangler. The latest version is here, older versions for older operating systems are here.

Connect to the server bbcsrv3.biotech.uconn.edu.

You will need two connections to the server,
(A) a terminal window that allows you execute commands from the command line, and
(B) a program that allows you to move files back and forth.

For A:

On a windows machine download and install the ssh client from ftp://ftp.uconn.edu/restricted/ssh/ . Install the program, it should include both a terminal program and a file transfer program. Open the terminal program and connect to the server. Your username is mcb3421_nn, the password is case sensitive and on the white board.

On a Mac, open the terminal program (terminal.app -- the program usually resides in the application folder inside the utilities subfolder).
In the terminal window type:
ssh mcb3421_nn@bbcsrv3.biotech.uconn.edu
where nn is the number assigned to you.

Your password is on the white-board.

Change your password by typing
passwd

Then follow the instructions to change your password. Write down your name, your account name and your password on a piece of paper and hand it to Amanda.

For B:

Load the two .faa files from the distantly related organisms into a folder in your home directory on bbcsrv3.biotech.uconn.edu.
As usual there are many ways to do this.

Alternative #1: The easiest is to use a program like filezilla. Go to their web page, download and install FileZilla client (use the correct version according to your operating system and processor). Unpack the program. Double click on the program. For Mac with Intel processor, the application is also available here, and for Macs with RISC processors, the application is also available here. Connect FileZilla to the server (top left icon, then click on the new site button, then select protocol sftp, and logon type normal. The two panes in the FileZilla window represent your computer (left) and the server (right).

Find the files you want to transfer on the left window pane and drag them into the directory on the server where you want them to go. Right click (or control click) on the right pane opens a menu that allows you create new subdirectories and change file permissions.

Alternative #2: ssh "always" works, but it is rather pedestrian: Save the two files onto your computer. Then open a terminal window on your mac (in the utilities folder inside the application folder) and type the commands in green

ssh your_account_name@bbcsrv3.biotech.uconn.edu    (enter your password when prompted)
mkdir blasttest    (creates a directory named blasttest)

Open another terminal window on your mac and type
sftp your_account_name@bbcsrv3.biotech.uconn.edu   (enter your password when prompted. This establishes a secure ftp connection to the XSERV cluster)
cd blasttest changes to the directory blasttest. lcd changes the directory on your local machine, lpwd reports your local directory name
mput *.faa transfers all files that end with .faa in your local directory into the directory on the server.

Whatever works for you - get the two files into your account on the cluster and establish a terminal with an ssh connection to the cluster.

=============== END(Alternative#2)

At this point you should have an ssh connection to the cluster running in terminal, and two genomes in a directory on the cluster called blasttest ==============

On the cluster (remote machine), you should NOT (meaning never-ever) run long processes on the master node. Either type qlogin to be transferred to a node that is not busy, or submit your command to the queue using qsub. (qrsh also seems to work)

On the cluster using the ssh connection established change to the directory, where you stored the *.faa files from the distantly related bacteria/archaea. If the directory is named blasttest and one of the genomes is called p_abyssi.faa, the following commands will create a searchable database:

cd blasttest
formatdb -i p_abyssi.faa -o T -p T
Where p_abyssi.faa is one of the genomes you downloaded; replace the name with the *.faa file you want to work with.

The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program type  formatdb - <enter>
Remember: You need to replace p_abyssi.faa with one of the genomes you downloaded. This genome will be converted into a searchable database.

To do a blast search of every sequence in t_maritima.faa against the P. abyssi genome, we use blastall. To get information on the program type
blastall <enter>
For more details check one of the many help pages here and here.

We will use the following options (you need to change the command to reflect the names of the genomes you want to compare; -d gives the name of the databank that you created using formatdb):

blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10 -m 8 -a 2
-i the query sequence(s)
-d the database
-o the output file
-p the blast program to use
-e the e-value cut off
-m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines.
the individual columns in the output denote the following
# Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit

Which option turns off the the low complexity filter?
Which option, and which setting, sets the wordsize to 2?
Which option allows to use two processors?

Inspect the output typing more blast.out on the commandline on the cluster

Often one is only interested in the top scoring hit for each query. A simple perl script (here) goes through the blastall output in m8 or m9 format and retains only the top scoring hit, and it adds a header line that will be correctly interpreted by EXCEL. Save this script into your blasttest folder on the cluster (the easiest is to ctrl click on here and select save link as, then select the blasttest folder in the finder). Glance through the script using a text editor (textwrangler, or just click on the here link in your browser).

To run the blast.out file through the perl script type
perl extract_lines.pl blast.out
This will create a file in your directory called blast.out.top that only contains the top scoring hits, and from which the comment lines (in case you used the m9 option in blastall) are removed.

(Aside: A modified version of this program (here) adds another column to the listing that for each alignment gives the bitscore per residue in the alignment.)

Import this file blast.top.out into an excel spreadsheet.

Download the file to your local machine using filezilla or sftp
start EXCEL on you local machine,
open an empty spreadsheet,
select Data -> Get external data;
select your file (you'll need to tell EXCEL to show you all files, not only files with a recognized extension)

Select sort in the data pull-down menu (the one on top, not in the application window) and sort the data on the different columns. Which gene was found to be most similar between the two genomes.

Which were the genomes you selected?


Which gene pair had the highest % identity?

Which gene pair had the smallest E-value?

(to retrieve the information on target sequence, you can use the command fastacmd. If the GI number of the target is 14521962, and the database is p_all.faa the command is as follows:
fastacmd -s 14521962 -d p_all.faa
By default the databank is nr as installed on the cluster. Thus fastacmd -s 14521962 also works (and you get more annotation), provided the sequences in your databank are also in nr.

Making HISTOGRAMs

Next, we will calculate a histogram for the % identity values for all the significant matches and one for all matches (including ones with an e-value above 0.0001).

The newer version of Excel for Macs no longer allows to calculate histograms via the add-ons. The simplest solution is to use use a histogram tool. Download this file and open it in excel. Allow it run macros. Then in the tools menu select better histograms, select the data you want to plot, and enter the start (e.g.: 0), step (e.g.: 10) and stop values (100). Then click ok.

Calculate % identity histograms for all matches and for matches that are significant and highly significant; e.g. all % identity values, and only the % identity values for matches with E-values < 10^-10. If you don't have time to do this yourself, go here - this compares T.maritima top scoring hits to three Pyrococcus genomes, check the %identity sheet.

=========================

Aside: If EXCEL on your computer does not cooperate, we can calculate the histograms in 'R' (a package for statistical analysis and graph plotting****).
Download the script from here (control click and 'save link as...' -- this is an archive with the script and with a file containing instructions, the script itself is here) and place the file histogramScript_pdf.R in the blasttest folder on the cluster via filezilla or sftp.

Copy the data you want to plot as a histogram into a text document (one number per line, using copy paste from Excel to a text-editor works).
Add a first line with information on the data (these will become a label on the histogram. The label is not allowed to contain any spaces! (use underlines instead). Save the textfile as indata in the directory that contains the script histogramScript_pdf.R (indata should not have a file extension)

In your ssh connection, using the cd and cd.. commands, move to the directory that contains the file indata.txt and histogramScript_pdf.R

At the commandline type
> R CMD BATCH histogramScript_pdf.R
(don’t type the >)

The script should generated a files called plot1.pdf and plot2.pdf. Using the file transfer connection you made earlierlocate this PDF file in Finder and open it (double clicking should open it in preview).

End(aside)

============================

Do you observe a smooth distribution of % identiy values?


Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?


What can explain the difference in the histograms for significant and insignificant hits?

2) Using Blastall to do genome plots for microbial genomes (different Aeromonas, Frankia, or different Thermotoga species work nicely) .

Download the *.faa, and*.ptt from two genomes (main chromosome only) from the ncbi's ftp site ftp://ftp.ncbi.nih.gov. As we want to plot syntenic relationships between the two genomes, we want to use the chromosomes from two close relatives.

We want to plot a genes genome position in one genome against the genome position of a homolog in another genome. One way to do this is to replace the GI numbers in the genome with the location in the genome. A program that does this is here. Save the two genomes (as faa files, the two ptt files, the two files, and the perlscript into a new directory on the cluster (name in gp (for genome plot)). We also will use the extract_lines.pl script (so put it into the same directory).

Open a terminal connection to the cluster, login, then
qlogin to a compute node
cd gp, run the addnucnum script. For example, if the genome and ptt files are ACN.faa, ACN.ptt, CCI.faa, CCI.ppt then the commands would be
perl addnucnum.pl ACN
and
perl addnucnum.pl CCI
Check the addnucnum.pl program, the ptt files, and the output in a a texteditor (Note, at present the script uses the middle of the ORF to identify the location, if you want the plot the beginning of each ORF, modify the script).

Pick one of the numbered genomes and turn it into a "blastable" databank. e.g.
formatdb -i ACN.num.faa -p T -o T
Check the result of the formatdb command by typing
more *.log

then run a blastall search of the other genome (or rather the proteins encoded in the other genome) against the databank
blastall -i CCI.num.faa -d ACN.num.faa -o blast.out -p blastp -e 10e-4 -m 8 -a2
NOTE: You want to select a reasonable significance cut-off, because we will plot all significant matches.
NOTE: You need to use the numbered genomes in both cases.

Run blast.out through extract_lines.pl
perl extract_lines.pl blast.out

(Note: If you do more than one comparison, rename blast.out into something meaningful!)

Download the files blast.out and blast.out.top into excel and create scatterplots for the first two columns.

What if any is the difference between the two plots?


Why do some proteins have more than one match?



Repeat the scatterplot exercise after you removed all blast hits that had an E-value worse than 10^-20.

Optional: Another fun exercise might be to plot the level of sequence conservation along a genome. An easy way to do this is to sort the spreadsheet on the ORF position, and then plot the bitscores as a bargraph, or using a scatterplot (bitscore versus position, or -log E-values versus position, or % identity versus position ... ). For this last exercise, if you want to identify the genes (see the fastacmd above), it would help to use one genome with numbers giving the position in the genome, and the other one using the GI numbers.

Which region(s) of the genome is least conserved?
What measure did you plot as function of position?
What would be a better measure?
How could you calculate this measure? (see the ptt files and the addnucnum.pl script)

In case you are moved to learn more PERL, you can earn extra credit, if you combine the addnucnum script with the extract lines script so that for each line the length of the query, the length of the target, and the position in the target genome is added to the blast.out.top table.

The following might be helpful:
A modified version of the addnucnum sript is here (it creates a %ash that links gi number as key to the length of the encoded protein);
a short script that pulls out two gi numbers per line from a m8 or m9 formated output is here, and a modified extract lines script is here.

 

=========================================================

* You can use a text editor for this (textwrangler), but the files might be pretty large. An alternative is to use unix:

open a terminal window (the application is in the utility folder inside the application folder).
cd to the directory where you downloaded the two files (e.g., cd Desktop)
cat name1.faa name2.faa > new_name_for_both_faa_files.faa
where name1.faa name2.faa are the names of the files you want to copy. cat list the content of one ore more files and > directs the output from the default (the screen) to a file

**** If you end up doing research in Science, Medicine, or Engineering you'll need to do statistical analyses. R is free, open source, and probably the most powerful software package available. You can also do publication quality plotting. There is a learning curve at the beginning, but it is a worthwhile investment of time! See here and here for more info.

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.