Assignment 6: Database searches using Blast and Blastall

Your name:
Your email address:

Questions you should answer are given in red.

0) Last weeks BLAST continued (10 minutes)

Using a protein coding nucleotide sequence of your choice as query (or use

http://www.ncbi.nlm.nih.gov/nuccore/21226102?from=2146398&to=2148008&report=gbwithparts

^ DANGER: cut and paste the nucleotide sequence -- the gi number refers to the whole genome, which is what you get if you display the FASTA sequence, numbers pasted into the sequence box are ignored. At the beginning of the seqence add a ">" followed by a name and an <enter> symbol)),

1. search the non-redundant databank using blastn [link] (set maximum target sequences to 20000, else defaults)

2. repeat the search with the translated query vs. protein database ( nr) search tool (Blastx) using the protein nonredundant databank as target (set maximum target sequences to 20000, else defaults), and

3. with the translated query vs. translated database search tool (Tblastx) again using the non-redundant databank as target.

Do you notice any differences in results? (How many significant matches did you get? The taxonomy report provides an easy way to check things.)

1) Search Entrez for a transposase sequence from Frankia (an Actinobacterium living in the soil and as a nitrogen fixing symbiont in some angiosperm trees and shrubs). Use the protein sequence and search all available Frankia genomes of (at http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi ). Use blastp and tblastn as search algorithm.
How many significant hits did you find?
(Note: for tblastn, the list of hits only lists the genome as a whole, to get the number of transposons you need to count the pairwise alignments)

gi number of your query:
Names of genomes searched:
number of hits using blastp in genome 1:
number of hits using blastp in genome 2:
number of hits using tblastn in genome 1:
number of hits using tblastn in genome 2:

2) Using Blast on a "local" machine:

Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov) in the folder named genomes. The eukaryotic genomes are accessible from the genomes directory while all prokaryotic (bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear case of archaeal discrimination).

Download two genomes of your choice, using one of these two methods (if in doubt of what to choose, go for one of the Thermotoga and one of the Pyrococcus genomes):

Method 1: Use the Refseq PID or GPID column for your chosen genomes here. (click on the PID link, then under Project Data on protein. In the "Send to" pull-down menu select File and select FASTA format. Rename the file (sequence.fasta) so that you recall what is in it.

Method 2: Go into the bacteria sub-folder using the ftp://ftp.ncbi.nih.gov link. Click on the appropriately named folder (e.g., Pyrococcus abyssi, Pyrococcus is a member of the archaeal domain).

Among the files in these folders we are are interested in the .ffn file which contains all nucleotide sequences for individual predicted ORF's, the .faa file which contains the predicted ORF's amino acids sequences, and the .fna file, which contains the whole genome. These files are in a fasta format. Another file of interest ar the ptt files that contain information on the individual annotated open reading frames, This file is useful, if you want to map ORFs by position in the genome. For this exercise we will only use the .faa files. If a folder contains more than one .faa file, either take only the file containing the data from the chromosome, or take all of the faa files and copy them into a single file*. The same approach is recommended, if you want to use more than one genome as target. On your computer rename the files into something you recognize. (e.g., p_abyssi.faa and t_maritima.faa)

Connect to the server bbcxsrv1.biotech.uconn.edu. Open the terminal program (terminal.app -- the program usually resides in the application folder inside the utilities subfolder).

In the terminal window type:
ssh mcb221_uxxx@bbcxsrv1.biotech.uconn.edu
where xxx is the number assigned to you.

Change your password by typing
passwd
Then follow the instructions to change your password. Write down your name, your account name and your password on a piece of paper and hand it to Amanda.

Load the two .faa files into a folder in your home directory on bbcxsrv1.biotech.uconn.edu.
As usual there are many ways to do this.

Alternative #1: The easiest might be to use the finder. Select Go->connect to server in the finder-menu. In the address field enter afp://bbcxsrv1.biotech.uconn.edu . When prompted enter your username and password. This opens a folder that contains the contents of your directory on the bioinformatics cluster. You can drop and drag files into that folder, and you can use the finder to create new folders. In your folder on the cluster create a sub-folder called blasttest and move your two faa files into that folder.
(Possible Problem: your iMac might remember a previous user and log him in automatically. If this happens, click at bbcxsrv1 in the bar on the left side of the finder window, this open a diconnect button on the right side of the finder, click it and log in as you).
Continue after =========== below!

Alternative #2: ssh "always" works, but it is very pedestrian: Save the two files onto your computer. Then open a terminal window on your mac (in the utilities folder inside the application folder) and type the commands in green

ssh your_account_name@bbcxsrv1.biotech.uconn.edu (enter your password when prompted)
mkdir blasttest (creates a directory named blasttest)

Open another terminal window on your mac and type
sftp your_account_name@bbcxsrv1.biotech.uconn.edu (enter your password when prompted. This establishes a secure ftp connection to the XSERV cluster)
cd blasttest changes to the directory blasttest. lcd changes the directory on your local machine, lpwd reports your local directory name
mput *.faa transfers all files that end with .faa in your local directory into the directory on the seerver.

Alternative 3: use filezilla and Jellyfissh, or finder and jellyfissh. jellyfissh is a simple program that keeps track of the ssh connections. It saves you from typing ssh username ....

Whatever works for you - get the two files into your account on the cluster and establish a terminal with an ssh connection to the cluster.

Establish an ssh connection to bbcxsrv1.biotech.uconn.edu

Open a terminal window on your mac (in the utilities folder inside the application folder) and type the commands in green

ssh your_account_name@bbcxsrv1.biotech.uconn.edu (enter your password when prompted)

===============

At this point you should have (A) and ssh connection to the cluster running in terminal, and two genomes in a directory on the cluster called blasttest ==============

On the cluster (remote machine), you should NOT (meaning never ever) run long processes on the master node. Either qrsh to a node that is not busy (type qrsh **), or submit your command to the queue using qsub.

On the XSERV cluster using the ssh connection established earlier type
qrsh -l hipri=TRUE <enter> enter your password when prompted,
cd blasttest
formatdb -i p_abyssi.faa -o T -p T
Where p_abyssi.faa is one of the genomes you downloaded; replace the name with the *.faa file you want to work with.

The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program type formatdb - <enter>
Remember: You need to replace p_abyssi.faa with one of the genomes you downloaded. This genome will be converted into a searchable database.

To do a blast search of every sequence in t_maritima.faa against the P.abyssi genome, we use blastall. To get information on the program type
blastall <enter>
For more details check one of the many help pages here and here.

We will use the following options (you need to change the command to reflect the names of the genomes you want to compare; -d gives the name of the databank that you created using formatdb):

blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10 -m 8 -a 2
-i the query sequence(s)
-d the database
-o the output file
-p the blast program to use
-e the e-value cut off
-m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines.
the individual columns in the output denote the following
# Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit

Which option turns off the the low complexity filter?
Which option, and which setting, sets the wordsize to 2?
Which option allows to use two processors?

Inspect the output typing more blast.out on the commandline on the cluster

Often one is only interested in the top scoring hit for each query. A simple perlscript (here) goes through the blastall output in m8 or m9 format and retains only the top scoring hit, and it adds a header line that will be correctly interpreted by EXCEL. Save this script into your blasttest folder on the cluster (the easiest is to ctrl click on here and select save link as, then select the blasttemp folder in the finder). Glance through the script using a text editor (vi, or textwrangler, or just click on the here link in your browser).

To run the blast.out file through the perl script type
perl extract_lines.pl blast.out
This will create a file in your directory called blast.out.top that only contains the top scoring hits, and from which the comment lines (in case you used the m9 option in blastall) are removed.

A modified version of this program (here) adds another column to the listing that for each alignment gives the bitscore per residue in the alignment.

Import this file into an excel spreadsheet. If you have an afp connection to the cluster, you can

control click on the blast.top.out file in finder
select open with Excel (you need to uncheck show preferred applications)

OR

start EXCEL on you local machine,
open an empty spreadsheet,
select Data -> Get external data;
select your file (you'll need to tell it to show you all files, not only files with a recognized extension)

Sorting the data on the different columns (selecting sort in the data menu) can figure out which gene was found to be most similar between the two genomes.

Which were the genomes you selected?
Which gene pair had the highest % identity?

Which gene pair had the smallest E-value?

(to retrieve the information on target sequence, you can use the command fastacmd. If the GI number of the target is 14521962, and the database is p_all.faa the command is as follows:
fastacmd -s 14521962 -d p_all.faa
By default the databank is nr as installed on the cluster. Thus fastacmd -s 14521962 also works (and you get more annotation), provided the sequences in your databank are also in nr.

Next, we will calculate a histogram for the % identity values for all the significant alignments. The newer version of Excel for Macs no longer allows to calculate histograms via the add-ons*** so we will use a script in 'R' (a package for statistical analysis and graph plotting****).
Download the script from here (control click and 'save link as...' -- this is an rchive with the script and with a file containing instructions, the script itself is here) and place the file histogramScript_pdf.R in the blasttest folder on the cluster via afp.

Copy the data you want to plot as a histogram into a text document (one number per line, using copy paste from Excel to a texteditor works).
Add a first line with information on the data (these will become a label on the histogram. The label is not allowed to contain any spaces! (use underlines instead). Save the textfile as indata in the directory that contains the script histogramScript_pdf.R (indata should not have a file extension)

In your ssh connection, using the cd and cd.. commands, move to the directory that contains the file indata.txt and histogramScript_pdf.R

At the commandline type
> R CMD BATCH histogramScript_pdf.R
(don’t type the >)

The script should generated a files called plot1.pdf and plot2.pdf. Using the afp connection you made earlier (or one of the alternative methods described above) locate this PDF file in Finder and open it (double clicking should open it in preview).

Do you observe a smooth distribution of % identiy values?
Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?

Are there other interesting things you could plot from these data?

A particularily enlightening comparison is obtained, when you plot histograms for the % identity values for differnet significant levels, e.g. all % identity values, and only the % identity values for matches with E-values < 10^-10.

Can you explain the observation? If you don't have time to do this yourself, go here - this compares T.maritima top scoring hits to three Pyrococcus genomes, check the %identity sheet.

* You can use a text editor for this (textwrangler), but the files might be pretty large. An alternative is to use unix:

open a terminal window (the application is in the utility folder inside the application folder).
cd to the directory where you downloaded the two files (e.g., cd Desktop)
cat name1.faa name2.faa > new_name_for_both_faa_files.faa
where name1.faa name2.faa are the names of the files you want to copy. cat list the content of one ore more files and > directs the output from the default (the screen) to a file

** Screen is a unix program that can be very useful, especially, if you want to go home and keep the cluster running on a program that you started. The best is to start screen immediately after you connect to the server via ssh. To get going type screen -a <return>, then type qrsh <return> this sends you to a compute node, you might need to enter your password again. Once you are ready to leave, enter the following keys <ctrl> and a at the same time followed by a d. This detaches the screen. logout from the master node; go home; ....; when you want to check the program, login to the master node and type screen -r to reattach to the screen (if you have several screens running you get instructions how to connect to a particular screen). In class we will use the high priority queue that works for jobs that do not run for a long time.

*** The 2008 version of Excel for Macs no longer allows to calculate histograms via the add-ons, but there are spreadsheets available on the 'net into which you can copy and paste the data you want to plot - a histogram generating spreadsheet for Excel2008 is here). Histograms may be available via the analysis tool pack in some Excel versions. If you did not install office with all options, you will need to have the office installation disk, sorry.

**** If you end up doing research in Science, Medicine, Engineering or working in many other areas you'll need to do statistical analyses. R is free, open source, and probably the most powerful software package available. You can also do publication quality plotting. There is a learning curve at the beginning but it is a worthwhile investment of time! See here, here and here for more info.

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.