Assignment 6: Database searches using BLAST+ executables

Your name:
Your email address:

1) Using Blast on a "local" machine

A) obtaining the genome sequences

When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful.

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then untick all the "Levels:" (at the top right) except for "Complete". This should result in a listing of completely sequenced genomes.

Take a note of the number of complete genomes (Items 1 - 100 of X) here:

Click on "Download Reports from FTP site" (top right of page). Download "prokaryotes.txt" to your computer, or get a local copy here. (Save target as, or Save link as, depending on your browser.) Save to the Desktop.

Load "prokaryotes.txt" into Excel. Start with a blank workbook, then File -- Open -- Browse -- Select "All Files". Choose prokaryotes.txt

Text Import Wizard opens. Just choose "Finish", as the defaults work fine.

Click on the header of column "A". The status bar should show 74338 rows.

Click on the first cell (A1). Then select "Format as Table". The header lines should now appear in table format.

Scroll to the right until you see a "Status" heading. Click on Status, and tick only "Complete Genome". Check status bar for the number of records found.
Does it agree with the previous number?

Sort by Release Date. How many complete genomes have been published so far this month (October)?

Which genome has the most genes? The highest GC content? (No answer required :-)

Go back to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then untick all the "Levels:" (at the top right) except for "Complete". This should result in a listing of completely sequenced genomes.

Download two genomes of your choice. They should be from distantly related organisms (e.g. one Archaeon, one Bacterium, if in doubt of what to choose, go for one of the Thermotoga and one of the Pyrococcus genomes.

To do this, use the Group and SubGroup boxes to narrow your selection, or better yet, use the search box to narrow to a taxonomic group.

Then look for the green diamond link in the far right-hand column. This green diamond takes you to a listing of all the files for a genome project.

You want to download the protein.faa file for each genome. This "faa" file contains all of the proteins coded by a genome.

Right-click on the faa file, and select "Copy link location". Then go to the cluster and paste the link, like this:

Start a program called Bitvise SSH Client. In the "Host" field, type
In the "Username" field, type your username. Login. It may ask you to accept a new host key. Now enter your password.

mkdir lab6

cd lab6

curl -O paste the link you copied

e.g. curl -O


mv oldname newname

mv ridiculouslyLongFilename.faa.gz somethingShorter.faa.gz

A good convention would be the first letter of the genus, and then the species name, e.g., D_radiodurans.faa.gz

gunzip somethingshorter.faa.gz


The file should now end with just "faa". The "gz" ending was a compressed file format. Take a look at the first few lines with

head somethingshorter.faa

And repeat the procedure for the second genome.

This takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone. See cluster etiquette.)

cd lab6
makeblastdb -in database_protein.faa -dbtype prot
Choose one of your genomes as the "database". Do an "ls" to see the extra files you just made.

blastp -query query_protein.faa -db database_protein.faa -out blast.txt -outfmt 6
The other genome will be the "query".

This will take a few minutes. When finished, open a new SFTP window in BitVise, navigate to your lab6 directory, and transfer over the blast.txt file to your Desktop. Load it into Excel.

Here is a description of the columns.

What is the accession ID pair of the most conserved protein between your two genomes? (IDs in the first two columns are accession IDs)

In Excel, select the third column from the left (this is the percent identity for each BLASTp match), and Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram

(More about histograms in Excel, including the formula used for default binning.)

You get a histogram of all percent identities. You may find it useful to convert to table format, and add headings. Repeat the histogram for E-values ≤ 1, and E-values ≤ 10-3.

Add a column to the spreadsheet that calculates the number of identical residues (column 3 * column 4 /100). Make histograms for the number of identical residues for all E-values, better than 1 and better than E-3.

Do you observe a smooth distribution of % identiy values?

Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?

What can explain the difference in the histograms for significant and insignificant hits?

For the following questions, the BLAST® Command Line Applications User Manual may be helpful → link

How can you get information on the possible parameters in blastp using the commandline? (try blastp -h first)

How can you set the wordsize to 2 ?

How can you filter the query sequence for regions with low complexity?

2) Using Blast to do genome plots for microbial genomes (different Frankia, Aeromonads, or different Thermotoga species work nicely).

Download the proteins from two genomes of your choice -- two genomes from closely related bacteria or archaea are needed. Two of four Aeromonas genomes would be interesting :)

This time you will also need the feature_table files for both genomes. So, two "faa" files, and their corresponding "feature_table" files, for a total of 4 files. The feature_table files contain the genome coordinates for each protein.

Repeat the BLASTp steps from before, but to get just the top BLASTp hit for each query sequence, use
blastp -query query_protein.faa -db database_protein.faa -out blast_top.txt -outfmt 6 -max_target_seqs 1

The BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:

e.g., Take two genomes, g1 and g2.

BLASTp results
g1 accession g2 accession
g1 feature table
g1 accession coordinate
g2 feature table
g2 accession coordinate

join BLASTp_results.txt g1_feature.txt > step1.txt

It will join on the first column.

g1 accession g2 accession g1 coordinate

join -1 2 step1.txt g2_feature.txt

The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.

final output
g2 accession g1 accession g1 coordinate g2 coordinate

A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.

grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt
head query_start_accession.txt
grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt
cut -f1,2 blast_top.txt > accession_top.txt
join -1 1 -2 2 <(sort accession_top.txt) <(sort -k 2 query_start_accession.txt) > accession_top_query_start.txt
join -1 2 -2 2 <(sort -k 2 accession_top_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_query_start_database_start.txt

Make an Excel scatter plot of the joined file (accession_top_query_start_database_start.txt)

Below are probably questions for next week. It is intended to do two genome plots for comparison — one of just the top hits, and the other plot of all hits (we can remove the -max_target_seqs_1 option, but otherwise repeat the procedure).

What if any is the difference between the two plots?

Why do some proteins have more than one match?

Repeat the scatterplot exercise after you removed all blast hits that had an E-value worse than 10^-20.

Type logout
You will return to the master computer. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.

Optional: Sequnece conservation along a genome

Plot the level of sequence conservation along a genome. An easy way to do this is to sort the spreadsheet on the ORF position, and then plot the bitscores as a bargraph, or using a scatterplot (bitscore versus position, or -log E-values versus position, or % identity versus position ... ). For this last exercise, if you want to identify the genes (see the fastacmd above), it might help to use one genome with numbers giving the position in the genome, and the other one using the GI numbers. (Even better would be do the blast search with the normal genomes and write a Perl script to add another coulms to the spreadsheet that contains the genome location of the query and the target genome. If you do, send me a copy of the script.)

Which region(s) of the genome is least conserved?


Type logout to release the compute node form the queue.
Check the queue for abandoned sessions using qstat.
If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.