Assignment 7: Database searches using BLAST+ executables (continued from last week)

Your name:
Your email address:


This week, we will make a genome plot. (As seen in the midterm and lectures.) This is basically a way of visualising the BLASTp hits between proteins (ORFs) in two genomes, in order to compare their relative arrangement (inversions, etc.). One genome is the x-axis, and the other genome is the y-axis. For each point (x,y) on a scatter plot, the following holds:

From last week, you'll recall that the tabular output (i.e., outfmt option 6) of a BLASTp search between proteins in two genomes (g1 and g2) looks like this:

BLASTp results
g1 accession g2 accession and 8 other columns (link) E-value Bit score

You'll also recall that we downloaded a "..._protein.faa" file, which was a FASTA format file of all the encoded proteins in a given genome.

The FASTA format has a ">" line, followed by each sequence. The NCBI ..._protein.faa files look something like this:

>proteinAccessionA(space)some other descriptive text
M...the rest of the protein for the accession on the previous line, on one or more lines...
>proteinAccessionB(space)some other descriptive text
M...the rest of the protein for the accession on the previous line, on one or more lines...

There are several files in the RefSeq FTP directory for a genome (the "green diamond link"). Today we will be using the two shaded in blue:

ReqSeq FTP directory contents, amongst others
File type Description of contents
..._protein.faaFASTA of encoded proteins
..._protein.gbffGenbank format file
..._feature_table.txtFor each gene, lists the coordinate and whether chromosome/plasmid
..._genomic.fnaFASTA of genome assembly
..._cds_from_genomic.fnaFASTA of nucleotide coding sequences
..._genomic.gffGeneric Feature Format Version 3, similar to the feature table information, link
..._genomic.gbffGenbank format file
..._rna_from_genomic.fnaFASTA of RNA sequences

Adding gene coordinates to BLASTp output

Given two genome (g1 and g2) FASTA protein files, our BLASTp output might look as follows:

g1 protein.faa FASTA (the query)
g2 protein.faa FASTA (the database)

BLASTp output
g1 accession g2 accession and 8 other columns (link) E-value Bit score

We would then need some way to add the coordinates for the query and database accessions to the BLASTp output. This information is in the respective feature tables.

g1 feature table
g1 accession start coordinate end coordinate
g2 feature table
g2 accession start coordinate end coordinate

One simple way would be to change the header lines in our FASTA protein files to a genome coordinate instead of an accession. For this example we will choose the start coordinate.

modified g1 protein.faa FASTA
modified g2 protein.faa FASTA

BLASTp output from modified FASTA
g1 accession g2 accession and 8 other columns (link) E-value Bit score

Then we proceed to do a scatter plot of the first two columns.

There are many ways to accomplish this, but today we will use option 1.

  1. Use a Perl program to substitute the accession numbers in the FASTA files for each genome with the corresponding genome coordinate in the feature table, then BLAST as usual.
  2. Use the command-line program "join". See Appendix 1 (if you dare).
  3. Use Excel to merge the BLASTp output with the feature tables. This is called "Get & Transform".

Using Blast to do genome plots for microbial genomes (different Frankia, Aeromonads, or different Thermotoga species work nicely).

A) obtaining the genome sequences

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then untick all the "Levels:" (at the top right) except for "Complete". This should result in a listing of completely sequenced genomes.

Download two genomes of your choice. Different Frankia, Aeromonads, or different Thermotoga species work nicely.

To do this, use the Group and SubGroup boxes to narrow your selection, or better yet, use the search box (the box to the left of "Search by organism") to narrow to a taxonomic group.

Then look for the green diamond link in the far right-hand column. This green diamond takes you to a listing of all the files for a genome project.

You want to download TWO files for each genome — the protein.faa file, and the feature_table file.

Right-click on the faa file, and select "Copy link location". Then go to the cluster and paste the link, like this:

Start a program called Bitvise SSH Client. In the "Host" field, type
In the "Username" field, type your username. Login. It may ask you to accept a new host key. Now enter your password.

mkdir lab7

cd lab7

curl -O paste the link you copied

You will need to download four files in total:

curl -O your first genome protein faa file link
curl -O your first genome feature table file link
curl -O your second genome protein faa file link
curl -O your second genome feature table file link


The "mv" command can be used to rename a long, unwieldy filename into a more concise one.

mv   oldname   newname

mv   ridiculouslyLongFilename.faa.gz   somethingShorter.faa.gz

A good convention would be the first letter of the genus, and then the species name, e.g., D_radiodurans.faa.gz and D_radiodurans_feature.txt.gz

gunzip somethingShorter.faa.gz


The file should now end with just "faa" or "txt". The "gz" ending was a compressed file format. Take a look at the first few lines with

head somethingShorter.faa

Here is a Perl program that substitutes the accession numbers in a protein FASTA file with the corresponding genome (start) coordinates in the feature table:

curl -O

You use it as follows:

perl   yourGenome1.faa   yourFeatureTable1.txt   >   yourGenome1WithStart.faa

perl   yourGenome2.faa   yourFeatureTable2.txt   >   yourGenome2WithStart.faa

head yourGenome1WithStart.faa

Check that the ">accession" lines have been replaced with ">number" lines.

Replace the accession lines in both genome FASTA files. Be careful that you use the corresponding feature table for each genome. Also note that if a file exists with the same name as that to the right of the "screen output" redirect (>) symbol, it will be replaced!

This takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone. See cluster etiquette.)

cd lab7
makeblastdb -in database_protein.faa -dbtype prot
Choose one of your genomes as the "database". Do an "ls" to see the extra files you just made. Use the FASTA file with the start coordinates.

blastp -query query_protein.faa -db database_protein.faa -out blast.txt -outfmt 6 -evalue 1e-4
The other genome will be the "query". An E-value cut-off of 10-4 is used. Use the FASTA files with the start coordinates.

This will take a few minutes. Again, here is a description of the columns.

To get just the top hit for each query sequence, we use another Perl program. Since the hits for each query are ordered by best E-value to worst, the top hit is simply the first hit for each query:

curl -O

You use it as follows:

perl blast.txt > blastTopHit.txt

head blastTopHit.txt

Notice that there is only one hit returned per query in the blastTopHit.txt file.

Note: The "-max_target_seqs 1" option also returns the top BLASTp database hit for each query sequence. However, since we also want all hits with E-value ≤ 10-4 for the other plot, we can use the Perl program to avoid computing the BLASTp twice (once without the max_target_seqs option, and once with).

Open a SFTP window in BitVise, navigate to your lab7 directory, and transfer over the blast.txt and blastTopHit.txt files to your Desktop.

Make an Excel scatter plot using all BLAST hits with E-value ≤ 10-4, and another using just the top hits.

What if any is the difference between the two plots?

Why do some proteins have more than one match?

Repeat the scatterplot exercise after you removed all blast hits that had an E-value worse than 10^-20.

In the plot of blastTopHit.txt, do you see matches along the main diagonal? What could explain the deviations from the diagonal?

In case your analysis resulted in a gene plot that had many matches close to both of the the two diagonals (see here and here for an example),
what process could have given rise to this pattern?

Type logout
You will return to the master computer. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.

Optional: Sequence conservation along a genome

Plot the level of sequence conservation along a genome. An easy way to do this is to sort the spreadsheet on the ORF position, and then plot the bitscores as a bargraph, or using a scatterplot (bitscore versus position, or -log E-values versus position, or % identity versus position ... ). For this last exercise, if you want to identify the genes (see blastdbcmd), it might help to use one genome with numbers giving the position in the genome, and the other one using the accession numbers. (Even better would be do the blast search with the normal genomes and write a Perl script to add another columns to the spreadsheet that contains the genome location of the query and the target genome. If you do, send me a copy of the script.)

Which region(s) of the genome is least conserved?


Type logout to release the compute node form the queue.
Check the queue for abandoned sessions using qstat.
If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.

Appendix 1: Using the join command

Problem: Your BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:

e.g., Take two genomes, g1 and g2.

BLASTp results, top hits with "-max_target_seqs 1"
g1 accession g2 accession
g1 feature table
g1 accession coordinate
g2 feature table
g2 accession coordinate

join blast_top_hit.txt g1_feature.txt > step1.txt

It will join on the first column.

g1 accession g2 accession g1 coordinate

join -1 2 step1.txt g2_feature.txt

The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.

final output
g2 accession g1 accession g1 coordinate g2 coordinate

A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.

grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt
head query_start_accession.txt
grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt
head database_start_accession.txt
cut -f1,2 blast_top_hit.txt > accession_top_hit.txt
join -1 1 -2 2 <(sort accession_top_hit.txt) <(sort -k 2 query_start_accession.txt) > accession_top_hit_query_start.txt
join -1 2 -2 2 <(sort -k 2 accession_top_hit_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_hit_query_start_database_start.txt

When finished, open a new SFTP window in BitVise, navigate to your lab7 directory, and transfer over the accession_top_hit_query_start_database_start.txt file to your Desktop. Load it into Excel.

Make an Excel scatter plot of the joined file (accession_top_hit_query_start_database_start.txt)