Assignment 14

Your name:
Your email address:

Some of these slides are from last year's class, so not everything is relevant to this year's lab — slides with examples are here !

1) Handling genome (or larger) amounts of data -- Extracting text from other applications

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

First we will download the encoded proteins from the genome of a group of organisms you are interested in (Halobacteria, Methanocella, Halomonas).

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then untick all the "Levels:" (at the top right) except for "Complete". This should result in a listing of completely sequenced genomes.

Use the Group and SubGroup boxes to narrow your selection, or better yet, use the "Search by organism" box to narrow to a taxonomic group. (Note that after a "Search by organism", one will need to repeat the process of clicking on the "Prokaryotes" tab, and re-tick the "Complete" genomes box.)

Then look for the green diamond link in the far right-hand column (you will probably have to scroll to the right). This green diamond takes you to a listing of all the files for a genome project.

You want to download the file ending in ".faa.gz". This "faa" file contains all of the proteins coded by a genome.

Right-click on the faa file, and select "Copy link location". Then go to the cluster and paste the link, like this:

Start a program called Bitvise SSH Client. In the "Host" field, type
In the "Username" field, type your username. Login.

Cluster commands: (bracketed parts are comments only!)

	mkdir lab14a								(make a new folder called "lab14a")
	cd lab14a								(equivalent of "double-clicking" a folder to descend "into" it)
	curl -O paste the link you copied from NCBI				(curl stands for "see URL", this downloads the NCBI link to the cluster)
	                                                              	 	( could also accomplish this with the Bitvise SFTP client)

	[don't copy this mindlessly! :-)] e.g. curl -O

	ls -l									(make sure the file is there)
	gunzip *								(the .gz suffix means this text file is compressed, so uncompress it)
	gunzip hit your tab key to complete the name				(...wasn't that impressive!?)
	ls -l									(the .gz suffix should be gone, and the file got larger)
	[comment] let's make the filename simpler and more descriptive for humans
	[comment] A good convention would be the first letter of the genus, and then the species name, e.g., D_radiodurans.faa.gz
	[comment] REMEMBER TO USE THE TAB KEY TO COMPLETE LONG FILENAMES — this makes life much easier
	mv oldname.faa newname.faa						(changes the name of a file from oldname to newname
	[comment] e.g., mv GCF_000940915.1_ASM94091v1_protein.faa G_species.faa
	more G_species.faa							(inspect the first few lines of the faa file, type "q" to exit)
										(...and space to go forward)
										(...and "b" to go back)
	qlogin									(we're about to do some serious computation, so onto a compute node we go...)
	cd lab14a								(logging into the compute node "resets" our location to our home directory)
	ls									(once more to make sure the faa file is present in this folder)

One more note about qlogin, the command that takes you to a "compute" node. (Why? Because we are going to run the pepstats command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone. See cluster etiquette.) We could "qlogin" before typing the above commands, and perhaps that is better, but the general rule is that as long as you are executing commands that only take a few seconds, you can stay on the master node. DEFINITELY qlogin before running programs like pepstats or BLAST+ utilities.

And finally, here is the exciting pepstats command:

	pepstats G_species.faa -outfile G_species.pepstats
	more G_species.pepstats							(inspect the pepstats file, type space to go forward)
										(...and "b" to go back)
										(...and "q" to exit)

Now we need a program to extract the isoelectric points (amongst other stuff). It's called It will work provided the output of pepstats is in a file ending in ".pepstats" (remember that is what you named it above, type "ls" to confirm. Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.

	curl -O		(download the perl script via link)
	perl							(run the script, and extract the isoelectric points)
	ls -l										(you made a bunch of additional files)
	head G_species.pepstats.pI							(the first 10 isoelectic points!)

Using the Bitvise SFTP Window, drag the file containing the isoelectric points (.pepstats.pI ending) to your computer, and load it into Excel. (Remember that it's in the lab14a folder on the cluster.) Make a histogram! In Excel, after loading the .pI file (remember to select "All Files" to see it), use Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram.

(More about histograms in Excel, including the formula used for default binning.)

Which genomes did you analyze?
Describe the distribution of isoelectric points in your selected genome. How many peaks?
Why might there be a minimum at around pH7?
Compare your finding with others in the class. Do thermo- and halophiles have the same
distributions of isoelectric points?

Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column; you could use
fastacmd -s accession_number to retrieve the sequence from nr).
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzymes?

2. A simple (i.e., much less sophisticated) version of Tax Plot, that we can make ourselves

Step 1: Download three genomes of interest. In this example I will use
Escherichia coli str. K-12 substr. MG1655 (reference genome, X-axis)
Archaeoglobus fulgidus (reference genome, Y-axis)
Deinococcus radiodurans (genome to analyze, compare this to the two others)

	cd								(make sure I'm in my home directory)
	mkdir lab14b
	cd lab14b
	[downloading 1st genome via green diamond/.faa link]
	curl -O
	gunzip GCF_000005845.2_ASM584v2_protein.faa.gz
	mv GCF_000005845.2_ASM584v2_protein.faa E_coli.faa
	[repeating for 2nd genome]
	curl -O
	gunzip GCF_000008665.1_ASM866v1_protein.faa.gz
	mv GCF_000008665.1_ASM866v1_protein.faa A_fulgidus.faa
	[repeating for 3rd genome]
	curl -O
	gunzip GCF_000008565.1_ASM856v1_protein.faa.gz
	mv GCF_000008565.1_ASM856v1_protein.faa D_radiodurans.faa
	ls -l								(...and I have the encoded proteins from 3 genomes)

Make sure you're logged into a compute node first!

Step 2: BLASTp the genome to analyse against both reference genomes

	makeblastdb -in E_coli.faa -dbtype prot				(first make databases out of ref genomes)
	makeblastdb -in A_fulgidus.faa -dbtype prot
	blastp -query D_radiodurans.faa -db E_coli.faa -out X.out -outfmt 6 -max_target_seqs 1
	blastp -query D_radiodurans.faa -db A_fulgidus.faa -out Y.out -outfmt 6 -max_target_seqs 1
	curl -O
	perl > plot.txt						(send this to Excel)

The output of (plot.txt) contains four (tab delimited) columns, they are:

  1. E.coli bitscore
  2. A.fulgidus bitscore
  3. an incrementing number from 1... — this is going to be an improvised label for the points in Excel -- then look up the number in the table for the annotation (HORRIBLE, TA says SORRY!)
  4. the full annotation line of this protein in D.radiodurans

The program does the following:

Transfer the plot.txt to your computer, and open it with Excel. Select the A and B columns, and Insert -- Scatter Chart. Right-click on your chart -- Move Chart -- New Sheet. Right-click on one of your points in the scatter plot -- Add Data Labels -- Add Data Labels. Right-click on one of your points -- Format Data Labels. Select "Value From Cells" and then select column C (the boring incrementing number column). Click OK. Deselect all other "Label Contains" except for the Value From Cells option.

A) Your question:

B) Your genome:

C) Your two reference genomes:

D) Which candidate genes did you find?:

For example:


    Type logout to release the compute node from the queue.
    If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000


    Send email to your instructor (and yourself) upon submit
    Send email to yourself only upon submit (as a backup)
    Show summary upon submit but do not send email to anyone.