Assignment 14

Your name:
Your email address:


1.   (20 minutes) One Genome in different data banks:

Go to the ENTREZ genome section
Search for the genome from Aeropyrum pernix.

Try to find information on which type of organism Aeropyrum pernix is, and how many proteins are encoded in the genome, and how many other genes?
(tip: select genome summary here)

Go to the DOE IMG site and select Aeropyrum pernix (link here)

Select the link to protein coding genes (see above), under preference (link on upper lright) select 5000 as number of returned search results, reload the page, then search with filter term EC: in gene product name. Select APE2326 (click on link in firt column) and inspect the gene neighborhood using the Show ortholog neighborhood regions option (link in the second box below "Evidence for protein function prediction"). Repeat this for a few other ATPsynthase subunits

Select the "find gene" link in the header and search for Repeat this for locus tag APE0405 in Aeropyrum. Again inspect the gene neighborhood using the Show ortholog neighborhood regions option.

2. (20 Minutes)  Develop a question to address using TAX PLOT (see below for inspiration, see the microbial genomes page for possible organisms). Use the pulldown menu in Tax Plot for selection. (In this plot each circle represents a single protein from the query genome, plotted by its BLAST scores to the highest scoring protein from each of the selected organisms. THINK ABOUT THIS FOR A SECOND before you start clicking.
What does it mean if an ORF is plotted close to one of the axes? Select two reference genomes appropriate for your question (see below for examples).
Change the zoom when you click at the graphic. Select a different function, then click compare.

A) Your question:

B) Your genome:

C) Your two reference genomes:

D) Which candidate genes did you find?:

For example:

4. Gene plot is a program provided through the NCBI that compares two genomes against each other using blastp on the encoded ORF. It is great to quickly explore synteny between two genomes, its problem is that it just concatenates all ORF encoded on plasmids and chromosomes into a single file, and that only some genomes are available. Try a few comparisons in gene plot (Frankia, Prochlorococcus, Borrelia, Leptosira or Mycobacteria.

Download 2 (or more) genomes (complete DNA sequence of the main chromosome) from related organisms that seem worth anayzing from here.

E.g., one could use (right click, save as)



and )

Choose genomes that you already tried in gene plot (Frankia), (Borrelia), (Leptosira) or (Mycobacteria) or in an earlier assignment, and that resulted in interesting plots (use the fasta formatted DNA sequence files that contain a single nucleotide sequence. If you want to analyze the main chromosome, make sure that this is the file you download! (a file that contains a genome is larger than 50 kByte).

Run mummer on the cluster using default conditions. First, log on to the head node of the cluster, with the Terminal application, then connect to a worker node with "qrsh":

ssh (enter password)


Transfer your genomes to the cluster using filezilla or an alternative approach (ssh client / fugu)

It is a good idea to rename the NC_... files so that you know which genomes you are working with.

Now run mummer: (Mummer is similar to dotlet, but it aligns whole genomes, and it is very, very fast.)

mummer -mum -b -c ref.fna qry.fna > ref_qry.mums
Example: mummer -mum -b -c B_garinii.fna B_burg.fna > B_garinii_B_burg.mums
(type mummer to get information on the options, go to for in deapth info)

exit (to get to the head node)

(mummerplot is a script that uses gnuplot)

mummerplot --postscript --prefix=ref_qry ref_qry.mums

(To keep track of things, you could replace ref and qry with the names you gave the genomes.

(OLD -- now inorporated into mummerplot- you don't need to type this gnuplot

You will get a warning about "depreciated syntax" - ignore it. )

If you use a Mac, you can view the postscript output ( with preview. On a Windows PCyou might need to install a program to display postscript files (GSview is one option). Alternatively, you could rplace the --postscript option with --png --large.

Is the result similar to the one you obtained with geneplot at the NCBI? (Two of the gene plots for the different Aeromonas strains are here and here). What can explain the difference?

If you don't have time, here (Frankia), here (Mycobacteria) and here (Borrelia) are some example outputs to look at.

If you have time, re-run mummer with the -maxmatch option (instead of -mum), and select different lenghts of matches. The default is 20 nucleotides, if you have too much noise, you could try to increase the lenght to 30 or 40 setting the parameter -l 30 or -l 40. If you have too few matches decrease the lengths parameter.

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.