Assignment 14

Your name:
Your email address:

1. (20 minutes) One Genome in different data banks:

Go to the ENTREZ genome section .
Search for the genome from Aeropyrum pernix.

Try to find information on which type of organism Aeropyrum pernix is, and how many proteins are encoded in the genome, and how many other genes?
(tip: select genome summary here)

To which phylum does Aeropyrum pernix belong?
How many genes does the NCBI annotation recognize?
How many of these encode proteins?
What might the other genes encode?

Go to the DOE IMG site and select Aeropyrum pernix (link here)

How many protein and structural RNA coding genes are recognized by the Joint Genome Institute?
What can explain the difference?

Select the link to protein coding genes (in the box below genome statistics), then select three or more protein annotated only as a hypothetical protein. (Select a few at random, don't pick the first three.)
Do any of the links provide a more detailed decription of the possible function or type of protein fold?
For each of the hypothetial proteins, give the Locus Tag, the lenght of the ORF, and the possible function or domain.

Select the link to protein coding genes (see above), under preference (link on upper lright) select 5000 as number of returned search results, reload the page, then search with filter term EC:3.6.3.14 in gene product name. Select APE2326 (click on link in firt column) and inspect the gene neighborhood using the Show ortholog neighborhood regions option (link in the second box below "Evidence for protein function prediction"). Repeat this for a few other ATPsynthase subunits

Select the "find gene" link in the header and search for Repeat this for locus tag APE0405 in Aeropyrum. Again inspect the gene neighborhood using the Show ortholog neighborhood regions option.

Does APE2326 encode a vacuolar ATPases subunit? Is it part of the operon in Aeropyrum?
Can you find the gene in Aeropyrum that encodes subunit I of the vacuolar ATPases? What is its Locus Tag?
Which ATPase subunits are encoded next to each other in the Aeropyrum genome?

2. (20 Minutes) Develop a question to address using TAX PLOT (see below for inspiration, see the microbial genomes page for possible organisms). Use the pulldown menu in Tax Plot for selection. (In this plot each circle represents a single protein from the query genome, plotted by its BLAST scores to the highest scoring protein from each of the selected organisms. THINK ABOUT THIS FOR A SECOND before you start clicking.
What does it mean if an ORF is plotted close to one of the axes? Select two reference genomes appropriate for your question (see below for examples).
Change the zoom when you click at the graphic. Select a different function, then click compare.

A) Your question:

B) Your genome:

C) Your two reference genomes:

D) Which candidate genes did you find?:

For example:

If you ask the question: which genes in Treponema pallidum are candidates for having been transferred from the archaeal domain into this genome, you would select the Treponema pallidum as your query genome. To look for genes transferred from the archaea, you need to select one bacterial genome (a deep branching one would be nice, if there is such a thing), and an archaeal genome. Aquifex aeolicus and Archaeoglobus fulgidus would be suitable. The putatively transferred genes will have higher scores to the Archaeoglobus homolog than to the Aquifex homolog.

If you look for halobacterial (archaeal) genes in cyanobacteria you could select B. subtilis (B.=Bacillus) and H.sp NRC1 (-Halobacterium, which is an archaeon, not a bacterium!) as reference genomes and the genome from Synechocystis sp PCC6803 as the genome to analyze (i.e. your query)

If you look for proteobacterial genes in Prochlorococcus, you could choose one of the Prochlorococcus genomes to analyze and use E. coli and a Synecchococcus genome as references.

What do the two coordinates represent?
What are the individual dots? If substitutions were fixed in the different genes in a clock like fashion, and if there were no Horizontal Gene Transfer, where would all the ORFs end up?

4. Gene plot is a program provided through the NCBI that compares two genomes against each other using blastp on the encoded ORF. It is great to quickly explore synteny between two genomes, its problem is that it just concatenates all ORF encoded on plasmids and chromosomes into a single file, and that only some genomes are available. Try a few comparisons in gene plot (Frankia, Prochlorococcus, Borrelia, Leptosira or Mycobacteria.

Download 2 (or more) genomes (complete DNA sequence of the main chromosome) from related organisms that seem worth anayzing from here.

E.g., one could use (right click, save as)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Frankia_CcI3_uid58397/NC_007777.fna
and
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Frankia_alni_ACN14a_uid58695/NC_008278.fna

or
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Borrelia_garinii_PBi_uid58125/NC_006156.fna
and
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Borrelia_burgdorferi_B31_uid57581/NC_001318.fna

or
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_hydrophila_ATCC_7966_uid58617/NC_008570.fna
and
fftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_hydrophila_ML09_119_uid205540/NC_021290.fna
(and
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_salmonicida_A449_uid58631/NC_009348.fna
and
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_veronii_B565_uid66323/NC_015424.fna )

Choose genomes that you already tried in gene plot (Frankia), (Borrelia), (Leptosira) or (Mycobacteria) or in an earlier assignment, and that resulted in interesting plots (use the fasta formatted DNA sequence files that contain a single nucleotide sequence. If you want to analyze the main chromosome, make sure that this is the file you download! (a file that contains a genome is larger than 50 kByte).

Run mummer on the cluster using default conditions. First, log on to the head node of the cluster, with the Terminal application, then connect to a worker node with "qrsh":

ssh mcb3421_u#@bbcxsrv1.biotech.uconn.edu (enter password)

qlogin

Transfer your genomes to the cluster using filezilla or an alternative approach (ssh client / fugu)

It is a good idea to rename the NC_... files so that you know which genomes you are working with.

Now run mummer: (Mummer is similar to dotlet, but it aligns whole genomes, and it is very, very fast.)

mummer -mum -b -c ref.fna qry.fna > ref_qry.mums
Example: mummer -mum -b -c B_garinii.fna B_burg.fna > B_garinii_B_burg.mums
(type mummer to get information on the options, go to http://mummer.sourceforge.net/manual/ for in deapth info)

exit (to get to the head node)

(mummerplot is a script that uses gnuplot)

mummerplot --postscript --prefix=ref_qry ref_qry.mums

(To keep track of things, you could replace ref and qry with the names you gave the genomes.

(OLD -- now inorporated into mummerplot- you don't need to type this gnuplot ref_qry.gp

You will get a warning about "depreciated syntax" - ignore it. )

If you use a Mac, you can view the postscript output (ref_qry.ps) with preview. On a Windows PCyou might need to install a program to display postscript files (GSview is one option). Alternatively, you could rplace the --postscript option with --png --large.

Is the result similar to the one you obtained with geneplot at the NCBI? (Two of the gene plots for the different Aeromonas strains are here and here). What can explain the difference?

If you don't have time, here (Frankia), here (Mycobacteria) and here (Borrelia) are some example outputs to look at.

If you have time, re-run mummer with the -maxmatch option (instead of -mum), and select different lenghts of matches. The default is 20 nucleotides, if you have too much noise, you could try to increase the lenght to 30 or 40 setting the parameter -l 30 or -l 40. If you have too few matches decrease the lengths parameter.

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.