Assignment 14

Your name:
Your email address:

Slides with examples are here !

1. (20 Minutes) Develop a question to address using TAX PLOT (see below for inspiration, see the microbial genomes page for possible organisms). Use the pulldown menu in Tax Plot for selection. (In this plot each circle represents a single protein from the query genome, plotted by its BLAST scores to the highest scoring protein from each of the selected organisms. THINK ABOUT THIS FOR A SECOND before you start clicking.
What does it mean if an ORF is plotted close to one of the axes? Select two reference genomes appropriate for your question (see below for examples).
Change the zoom when you click at the graphic. Select a different function, then click compare.

A) Your question:

B) Your genome:

C) Your two reference genomes:

D) Which candidate genes did you find?:

For example:

If you ask the question: which genes in Treponema pallidum are candidates for having been transferred from the archaeal domain into this genome, you would select the Treponema pallidum as your query genome. To look for genes transferred from the archaea, you need to select one bacterial genome (a deep branching one would be nice, if there is such a thing), and an archaeal genome. Aquifex aeolicus and Archaeoglobus fulgidus would be suitable. The putatively transferred genes will have higher scores to the Archaeoglobus homolog than to the Aquifex homolog.

If you look for halobacterial (archaeal) genes in cyanobacteria you could select B. subtilis (B.=Bacillus) and H.sp NRC1 (-Halobacterium, which is an archaeon, not a bacterium!) as reference genomes and the genome from Synechocystis sp PCC6803 as the genome to analyze (i.e. your query)

If you look for proteobacterial genes in Prochlorococcus, you could choose one of the Prochlorococcus genomes to analyze and use E. coli and a Synecchococcus genome as references.

What do the two coordinates represent?
What are the individual dots? If substitutions were fixed in the different genes in a clock like fashion, and if there were no Horizontal Gene Transfer, where would all the ORFs end up?

2) Handling genome (or larger) amounts of data -- Extracting text from other applications

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

Download one or many "genomes" of your choice from NCBI. The easiest is to use the ftp server: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Alternatively, especially if the genome is not jet on the ftp server, select the genome link at the NCBI, follow the link to the "bioproject," click on the number in protein links (protein sequences nnnn links), then use the "send to" -> file pull down menu to create a multiple fasta file for all the linked sequences. Since we will be using pepstats, be sure to grab the protein sequence ".faa" file(s). You also can save linked encoded proteins files from the taxonomy browser (click on protein in header, then display, click on the number, save as file in fasta sormat).

First, log on to the head node of the cluster, with the Terminal application, then connect to a worker node with "qlogin":

ssh username@bbcsrv3.biotech.uconn.edu (enter password)

qlogin

Move your multiple sequence fasta file to a directory on the cluster. (As you are transferring files, you might as well move this R-script, and the scripts parse_pepstats.pl and parse_pepstats_mod2.pl into the same directory.)
The script histogramScript_pdf.R creates a histogram from data in a file called indata. The first line in indata contains the name of the data (no spaces), the following lines contain the the data (one data point per line). The output of the script are two pdf files, plot1.pdf contains a histogram of the data, plot2.pdf compares the histogram of the data in indata to a normal distribution. More information is in the documentation file (MS-Word formated file).

(This is for information, do not type this yet) parse_pepstats_mod2.pl calls the histogram script, which needs to be in the same directory. parse_pepstats.pl creates tables that you can import and analyze in Excel.

Note: the modified program parse_pepstats_mod2.pl works on all "pepstats" files in the directory.

To run the EMBOSS program from the command line, type: pepstats genome.faa -outfile genome.pepstats
where "genome.faa" should be the name of the file that contains the protein sequences encoded in the genome, and "genome.pepstats" is the output file.

Check the output file generated by pepstats using a text editor.

We will use parse_pepstats.pl or parse_pepstats_mod2.pl to extract the isoelectric point for all proteins in the faa file:

Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output. Execute the program:

perl parse_pepstats.pl genome.pepstats
where genome.pepstats is the name of outfile (see above)

perl parse_pepstats_mod2.pl genome.pepstats
where genome.pepstats is the name of outfile (see above)

parse_pepstats.pl will generate three files, with suffixes ".pI", ".pos_charged", and ".parsed".

Use the ".pI" file (isoelectric points) to construct a histogram (or the .parsed file). You can use the histogram_script (via parse_pepstats_mod2.pl) or try to use an Excel addon for this.

Which genomes did you analyze?
Describe the distribution of isoelectric points in your selected genome. How many peaks?
Why might there be a minimum at around pH7?
Compare your finding with others in the class. Do thermo- and halophiles have the distributions of isoelectric points?

Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column; you could use
fastacmd -s accession_number to retrieve the sequence from nr).
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzmes?

3. Gene plot is a program provided through the NCBI that compares two genomes against each other using blastp on the encoded ORF. It is great to quickly explore synteny between two genomes, its problem is that it just concatenates all ORF encoded on plasmids and chromosomes into a single file, and that only some genomes are available. Try a few comparisons in gene plot (Frankia, Prochlorococcus, Borrelia, Leptosira or Mycobacteria.

Download 2 (or more) genomes (complete DNA sequence of the main chromosome) from related organisms that seem worth anayzing from here.

E.g., one could use (right click, save as)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Frankia_CcI3_uid58397/NC_007777.fna
and
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Frankia_alni_ACN14a_uid58695/NC_008278.fna

or
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Borrelia_garinii_PBi_uid58125/NC_006156.fna
and
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Borrelia_burgdorferi_B31_uid57581/NC_001318.fna

or
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_hydrophila_ATCC_7966_uid58617/NC_008570.fna
and
fftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_hydrophila_ML09_119_uid205540/NC_021290.fna
(and
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_salmonicida_A449_uid58631/NC_009348.fna
and
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_veronii_B565_uid66323/NC_015424.fna )

Choose genomes that you already tried in gene plot (Frankia), (Borrelia), (Leptosira) or (Mycobacteria) or in an earlier assignment, and that resulted in interesting plots (use the fasta formatted DNA sequence files that contain a single nucleotide sequence. If you want to analyze the main chromosome, make sure that this is the file you download! (a file that contains a genome is larger than 50 kByte).

Run mummer on the cluster using default conditions. First, log on to the head node of the cluster, with the Terminal application, then connect to a worker node with "qlogin":

ssh username@bbcsrv3.biotech.uconn.edu (enter password)

qlogin

Transfer your genomes to the cluster using filezilla or an alternative approach (ssh client / fugu)

It is a good idea to rename the NC_... files so that you know which genomes you are working with.

Now run mummer: (Mummer is similar to dotlet, but it aligns whole genomes, and it is very, very fast.)

mummer -mum -b -c ref.fna qry.fna > ref_qry.mums
Example: mummer -mum -b -c B_garinii.fna B_burg.fna > B_garinii_B_burg.mums
(type mummer to get information on the options, go to http://mummer.sourceforge.net/manual/ for in deapth info)

exit (to get to the head node)

(mummerplot is a script that uses gnuplot)

mummerplot --postscript --prefix=ref_qry ref_qry.mums

(To keep track of things, you could replace ref and qry with the names you gave the genomes.

(OLD -- now inorporated into mummerplot- you don't need to type this gnuplot ref_qry.gp

You will get a warning about "depreciated syntax" - ignore it. )

If you use a Mac, you can view the postscript output (ref_qry.ps) with preview. On a Windows PCyou might need to install a program to display postscript files (GSview is one option). Alternatively, you could replace the --postscript option with --png --large.

Is the result similar to the one you obtained with geneplot at the NCBI? (Two of the gene plots for the different Aeromonas strains are here and here). What can explain the difference?

If you don't have time, here (Frankia), here (Mycobacteria) and here (Borrelia) are some example outputs to look at.

If you have time, re-run mummer with the -maxmatch option (instead of -mum), and select different lenghts of matches. The default is 20 nucleotides, if you have too much noise, you could try to increase the lenght to 30 or 40 setting the parameter -l 30 or -l 40. If you have too few matches decrease the lengths parameter.

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.