Your name: Your email address:
Slides with examples are here !
1. (20 Minutes) Develop a question to address using TAX PLOT (see below for inspiration, see the microbial genomes page for possible organisms). Use the pulldown menu in Tax Plot for selection. (In this plot each circle represents a single protein from the query genome, plotted by its BLAST scores to the highest scoring protein from each of the selected organisms. THINK ABOUT THIS FOR A SECOND before you start clicking. What does it mean if an ORF is plotted close to one of the axes? Select two reference genomes appropriate for your question (see below for examples). Change the zoom when you click at the graphic. Select a different function, then click compare. A) Your question: B) Your genome: C) Your two reference genomes: D) Which candidate genes did you find?:
For example: If you ask the question: which genes in Treponema pallidum are candidates for having been transferred from the archaeal domain into this genome, you would select the Treponema pallidum as your query genome. To look for genes transferred from the archaea, you need to select one bacterial genome (a deep branching one would be nice, if there is such a thing), and an archaeal genome. Aquifex aeolicus and Archaeoglobus fulgidus would be suitable. The putatively transferred genes will have higher scores to the Archaeoglobus homolog than to the Aquifex homolog. If you look for halobacterial (archaeal) genes in cyanobacteria you could select B. subtilis (B.=Bacillus) and H.sp NRC1 (-Halobacterium, which is an archaeon, not a bacterium!) as reference genomes and the genome from Synechocystis sp PCC6803 as the genome to analyze (i.e. your query) If you look for proteobacterial genes in Prochlorococcus, you could choose one of the Prochlorococcus genomes to analyze and use E. coli and a Synecchococcus genome as references.
For example:
EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.
Download one or many "genomes" of your choice from NCBI. The easiest is to use the ftp server: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Alternatively, especially if the genome is not jet on the ftp server, select the genome link at the NCBI, follow the link to the "bioproject," click on the number in protein links (protein sequences nnnn links), then use the "send to" -> file pull down menu to create a multiple fasta file for all the linked sequences. Since we will be using pepstats, be sure to grab the protein sequence ".faa" file(s). You also can save linked encoded proteins files from the taxonomy browser (click on protein in header, then display, click on the number, save as file in fasta sormat).
First, log on to the head node of the cluster, with the Terminal application, then connect to a worker node with "qlogin":
ssh username@bbcsrv3.biotech.uconn.edu (enter password) qlogin
Move your multiple sequence fasta file to a directory on the cluster. (As you are transferring files, you might as well move this R-script, and the scripts parse_pepstats.pl and parse_pepstats_mod2.pl into the same directory.) The script histogramScript_pdf.R creates a histogram from data in a file called indata. The first line in indata contains the name of the data (no spaces), the following lines contain the the data (one data point per line). The output of the script are two pdf files, plot1.pdf contains a histogram of the data, plot2.pdf compares the histogram of the data in indata to a normal distribution. More information is in the documentation file (MS-Word formated file).
(This is for information, do not type this yet) parse_pepstats_mod2.pl calls the histogram script, which needs to be in the same directory. parse_pepstats.pl creates tables that you can import and analyze in Excel.
Note: the modified program parse_pepstats_mod2.pl works on all "pepstats" files in the directory.
To run the EMBOSS program from the command line, type: pepstats genome.faa -outfile genome.pepstats where "genome.faa" should be the name of the file that contains the protein sequences encoded in the genome, and "genome.pepstats" is the output file.
We will use parse_pepstats.pl or parse_pepstats_mod2.pl to extract the isoelectric point for all proteins in the faa file:
Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output. Execute the program:
perl parse_pepstats.pl genome.pepstats where genome.pepstats is the name of outfile (see above)
or
perl parse_pepstats_mod2.pl genome.pepstats where genome.pepstats is the name of outfile (see above)
parse_pepstats.pl will generate three files, with suffixes ".pI", ".pos_charged", and ".parsed".
Use the ".pI" file (isoelectric points) to construct a histogram (or the .parsed file). You can use the histogram_script (via parse_pepstats_mod2.pl) or try to use an Excel addon for this.
Which genomes did you analyze? Describe the distribution of isoelectric points in your selected genome. How many peaks? Why might there be a minimum at around pH7? Compare your finding with others in the class. Do thermo- and halophiles have the distributions of isoelectric points?
Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column; you could use fastacmd -s accession_number to retrieve the sequence from nr). Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzmes?
3. Gene plot is a program provided through the NCBI that compares two genomes against each other using blastp on the encoded ORF. It is great to quickly explore synteny between two genomes, its problem is that it just concatenates all ORF encoded on plasmids and chromosomes into a single file, and that only some genomes are available. Try a few comparisons in gene plot (Frankia, Prochlorococcus, Borrelia, Leptosira or Mycobacteria.
Download 2 (or more) genomes (complete DNA sequence of the main chromosome) from related organisms that seem worth anayzing from here.
E.g., one could use (right click, save as) ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Frankia_CcI3_uid58397/NC_007777.fna and ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Frankia_alni_ACN14a_uid58695/NC_008278.fna
or ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Borrelia_garinii_PBi_uid58125/NC_006156.fna and ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Borrelia_burgdorferi_B31_uid57581/NC_001318.fna
or ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_hydrophila_ATCC_7966_uid58617/NC_008570.fna and fftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_hydrophila_ML09_119_uid205540/NC_021290.fna (and ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_salmonicida_A449_uid58631/NC_009348.fna and ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeromonas_veronii_B565_uid66323/NC_015424.fna )
Choose genomes that you already tried in gene plot (Frankia), (Borrelia), (Leptosira) or (Mycobacteria) or in an earlier assignment, and that resulted in interesting plots (use the fasta formatted DNA sequence files that contain a single nucleotide sequence. If you want to analyze the main chromosome, make sure that this is the file you download! (a file that contains a genome is larger than 50 kByte).
Run mummer on the cluster using default conditions. First, log on to the head node of the cluster, with the Terminal application, then connect to a worker node with "qlogin":
Transfer your genomes to the cluster using filezilla or an alternative approach (ssh client / fugu)
It is a good idea to rename the NC_... files so that you know which genomes you are working with.
Now run mummer: (Mummer is similar to dotlet, but it aligns whole genomes, and it is very, very fast.)
mummer -mum -b -c ref.fna qry.fna > ref_qry.mums Example: mummer -mum -b -c B_garinii.fna B_burg.fna > B_garinii_B_burg.mums (type mummer to get information on the options, go to http://mummer.sourceforge.net/manual/ for in deapth info)
exit (to get to the head node)
(mummerplot is a script that uses gnuplot)
mummerplot --postscript --prefix=ref_qry ref_qry.mums
(To keep track of things, you could replace ref and qry with the names you gave the genomes.
(OLD -- now inorporated into mummerplot- you don't need to type this gnuplot ref_qry.gp You will get a warning about "depreciated syntax" - ignore it. )
If you use a Mac, you can view the postscript output (ref_qry.ps) with preview. On a Windows PCyou might need to install a program to display postscript files (GSview is one option). Alternatively, you could replace the --postscript option with --png --large.
Is the result similar to the one you obtained with geneplot at the NCBI? (Two of the gene plots for the different Aeromonas strains are here and here). What can explain the difference?
If you don't have time, here (Frankia), here (Mycobacteria) and here (Borrelia) are some example outputs to look at.
If you have time, re-run mummer with the -maxmatch option (instead of -mum), and select different lenghts of matches. The default is 20 nucleotides, if you have too much noise, you could try to increase the lenght to 30 or 40 setting the parameter -l 30 or -l 40. If you have too few matches decrease the lengths parameter.
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.