==================================================================== |This package is free software; you can redistribute it and/or | |modify it under the terms of the GNU General Public License | |as published by the Free Software Foundation; either version 2 | |of the License, or (at your option) any later version. | | | |This package is distributed in the hope that it will be useful, | |but WITHOUT ANY WARRANTY; without even the implied warranty of | |MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |GNU General Public License for more details. | | | | | |Authors: | | Olga A. Zhaxybayeva | | J. Peter Gogarten | | Department of Molecular and Cell Biology | | University of Connecticut | | 1999-2001 | | | |Citation: | | Olga Zhaxybayeva and J. Peter Gogarten. QuartOP analysis package.| | Distributed by authors, Department of Molecular and Cell Biology.| | University of Connecticut. | | | |Questions and/or Comments: olga.zh@uconn.edu or gogarten@uconn.edu| ==================================================================== ================================================= DOCUMENTATION VERSION ALPHA 0.1 NOVEMBER 29, 2001 PLEASE CHECK HTTP://CARROT.MCB.UCONN.EDU/QUARTETS FOR THE LATEST VERSION ================================================= ============== Prerequisites: ============== *Unix operating system (package is tested on Red Hat Linux, but should work under any Unix) *SEALS package (Walker and Koonin, 1997) - http://www.ncbi.nlm.nih.gov/CBBresearch/Walker/SEALS/index.html *JDK v. 1.2 or higher - http://java.sun.com *PAL library (Strimmer, 2001) - http://www.pal-project.org/ *MySQL database - http://www.mysql.com/ *GnuPlot v. 3.7 - http://www.gnuplot.org/ *Perl 5 - http://www.cpan.org/ *local copies of the databases: nr, COG, completed genomes (available through NCBI's FTP site) in FASTA format (a.a. sequences); Databases need to be formatted with formatdb (from BLAST package) *local BLAST - ftp://ftp.ncbi.nlm.nih.gov/blast/ *ClustalW v. 1.8 - ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/ *TREE-PUZZLE v. 5.0 - http://www.tree-puzzle.de/ ============== Installation: ============== 1) You need to create the following environment variables: $QUARTOP_DB_FOLDER - folder where all databases are located $MYSQL_NAME - username for account in MySQL $MYSQL_PSWD - password for account in MySQL $QUARTOP_HOME - home folder of the package 2) Add the path to the package home folder to your $PATH ================================= Preparations to run the programs: ================================= 1) All local databases have to be formatted with formatdb program (formatdb -i name_of_the_database -o T -p T), and indexed with fasta2index (through SEALS) "nr" has to be no more than 30 days old (updates are available through NCBI's FTP site at ftp://ftp.ncbi.nlm.nih.gov/pub/blast/db/). Every record of completed genome databases has to have unique GI number assigned to it. Usually it is the case if the genomes are downloaded from NCBI's FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/). In case of incompleted genomes, fake GIs may be assigned through SEALS package using "assign_subversive_gis" function. All genome database names have to have .faa extension (except nr). In addition, COG database has to be prepared using the COG_trans1.pl and COG_trans2.pl scripts (see below). All databases have to be located at $QUARTOP_DB_FOLDER directory. 2) MySQL daemon has to be running with an existing account specified in environment variables (see Installation section above) Due to the insecure storage of username/password (as environmental variables) it is recommended to use separate account for QUARTOPs analyses with limited rights. 3) SEALS has to be activated at least in the shell you are going to run analysis from (This depends on SEALS installation. Check SEALS installation notes) 4) The scripts assume that paths to all prerequisite packages are added to your $PATH variable. 5) Every genome quartet has to be run in an empty folder, since the scripts use globs like *.fa etc. ======================== How to run the analysis: ======================== Execute the following scripts in the order: 1) script_creator3.pl (see instructions below) 2) script.sh 3) gnuplot ml_plot.p 4) gv ml_map.ps (gv is GhostViewer, http://www.cs.wisc.edu/~ghost/, but you can view PS file with other programs). For more on where to look for results, see "Results" section below. "script.sh" script automatically calls additional programs/scripts which are described in the section below. ==================================== Description of the scripts/programs: ==================================== 1) PREPARING COG DATABASE COG_trans1.pl - operates on COGs.txt file. Creates COG look-up table used to calculate distribution of QUARTOPs among functional categories. COG_trans2.pl - adds COG name into FASTA definition line. Operates on COGs folder in COGs database distribution. All created *.faa files has to be concatenated into one file that is called COGs.faa and that file has to be formatted with formatdb program. 2) CREATION OF GENOME QUARTET SPECIFIC SCRIPTS script_creator3.pl - script that creates main executable file for the analysis as well as additional necessary files that are customized for the selected genome quartet. Usage: script_creator3.pl genome1 genome2 genome3 genome4 "genome1 genome2 genome3 genome4" are the names of the files that correspond to the genome databases. As new sequenced genomes become available names has to be added to the following hash tables: %genomes, %fullname, %name10. By default most of the names are in the form "xyyy" where x=first letter of the genus name, yyy=first three letters of the species name. If a genome consists of more than one chromosomes/plasmids, all the ORFs were concatenated into one file with name "xyyy_genome". If different names are desired, the script has to be changed accordingly. The script creates 6 files: script.sh - main executable file with executable permissions mysql.script - instructions for MySQL to run in batch mode treefile1, treefile2, treefile3 - three possible unrooted tree topologies for the genome quartet in NH format ml_plot.p - instructions for the GNUPlot to run in a batch mode 3) COGnification OF QUARTOPS COGnify.pl - assignes COG category to each of the QUARTOPs and generates distribution of QUARTOPs among functional categories 4) ANALYSES OF LOG-LIKELIHOODS: The following Java programs are used: lv_analysis.java - creates the file lv_analyzed99.out (see Results section). lv_analysis90.java - creates the file lv_analyzed99.out (see Results section). lv_analysis2.java - converts posterior probabilities (barycentric coordinates) into cartesian coordinate system. lv_analysis3.java - creates lists of datasets that belong to the three corners (i.e. have posterior probability above 99%) ========= Results: ========= The results are in the following files: Files "lv_analyzed99.out" and "lv_analyzed90.out" contain the information on how many quartets support each tree topology in each "zone". This data has to be entered into ml_plot.p file manually, end then "gnuplot ml_plot.p" command has to be executed to obtain PostScript plot of likelihood values into equilateral triangle (ml_map.ps). The datasets are located under "shattered" subfolder. The QUARTOPS with posterior probabilities above 99% are sorted into 3 folders: "corner_1", "corner_2" and "corner_3" under "shattered" folder. Each of those subfolders contain the results of COGnification. The summary files are called "distrib.out" (distribution of COGs among the functional categories) and "COGnified.tab" (definition lines, assigned COG category and dataset number for each QuartOP). For the abbreviations of COGs functional categories go to COG home page at http://www.ncbi.nlm.nih.gov/COG/. ================= Auxilary Scripts: ================= These scripts are located in "extras" subfolder. 1) COGnification of all ORFs in a single genome: COGnify_mod.pl - assigns COG category to each of the ORFs in the genome. Allows to see distribution of all genes in a genome among functional categories. 2) AUTOMATION of DATA ANALYSIS with MrBAYES program. [Requires PAUP* 4.0 and MrBAYES programs] process.sh - script that runs automated analysis of QUARTOPs using MrBAYES program. Operates on the QUARTOPs in PHYLIP format. Calls the following scripts/programs: tonexus.sh - shell script to convert PHYLIP formatted COGs into NEXUS format. clean_nexus.pl - cleans NEXUS files from the stuff that MrBayes does not understand. create_cmd.pl - creates executable file run.sh mrbayes.block - NEXUS block that is used by MrBayes counter4.pl - calculates posterior probabilities for each dataset. pp_analysis.java, pp_analysis2.java, pp_analysis3.java are analogous to lv_analysis.java programs, but operate on posterior probabilities instead of likelihood values. 4) AUTOMATION of BOOTSTRAPPING ANALYSIS of QUARTOPs [Requires PHYLIP 3.6 package] seqboot.sh - generates 100 bootstrap samples for each QUARTOP. Uses "seqboot.cmd" as choices for seqboot program from PHYLIP package lboot3.sh - analyzes bootstrap samples generated with seqboot.sh and reports logL values and tree topology for each of the bootstrap samples triangulate.pl - calculates bootstrap support values for each of the tree topology and calculates barycentric coordinates from the bootstrap support values percent.pl - summarizes QuartOPs by zones (total, 70%, 90%) 5) TAKING AMONG SITE RATE VARIATION INTO ACCOUNT If ASRV needs to be taken into account, puzzle1.cmd through puzzle3.cmd files has to be changed to the ones located under "extras/asrv/" ================== Unimportant Notes: ================== *Internal Program versions are being kept for now (e.g. script_creator3.pl corresponds to internal version #3) =========== Known bugs: =========== * If number of QUARTOPs exceed 1000, then "clustalify" fails to analayze more than 999 QUARTOPs in a batch mode