====================================================================
|This package is free software; you can redistribute it and/or     |
|modify it under the terms of the GNU General Public License       |
|as published by the Free Software Foundation; either version 2    |
|of the License, or (at your option) any later version.            |
|                                                                  |
|This package is distributed in the hope that it will be useful,   |
|but WITHOUT ANY WARRANTY; without even the implied warranty of    |
|MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the     |
|GNU General Public License for more details.                      |
|                                                                  |
|                                                                  |
|Authors:                                                          |
| Olga A. Zhaxybayeva                                              |
| J. Peter Gogarten                                                |
| Department of Molecular and Cell Biology                         |
| University of Connecticut                                        |
| 1999-2001                                                        |
|                                                                  |
|Citation:                                                         |
| Olga Zhaxybayeva and J. Peter Gogarten. QuartOP analysis package.| 
| Distributed by authors, Department of Molecular and Cell Biology.|
| University of Connecticut.                                       |
|                                                                  |
|Questions and/or Comments: olga.zh@uconn.edu or gogarten@uconn.edu|
====================================================================

=================================================
DOCUMENTATION VERSION ALPHA 0.1
NOVEMBER 29, 2001
PLEASE CHECK HTTP://CARROT.MCB.UCONN.EDU/QUARTETS
FOR THE LATEST VERSION
=================================================

==============
Prerequisites:
==============

    *Unix operating system (package is tested on Red Hat Linux, but should work under any Unix)

    *SEALS package (Walker and Koonin, 1997) - http://www.ncbi.nlm.nih.gov/CBBresearch/Walker/SEALS/index.html

    *JDK v. 1.2 or higher - http://java.sun.com

    *PAL library (Strimmer, 2001) - http://www.pal-project.org/

    *MySQL database - http://www.mysql.com/

    *GnuPlot v. 3.7 - http://www.gnuplot.org/

    *Perl 5 - http://www.cpan.org/

    *local copies of the databases: nr, COG, 
    completed genomes (available through NCBI's FTP site) in FASTA format (a.a. sequences);
    Databases need to be formatted with formatdb (from BLAST package)

    *local BLAST - ftp://ftp.ncbi.nlm.nih.gov/blast/

    *ClustalW v. 1.8 - ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/

    *TREE-PUZZLE v. 5.0 - http://www.tree-puzzle.de/

==============
Installation:
==============

1) You need to create the following environment variables:

    $QUARTOP_DB_FOLDER - folder where all databases are located

    $MYSQL_NAME - username for account in MySQL

    $MYSQL_PSWD - password for account in MySQL

    $QUARTOP_HOME - home folder of the package

2) Add the path to the package home folder to your $PATH 

=================================
Preparations to run the programs:
=================================

1) All local databases have to be formatted with formatdb program 
    (formatdb -i name_of_the_database -o T -p T), and indexed with fasta2index (through SEALS)
    
    "nr" has to be no more than 30 days old (updates are available through 
    NCBI's FTP site at ftp://ftp.ncbi.nlm.nih.gov/pub/blast/db/).
    
    Every record of completed genome databases has to have unique GI number assigned to it. 
    Usually it is the case if the genomes are downloaded from NCBI's FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/).
    In case of incompleted genomes, fake GIs may be assigned through 
    SEALS package using "assign_subversive_gis" function. 
    All genome database names have to have .faa extension (except nr).
    
    In addition, COG database 
    has to be prepared using the COG_trans1.pl and COG_trans2.pl scripts (see below).
    
    All databases have to be located at $QUARTOP_DB_FOLDER directory. 

2) MySQL daemon has to be running with an existing account specified 
    in environment variables (see Installation section above)
    Due to the insecure storage of username/password (as environmental variables) 
    it is recommended to use separate account for QUARTOPs analyses with limited rights.

3) SEALS has to be activated at least in the shell you are going 
    to run analysis from (This depends on SEALS installation. 
    Check SEALS installation notes)

4) The scripts assume that paths to all prerequisite 
    packages are added to your $PATH variable. 

5) Every genome quartet has to be run in an empty folder, since the scripts use globs 
    like *.fa etc.
    
========================
How to run the analysis:
========================

Execute the following scripts in the order:

1) script_creator3.pl (see instructions below)

2) script.sh

3) gnuplot ml_plot.p

4) gv ml_map.ps (gv is GhostViewer, http://www.cs.wisc.edu/~ghost/, 
    but you can view PS file with other programs). For more on where 
    to look for results, see "Results" section below.

"script.sh" script automatically calls additional programs/scripts 
    which are described in the section below.


====================================
Description of the scripts/programs:
====================================

1) PREPARING COG DATABASE

    COG_trans1.pl - operates on COGs.txt file. 
    Creates COG look-up table used to calculate distribution 
    of QUARTOPs among functional categories.

    COG_trans2.pl - adds COG name into FASTA definition line. 
    Operates on COGs folder in COGs database distribution.
    All created *.faa files has to be concatenated into one file that is called COGs.faa 
    and that file has to be formatted with formatdb program.
    

2) CREATION OF GENOME QUARTET SPECIFIC SCRIPTS

    script_creator3.pl - script that creates main executable file for 
    the analysis as well as additional 
    necessary files that are customized for the selected genome quartet.

    Usage: script_creator3.pl genome1 genome2 genome3 genome4
    "genome1 genome2 genome3 genome4" are the names of the files 
    that correspond to the genome databases.    
    As new sequenced genomes become available names has to be added 
    to the following hash tables: %genomes, %fullname, %name10. 
    By default most of the names are in the form "xyyy" where x=first letter 
    of the genus name, yyy=first three letters of the species name.
    If a genome consists of more than one chromosomes/plasmids, 
    all the ORFs were concatenated into one file with name "xyyy_genome". 
    If different names are desired, the script has to be changed accordingly.  

    The script creates 6 files:
    script.sh - main executable file with executable permissions
    mysql.script - instructions for MySQL to run in batch mode
    treefile1, treefile2, treefile3 - three possible unrooted tree 
        topologies for the genome quartet in NH format
    ml_plot.p - instructions for the GNUPlot to run in a batch mode

3) COGnification OF QUARTOPS
    COGnify.pl - assignes COG category to each of the QUARTOPs and generates 
    distribution of QUARTOPs among functional categories

4) ANALYSES OF LOG-LIKELIHOODS:    
    The following Java programs are used:
    lv_analysis.java - creates the file lv_analyzed99.out (see Results section).
    lv_analysis90.java - creates the file lv_analyzed99.out (see Results section).
    lv_analysis2.java - converts posterior probabilities (barycentric coordinates) 
        into cartesian coordinate system.
    lv_analysis3.java - creates lists of datasets that belong to 
        the three corners (i.e. have posterior probability above 99%) 
    
    
=========
Results:
=========

The results are in the following files:
Files "lv_analyzed99.out" and "lv_analyzed90.out" contain the 
information on how many quartets support each tree topology in each "zone".
This data has to be entered into ml_plot.p file manually, end then 
"gnuplot ml_plot.p" command has to be executed to obtain 
PostScript plot of likelihood values into equilateral triangle (ml_map.ps). 

The datasets are located under "shattered" subfolder.

The QUARTOPS with posterior probabilities above 99% are sorted 
into 3 folders: "corner_1", "corner_2" and "corner_3" under "shattered" folder.
Each of those subfolders contain the results of COGnification. 
The summary files are called "distrib.out" (distribution of COGs 
among the functional categories) and "COGnified.tab" (definition lines, 
assigned COG category and dataset number for each QuartOP).

For the abbreviations of COGs functional categories 
go to COG home page at http://www.ncbi.nlm.nih.gov/COG/.


=================
Auxilary Scripts:
=================
These scripts are located in "extras" subfolder.


1) COGnification of all ORFs in a single genome: 
    COGnify_mod.pl - assigns COG category to each of the ORFs in the genome. Allows to see 
    distribution of all genes in a genome among functional categories.  

2) AUTOMATION of DATA ANALYSIS with MrBAYES program. [Requires PAUP* 4.0 and MrBAYES programs]
     process.sh - script that runs automated analysis of QUARTOPs using MrBAYES program. 
         Operates on the QUARTOPs in PHYLIP format.
         Calls the following scripts/programs:
    
        tonexus.sh - shell script to convert PHYLIP formatted COGs into NEXUS format.
        clean_nexus.pl - cleans NEXUS files from the stuff that MrBayes does not understand.
        create_cmd.pl - creates executable file run.sh
        mrbayes.block - NEXUS block that is used by MrBayes
        counter4.pl - calculates posterior probabilities for each dataset.
        pp_analysis.java, pp_analysis2.java, pp_analysis3.java are analogous 
        to lv_analysis.java programs, but operate on posterior probabilities 
        instead of likelihood values.
       

4) AUTOMATION of BOOTSTRAPPING ANALYSIS of QUARTOPs [Requires PHYLIP 3.6 package]
    seqboot.sh - generates 100 bootstrap samples for each QUARTOP. Uses "seqboot.cmd" 
    as choices for seqboot program from PHYLIP package
    lboot3.sh - analyzes bootstrap samples generated with seqboot.sh and reports logL values and 
        tree topology for each of the bootstrap samples
    triangulate.pl - calculates bootstrap support values for each of the tree topology  and 
        calculates barycentric coordinates from the bootstrap support values
    percent.pl - summarizes QuartOPs by zones (total, 70%, 90%)         


5) TAKING AMONG SITE RATE VARIATION INTO ACCOUNT
    If ASRV needs to be taken into account, puzzle1.cmd through puzzle3.cmd 
    files has to be changed to the ones located under "extras/asrv/"


==================
Unimportant Notes:
==================

*Internal Program versions are being kept for now (e.g. script_creator3.pl 
    corresponds to internal version #3)

===========
Known bugs:
===========

* If number of QUARTOPs exceed 1000, then "clustalify" fails to analayze 
    more than 999 QUARTOPs in a batch mode