**** THIS FILE DESCRIBES THE COMPONENTS OF THE *****
**** STUBB PROGRAM SUITE ***************************

The STUBB program suite implements algorithm(s) for
finding likely cis-regulatory modules, described in the following
paper:
"A Probabilistic Method to Detect Regulatory Modules"
by Saurabh Sinha, Erik van Nimwegen and Eric Siggia. 
Eleventh International Conference on Intelligent Systems for
Molecular Biology, Brisbane, Australia, July 2003, pg 292-301.


**** REQUIREMENTS ****
**********************
Our local installation of Stubb uses the following:
1. Red Hat Linux 9.0 
2. gcc compiler 3.2.2
3. Lagan toolkit version 1.0+ (coutersy Michael Brudno, Stanford).
   (Starting with Stubb version 2.1, the Lagan toolkit is co-distributed with Stubb.)
   (***NOTE: Lagan is a separate software, and comes with its own license, a GPL 
   license, which is separate from Stubb's license.***)
   (Lagan toolkit requires Perl.)
4. Some preprocessing tools for multiple species Stubb (see helpers/) require Python.


**** INSTALLING STUBB ************
**********************************

1. tar xvfz <stubb_xxx>.tar.gz  (where xxx is the version number)
2. cd stubb_xxx/
3. make

This creates the following program files:
1. bin/stubb :                    the single species stubb, no correlation parameters
2. bin/stubbh01 and
3. bin/stubbh01f:                 both are single species stubb, with correlation parameters 
                                  (see below for details.)         
4. bin/stubbms:                   multiple species stubb.
5. bin/stubbms_modifyalignments:  multiple species stubb. (optional preprocessor) 
6. bin/stubb_fitprobs:            single species stubb to run on a single window and write out
                                  the fit parameters.
7. bin/stubb_fixedprobs:          single species stubb, with no parameter training. (uses 
                                  fit probability values and score windows.)

===================================================================================
IMPORTANT: Stubb uses an external library called "newmat" (http://www.robertnz.net).
This library comes with the Stubb distribution, in the directory lib/newmat/
The library file is lib/newmat/libnewmat.a. If you wish to recompile the newmat 
library, go to lib/newmat/ and type gmake -f nm_gnu.mak. 
===================================================================================

===================================================================================
IMPORTANT: To run stubb with multiple species (only 2 species supported currently),
you will need to install the LAGAN software.
===================================================================================

**** INSTALLING LAGAN ON YOUR SYSTEM ******************************
*******************************************************************
1. Go to lib/mlagan/ and type make.
2. Set the LAGAN_DIR environment variable to LAGAN's installation 
   directory. This would be <path of Stubb installation dir>/lib/mlagan/
   It is advised that you set the LAGAN_DIR environment variable though 
   your .cshrc or .bashc file so that you dont have to re-do it every
   time you log-on. 
   To set the env. var., you can use something like
   export LAGAN_DIR=<path of Stubb installation dir>/lib/mlagan/   or
   setenv LAGAN_DIR <path of Stubb installation dir>/lib/mlagan/
   depending on the shell that you use.

**** CONTENTS OF STUBB DISTRIBUTION ****
**********************************
Subdirectories:
---------------
bin/    : this is where the compiled Stubb programs are output
lib/    : this has an external libraries (newmat and mlagan) used by Stubb
sample/ : this has a sample data to run Stubb on
helpers/: this has some Python/Perl tools to preprocess Stubb input 
          and postprocess Stubb output

Cpp files:
----------
stubbh01f.cpp
parameters_h01.cpp  
stubbms.cpp
fastafile.cpp   
parameters_h0.cpp   
stubbms_modifyalignments.cpp
parameters_h1.cpp   
lagan.cpp
util.cpp
sequence.cpp
windowiterator.cpp
wtmx.cpp
stubb.cpp
parameters.cpp
stubbh01.cpp
stubb_fitprobs.cpp
stubb_fixedprobs.cpp

Header files:
-------------
parameters.h
fastafile.h     
typedefs.h
util.h
sequence.h
mynewmat.h
wtmx.h

Makefiles:
----------
Makefile.ss : this is for single species Stubb
Makefile.ms : this is for multiple species Stubb
Makefile    : this calls make for both Makefiles.

********************************
**** STUBB (single species) ****
********************************

Type `bin/stubb' to see usage information.
The format is :

bin/stubb <sequencefile> <wtmxfile> <windowsize> <shiftsize> [-od <output_dir>] [-b <background file>] [-ft <energy thresold for printing>] [-ot <motif occurrence threshold for printing]

e.g.,  bin/stubb sample/eve.fna sample/gap_wtmx 500 100 -od sample/output/ -b sample/eve.fna -ft 10.0 -ot 0.5 

(you may run this command and compare output files from sample/output/ to the precomputed output files in sample/test_output/singles_species/)

Required arguments:
-------------------
<sequencefile> : Fasta file containing one or more input DNA sequences.
<wtmxfile>     : Weight matrix file. See sample/gap_wtmx for format example.
<windowsize>   : Size of the sliding window.
<shiftsize>    : By how much does the window slide.

Optional arguments:
-------------------
<output_dir>: where you want the results to go. If this is not specified, output files go to the directory in which <sequencefile> resides.

<background_file> : a Fasta file that has sequence to be used to derive background. If this is not specified, the local sequence context of the current window is used for background. (See the paragraph on CONTEXT_SIZE, written below, for details.)

<energy threshold for printing> : window information is printed in the `dictionary' and `profile' files if free energy of window is above this threshold. Default value is 0.

<motif occurrence threshold for printing> : in printing the `profile' of the window, the occurrence of a matrix is shown if its strength (on a scale of 0 to 1) is above this threshold. Default value is 0.1.

Compile-time arguments:
-----------------------
MARKOV_ORDER : Order of Markov chain used to model background. Change its value in Makefile.{ss,ms} (-DMARKOV_ORDER=..). Use an integer value >= 0. Markov order n means that (n+1)-mers are counted, and the emission probability of the next base depends on the previous n bases encountered.

CONTEXT_SIZE : If a background file is not specified, the background Markov model for a window is generated from the sequence around the window. A sequence of length <windowsize>*CONTEXT_SIZE is extracted from either side of the window, and these sequences, along with the window itself, serve as background sequence. In the following figure, CONTEXT_SIZE = c, windowsize = l.

   ----------<-----[WINDOW]----->---------  : sequence
             --------------------           : window context.
       c*l   +        l        +    c*l     = 2cl + l
Change the value of CONTEXT_SIZE in Makefile.{ss,ms} (-DCONTEXT_SIZE=..). Use an integer value >= 0.

Output
------
1. <sequencefile>.fen: 
The free energy file. For each **starting** position of the sliding window, there is one line in this file. Each line has the format (tab-separated):

<Starting position> <Free Energy> <Free Energy per length> <#non-N chars in window> <#iterations to converge>

2. <sequencefile>.parameters: 
This shows the values of some of the important parameters in use in the program.

3. <sequencefile>.dict: 
This has one entry for each window scoring above the free energy threshold. Each entry has the following format:

><Sequencename>
Position: <Starting Position>	Nucl: <#non-N chars in window>	Word_av_length: 0.00	Free Energy: <free energy>
<Matrix name>	<Matrix prior probability>	<Matrix count>
<Matrix name>	<Matrix prior probability>	<Matrix count>
....
<

(The field "Word_av_length: 0.00" is uninformative.)
(The field <Sequencename> is obtained from the header line of the input Fasta sequence.)

4. <sequencefile>.prof:
This has one entry for each window scoring above the free energy threshold. It shows the occurrences of matrices in the window. Only matrices scoring above the motif occurrence threshold argument are displayed. Each entry has the format:

>
Sequence: <Sequencename>	Position <Starting Position>
<Position>	<Nucleotide at this position>
<Matrix name>	[+	<offset>]	<Occurrence score>
<Position>	<Nucleotide at this position>
<Matrix name>	[+	<offset>]	<Occurrence score>
.
.
.
<

(The field `+	offset' is shown only for non-Background matrices.)
(The field <Sequencename> is obtained from the header line of the input Fasta sequence.)


Parsing the output
------------------
There is a Perl script `helpers/parse.pl' that is provided to extract high scoring
modules from Stubb's output. This script extracts those peaks in the free energy 
profile (from the .fen file) that have free energy (score) above a threshold. It also
filters for peaks based on their content/dictionary (from the .dict file). Using the script
requires modification of some of its variables. Run the script from the installation directory, as:
e.g., helpers/parse.pl -html -chrom=eve.fna -out=peaks.html




*****************************************
**** STUBBMS (Stubb for two species) ****
*****************************************

This is the main program for running Stubb on two species. More than two species will be supported in future versions.

Type `bin/stubbms' to see usage information.
The format is:

bin/stubbms <sequencefile> <wtmxfile> <windowsize> <shiftsize> -pf <phylogeny file> -af <alignments file> [-od <output_dir>] [-b <background file>] [-ft <energy thresold for printing>] [-ot <motif occurrence threshold for printing] 

e.g., bin/stubbms sample/eve.fna.mfasta sample/gap_wtmx 500 100 -pf sample/phylogeny.txt -af sample/eve.fna.blk.mod -od sample/output/ -ft 10.0 -ot 0.5

(you may run this command and compare output files from sample/output/ to the precomputed output files in sample/test_output/two_species/)

The arguments have the same semantics as for bin/stubb (see above.) The exceptions are explained next:
1. <sequencefile> is a fasta file with two sequences (from the two species). The order is important. The first sequence will be used as the "reference sequence", i.e., the output will be in terms of its coordinates. 
2. <phylogeny file> gives the distance from the common ancestor to each of the two species. The sample file provided (sample/phylogeny.txt) reflects a $\mu$ of 0.5 between the two species. (Distance to first species is 0, distance to second is 0.5).
3. <alignments file> is the list of locally aligned blocks having a minimum length and pid. Each line of this file has the following format:

(Seq1_left Seq1_right)=(Seq2_left Seq2_right) PID

The above entry represents the aligned block between (and including) coordinates Seq<i>_left ... Seq<i>_right of the i^{th} sequence. (Note that these coordinates are assumed to start at 1, not 0.) PID is the percent identity of the block. This value is ignored by the Stubb program. The block is ungapped.

Output: 
------
The output from stubbms is very similar to that from single species stubb (described above). The output files are --

1. <sequencefile>.fen: 
The free energy file. For each **starting** position of the sliding window, there is one line in this file. (The sliding window goes over the "reference sequence" only, so the starting positions are positions in this reference sequence.) Each line has the format (tab-separated):

<Starting position> <Free Energy> <Free Energy per length> <#non-N chars in window> <#iterations to converge>

(The "#non-N chars in window " field refers to the entire homologous window, i.e., the sliding window in the reference species, as well as the unaligned regions from the other species.)

2. <sequencefile>.parameters: Same as for stubb.

3. <sequencefile>.dict: 
4. <sequencefile>.prof: Both these files are similar in format to that for stubb, with one difference. For each homologous window scoring above the free-energy threshold, there may be multiple dictionaries or profiles (respectively) -- one for the entire sliding window in the reference species (including the aligned blocks) and one each for the unaligned regions, if any, in the other species.

5. <sequencefile>.align: This is the list of aligned blocks that the stubbms program used. It carries redundant information that was already part of the input (in the form of the .blk file).


Preparing for executing stubbms:
--------------------------------
***NOTE: The LAGAN software must be installed on your system for this step to work. The environment variable LAGAN_DIR must point to the top of the lagan installation. See notes above on how to install Lagan.

***Local Alignment of blocks:
As explained above, executing stubbms requires a local alignment of the two sequences in the form of the "alignments file" argument. To create this, run:

python helpers/prepare_for_stubbms.py <Seq1file> <Seq2file>

where <Seq1file> and <Seq2file> each contain one fasta sequence, for the appropriate species, corresponding roughly to homologous regions in the genomes. This may have been detected, e.g., by BLAST.

e.g., python helpers/prepare_for_stubbms.py sample/eve.fna sample/eve_Contig5521.fa

(there is an identical script helpers/prepare_for_stubbms.pl, written in perl, that may be invoked the same way to obtain the same effect.)

This command creates the following files:
<Seq1file>.blk    : this is the file of locally aligned blocks.
<Seq1file>.mfasta : this is the concatenated sequence file (containing both Seq1file and Seq2file sequences) to be used as input to stubbms.

Note that by default this program (helpers/prepare_for_stubbms.py) extracts locally aligned blocks that are at least 10 bp long and at least 70 percent identical. You may change these two parameters via the variable `minblklen' and `minpid' in the script helpers/prepare_for_stubbms.py. 

Also note that the helpers/prepare_for_stubbms.py script is where the LAGAN program is invoked from. The status messages output by LAGAN are redirected by the script into a file called __laganlog__ which is deleted later. If you wish to examine these status messages, edit the script helpers/prepare_for_stubbms.py : change `Process("rm anchs.final __laganlog__")' on line 37 to Process("rm anchs.final"). 

***Modification of locally aligned blocks
The program stubbms does not allow any motif occurrences to overlap an aligned block boundary. Motifs must occur either completely inside or completely outside these blocks. However, since the alignment program does not use the knowledge of the weight matrices in producing the aligned blocks, sometimes a good motif occurrence may end up overlapping a block boundary. Hence, before you run stubbms, you may want to modify the aligned blocks to reflect the weight matrix occurrences. To do this, run the program bin/stubbms_modifyalignments. 

usage: bin/stubbms_modifyalignments <sequencefile> <wtmxfile> <windowsize> -pf <phylogeny file> -af <alignments file>

This has exactly the same format as stubbms (explained above), except that it does not take a "<shiftsize>" argument. The output of this program is the file <alignments file>.mod, which has the same format as <alignments file>, but the blocks may have changed somewhat. (Also, the PID field is irrelevant, and is set arbitrarily to 100.)

e.g., bin/stubbms_modifyalignments sample/eve.fna.mfasta sample/gap_wtmx 500 -pf sample/phylogeny.txt -af sample/eve.fna.blk




****************************************************************
**** STUBBH01 (Stubb for single species, with correlations) ****
****************************************************************

stubbh01: This has exactly the same usage as "stubb" (see above). When scoring a window, this program checks if any pair of input matrices (excluding background) are correlated, and if so, includes the correlation parameter for that (ordered) pair. The detection of correlation (on a per-window basis) suffers from insufficient statistics, and YOU ARE ADVISED TO INSTEAD USE THE NEXT PROGRAM FOR WORKING WITH CORRELATED PARAMETERS.

stubbh01f: This uses correlation parameters only for the (ordered) motif-pairs specified by the user. Type `bin/stubbh01f' to see usage information.
The format is:

bin/stubbh01f <sequencefile> <wtmxfile> <windowsize> <shiftsize> [-od <output_dir>] [-b <background file>] [-ft <energy thresold for printing>] [-ot <motif occurrence threshold for printing] [-cf <corr file>] [-cl <numpairs> <corr_l> <corr_r> ...]

e.g.,  bin/stubbh01f sample/eve.fna sample/gap_wtmx 500 100 -od sample/output/ -ft 10.0 -ot 0.5 -b sample/eve.fna -cl 3 0 1 1 0 0 2

(you may run this command and compare output files from sample/output/ to the precomputed output files in sample/test_output/correlated/)

All arguments are as in "stubb" except the "-cf" and "-cl" optional arguments.
These are two alternative ways to specify correlations:
-cl: This allows the user to specify the correlated motif pairs by a list.
<numpairs> is the number of pairs being specified. This many pairs of integers follow.
<corr_l> and <corr_r> are the indices of the left and right component matrices of the pair being specified. The indices are defined by the position of the matrix in the input wtmx file. The first matrix has index 0.
Note that specifying a matrix-pair means that the corresponding correlation parameter will be trained on each window. The matrix pair is ordered. Thus two pairs are needed to specify both directions of correlation between a pair of matrices
.
-cf: This option allows use of specific values of correlation parameters. Not supported yet.

Note on stubbh01f: Set the "_CYCLIC_WINDOWS" option in Makefile.ss to make Stubb treat each window as cyclic. This has the following interpretation: By default (if _CYCLIC_WINDOWS is not set), Stubb fits the HMM on the current window assuming that the last motif encountered before seeing the first character (base) of the window was the background motif. That is, the initial probability distribution of the HMM is skewed towards the background motif. In the absence of correlation parameters, this does not make any difference. However, in the presence of correlation parameters, this is an arbitrary assumption, and in fact leads to a translational asymmetry in the score. For example, if the current window has a cluster of motifs, then the score will depend on whether the window was centered around this cluster or whether the cluster is near one of the ends of the window. To remove this asymmetry, we treat the window's sequence as being "cyclic" or "circular" and fit the HMM appropriately. This largely reduces the asymmetry. To use this feature, you need to set the _CYCLIC_WINDOWS option in the Makefile.ss. However, this feature is supported currently only for stubbh01f and not for stubbh01. You may have to recompile the entire program depending on whether you want to use stubbh01 or stubbh01f.





*************************************************************************************
**** STUBB_FIXEDPROBS (Stubb for single species, with fixed prior probabilities) ****
*************************************************************************************

***stubb_fitprobs: This program runs stubb on the entire input sequence, fits the prior probabilities (as in Stubb) and prints them in a file called <sequencefile>.fitprobs. This output file has a single line containing the fit prior probabilities of each weight matrix in the input matrix file, in the same order. (the background's prior is the last entry.) 

Type `bin/stubb_fitprobs' to see usage information.
The format is:

bin/stubb_fitprobs <sequencefile> <wtmxfile> [-od <output_dir>] [-b <background file>] 

<sequencefile> contains the input sequence, on which the probabilities will be fit. <wtmxfile> contains the weight matrices, in Stubb format (see above). <output_dir> is where the program's output will be written -- the <sequencefile>.fitprobs file also goes here. <background file> specifies the sequence, in fasta format, to be used as background sequence. The last two arguments are optional.

e.g., bin/stubb_fitprobs sample/eve_stripe2.fna sample/gap_wtmx  -b sample/eve.fna -od sample/output/

(you may run this command and compare output files from sample/output/ to the precomputed output files in sample/test_output/fitprobs/)

Output
------
The output files are the same as for stubb, with the addition of the new output file <sequencefile>.fitprobs. As mentione above, this file has a single line containing the fit prior probabilities of each weight matrix in the input matrix file, in the same order. (the background's prior is the last entry.) 



***stubb_fixedprobs: This uses the *.fitprobs file generated by the stubb_fitprobs program, and uses the prior probabilities mentioned therein to run Stubb on each window in the input sequence. Thus, this is identical to Stubb except that there is no training of parameters for each window, or handling of multiple species. Correlated parameters are not allowed by this program. 

Type `bin/stubb_fixedprobs' to see usage information.
The format is:

bin/stubb_fixedprobs <sequencefile> <wtmxfile> <windowsize> <shiftsize> <fitprobs_file> [-od <output_dir>] [-b <background file>] [-ft <energy thresold for printing>] [-ot <motif occurrence threshold for printing]

All arguments are as in "stubb" except the <fitprobs_file> argument, which is essential. This is the file created by the stubb_fitprobs program.

e.g., bin/stubb_fixedprobs sample/eve.fna sample/gap_wtmx 663 50 sample/output/eve_stripe2.fna.fitprobs -ft 10.0 -ot 0.5 -b sample/eve.fna -od sample/output/

(you may run this command and compare output files from sample/output/ to the precomputed output files in sample/test_output/fixedprobs/)

Output is same as for stubb.

In the above example, sample/eve_stripe2.fna carries the sequence of a 663bp long module, the eve stripe 2 enhancer. The goal was to search for similar (in content) modules in the 50Kb region surrounding the eve gene (sample/eve.fna). Hence we first run stubb_fitprobs on sample/eve_stripe2.fna, and then we run stubb_fixedprobs on sample/eve.fna with the window length argument being 663bp. This argument could very well have been some other value, in case we wanted to search for shorter or longer modules that are similar in composition to eve_stripe2. The "composition" here refers to the maximum likelihood prior probabilities of the input weight matrices, as trained from the eve_stripe2 sequence.



******************
**** PROBLEMS ****
******************
e-mail: saurabh@lonnrot.rockefeller.edu
