

This is the distribution package of PROCSE version 2.0

The permission to use this software is granted by the authors only on
the following 5 conditions:
1. The PROCSE software suite remains at your institution and is not
published, distributed, or otherwise transferred or made available to
people other than institution employees and students involved in
research under your supervision.
2. The PROCSE software suite will be used by you and/or your
institution solely for non-commercial purposes, except with express
permission from the authors.
3. You may provide the authors with feedback on the use of the PROCSE
software suite in your research, and that the authors are permitted to
use any information you provide in making changes to the PROCSE
software suite.
4. Any risk associated with using the PROCSE software suite at your
institution is with you and your institution.
5. The PROCSE software suite will be cited in any publication(s)
reporting on data obtained by using it, as:
Probabilistic clustering of sequences: Inferring new bacterial
regulons by comparative genomics.  
Erik van Nimwegen, Mihaela Zavolan, Nikolaus Rajewsky, 
and Eric D. Siggia 
PNAS 99 (2002) 7323-7328


INSTALLATION

On most platforms installation boils down to:
1) unzipping and untarring the package in the directory where one wants
the program to sit.
2) Typing make
There will be an executable file called 'procse'.

please contact erik.vannimwegen@unibas.ch when you run into compiling
problems. 

RUNNING

The general format of running is:
procse datafile parameterfile

The datafile has a fasta like format. For data that consists of single
sequences the file format is simply:
>name_sequence_1
sequence1
>name_sequence_2
sequence2
>name_sequence_3
sequence3
etc.
where sequence1,sequence2,sequence3 are DNA sequences (the program
does not as of yet work on protein sequences). 

The datafile may also contain ungapped alignments of multiple binding
sites for the same factor. That is, sets of sequences that the user
has preclustered and prealigned and that should never be moved apart
by the program. These enter in the datafile in the exact same way as
single sequences, with the only difference that there is more than a
single sequence, e.g. 
>name_sequence_group1
sequence1_from_group1
sequence2_from_group1
sequence3_from_group1
sequence4_from_group1
>name_sequence_group2
sequence1_from_group2
sequence2_from_group2

In this example group 1 has 4 sequences and group 2 has 2 sequences.
Note that the sequences within a group all need to have the same
length (and that no gaps should occur).

For explanation of all the parameters the file:
params_plus_explanation
is included. This is an example parameter file with comments about
each parameter. In addition, included also is a minimal parameter file
that describes only the parameters that the user *has* to specify when
running the program (all other parameters being set to their
defaults). This file is called: minimal_params.



NEW FEATURES IN VERSION 2.0

Note that the current version has a few more features than the
original version described in the original paper.

The new features are:
-apart from the standard scoring the new version also implements a
scoring scheme in which each column of the cluster of aligned binding
sites is scored both according to a weight matrix model and according
to a background model. This can be appropriate for binding sites that
have "don't care" columns in them.
-The original version used uniform priors over the space of weight
matrices. The new version allows for arbitrary symmetric Dirichlet
priors of the form:
P(w) = prod_a (w_a)^(q-1)
with q a parameter that can be set in the parameter file.
-The original version used a uniform prior over the space of
partitions of the sequences into subsets. The new version allows for a
prior over partitions that is uniform over cluster number, or a prior
that is both uniform over cluster number as well as over the sizes of
the clusters. 
-The prior over partitions can be further specified using a chemical
potential. That is, the prior probability of a partition C takes on
the form 
P(C) = exp(-mu * n(C))
with n(C) the number of clusters in partition C and mu the chemical
potential. This prior can be useful when the expected number of
clusters in the data is (approximately) known. 
-The original version always tried to align sequences both in the
forward and reverse-complement direction. In the current version the
user has the option to direct the progam to use only the forward
strand. 

CONTACT

Please send comments, questions, and suggestions to:
erik.vannimwegen@unibas.ch
 


