      Spa : A Probabilistic algorithm for Spliced Alignment

Spa is a computer program for aligning cDNA sequences to a genome. It uses a probabilistic
Bayesian model to find the optimal alignment. To keep running times feasible we use the BLAT
gfServer to identify genomic loci and return the best mapping from these loci.

BLAT source and executables are freely available for academic, nonprofit and personal use. The
source can be downloaded from http://www.soe.ucsc.edu/~kent and the executables from 
http://www.soe.ucsc.edu/~kent/exe/.

This command : gfServer start hostName portNumber *.nib

for example will run the gfServer with the genome nib files. If your genome data is only available 
in fasta format, you can use the faToNib utility provided with the blat suite to have your fasta files 
translated into nib files. 


      	Spa Usage :

usage:
   spa host port spa_top_parameters spa_parameters nibDir in.fa out.mach
where
   host is the name of the machine running the gfServer
   port is the same as you started the gfServer with

   spa_top_parameters is a parameter file which affects the search process 

   spa_parameters is another parameter file which affects the scoring scheme 
   nibDir is the path of the nib files relative to the current dir
       (note these are needed by the client as well as the server)
   in.fa a fasta format file.  May contain multiple records
   out.mach where to put the output

 

	Spa parameter descriptions

Spa uses two parameter files to set user selectable parameters: 
	- spa_top_parameters is the highest level of parameters which affect BLAT
     	  query, Locus attributes, and retiling strategy.
	- spa_parameters is the other parameter file which sets parameters for
	  the scoring scheme.

 
	spa_top_parameters

*tilesize* : used to create genomic locus index. 
values outside the range [4,100] yield error

*tileoverlay :* defines the number of tiles which overlap at each position
the values outside the range [1,tilesize] yield error

*locus_expand* : is the number of positions to expand the loci obtained from blat

*maximum_locus_expand* : is the maximum number of positions to expand the loci obtained
from blat

*clone_expand* : is the number of positions we expand the region we want to retile

*best_locus_list_size_max* : is the maximum number of blat loci we use in the mapping
         
*define_radius* : controls how much 'fuzz' is added to the end of diagonals in the 
dynamic programming matrix. Larger numbers mean longer running time and more accurate alignments.

*init_cell_number_cutoff* : controls the starting tile size. 
If number of cells > init_cell_number_cutoff then the tile size will be increased to keep the number 
of cells less than this cutoff.

*extension_cell_number_cutoff* : controls the number of cells occuring from an extension of the locus
number of extra positions in the dynamic programming matrix that are allowed when doing extensions.
Larger numbers mean longer running time and more accurate alignments

*inside_retiling_cell_number_cutoff* : controls the number of cells occuring during retiling.
Number of extra positions in the dynamic programming matrix that are allowed at each retiling step.

*retile_genomic_gap_lower_limit* : any genomic gap shorter than this will not be retiled 
in an effort to align them.

*retile_genomic_gap_upper_limit* : any genomic gap longer than this will not be retiled 
in an effort to align them.

*retile_poly_a_percentage *: the percentage of A (or T in complement) bases in unaligned 
5' tail (or 3' tail) to disqualify tail from consideration with dynamic tile resizing

*retile_tilesize* : finegrain tilesize to resolve genome gaps. Values outside of the range [4,12] yield error

*retile_tileoverlay* : tileoverlay used to resolve genome gaps values outside of 
[4,retile_tilesize] yield error

*retile_exon_extension* : when constructing exons for the retiling, an exon will include all 
aligned genome positions which are adjacent to another aligned genome positioning the exon, 
and where the difference in offset in either clone offset or genome offset is no more than
this value + 1 (i.e. 0 means each offset must be exactly adjacent to the adjacent aligned 
position


        spa_parameters

*score_match* : probability per base for a match 

*score_mismatch* : probability per base for a mismatch introduced by sequencing error
(score_match + score_mismatch = 1.0)

*score_splice_NNNN* : is the relative probability of various splice juctions
for example score_splice_GTAG is the relative probability associated with
the canonical splice junction GTAG. The sum of all possible splice junctions
must be equal to 1.0.

*min_diff_misorientation* : involved in deciding whether a clone is misoriented
Genome non-intron jump scoring curve parameters (specified as log(probability)) controls 
the probability of sequencing errors leading to deletions (from the clone) of various lengths.

*score_genome_jump_beta0* : controls the probability of no deletion in the clone.

*score_genome_jump_beta1* : controls the probability for having 1 deletion in the clone. 

*score_genome_jump_beta2* : controls the probability for having 2 deletions in the clone.

*score_genome_jump_beta3* : controls the probability for having 3 deletions in the clone.

*score_genome_jump_alpha* : controls the probability for having n (n > 3) deletions in the clone.
Clone jump scoring curve parameters (specified as log(probability)) control the probability of 
sequencing errors leading to insertions (in the clone) of various lengths.

*score_clone_jump_beta0* : controls the probability of no insetion in the clone.

*score_clone_jump_beta1* : controls the probability for having 1 insertion in the clone. 

*score_clone_jump_beta2* : controls the probability for having 2 insertions in the clone.

*score_clone_jump_beta3* : controls the probability for having 3 insertions in the clone.

*score_clone_jump_alpha* : controls the probability for having n (n > 3) insertions in the clone.

*score_genome_extension* : controls the probability for exon extension. It is the log of 
probability to NOT start an intron, it controls exon-length distribution.

*shortest_intron_length* : is the minimum intron length observed

*longest_intron_length* : is the maximum intron length observed

*score_intron* : controls the probability associated with an intron of a specified length. 
Each intron is assigned to a bin. There are 100 bins. Thus there are a hundred score_intron
entry in the spa_parameters file.

    
  	Spa utilities

Associated with the Spa program, some utilities are given. They are mainly perl scripts useful 
for generating the spa_parameter file from a set of fasta sequences of a specific organism. The 
parameters are estimated in a number of steps :
	- First we generate a default parameter file (unbiased parameters) and use them to get a first mapping.
	- From this mapping, we estimate a high error rate parameter and we remap only the high error rate mappings 
	  with the previous parameter file where mismatch_score is replaced by the high error rate estimated.
	- We generate then a new spa_parameters file from all the mappings.
	- Repeat step 1 with the new parameter file,and then 2, 3, to obtain the final parameters file.

To perform the estimation, we provide two perl scripts :

*generateParameters.pl* : that produces a spa_parameters file from a mapping or an unbiased spa_parameters file.

*extractHighERMappings.pl* : that extracs high error rate mappings.

To do the parameter estimation using these two scripts, you can first generate an
unbiased spa_parameters with generateParameters.pl then use this spa_parameters file
to run the first mapping. Using the extractHighERMappings.pl, you will get the high
error rate mappings and the new spa_parameters file to map them. Once the second
mapping is done, merge the two mappings and run the generateParameters script to have
a new spa_parameters file. Repeat the process for the new spa_parameters file to get
the final spa_parameters file.
 
*spa-parse* : Spa generates the result alignment in a machine readable match file. The spa-parse program 
can convert this format into a human readable format similar to Sim4 output format.

