Spa : A Probabilistic algorithm for Spliced Alignment

 
 
Spa is a computer program for aligning cDNA sequences to a genome. It uses a probabilistic
Bayesian model to find the optimal alignment. To keep running times feasible we use the BLAT
gfServer to identify genomic loci and return the best mapping from these loci.
 
BLAT source and executables are freely available for academic, nonprofit and personal use. The
source can be downloaded from http://www.soe.ucsc.edu/~kent and the executables from 
http://www.soe.ucsc.edu/~kent/exe/.
 
 
This command : gfServer start hostName portNumber *.nib
 
for example will run the gfServer with the genome nib files. If your genome data is only available 
in fasta format, you can use the faToNib utility provided with the blat suite to have your fasta files 
translated into nib files. 

Spa Usage :

usage:
spa host port spa_top_parameters spa_parameters nibDir in.fa out.mach
where
host is the name of the machine running the gfServer
port is the same as you started the gfServer with
   spa_top_parameters is a parameter file which affects the search process 
   spa_parameters is another parameter file which affects the scoring scheme 
   nibDir is the path of the nib files relative to the current dir
       (note these are needed by the client as well as the server)
   in.fa a fasta format file.  May contain multiple records
   out.mach where to put the output
 
 






Spa parameter descriptions

 

Spa uses two parameter files to set user selectable parameters: 

 

bullet

spa_top_parameters is the highest level of parameters which affect BLAT

     query, Locus attributes, and retiling strategy.

bullet spa_parameters is the other parameter file which sets parameters for the scoring 

     scheme.

 

 

spa_top_parameters

 

tilesize : used to create genomic locus index.
 	   values outside the range [4,100] yield error

tileoverlay : defines the number of tiles which overlap at each position
the values outside the range [1,tilesize] yield error

locus_expand : is the number of positions to expand the loci obtained from blat
 
maximum_locus_expand : is the maximum number of positions to expand the loci obtained
		       from blat


 
clone_expand : is the number of positions we expand the region we want to retile
 
best_locus_list_size_max : is the maximum number of blat loci we use in the mapping


 

define_radius : controls how much 'fuzz' is added to the end of diagonals in the 
dynamic programming matrix. Larger numbers mean longer running time 
and more accurate alignments.
 
init_cell_number_cutoff : controls the starting tile size. 
If number of cells > init_cell_number_cutoff then
the tile size will be increased to keep the number of cells
less than this cutoff.
 
extension_cell_number_cutoff : controls the number of cells occuring from an extension of the locus
number of extra positions in the dynamic programming matrix that are allowed when doing extensions.
Larger numbers mean longer running time and more accurate alignments

inside_retiling_cell_number_cutoff : controls the number of cells occuring during retiling.
Number of extra positions in the dynamic programming matrix that are allowed at each retiling step.

 

retile_genomic_gap_lower_limit : any genomic gap shorter than this will not be retiled 
in an effort to align them.

retile_genomic_gap_upper_limit : any genomic gap longer than this will not be retiled
in an effort to align them.

retile_poly_a_percentage : the percentage of A (or T in complement) bases in unaligned
5' tail (or 3' tail) to disqualify tail from consideration with dynamic tile resizing

retile_tilesize : finegrain tilesize to resolve genome gaps. Values outside of the range [4,12] yield error

retile_tileoverlay : tileoverlay used to resolve genome gaps values outside of
[4,retile_tilesize] yield error

retile_exon_extension : when constructing exons for the retiling, an exon will include all
aligned genome positions which are adjacent to another aligned genome positioning the exon, 
and where the difference in offset in either clone offset or genome offset is no more than
this value + 1 (i.e. 0 means each offset must be exactly adjacent to the adjacent aligned 
position


spa_parameters



score_match : probability per base for a match 
 
score_mismatch : probability per base for a mismatch introduced by sequencing error
(score_match + score_mismatch = 1.0)

score_splice_NNNN : is the relative probability of various splice juctions
for example score_splice_GTAG is the relative probability associated with
the canonical splice junction GTAG. The sum of all possible splice junctions
must be equal to 1.0.
 
 
min_diff_misorientation : involved in deciding whether a clone is misoriented
 

Genome non-intron jump scoring curve parameters (specified as log(probability)) controls 
the probability of sequencing errors leading to deletions (from the clone) of various lengths.
 
score_genome_jump_beta0 : controls the probability of no deletion in the clone.
score_genome_jump_beta1 : controls the probability for having 1 deletion in the clone. 
score_genome_jump_beta2 : controls the probability for having 2 deletions in the clone.
score_genome_jump_beta3 : controls the probability for having 3 deletions in the clone.
score_genome_jump_alpha : controls the probability for having n (n > 3) deletions in the clone.
 
 


Clone jump scoring curve parameters (specified as log(probability)) control 
the probability of sequencing errors leading to insertions (in the clone) of various lengths.
 
score_clone_jump_beta0 : controls the probability of no insetion in the clone.
score_clone_jump_beta1 : controls the probability for having 1 insertion in the clone. 
score_clone_jump_beta2 : controls the probability for having 2 insertions in the clone.
score_clone_jump_beta3 : controls the probability for having 3 insertions in the clone.
score_clone_jump_alpha : controls the probability for having n (n > 3) insertions in the clone.
 

score_genome_extension : controls the probability for exon extension
It is the log of probability to NOT start an intron, it controls exon-length distribution.
 
shortest_intron_length : is the minimum intron length observed

longest_intron_length : is the maximum intron length observed

score_intron : controls the probability associated with an intron of a specified length. 
Each intron is assigned to a bin. There are 100 bins. Thus there are a hundred score_intron
entry in the spa_parameters file.
   
 






Spa utilities

Associated with the Spa program, some utilities are given. They are mainly perl scripts useful 
for generating the spa_parameter file from a set of fasta sequences of a specific organism. The 
parameters are estimated in a number of steps :
 
bullet
First we generate a default parameter file (unbiased parameters) and use them to get a first
     mapping.
bullet
From this mapping, we estimate a high error rate parameter and we remap only the high error 
    rate mappings with the previous parameter file where mismatch_score is replaced by the high error 
    rate estimated.
bullet
We generate then a new spa_parameters file from all the mappings.
bullet
Repeat step 1 with the new parameter file,and then 2, 3, to obtain the final parameters file.
 
 
To perform the estimation, we provide two perl scripts :
 
generateParameters.pl : that produces a spa_parameters file from a mapping or an unbiased spa_parameters file.
 
extractHighERMappings.pl : that extracs high error rate mappings.
 
To do the parameter estimation using these two scripts, you can first generate an
unbiased spa_parameters with generateParameters.pl then use this spa_parameters file
to run the first mapping. Using the extractHighERMappings.pl, you will get the high
error rate mappings and the new spa_parameters file to map them. Once the second
mapping is done, merge the two mappings and run the generateParameters script to have
a new spa_parameters file. Repeat the process for the new spa_parameters file to get
the final spa_parameters file.
 
spa-parse : Spa generates the result alignment in a machine readable match file. The spa-parse program 
can convert this format into a human readable format similar to Sim4 output format.