Spa is a computer program for aligning cDNA sequences to a genome. It uses a probabilistic
Bayesian model to find the optimal alignment. To keep running times feasible we use the BLAT
gfServer to identify genomic loci and return the best mapping from these loci.
BLAT source and executables are freely available for academic, nonprofit and personal use. The
source can be downloaded from http://www.soe.ucsc.edu/~kent and the executables from
This command : gfServer start hostName portNumber *.nib
for example will run the gfServer with the genome nib files. If your genome data is only available
in fasta format, you can use the faToNib utility provided with the blat suite to have your fasta files
translated into nib files.
usage:
spa host port spa_top_parameters spa_parameters nibDir in.fa out.mach
where
host is the name of the machine running the gfServer
port is the same as you started the gfServer with
spa_top_parameters is a parameter file which affects the search process
spa_parameters is another parameter file which affects the scoring scheme
nibDir is the path of the nib files relative to the current dir
(note these are needed by the client as well as the server)
in.fa a fasta format file. May contain multiple records
out.mach where to put the output
Spa uses two parameter files to set user selectable parameters:
![]() |
spa_top_parameters is the highest level of parameters which affect BLAT |
query, Locus attributes, and retiling strategy.
![]() |
spa_parameters is the other parameter file which sets parameters for the scoring |
scheme.
tilesize : used to create genomic locus index. values outside the range [4,100] yield error
tileoverlay : defines the number of tiles which overlap at each position
the values outside the range [1,tilesize] yield error
locus_expand : is the number of positions to expand the loci obtained from blat
maximum_locus_expand : is the maximum number of positions to expand the loci obtained
from blat
clone_expand : is the number of positions we expand the region we want to retile
best_locus_list_size_max : is the maximum number of blat loci we use in the mapping
define_radius : controls how much 'fuzz' is added to the end of diagonals in the
dynamic programming matrix. Larger numbers mean longer running time
and more accurate alignments.
init_cell_number_cutoff : controls the starting tile size.
If number of cells > init_cell_number_cutoff then
the tile size will be increased to keep the number of cells
less than this cutoff.
extension_cell_number_cutoff : controls the number of cells occuring from an extension of the locus
number of extra positions in the dynamic programming matrix that are allowed when doing extensions.
Larger numbers mean longer running time and more accurate alignments
inside_retiling_cell_number_cutoff : controls the number of cells occuring during retiling.
Number of extra positions in the dynamic programming matrix that are allowed at each retiling step.
retile_genomic_gap_lower_limit : any genomic gap shorter than this will not be retiled
in an effort to align them.
retile_genomic_gap_upper_limit : any genomic gap longer than this will not be retiled
in an effort to align them.
retile_poly_a_percentage : the percentage of A (or T in complement) bases in unaligned
5' tail (or 3' tail) to disqualify tail from consideration with dynamic tile resizing
retile_tilesize : finegrain tilesize to resolve genome gaps. Values outside of the range [4,12] yield error
retile_tileoverlay : tileoverlay used to resolve genome gaps values outside of
[4,retile_tilesize] yield error
retile_exon_extension : when constructing exons for the retiling, an exon will include all
aligned genome positions which are adjacent to another aligned genome positioning the exon,
and where the difference in offset in either clone offset or genome offset is no more than
this value + 1 (i.e. 0 means each offset must be exactly adjacent to the adjacent aligned
position
score_match : probability per base for a match
score_mismatch : probability per base for a mismatch introduced by sequencing error
(score_match + score_mismatch = 1.0) score_splice_NNNN : is the relative probability of various splice juctions
for example score_splice_GTAG is the relative probability associated with
the canonical splice junction GTAG. The sum of all possible splice junctions
must be equal to 1.0.
min_diff_misorientation : involved in deciding whether a clone is misoriented
Genome non-intron jump scoring curve parameters (specified as log(probability)) controls
the probability of sequencing errors leading to deletions (from the clone) of various lengths.
score_genome_jump_beta0 : controls the probability of no deletion in the clone.
score_genome_jump_beta1 : controls the probability for having 1 deletion in the clone.
score_genome_jump_beta2 : controls the probability for having 2 deletions in the clone.
score_genome_jump_beta3 : controls the probability for having 3 deletions in the clone.
score_genome_jump_alpha : controls the probability for having n (n > 3) deletions in the clone.
Clone jump scoring curve parameters (specified as log(probability)) control
the probability of sequencing errors leading to insertions (in the clone) of various lengths.
score_clone_jump_beta0 : controls the probability of no insetion in the clone.
score_clone_jump_beta1 : controls the probability for having 1 insertion in the clone.
score_clone_jump_beta2 : controls the probability for having 2 insertions in the clone.
score_clone_jump_beta3 : controls the probability for having 3 insertions in the clone.
score_clone_jump_alpha : controls the probability for having n (n > 3) insertions in the clone.
score_genome_extension : controls the probability for exon extension
It is the log of probability to NOT start an intron, it controls exon-length distribution.
shortest_intron_length : is the minimum intron length observed
longest_intron_length : is the maximum intron length observed score_intron : controls the probability associated with an intron of a specified length.
Each intron is assigned to a bin. There are 100 bins. Thus there are a hundred score_intron
entry in the spa_parameters file.
Associated with the Spa program, some utilities are given. They are mainly perl scripts useful
for generating the spa_parameter file from a set of fasta sequences of a specific organism. The
parameters are estimated in a number of steps :
![]() |
First we generate a default parameter file (unbiased parameters) and use them to get a first
|
mapping.
![]() |
From this mapping, we estimate a high error rate parameter and we remap only the high error
|
rate mappings with the previous parameter file where mismatch_score is replaced by the high error
rate estimated.
![]() |
We generate then a new spa_parameters file from all the mappings.
|
![]() |
Repeat step 1 with the new parameter file,and then 2, 3, to obtain the final parameters file.
|
To perform the estimation, we provide two perl scripts :
generateParameters.pl : that produces a spa_parameters file from a mapping or an unbiased spa_parameters file.
extractHighERMappings.pl : that extracs high error rate mappings.
To do the parameter estimation using these two scripts, you can first generate an
unbiased spa_parameters with generateParameters.pl then use this spa_parameters file
to run the first mapping. Using the extractHighERMappings.pl, you will get the high
error rate mappings and the new spa_parameters file to map them. Once the second
mapping is done, merge the two mappings and run the generateParameters script to have
a new spa_parameters file. Repeat the process for the new spa_parameters file to get
the final spa_parameters file.
spa-parse : Spa generates the result alignment in a machine readable match file. The spa-parse program
can convert this format into a human readable format similar to Sim4 output format.