fasta36/ssearch36/[t]fastfasta36/ssearch36/[t]fast[x,y]36/lalign36 1(local)NAMEfasta36 - scan a protein or DNA sequence library for similar sequences
fastx36 - compare a DNA sequence to a protein sequence database, com‐
paring the translated DNA sequence in forward and reverse frames.
tfastx36 - compare a protein sequence to a DNA sequence database, cal‐
culating similarities with frameshifts to the forward and reverse ori‐
entations.
fasty36 - compare a DNA sequence to a protein sequence database, com‐
paring the translated DNA sequence in forward and reverse frames.
tfasty36 - compare a protein sequence to a DNA sequence database, cal‐
culating similarities with frameshifts to the forward and reverse ori‐
entations.
fasts36 - compare unordered peptides to a protein sequence database
fastm36 - compare ordered peptides (or short DNA sequences) to a pro‐
tein (DNA) sequence database
tfasts36 - compare unordered peptides to a translated DNA sequence
database
fastf36 - compare mixed peptides to a protein sequence database
tfastf36 - compare mixed peptides to a translated DNA sequence database
ssearch36 - compare a protein or DNA sequence to a sequence database
using the Smith-Waterman algorithm.
ggsearch36 - compare a protein or DNA sequence to a sequence database
using a global alignment (Needleman-Wunsch)
glsearch36 - compare a protein or DNA sequence to a sequence database
with alignments that are global in the query and local in the database
sequence (global-local).
lalign36 - produce multiple non-overlapping alignments for protein and
DNA sequences using the Huang and Miller sim algorithm for the Water‐
man-Eggert algorithm.
prss36, prfx36 - discontinued; all the FASTA programs will estimate
statistical significance using 500 shuffled sequence scores if two
sequences are compared.
DESCRIPTION
Release 3.6 of the FASTA package provides a modular set of sequence
comparison programs that can run on conventional single processor com‐
puters or in parallel on multiprocessor computers. More than a dozen
programs - fasta36, fastx36/tfastx36, fasty36/tfasty36,
fasts36/tfasts36, fastm36, fastf36/tfastf36, ssearch36, ggsearch36, and
glsearch36 - are currently available.
All the comparison programs share a set of basic command line options;
additional options are available for individual comparison functions.
Threaded versions of the FASTA programs (built by default under
Unix/Linux/MacOX) run in parallel on modern Linux and Unix multi-core
or multi-processor computers. Accelerated versions of the Smith-Water‐
man algorithm are available for architectures with the Intel SSE2 or
Altivec PowerPC architectures, which can speed-up Smith-Waterman calcu‐
lations 10 - 20-fold.
In addition to the serial and threaded versions of the FASTA programs,
MPI parallel versions are available as fasta36_mpi, ssearch36_mpi,
fastx36_mpi, etc. The MPI parallel versions use the same command line
options as the serial and threaded versions.
Running the FASTA programs
By default, the FASTA programs are no longer interactive; they are run
from the command line by specifying the program, query.file, and
library.file. Program options must preceed the query.file and
library.file arguments:
fasta36-option1 -option2 -option3 query.file library.file > fasta.output
The "classic" interactive mode, which prompts for a query.file and
library.file, is available with the -I option. Typing a program name
without any arguments (ssearch36) provides a short help message; pro‐
gram_name -help provides a complete set of program options.
Program options MUST preceed the query.file and library.file arguments.
FASTA program options
The default scoring matrix and gap penalties used by each of the pro‐
grams have been selected for high sensitivity searches with the various
algorithms. The default program behavior can be modified by providing
command line options before the query.file and library.file arguments.
Command line options can also be used in interactive mode.
Command line arguments come in several classes.
(1) Commands that specify the comparison type. FASTA, FASTS, FASTM,
SSEARCH, GGSEARCH, and GLSEARCH can compare either protein or DNA
sequences, and attempt to recognize the comparison type by looking the
residue composition. -n, -p specify DNA (nucleotide) or protein compar‐
ison, respectively. -U specifies RNA comparison.
(2) Commands that limit the set of sequences compared: -1, -3, -M.
(3) Commands that modify the scoring parameters: -f gap-open penaltyP,
-g gap-extend penalty, -j inter-codon frame-shift, within-codon
frameshift, -s scoring-matrix, -r match/mismatch score, -x X:X score.
(4) Commands that modify the algorithm (mostly FASTA and [T]FASTX/Y):
-c, -w, -y, -o. The -S can be used to ignore lower-case (low complex‐
ity) residues during the initial score calculation.
(5) Commands that modify the output: -A, -b number, -C width, -d num‐
ber, -L, -m 0-11,B, -w line-width, -W context-width, -o offset1,ofset2
(6) Commands that affect statistical estimates: -Z, -k.
Option summary:-1 Sort by "init1" score (obsolete)
-3 ([t]fast[x,y] only) use only forward frame translations
-a Displays the full length (included unaligned regions) of both
sequences with fasta36, ssearch36, glsearch36, and fasts36.
-A (fasta36 only) For DNA:DNA, force Smith-Waterman alignment for
output. Smith-Waterman is the default for FASTA protein align‐
ment and [t]fast[x,y], but not for DNA comparisons with FASTA.
For protein:protein, use band-alignment algorithm.
-b # number of best scores/descriptions to show (must be < expecta‐
tion cutoff if -E is given). By default, this option is no
longer used; all scores better than the expectation (E()) cutoff
are listed. To guarantee the display of # descriptions/scores,
use -b =#, i.e. -b =100 ensures that 100 descriptions/scores
will be displayed. To guarantee at least 1 description, but
possibly many more (limited by -E e_cut), use -b >1.
-c "E-opt E-join"
threshold for gap joining (E-join) and band optimization (E-opt)
in FASTA and [T]FASTX/Y. FASTA36 now uses BLAST-like statisti‐
cal thresholds for joining and band optimization. The default
statistical thresholds for protein and translated comparisons
are E-opt=0.2, E-join=0.5; for DNA, E-join = 0.1 and E-opt=
0.02. The actual number of joins and optimizations is reported
after the E-join and E-opt scoring parameters. Statistical
thresholds improves search speed 2 - 3X, and provides much more
accurate statistical estimates for matrices other than BLOSUM50.
The "classic" joining/optimization thresholds that were the
default in fasta35 and earlier programs are available using -c O
(upper case O), possibly followed a value > 1.0 to set the opt‐
cut optimization threshold.
-C # length of name abbreviation in alignments, default = 6. Must be
less than 20.
-d # number of best alignments to show ( must be < expectation (-E)
cutoff and <= the -b description limit).
-D turn on debugging mode. Enables checks on sequence alphabet
that cause problems with tfastx36, tfasty36 (only available
after compile time option). Also preserves temp files with -e
expand_script.sh option.
-e expand_script.sh
Run a script to expand the set of sequences displayed/aligned
based on the results of the initial search. When the -e
expand_script.sh option is used, after the initial scan and sta‐
tistics calculation, but before the "Best scores" are shown,
expand_script.sh with a single argument, the name of a file that
contains the accession information (the text on the fasta
description line between > and the first space) and the
E()-value for the sequence. expand_script.sh then uses this
information to send a library of additional sequences to stdout.
These additional sequences are included in the list of high-
scoring sequences (if their scores are significant) and aligned.
The additional sequences do not change the statistics or data‐
base size.
-E e_cut e_cut_r
expectation value upper limit for score and alignment display.
Defaults are 10.0 for FASTA36 and SSEARCH36 protein searches,
5.0 for translated DNA/protein comparisons, and 2.0 for DNA/DNA
searches. FASTA version 36 now reports additional alignments
between the query and the library sequence, the second value
sets the threshold for the subsequent alignments. If not given,
the threshold is e_cut/10.0. If given and value > 1.0, e_cut_r
= e_cut / value; for value < 1.0, e_cut_r = value; If e_cut_r <
0, then the additional alignment option is disabled.
-f # penalty for opening a gap.
-F # expectation value lower limit for score and alignment display.
-F 1e-6 prevents library sequences with E()-values lower than
1e-6 from being displayed. This allows the use to focus on more
distant relationships.
-g # penalty for additional residues in a gap
-h Show short help message.
-help Show long help message, with all options.
-H show histogram (with fasta-36.3.4, the histogram is not shown by
default).
-i (fasta DNA, [t]fastx[x,y]) compare against only the reverse com‐
plement of the library sequence.
-I interactive mode; prompt for query filename, library.
-j # # ([t]fast[x,y] only) penalty for a frameshift between two codons,
([t]fasty only) penalty for a frameshift within a codon.
-J (lalign36 only) show identity alignment.
-k specify number of shuffles for statistical parameter estimation
(default=500).
-l str specify FASTLIBS file
-L report long sequence description in alignments (up to 200 char‐
acters).
-m 0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file" alignment display
options. -m 0, 1, 2, 3 display different types of alignments.
-m 4 provides an alignment "map" on the query. -m 5 combines the
alignment map and a -m 0 alignment. -m 6 provides an HTML out‐
put.
-m 8 seeks to mimic BLAST -m 8 tabular output. Only query and
library sequence names, and identity, mismatch, starts/stops,
E()-values, and bit scores are displayed. -m 8C mimics BLAST
tabular format with comment lines. -m 8 formats do not show
alignments.
-m 9 does not change the alignment output, but provides
alignment coordinate and percent identity information with the
best scores report. -m 9c adds encoded alignment information to
the -m 9; -m 9C adds encoded alignment information as a CIGAR
formatted string. To accomodate frameshifts, the CIGAR format
has been supplemented with F (forward) and R (reverse). -m 9i
provides only percent identity and alignment length information
with the best scores. With current versions of the FASTA pro‐
grams, independent -m options can be combined; e.g. -m 1 -m 9c
-m 6.
-m 11 provides lav format output from lalign36. It does not
currently affect other alignment algorithms. The lav2ps and
lav2svg programs can be used to convert lav format output to
postscript/SVG alignment "dot-plots".
-m B provides BLAST-like alignments. Alignments are labeled as
"Query" and "Sbjct", with coordinates on the same line as the
sequences, and BLAST-like symbols for matches and mismatches. -m
BB extends BLAST similarity to all the output, providing an out‐
put that closely mimics BLAST output.
-m "F# out.file" allows one search to write different alignment
formats to different files. The 'F' indicates separate file
output; the '#' is the output format (1-6,8,9,10,11,B,BB, multi‐
ple compatible formats can be combined separated by commas
-',').
-M #-# molecular weight (residue) cutoffs. -M "101-200" examines only
library sequences that are 101-200 residues long.
-n force query to nucleotide sequence
-N # break long library sequences into blocks of # residues. Useful
for bacterial genomes, which have only one sequence entry. -N
2000 works well for well for bacterial genomes. (This option was
required when FASTA only provided one alignment between the
query and library sequence. It is not as useful, now that mul‐
tiple alignments are available.)
-o "#,#"
offsets query, library sequence for numbering alignments
-O file
send output to file.
-p force query to protein alphabet.
-P pssm_file
(ssearch36, ggsearch36, glsearch36 only). Provide blastpgp
checkpoint file as the PSSM for searching. Two PSSM file formats
are available, which must be provided with the filename.
'pssm_file 0' uses a binary format that is machine specific;
'pssm_file 1' uses the "blastpgp -u 1 -C pssm_file" ASN.1 binary
format (preferred).
-q/-Q quiet option; do not prompt for input (on by default)
-r "+n/-m"
(DNA only) values for match/mismatch for DNA comparisons. +n is
used for the maximum positive value and -m is used for the maxi‐
mum negative value. Values between max and min, are rescaled,
but residue pairs having the value -1 continue to be -1.
-R file
save all scores to statistics file (previously -r file)
-s name
specify substitution matrix. BLOSUM50 is used by default;
PAM250, PAM120, and BLOSUM62 can be specified by setting -s
P120, P250, or BL62. Additional scoring matrices include: BLO‐
SUM80 (BL80), and MDM10, MDM20, MDM40 (Jones, Taylor, and Thorn‐
ton, 1992 CABIOS 8:275-282; specified as -s MD10, -s MD20, -s
MD40), OPTIMA5 (-s OPT5, Kann and Goldstein, (2002) Proteins
48:367-376), and VTML160 (-s VT160, Mueller and Vingron (2002)
J. Comp. Biol. 19:8-13). Each scoring matrix has associated
default gap penalties. The BLOSUM62 scoring matrix and -11/-1
gap penalties can be specified with -s BP62.
Alternatively, a BLASTP format scoring matrix file can be speci‐
fied, e.g. -s matrix.filename. DNA scoring matrices can also be
specified with the "-r" option.
With fasta36.3, variable scoring matrices can be specified by
preceeding the scoring matrix abbreviation with '?', e.g. -s
'?BP62'. Variable scoring matrices allow the FASTA programs to
choose an alternative scoring matrix with higher information
content (bit score/position) when short queries are used. For
example, a 90 nucleotide FASTX query can produce only a 30
amino-acid alignment, so a scoring matrix with 1.33 bits/posi‐
tion is required to produce a 40 bit score. The FASTA programs
include BLOSUM50 (0.49 bits/pos) and BLOSUM62 (0.58 bits/pos)
but can range to MD10 (3.44 bits/position). The variable scoring
matrix option searches down the list of scoring matrices to find
one with information content high enough to produce a 40 bit
alignment score.
-S treat lower case letters in the query or database as low com‐
plexity regions that are equivalent to 'X' during the initial
database scan, but are treated as normal residues for the final
alignment display. Statistical estimates are based on the 'X'ed
out sequence used during the initial search. Protein databases
(and query sequences) can be generated in the appropriate format
using John Wooton's "pseg" program, available from
ftp://ftp.ncbi.nih.gov/pub/seg/pseg. Once you have compiled the
"pseg" program, use the command:
pseg database.fasta -z 1 -q > database.lc_seg
-t # Translation table - [t]fastx36 and [t]fasty36 support the BLAST
tranlation tables. See http://www.ncbi.nih.gov/htbin-post/Tax‐
onomy/wprintgc?mode=c/.
-T # (threaded, parallel only) number of threads or workers to use
(on Linux/MacOS/Unix, the default is to use as many processors
as are available; on Windows systems, 2 processors are used).
-U Do RNA sequence comparisons: treat 'T' as 'U', allow G:U base
pairs (by scoring "G-A" and "T-C" as score(G:G)-3). Search only
one strand.
-V "?$%*"
Allow special annotation characters in query sequence. These
characters will be displayed in the alignments on the coordinate
number line.
-w # line width for similarity score, sequence alignment, output.
-W # context length (default is 1/2 of line width -w) for alignment,
like fasta and ssearch, that provide additional sequence con‐
text.
-X extended options. Less used options. Other options include
-XB, -XM4G, -Xo, -Xx, and -Xy; see fasta_guide.pdf.
-z 1, 2, 3, 4, 5, 6
Specify the statistical calculation. Default is -z 1 for local
similarity searches, which uses regression against the length of
the library sequence. -z -1 disables statistics. -z 0 estimates
significance without normalizing for sequence length. -z 2 pro‐
vides maximum likelihood estimates for lambda and K, censoring
the 250 lowest and 250 highest scores. -z 3 uses Altschul and
Gish's statistical estimates for specific protein BLOSUM scoring
matrices and gap penalties. -z 4,5: an alternate regression
method. -z 6 uses a composition based maximum likelihood esti‐
mate based on the method of Mott (1992) Bull. Math. Biol.
54:59-75.
-z 11,12,14,15,16
compute the regression against scores of randomly shuffled
copies of the library sequences. Twice as many comparisons are
performed, but accurate estimates can be generated from data‐
bases of related sequences. -z 11 uses the -z 1 regression
strategy, etc.
-z 21, 22, 24, 25, 26
compute two E()-values. The standard (library-based) E()-value
is calculated in the standard way (-z 1, 2, etc), but a second
E2() value is calculated by shuffling the high-scoring sequences
(those with E()-values less than the threshold). For "average"
composition proteins, these two estimates will be similar
(though the best-shuffle estimates are always more conserva‐
tive). For biased composition proteins, the two estimates may
differ by 100-fold or more. A second -z option, e.g. -z "21 2",
specifies the estimation method for the best-shuffle E2()-val‐
ues. Best-shuffle E2()-values approximate the estimates given by
PRSS (or in a pairwise SSEARCH).
-Z db_size
Set the apparent database size used for expectation value calcu‐
lations (used for protein/protein FASTA and SSEARCH, and for
[T]FASTX/Y).
Reading sequences from STDIN
The FASTA programs can accept a query sequence from the unix "stdin"
data stream. This makes it much easier to use fasta36 and its rela‐
tives as part of a WWW page. To indicate that stdin is to be used, use
"@" as the query sequence file name. "@" can also be used to specify a
subset of the query sequence to be used, e.g:
cat query.aa | fasta36 @:50-150 s
would search the 's' database with residues 50-150 of query.aa. FASTA
cannot automatically detect the sequence type (protein vs DNA) when
"stdin" is used and assumes protein comparisons by default; the '-n'
option is required for DNA for STDIN queries.
Environment variables:
FASTLIBS
location of library choice file (-l FASTLIBS)
SRCH_URL1, SRCH_URL2
format strings used to define options to re-search the database.
REF_URL
the format string used to define the option to lookup the
library sequence in entrez, or some other database.
AUTHOR
Bill Pearson
wrp@virginia.EDU
Version: $ Id: $ Revision: $Revision: 210 $
fasta36/ssearch36/[t]fast[x,y]36/lalign36 1(local)