FASTA/TFASTA/FASTX/TFASTXv2.0u(1) FASTA/TFASTA/FASTX/TFASTXv2.0u(1)NAME
fasta - scan a protein or DNA sequence library for similar sequences
tfasta - compare a protein sequence to a DNA sequence library, trans‐
lating the DNA sequence library `on-the-fly'.
lfasta - compare two protein or DNA sequences for local similarity and
show the local sequence alignments
plfasta - compare two sequences for local similarity and plot the local
sequence alignments
SYNOPSIS
fasta [-a -A -b # -c # -d # -E # -f # -g # -k # -l file -L FASTLIBS
-r STATFILE -m # -o -O file -p # -Q -s SMATRIX -w # -x "# #" -y # -z -1
] query-sequence-file library-file [ ktup ]
fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file @library-name-file
fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file "%PRMVI"
fasta [-aAbcdEgHlmnoOprswyx] - interactive mode
fastx [-aAbcdEfghHlmnoOprswyx] DNA-query-file protein-library [ ktup ]
tfasta [-aAbcdEfgkmoOprswy3] protein-query-file DNA-library [ ktup ]
tfastx [-abcdEfghHikmoOprswy3] protein-query-file DNA-library [ ktup ]
lfasta [-afgmnpswx] sequence-file-1 sequence-file-2 [ ktup ]
plfasta [-afgkmnpsxv] sequence-file-1 sequence-file-2 [ ktup ]
DESCRIPTION
fasta is used to compare a protein or DNA sequence to all of the
entries in a sequence library. For example, fasta can compare a pro‐
tein sequence to all of the sequences in the NBRF PIR protein sequence
database. fasta will automatically decide whether the query sequence
is DNA or protein by reading the query sequence as protein and deter‐
mining whether the `amino-acid composition' is more than 85% A+C+G+T.
fasta uses an improved version of the rapid sequence comparison algo‐
rithm described by Lipman and Pearson (Science, (1985) 227:1427) that
is described in Pearson and Lipman, Proc. Natl. Acad. USA, (1988)
85:2444. The program can be invoked either with command line arguments
or in interactive mode. The optional third argument, ktup sets the
sensitivity and speed of the search. If ktup=2, similar regions in the
two sequences being compared are found by looking at pairs of aligned
residues; if ktup=1, single aligned amino acids are examined. ktup can
be set to 2 or 1 for protein sequences, or from 1 to 6 for DNA
sequences. The default if ktup is not specified is 2 for proteins and
6 for DNA.
fasta compares a query sequence to a sequence library which consists of
sequence data interspersed with comments, see below. Normally fasta,
fastx, tfasta, and tfastx search the libraries listed in the file
pointed to by the environment variable FASTLIBS. The format of this
file is described in the file FASTA.DOC. tfasta compares a protein
sequence to a DNA sequence database, translating the DNA sequence
library in 6 frames `on-the-fly' (3 frames with the -3 option). The
search uses the standard BLOSUM50 scoring matrix, and uses a ktup=2 by
default. tfasta searches a DNA sequence database in the standard text
format described below. tfastx, like tfasta, compares a protein
sequence to a DNA sequence library. However, tfastx compares the pro‐
tein sequence to the forward and reverse three-frame translation of the
DNA library sequence, allowing for frameshifts. fastx compares a DNA
sequence to a protein sequence database, translating the DNA sequence
in three frames and allowing frameshifts in the alignment. lfasta and
plfasta programs compare two sequences looking for local sequence simi‐
larities. While fasta, fastx, and tfasta report only the best align‐
ment between the query sequence and the library sequence, lfasta and
plfasta will report all of the alignments between the two sequences
with scores greater than a cut-off value. lfasta shows the actual
local alignments between the two sequences and their scores, while
plfasta produces a plot of the alignments that looks similar to a `dot-
matrix' homology plot. On Unix™ systems, plfasta generates postscript
output.
The fasta programs use a standard text format sequence file. Lines
beginning with '>' or ';' are considered comments and ignored;
sequences can be upper or lower case, blanks,tabs and unrecognizable
characters are ignored. fasta expects sequences to use the single let‐
ter amino acid codes, see protcodes(1) . Library files for fasta
should have the form shown below.
OPTIONS
fasta and the other programs can be directed to change the scoring
matrix, search parameters, output format, and default search directo‐
ries by entering options on the command line (preceeded by a `-' or `/'
for MS-DOS). All of the options should preceed the file name and ktup
arguments). Alternately, these options can be changed by setting envi‐
ronment variables. The options and environment variables are:
-1 Normally, the top scoring sequences are ranked by the z-score
based on the opt score. To rank sequences by raw scores, use
the -z option. With the -1 option, sequences are ranked by the
z-score based on the init1 score. With the
-a (SHOWALL) Modifies the display of the two sequences in align‐
ments. Normally, both sequences are shown only where they over‐
lap (SHOWALL=0); If -a or the environment variable SHOWALL = 1,
both sequences are shown in their entirety.
-A Force use of unlimited Smith-Waterman alignment for DNA FASTA
and TFASTA. By default, the program uses the older (and faster)
band-limited Smith-Waterman alignment for DNA FASTA and TFASTA
alignments.
-b # The number of similarity scores to be shown when the -Q option
is used. This value is usually calculated based on the actual
scores.
-c # (OPTCUT) The threshold for optimization with the option. The
OPTCUT value is normally calculated based on sequence length.
-d # The number of alignments to be shown. Normally, fasta shows the
same number of alignments as similarity scores. By using fasta
-Q -b 200 -d 50, one would see the top scoring 200 sequences and
alignments for the 50 best scores.
-E # The expectation value threshold for displaying similarity scores
and sequence alignments. fasta -Q-E 2.0 would show all
library sequences with scores expected to occur no more than 2
times by chance in a search of the library.
-f # Penalty for the first residue in a gap (-12 by default for fasta
with proteins, -16 for DNA).
-g # Penalty for additional residues in a gap (-2 by default for
fasta with proteins, -4 for DNA).
-h # (fastx, tfastx only) penalty for a +1 or -1 frameshift.
-H Do not display histogram of similarity scores.
-i (fasta, fastx) search with the reverse-complement of the query
DNA sequence. (tfastx) search only the reverse complement of
the DNA library sequence.
-k # (GAPCUT) Sets the threshold for joining the initial regions for
calculating the initn score.
-l file
(FASTLIBS) The name of the library menu file. Normally this
will be determined by the environment variable FASTLIBS. How‐
ever, a library menu file can also be specified with -l.
-L display more information about the library sequence in the
alignment.
-m # (MARKX) =0,1,2,3,4,10. Alternate display of matches and mis‐
matches in alignments. MARKX=0 uses ":","."," ", for identities,
consevative replacements, and non-conservative replacements,
respectively. MARKX=1 uses " ","x", and "X". MARKX=2 does not
show the second sequence, but uses the second alignment line to
display matches with a "." for identity, or with the mismatched
residue for mismatches. MARKX=2 is useful for aligning large
numbers of similar sequences. MARKX=3 writes out a file of
library sequences in FASTA format. MARKX=3 should always be
used with the "SHOWALL" (-a) option, but this does not com‐
pletely ensure that all of the sequences output will be aligned.
MARKX=4 displays a graph of the alignment of the library
sequence with repect to the query sequence, so that one can
identify the regions of the query sequence that are conserved.
MARKX=10 is used to produce a parseable output format.
-n Forces the query sequence to be treated as a DNA sequence.
-O filename
send copy of results to "filename."
-o Turns off default fasta limited optimization on all of the
sequences in the library with initn scores greater than OPTCUT.
This option is now the reverse of previous versions of fasta.
-Q Quiet option. This allows fasta and tfasta to search a database
and report the results without asking any questions. fasta -Q
file library > output can be put in the background or run at a
later time with the unix 'at' command. The number of similarity
scores and alignments displayed with the -Q option can be modi‐
fied with the -b (scores) and -d (alignments) options.
-r STATFILE Causes fasta to write out the sequence identifier,
superfamily number (if available), and similarity scores to
STATFILE for every sequence in the library. These results are
not sorted.
-s str (SMATRIX) the filename of an alternative scoring matrix file.
For protein sequences, BLOSUM50 is used by default; PAM250 can
be used with the command line option -s 250.
-v str (LINEVAL) (plfasta only) plfasta and pclfasta can use up to 4
different line styles to denote the scores of local alignments.
The scores that correspond to these line styles can be specified
with the environment variable LINVAL, or with the -v option. In
either case, a string with three numbers separated by spaces
should be given. This string must be surrounded by double quo‐
tation marks. For example, LINEVAL="200 100 50" tells plfasta
to use solid lines for local alignments with scores greater than
200, long dashed lines for scores between 100 and 200, short
dashed lines for scores between 50 and 100, and dotted lines for
scores less than 50.
plfasta -v "200 100 50"
Normally, the values are 200, 100, and 50 for protein sequence
comparisons and 400, 200, and 100 for DNA sequence comparisons.
-w # (LINLEN) output line length for sequence alignments. (normally
60, can be set up to 200).
-x "offset1 offset2"
Causes fasta/lfasta/plfasta to start numbering the aligned
sequences starting with offset1 and offset2, rather than 1 and
1. This is particularly useful for showing alignments of pro‐
moter regions.
-y Set the band-width used for optimization. -y 16 is the default
for protein when ktup=2 and for all DNA alignments. -y 32 is
used for protein and ktup=1. For proteins, optimization slows
comparison 2-fold and is highly recommended.
-z Do not do statistical significance calculation. Results are
ranked by the unnormalized opt, initn, or init1 score.
-3 (tfasta, tfastx) only. Normally tfasta and tfastx translate
sequences in the DNA sequence library in all six frames. With
the -3 option, only the three forward frames are searched.
EXAMPLES
(1) fasta musplfm.aa $AABANK
Compare the amino acid sequence in the file musplfm.aa with the com‐
plete PIR protein sequence library using ktup = 2 Each "library"
sequence (there need only be one) should start with a comment line
which starts with a '>', e.g.
>LCBO bovine preprolactin
WILLLSQ ...
>LCHU human ...
...
(2) fasta -a -w 80 musplfm.aa lcbo.aa 1
Compare the amino acid sequence in the file musplfm.aa with the
sequences in the file lcbo.aa using ktup = 1. Show both sequences in
their entirety, with 80 residues on each output line.
(3) fasta
Run the fasta program in interactive mode. The program will prompt for
the file name for the query sequence, list alternative libraries to be
seached (if FASTLIBS is set), and prompt for the ktup.
FILES
This version of fasta prompts for the library file to be searched from
a list of file names that are saved in the file pointed to by the envi‐
ronment variable FASTLIBS. If FASTLIBS = fastgb.list, then the file
fastgb.list might have the entries:
NBRF Protein$0P/u/lib/aabank.lib 0
GB Primate$1P@/u/lib/gpri.nam
GB Rodent$1R@/u/lib/grod.nam
GB Mammal$1M@/u/lib/gmammal.nam
Each line in this file has 4 fields: (1) The library name, separated
from the remaining fields by a '$'; (2) A 0 or a 1 indicating protein
or DNA library respectively; (3) A single letter that will be used to
choose the library; (4) the location of the library file itself (the
library file name can contain an optional library format specfier.
Fasta recognizes the following library formats: 0 - Pearson/FASTA; 1 -
Genbank flat file; 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 4 - Intel‐
ligenetics; 5 - NBRF/PIR VMS); Note that this fourth field can contain
an '@' character, which indicates that the library file is an indirect
library file containing list of library files, one per line. An indi‐
rect library file might have the lines:
</usr/slib/genbank (the directory for the library files)
gbpri.seq 1
gbrod.seq 1
gbmam.seq 1
...
gbvrl.seq 1
...
You can use your own sequence files for fasta, just be certain to put a
'>' and comment as the first line before the sequence. Only one
library file type, the standard NBRF library format, is supported by
the VAX/VMS programs. lfasta and plfasta do not required the '>' and
comment line. fasta does.
SEE ALSOrdf2(1),protcodes(5), dnacodes(5)AUTHOR
Bill Pearson
wrp@virginia.EDU
local FASTA/TFASTA/FASTX/TFASTXv2.0u(1)