fasta36 man page on DragonFly

fasta36 man page on DragonFly
Man page or keyword search:
man Server 44335 pages
apropos Keyword Search (all sections)
Output format
fasta36/ssearch36/[t]fastfasta36/ssearch36/[t]fast[x,y]36/lalign36    1(local)

NAME
       fasta36 - scan a protein or DNA sequence library for similar sequences

       fastx36	 - compare a DNA sequence to a protein sequence database, com‐
       paring the translated DNA sequence in forward and reverse frames.

       tfastx36	 - compare a protein sequence to a DNA sequence database, cal‐
       culating	 similarities with frameshifts to the forward and reverse ori‐
       entations.

       fasty36	- compare a DNA sequence to a protein sequence database,  com‐
       paring the translated DNA sequence in forward and reverse frames.

       tfasty36	 - compare a protein sequence to a DNA sequence database, cal‐
       culating similarities with frameshifts to the forward and reverse  ori‐
       entations.

       fasts36 - compare unordered peptides to a protein sequence database

       fastm36	-  compare ordered peptides (or short DNA sequences) to a pro‐
       tein (DNA) sequence database

       tfasts36 - compare unordered peptides  to  a  translated	 DNA  sequence
       database

       fastf36 - compare mixed peptides to a protein sequence database

       tfastf36 - compare mixed peptides to a translated DNA sequence database

       ssearch36  -  compare  a protein or DNA sequence to a sequence database
       using the Smith-Waterman algorithm.

       ggsearch36 - compare a protein or DNA sequence to a  sequence  database
       using a global alignment (Needleman-Wunsch)

       glsearch36  -  compare a protein or DNA sequence to a sequence database
       with alignments that are global in the query and local in the  database
       sequence (global-local).

       lalign36	 - produce multiple non-overlapping alignments for protein and
       DNA sequences using the Huang and Miller sim algorithm for  the	Water‐
       man-Eggert algorithm.

       prss36,	prfx36	-  discontinued;  all the FASTA programs will estimate
       statistical significance using 500  shuffled  sequence  scores  if  two
       sequences are compared.

DESCRIPTION
       Release	3.6  of	 the  FASTA package provides a modular set of sequence
       comparison programs that can run on conventional single processor  com‐
       puters  or  in  parallel on multiprocessor computers. More than a dozen
       programs	    -	  fasta36,     fastx36/tfastx36,     fasty36/tfasty36,
       fasts36/tfasts36, fastm36, fastf36/tfastf36, ssearch36, ggsearch36, and
       glsearch36 - are currently available.

       All the comparison programs share a set of basic command line  options;
       additional options are available for individual comparison functions.

       Threaded	 versions  of  the  FASTA  programs  (built  by	 default under
       Unix/Linux/MacOX) run in parallel on modern Linux and  Unix  multi-core
       or multi-processor computers.  Accelerated versions of the Smith-Water‐
       man algorithm are available for architectures with the  Intel  SSE2  or
       Altivec PowerPC architectures, which can speed-up Smith-Waterman calcu‐
       lations 10 - 20-fold.

       In addition to the serial and threaded versions of the FASTA  programs,
       MPI  parallel  versions	are  available	as fasta36_mpi, ssearch36_mpi,
       fastx36_mpi, etc. The MPI parallel versions use the same	 command  line
       options as the serial and threaded versions.

Running the FASTA programs
       By  default, the FASTA programs are no longer interactive; they are run
       from the command	 line  by  specifying  the  program,  query.file,  and
       library.file.	Program	  options  must	 preceed  the  query.file  and
       library.file arguments:

     fasta36 -option1 -option2 -option3 query.file library.file > fasta.output

       The "classic" interactive mode, which  prompts  for  a  query.file  and
       library.file,  is  available with the -I option.	 Typing a program name
       without any arguments (ssearch36) provides a short help	message;  pro‐
       gram_name -help provides a complete set of program options.

       Program options MUST preceed the query.file and library.file arguments.

FASTA program options
       The  default  scoring matrix and gap penalties used by each of the pro‐
       grams have been selected for high sensitivity searches with the various
       algorithms.   The default program behavior can be modified by providing
       command line options before the query.file and library.file  arguments.
       Command line options can also be used in interactive mode.

       Command line arguments come in several classes.

       (1)  Commands  that  specify  the comparison type. FASTA, FASTS, FASTM,
       SSEARCH, GGSEARCH, and GLSEARCH	can  compare  either  protein  or  DNA
       sequences,  and attempt to recognize the comparison type by looking the
       residue composition. -n, -p specify DNA (nucleotide) or protein compar‐
       ison, respectively. -U specifies RNA comparison.

       (2) Commands that limit the set of sequences compared: -1, -3, -M.

       (3)  Commands that modify the scoring parameters: -f gap-open penaltyP,
       -g  gap-extend  penalty,	 -j  inter-codon   frame-shift,	  within-codon
       frameshift, -s scoring-matrix, -r match/mismatch score, -x X:X score.

       (4)  Commands  that modify the algorithm (mostly FASTA and [T]FASTX/Y):
       -c, -w, -y, -o. The -S can be used to ignore lower-case	(low  complex‐
       ity) residues during the initial score calculation.

       (5)  Commands  that modify the output: -A, -b number, -C width, -d num‐
       ber, -L, -m 0-11,B, -w line-width, -W context-width, -o offset1,ofset2

       (6) Commands that affect statistical estimates: -Z, -k.

Option summary:
       -1     Sort by "init1" score (obsolete)

       -3     ([t]fast[x,y] only) use only forward frame translations

       -a     Displays the full length (included unaligned  regions)  of  both
	      sequences with fasta36, ssearch36, glsearch36, and fasts36.

       -A (fasta36 only) For DNA:DNA, force Smith-Waterman alignment for
	      output.	Smith-Waterman is the default for FASTA protein align‐
	      ment and [t]fast[x,y], but not for DNA comparisons  with	FASTA.
	      For protein:protein, use band-alignment algorithm.

       -b #   number  of  best scores/descriptions to show (must be < expecta‐
	      tion cutoff if -E is given).  By	default,  this	option	is  no
	      longer used; all scores better than the expectation (E()) cutoff
	      are listed. To guarantee the display of  #  descriptions/scores,
	      use  -b  =#,  i.e.  -b =100 ensures that 100 descriptions/scores
	      will be displayed.  To guarantee at  least  1  description,  but
	      possibly many more (limited by -E e_cut), use -b >1.

       -c "E-opt E-join"
	      threshold for gap joining (E-join) and band optimization (E-opt)
	      in FASTA and [T]FASTX/Y.	FASTA36 now uses BLAST-like  statisti‐
	      cal  thresholds  for joining and band optimization.  The default
	      statistical thresholds for protein  and  translated  comparisons
	      are  E-opt=0.2,  E-join=0.5;  for	 DNA,  E-join = 0.1 and E-opt=
	      0.02. The actual number of joins and optimizations  is  reported
	      after  the  E-join  and  E-opt  scoring parameters.  Statistical
	      thresholds improves search speed 2 - 3X, and provides much  more
	      accurate statistical estimates for matrices other than BLOSUM50.
	      The "classic"  joining/optimization  thresholds  that  were  the
	      default in fasta35 and earlier programs are available using -c O
	      (upper case O), possibly followed a value > 1.0 to set the  opt‐
	      cut optimization threshold.

       -C #   length of name abbreviation in alignments, default = 6.  Must be
	      less than 20.

       -d #   number of best alignments to show ( must be <  expectation  (-E)
	      cutoff and <= the -b description limit).

       -D     turn  on	debugging  mode.   Enables checks on sequence alphabet
	      that cause problems  with	 tfastx36,  tfasty36  (only  available
	      after  compile  time option).  Also preserves temp files with -e
	      expand_script.sh option.

       -e expand_script.sh
	      Run a script to expand the set  of  sequences  displayed/aligned
	      based  on	 the  results  of  the	initial	 search.   When the -e
	      expand_script.sh option is used, after the initial scan and sta‐
	      tistics  calculation,  but  before  the "Best scores" are shown,
	      expand_script.sh with a single argument, the name of a file that
	      contains	the  accession	information  (the  text	 on  the fasta
	      description  line	 between  >  and  the  first  space)  and  the
	      E()-value	 for  the  sequence.   expand_script.sh then uses this
	      information to send a library of additional sequences to stdout.
	      These  additional	 sequences  are	 included in the list of high-
	      scoring sequences (if their scores are significant) and aligned.
	      The  additional  sequences do not change the statistics or data‐
	      base size.

       -E e_cut e_cut_r
	      expectation value upper limit for score and  alignment  display.
	      Defaults	are  10.0  for FASTA36 and SSEARCH36 protein searches,
	      5.0 for translated DNA/protein comparisons, and 2.0 for  DNA/DNA
	      searches.	 FASTA	version	 36  now reports additional alignments
	      between the query and the library	 sequence,  the	 second	 value
	      sets the threshold for the subsequent alignments.	 If not given,
	      the threshold is e_cut/10.0.  If given and value > 1.0,  e_cut_r
	      = e_cut / value; for value < 1.0, e_cut_r = value;  If e_cut_r <
	      0, then the additional alignment option is disabled.

       -f #   penalty for opening a gap.

       -F #   expectation value lower limit for score and  alignment  display.
	      -F  1e-6	prevents  library sequences with E()-values lower than
	      1e-6 from being displayed. This allows the use to focus on  more
	      distant relationships.

       -g #   penalty for additional residues in a gap

       -h     Show short help message.

       -help  Show long help message, with all options.

       -H     show histogram (with fasta-36.3.4, the histogram is not shown by
	      default).

       -i     (fasta DNA, [t]fastx[x,y]) compare against only the reverse com‐
	      plement of the library sequence.

       -I     interactive mode; prompt for query filename, library.

       -j # # ([t]fast[x,y] only) penalty for a frameshift between two codons,
	      ([t]fasty only) penalty for a frameshift within a codon.

       -J     (lalign36 only) show identity alignment.

       -k     specify number of shuffles for statistical parameter  estimation
	      (default=500).

       -l str specify FASTLIBS file

       -L     report  long sequence description in alignments (up to 200 char‐
	      acters).

       -m 0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file" alignment display
	      options.	-m 0, 1, 2, 3 display different types  of  alignments.
	      -m 4 provides an alignment "map" on the query. -m 5 combines the
	      alignment map and a -m 0 alignment.  -m 6 provides an HTML  out‐
	      put.

       -m 8 seeks to mimic BLAST -m 8 tabular output.  Only query and
	      library  sequence	 names,	 and identity, mismatch, starts/stops,
	      E()-values, and bit scores are displayed.	 -m  8C	 mimics	 BLAST
	      tabular  format  with  comment  lines.  -m 8 formats do not show
	      alignments.

       -m 9 does not change the alignment output, but provides
	      alignment coordinate and percent identity information  with  the
	      best scores report.  -m 9c adds encoded alignment information to
	      the -m 9; -m 9C adds encoded alignment information  as  a	 CIGAR
	      formatted	 string.  To  accomodate frameshifts, the CIGAR format
	      has been supplemented with F (forward) and R (reverse).	-m  9i
	      provides	only percent identity and alignment length information
	      with the best scores.  With current versions of the  FASTA  pro‐
	      grams,  independent  -m options can be combined; e.g. -m 1 -m 9c
	      -m 6.

       -m 11 provides lav format output from lalign36.	It does not
	      currently affect other alignment	algorithms.   The  lav2ps  and
	      lav2svg  programs	 can  be  used to convert lav format output to
	      postscript/SVG alignment "dot-plots".

       -m B provides BLAST-like alignments.  Alignments are labeled as
	      "Query" and "Sbjct", with coordinates on the same	 line  as  the
	      sequences, and BLAST-like symbols for matches and mismatches. -m
	      BB extends BLAST similarity to all the output, providing an out‐
	      put that closely mimics BLAST output.

       -m "F# out.file" allows one search to write different alignment
	      formats  to  different  files.   The 'F' indicates separate file
	      output; the '#' is the output format (1-6,8,9,10,11,B,BB, multi‐
	      ple  compatible  formats	can  be	 combined  separated by commas
	      -',').

       -M #-# molecular weight (residue) cutoffs.  -M "101-200" examines  only
	      library sequences that are 101-200 residues long.

       -n     force query to nucleotide sequence

       -N #   break  long library sequences into blocks of # residues.	Useful
	      for bacterial genomes, which have only one sequence  entry.   -N
	      2000 works well for well for bacterial genomes. (This option was
	      required when FASTA only	provided  one  alignment  between  the
	      query  and library sequence.  It is not as useful, now that mul‐
	      tiple alignments are available.)

       -o "#,#"
	      offsets query, library sequence for numbering alignments

       -O file
	      send output to file.

       -p     force query to protein alphabet.

       -P pssm_file
	      (ssearch36,  ggsearch36,	glsearch36  only).   Provide  blastpgp
	      checkpoint file as the PSSM for searching. Two PSSM file formats
	      are  available,  which  must  be	provided  with	the  filename.
	      'pssm_file  0'  uses  a  binary format that is machine specific;
	      'pssm_file 1' uses the "blastpgp -u 1 -C pssm_file" ASN.1 binary
	      format (preferred).

       -q/-Q  quiet option; do not prompt for input (on by default)

       -r "+n/-m"
	      (DNA  only) values for match/mismatch for DNA comparisons. +n is
	      used for the maximum positive value and -m is used for the maxi‐
	      mum  negative  value.  Values between max and min, are rescaled,
	      but residue pairs having the value -1 continue to be -1.

       -R file
	      save all scores to statistics file (previously -r file)

       -s name
	      specify substitution  matrix.   BLOSUM50	is  used  by  default;
	      PAM250,  PAM120,	and  BLOSUM62  can  be specified by setting -s
	      P120, P250, or BL62.  Additional scoring matrices include:  BLO‐
	      SUM80 (BL80), and MDM10, MDM20, MDM40 (Jones, Taylor, and Thorn‐
	      ton, 1992 CABIOS 8:275-282; specified as -s MD10,	 -s  MD20,  -s
	      MD40),  OPTIMA5  (-s  OPT5,  Kann and Goldstein, (2002) Proteins
	      48:367-376), and VTML160 (-s VT160, Mueller and  Vingron	(2002)
	      J.  Comp.	 Biol.	19:8-13).   Each scoring matrix has associated
	      default gap penalties.  The BLOSUM62 scoring matrix  and	-11/-1
	      gap penalties can be specified with -s BP62.

	      Alternatively, a BLASTP format scoring matrix file can be speci‐
	      fied, e.g. -s matrix.filename.  DNA scoring matrices can also be
	      specified with the "-r" option.

	      With  fasta36.3,	variable  scoring matrices can be specified by
	      preceeding the scoring matrix abbreviation  with	'?',  e.g.  -s
	      '?BP62'.	Variable  scoring matrices allow the FASTA programs to
	      choose an alternative scoring  matrix  with  higher  information
	      content  (bit  score/position) when short queries are used.  For
	      example, a 90 nucleotide FASTX  query  can  produce  only	 a  30
	      amino-acid  alignment,  so a scoring matrix with 1.33 bits/posi‐
	      tion is required to produce a 40 bit score. The  FASTA  programs
	      include  BLOSUM50	 (0.49	bits/pos) and BLOSUM62 (0.58 bits/pos)
	      but can range to MD10 (3.44 bits/position). The variable scoring
	      matrix option searches down the list of scoring matrices to find
	      one with information content high enough to  produce  a  40  bit
	      alignment score.

       -S     treat  lower  case  letters in the query or database as low com‐
	      plexity regions that are equivalent to 'X'  during  the  initial
	      database	scan, but are treated as normal residues for the final
	      alignment display.  Statistical estimates are based on the 'X'ed
	      out  sequence  used during the initial search. Protein databases
	      (and query sequences) can be generated in the appropriate format
	      using    John    Wooton's	  "pseg"   program,   available	  from
	      ftp://ftp.ncbi.nih.gov/pub/seg/pseg.  Once you have compiled the
	      "pseg" program, use the command:

	      pseg database.fasta -z 1 -q  > database.lc_seg

       -t #   Translation  table - [t]fastx36 and [t]fasty36 support the BLAST
	      tranlation tables.  See  http://www.ncbi.nih.gov/htbin-post/Tax‐
	      onomy/wprintgc?mode=c/.

       -T #   (threaded,  parallel  only)  number of threads or workers to use
	      (on Linux/MacOS/Unix, the default is to use as  many  processors
	      as are available; on Windows systems, 2 processors are used).

       -U     Do  RNA  sequence	 comparisons: treat 'T' as 'U', allow G:U base
	      pairs (by scoring "G-A" and "T-C" as score(G:G)-3).  Search only
	      one strand.

       -V "?$%*"
	      Allow  special  annotation  characters in query sequence.	 These
	      characters will be displayed in the alignments on the coordinate
	      number line.

       -w # line width for similarity score, sequence alignment, output.

       -W # context length (default is 1/2 of line width -w) for alignment,
	      like  fasta  and	ssearch, that provide additional sequence con‐
	      text.

       -X extended options.  Less used options. Other options include
	      -XB, -XM4G, -Xo, -Xx, and -Xy; see fasta_guide.pdf.

       -z 1, 2, 3, 4, 5, 6
	      Specify the statistical calculation. Default is -z 1  for	 local
	      similarity searches, which uses regression against the length of
	      the library sequence. -z -1 disables statistics.	-z 0 estimates
	      significance  without normalizing for sequence length. -z 2 pro‐
	      vides maximum likelihood estimates for lambda and	 K,  censoring
	      the  250	lowest	and 250 highest scores. -z 3 uses Altschul and
	      Gish's statistical estimates for specific protein BLOSUM scoring
	      matrices	and  gap  penalties.  -z  4,5: an alternate regression
	      method.  -z 6 uses a composition based maximum likelihood	 esti‐
	      mate  based  on  the  method  of	Mott  (1992) Bull. Math. Biol.
	      54:59-75.

       -z 11,12,14,15,16
	      compute the  regression  against	scores	of  randomly  shuffled
	      copies  of the library sequences.	 Twice as many comparisons are
	      performed, but accurate estimates can be	generated  from	 data‐
	      bases  of	 related  sequences.  -z  11  uses the -z 1 regression
	      strategy, etc.

       -z 21, 22, 24, 25, 26
	      compute two E()-values.  The standard (library-based)  E()-value
	      is  calculated  in the standard way (-z 1, 2, etc), but a second
	      E2() value is calculated by shuffling the high-scoring sequences
	      (those  with E()-values less than the threshold).	 For "average"
	      composition  proteins,  these  two  estimates  will  be  similar
	      (though  the  best-shuffle  estimates  are always more conserva‐
	      tive).  For biased composition proteins, the two	estimates  may
	      differ by 100-fold or more.  A second -z option, e.g. -z "21 2",
	      specifies the estimation method for the  best-shuffle  E2()-val‐
	      ues. Best-shuffle E2()-values approximate the estimates given by
	      PRSS (or in a pairwise SSEARCH).

       -Z db_size
	      Set the apparent database size used for expectation value calcu‐
	      lations  (used  for  protein/protein  FASTA and SSEARCH, and for
	      [T]FASTX/Y).

Reading sequences from STDIN
       The FASTA programs can accept a query sequence from  the	 unix  "stdin"
       data  stream.   This  makes it much easier to use fasta36 and its rela‐
       tives as part of a WWW page. To indicate that stdin is to be used,  use
       "@" as the query sequence file name.  "@" can also be used to specify a
       subset of the query sequence to be used, e.g:

     cat query.aa | fasta36 @:50-150 s

       would search the 's' database with residues 50-150 of query.aa.	 FASTA
       cannot  automatically  detect  the  sequence type (protein vs DNA) when
       "stdin" is used and assumes protein comparisons by  default;  the  '-n'
       option is required for DNA for STDIN queries.

Environment variables:
       FASTLIBS
	      location of library choice file (-l FASTLIBS)

       SRCH_URL1, SRCH_URL2
	      format strings used to define options to re-search the database.

       REF_URL
	      the  format  string  used	 to  define  the  option to lookup the
	      library sequence in entrez, or some other database.

AUTHOR
       Bill Pearson
       wrp@virginia.EDU

       Version: $ Id: $ Revision: $Revision: 210 $

			 fasta36/ssearch36/[t]fast[x,y]36/lalign36    1(local)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome