EMBOSS: matcher


Program matcher ( YMBC , NCHC )

Function

Finds the best local alignments between two sequences

Description

matcher compares two sequences looking for local sequence similarities using a rigorous algorithm.

matcher is based on Bill Pearson's 'lalign' application, version 2.0u4 Feb. 1996

Lalign uses code developed by X. Huang and W. Miller (Adv. Appl. Math. (1991) 12:337-357) for the "sim" program, which is a linear-space version of an algorithm described by M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).

Like water, matcher is rigorous, but also very slow. The advantage of matcher is that it uses far less memory than water, so you are much less likely to run out of memory when aligning large sequences.

matcher will also report a specified number of alignments between the two sequences showing the actual local alignments. (water will only report the single best match.) The default number of alignments output is 1, but can be increased to (for example) the 10 best alignments by using the '-alternatives 10' command-line qualifier. In some cases, for example multidomain proteins or cDNA and genomic DNA comparisons, there may be many interesting and significant alignments.

Usage

Here is a sample session with matcher.

% matcher sw:hba_human sw:hbb_human
Finds the best local alignments between two sequences
Output file [hba_human.matcher]: 

Here is an example to find the 10 best alignments:

% matcher sw:hba_human sw:hbb_human -alt 10
Finds the best local alignments between two sequences
Output file [hba_human.matcher]: hba_human.matcher10

Command line arguments

   Mandatory qualifiers:
  [-sequencea]         sequence   Sequence USA
  [-sequenceb]         sequence   Sequence USA
  [-outfile]           outfile    Output file name

   Optional qualifiers:
   -datafile           matrix     Matrix file
   -alternatives       integer    This sets the number of alternative matches
                                  output. By default only the highest scoring
                                  alignment is shown. A value of 2 gves you
                                  other reasonable alignments. In some cases,
                                  for example multidomain proteins of cDNA and
                                  gemomic DNA comparisons, there may be other
                                  interesting and significant alignments.
   -gappenalty         integer    The gap penalty is the score taken away when
                                  a gap is created. The best value depends on
                                  the choice of comparison matrix. The
                                  default value of 14 assumes you are using
                                  the EBLOSUM62 matrix for protein sequences,
                                  or a value of 16 and the EDNAFULL matrix for
                                  nucleotide sequences.
   -gaplength          integer    The gap length, or gap extension, penalty is
                                  added to the standard gap penalty for each
                                  base or residue in the gap. This is how long
                                  gaps are penalized. Usually you will expect
                                  a few long gaps rather than many short
                                  gaps, so the gap extension penalty should be
                                  lower than the gap penalty. An exception is
                                  where one or both sequences are single
                                  reads with possible sequencing errors in
                                  which case you would expect many single base
                                  gaps. You can get this result by setting
                                  the gap penalty to zero (or very low) and
                                  using the gap extension penalty to control
                                  gap scoring.
   -markx              integer    This sets the alternate display of matches
                                  and mismatches in alignments.
                                  -markx=0 uses ':','.',' ', for identities,
                                  conservative replacements, and
                                  non-conservative replacements, respectively.
                                  -markx=1 uses ' ','x', and 'X'.
                                  -markx=2 does not show the second sequence,
                                  but uses the second alignment line to
                                  display matches with a '.' for identity, or
                                  with the mismatched residue for mismatches.
                                  -markx=3 outputs a title line with the
                                  percentage identity and score and then
                                  outputs the gapped sequences in multiple
                                  FASTA format.
                                  -markx=4 outputs only the title line with
                                  the percentage identity and score.
                                  -markx=5,6,7,8 and 9 are the same as
                                  -markx=1
                                  -markx=10 outputs a parseable output.
   -length             integer    Number of residues per line

   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequencea]
(Parameter 1)
Sequence USA Readable sequence Required
[-sequenceb]
(Parameter 2)
Sequence USA Readable sequence Required
[-outfile]
(Parameter 3)
Output file name Output file <sequence>.matcher
Optional qualifiers Allowed values Default
-datafile Matrix file Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-alternatives This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. Integer 1 or more 1
-gappenalty The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAFULL matrix for nucleotide sequences. Positive integer 14 for protein, 16 for nucleic
-gaplength The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Positive integer 4 for any sequence
-markx This sets the alternate display of matches and mismatches in alignments. -markx=0 uses ':','.',' ', for identities, conservative replacements, and non-conservative replacements, respectively. -markx=1 uses ' ','x', and 'X'. -markx=2 does not show the second sequence, but uses the second alignment line to display matches with a '.' for identity, or with the mismatched residue for mismatches. -markx=3 outputs a title line with the percentage identity and score and then outputs the gapped sequences in multiple FASTA format. -markx=4 outputs only the title line with the percentage identity and score. -markx=5,6,7,8 and 9 are the same as -markx=1 -markx=10 outputs a parseable output. Integer up to 10 0
-length Number of residues per line Integer from 1 to 200 60
Advanced qualifiers Allowed values Default
(none)

Input file format

Any 2 sequence USAs or the same type (DNA or protein).

Output file format

The output from matcher is a sequence alignment.

Here is the output for the example:



  43.4% identity in 145 HBA_HUMAN overlap; score:  264

              10        20        30        40         50          
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
       :.: .:. : : ::::  .. : :.::: :... .: :. .:  : :::      :. .:
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
             10          20        30        40        50        60

          60        70        80        90       100       110     
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
       :.:::::  :.....::.:.. .....::.::. ::.::: ::.::.. :. .:: :.  
HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
               70        80        90       100       110       120

         120       130       140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
       :::: :.:. .: .:.:...:. ::
HBB_HU EFTPPVQAAYQKVVAGVANALAHKY
              130       140     

----------

Here is the output for the example giving the 10 best alignments:



  43.4% identity in 145 HBA_HUMAN overlap; score:  264

              10        20        30        40         50          
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
       :.: .:. : : ::::  .. : :.::: :... .: :. .:  : :::      :. .:
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
             10          20        30        40        50        60

          60        70        80        90       100       110     
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
       :.:::::  :.....::.:.. .....::.::. ::.::: ::.::.. :. .:: :.  
HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
               70        80        90       100       110       120

         120       130       140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
       :::: :.:. .: .:.:...:. ::
HBB_HU EFTPPVQAAYQKVVAGVANALAHKY
              130       140     

----------

  46.2% identity in 13 HBA_HUMAN overlap; score:   32

      60        70  
HBA_HU KKVADALTNAVAH
       .::. ...::.::
HBB_HU QKVVAGVANALAH
              140   

----------

  38.9% identity in 18 HBA_HUMAN overlap; score:   28

      90       100       
HBA_HU KLRVDPVNFKLLSHCLLV
       :..:: :. . :.. :.:
HBB_HU KVNVDEVGGEALGRLLVV
         20        30    

----------

  60.0% identity in 10 HBA_HUMAN overlap; score:   23

      10         
HBA_HU VKAAWGKVGA
       :.::. :: :
HBB_HU VQAAYQKVVA
         130     

----------

  60.0% identity in 10 HBA_HUMAN overlap; score:   23

      80         
HBA_HU LSALSDLHAH
       :.:.::  ::
HBB_HU LGAFSDGLAH
        70       

----------

  29.4% identity in 17 HBA_HUMAN overlap; score:   21

         80        90   
HBA_HU PNALSALSDLHAHKLRV
       :.:. .   . ::  .:
HBB_HU PDAVMGNPKVKAHGKKV
               60       

----------

  41.7% identity in 12 HBA_HUMAN overlap; score:   21

      30        40 
HBA_HU ERMFLSFPTTKT
       .:.: ::   .:
HBB_HU QRFFESFGDLST
       40        50

----------

  50.0% identity in 8 HBA_HUMAN overlap; score:   20

          110  
HBA_HU LLVTLAAH
       .:: . ::
HBB_HU VLVCVLAH
      110      

----------

  50.0% identity in 12 HBA_HUMAN overlap; score:   20

             120   
HBA_HU HLPAEFTPAVHA
       ::  :   :: :
HBB_HU HLTPEEKSAVTA
              10   

----------

  28.6% identity in 21 HBA_HUMAN overlap; score:   19

            10        20    
HBA_HU PADKTNVKAAWGKVGAHAGEY
       :.. .  :.. : ..: : .:
HBB_HU PVQAAYQKVVAGVANALAHKY
          130       140     

----------

Here are the example outputs using other values of -markx:


% matcher sw:hba_human sw:hbb_human stdout -markx 1
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264

              10        20        30        40         50          
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
        x Xx xX X X      xxX X x   X xxxXx X xXx XX     X      xXx 
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
             10          20        30        40        50        60

          60        70        80        90       100       110     
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
        x     XX xxxxx  x xxXxxxxx  x  xX  x   X  x  xxX xXx  X xXX
HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
               70        80        90       100       110       120

         120       130       140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
           X x xXx Xx x xxx xX  
HBB_HU EFTPPVQAAYQKVVAGVANALAHKY
              130       140     


% matcher sw:hba_human sw:hbb_human stdout -markx 2
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264

              10        20        30        40         50          
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
HBB_HU .T.EE.SA.T.L....--NVD.V.G...G.LLVVY.W.QRF.ES.G...TPDAVM.NPK.

          60        70        80        90       100       110     
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
HBB_HU .A.....LG.FSDGL..L.NLKGTFAT..E..CD..H...E..R..GNV.VCV..H.FGK

         120       130       140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
HBB_HU ....P.Q.AYQ.VV.G.ANA.AH..


% matcher sw:hba_human sw:hbb_human stdout -markx 3
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264
>HBA_HUMAN ..
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH
-----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP
VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
>HBB_HUMAN ..
LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY


% matcher sw:hba_human sw:hbb_human stdout -markx 4
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264


% matcher sw:hba_human sw:hbb_human stdout -markx 10
Finds the best local alignments between two sequences
>>#1
; sw_score: 264
; sw_ident: 0.434
; sw_overlap: 145
>HBA_HUMAN ..
; sq_len: -115
; al_start: 2
; al_stop: 140
; al_display_start: 2
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH
-----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP
VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
>HBB_HUMAN ..
; sq_len: 5
; al_start: 3
; al_stop: 145
; al_display_start: 3
LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY

Note that the parseable output starts each alignment record with ">>" while each aligned sequence record starts with ">".

All parameters produced will be of the form: ; xx_yyyyy

In this version, we have xx: sw - Smith-Waterman scores sq - sequence length, type al - alignment start, stop, display_offset

All of the output parameters correspond to values that are presented in other output formats, with the exception of the "al_" parameters.

al_start gives the location of the alignment start in the original sequence

al_stop gives the location of the end of the alignment in the original sequence

al_display_start gives the location of the first displayed amino acid residue in the original sequence. The -markx=10 alignments are the same as those produced in the other modes. If the beginning of the first sequence aligns with the 10'th residue of the second sequence, then the first sequence will be padded with ten leading "-" to produce the alignment. The leading '-' are a formatting convenience only; they are not considered in the numbering system for al_display_start, al_start, or al_stop.

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

None.

References

  1. X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381
  2. M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 upon successful completion.

Known bugs

None.

See also

seqmatchallDoes an all-against-all comparison of a set of sequences
supermatcherFinds a match of a large sequence against one or more sequences
waterSmith-Waterman local alignment
wordmatchFinds all exact matches of a given size between 2 sequences

water will give a single best rigorous local alignment. It will use memory of the order of the product of the lengths of the sequences to be aligned. If you wish the 'best' local alignment you should use water. If you run out of memory or want several possible good alignments, use matcher.

Author(s)

This program was originally written by Bill Pearson as part of the FASTA package under the name 'lalign'.

This application was modified for inclusion in EMBOSS by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

 Completed 11th May 1999.
 Last modified 19th July 1999.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments