![]() |
EMBOSS: matcher |
matcher is based on Bill Pearson's 'lalign' application, version 2.0u4 Feb. 1996
Lalign uses code developed by X. Huang and W. Miller (Adv. Appl. Math. (1991) 12:337-357) for the "sim" program, which is a linear-space version of an algorithm described by M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).
Like water, matcher is rigorous, but also very slow. The advantage of matcher is that it uses far less memory than water, so you are much less likely to run out of memory when aligning large sequences.
matcher will also report a specified number of alignments between the two sequences showing the actual local alignments. (water will only report the single best match.) The default number of alignments output is 1, but can be increased to (for example) the 10 best alignments by using the '-alternatives 10' command-line qualifier. In some cases, for example multidomain proteins or cDNA and genomic DNA comparisons, there may be many interesting and significant alignments.
% matcher sw:hba_human sw:hbb_human Finds the best local alignments between two sequences Output file [hba_human.matcher]:
Here is an example to find the 10 best alignments:
% matcher sw:hba_human sw:hbb_human -alt 10 Finds the best local alignments between two sequences Output file [hba_human.matcher]: hba_human.matcher10
Mandatory qualifiers: [-sequencea] sequence Sequence USA [-sequenceb] sequence Sequence USA [-outfile] outfile Output file name Optional qualifiers: -datafile matrix Matrix file -alternatives integer This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. -gappenalty integer The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAFULL matrix for nucleotide sequences. -gaplength integer The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. -markx integer This sets the alternate display of matches and mismatches in alignments. -markx=0 uses ':','.',' ', for identities, conservative replacements, and non-conservative replacements, respectively. -markx=1 uses ' ','x', and 'X'. -markx=2 does not show the second sequence, but uses the second alignment line to display matches with a '.' for identity, or with the mismatched residue for mismatches. -markx=3 outputs a title line with the percentage identity and score and then outputs the gapped sequences in multiple FASTA format. -markx=4 outputs only the title line with the percentage identity and score. -markx=5,6,7,8 and 9 are the same as -markx=1 -markx=10 outputs a parseable output. -length integer Number of residues per line Advanced qualifiers: (none) General qualifiers: -help bool report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequencea] (Parameter 1) |
Sequence USA | Readable sequence | Required |
[-sequenceb] (Parameter 2) |
Sequence USA | Readable sequence | Required |
[-outfile] (Parameter 3) |
Output file name | Output file | <sequence>.matcher |
Optional qualifiers | Allowed values | Default | |
-datafile | Matrix file | Comparison matrix file in EMBOSS data path | EBLOSUM62 for protein EDNAFULL for DNA |
-alternatives | This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. | Integer 1 or more | 1 |
-gappenalty | The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAFULL matrix for nucleotide sequences. | Positive integer | 14 for protein, 16 for nucleic |
-gaplength | The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Positive integer | 4 for any sequence |
-markx | This sets the alternate display of matches and mismatches in alignments. -markx=0 uses ':','.',' ', for identities, conservative replacements, and non-conservative replacements, respectively. -markx=1 uses ' ','x', and 'X'. -markx=2 does not show the second sequence, but uses the second alignment line to display matches with a '.' for identity, or with the mismatched residue for mismatches. -markx=3 outputs a title line with the percentage identity and score and then outputs the gapped sequences in multiple FASTA format. -markx=4 outputs only the title line with the percentage identity and score. -markx=5,6,7,8 and 9 are the same as -markx=1 -markx=10 outputs a parseable output. | Integer up to 10 | 0 |
-length | Number of residues per line | Integer from 1 to 200 | 60 |
Advanced qualifiers | Allowed values | Default | |
(none) |
Here is the output for the example:
43.4% identity in 145 HBA_HUMAN overlap; score: 264 10 20 30 40 50 HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. .: HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 10 20 30 40 50 60 60 70 80 90 100 110 HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA :.::::: :.....::.:.. .....::.::. ::.::: ::.::.. :. .:: :. HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 70 80 90 100 110 120 120 130 140 HBA_HU EFTPAVHASLDKFLASVSTVLTSKY :::: :.:. .: .:.:...:. :: HBB_HU EFTPPVQAAYQKVVAGVANALAHKY 130 140 ----------
Here is the output for the example giving the 10 best alignments:
43.4% identity in 145 HBA_HUMAN overlap; score: 264 10 20 30 40 50 HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. .: HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 10 20 30 40 50 60 60 70 80 90 100 110 HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA :.::::: :.....::.:.. .....::.::. ::.::: ::.::.. :. .:: :. HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 70 80 90 100 110 120 120 130 140 HBA_HU EFTPAVHASLDKFLASVSTVLTSKY :::: :.:. .: .:.:...:. :: HBB_HU EFTPPVQAAYQKVVAGVANALAHKY 130 140 ---------- 46.2% identity in 13 HBA_HUMAN overlap; score: 32 60 70 HBA_HU KKVADALTNAVAH .::. ...::.:: HBB_HU QKVVAGVANALAH 140 ---------- 38.9% identity in 18 HBA_HUMAN overlap; score: 28 90 100 HBA_HU KLRVDPVNFKLLSHCLLV :..:: :. . :.. :.: HBB_HU KVNVDEVGGEALGRLLVV 20 30 ---------- 60.0% identity in 10 HBA_HUMAN overlap; score: 23 10 HBA_HU VKAAWGKVGA :.::. :: : HBB_HU VQAAYQKVVA 130 ---------- 60.0% identity in 10 HBA_HUMAN overlap; score: 23 80 HBA_HU LSALSDLHAH :.:.:: :: HBB_HU LGAFSDGLAH 70 ---------- 29.4% identity in 17 HBA_HUMAN overlap; score: 21 80 90 HBA_HU PNALSALSDLHAHKLRV :.:. . . :: .: HBB_HU PDAVMGNPKVKAHGKKV 60 ---------- 41.7% identity in 12 HBA_HUMAN overlap; score: 21 30 40 HBA_HU ERMFLSFPTTKT .:.: :: .: HBB_HU QRFFESFGDLST 40 50 ---------- 50.0% identity in 8 HBA_HUMAN overlap; score: 20 110 HBA_HU LLVTLAAH .:: . :: HBB_HU VLVCVLAH 110 ---------- 50.0% identity in 12 HBA_HUMAN overlap; score: 20 120 HBA_HU HLPAEFTPAVHA :: : :: : HBB_HU HLTPEEKSAVTA 10 ---------- 28.6% identity in 21 HBA_HUMAN overlap; score: 19 10 20 HBA_HU PADKTNVKAAWGKVGAHAGEY :.. . :.. : ..: : .: HBB_HU PVQAAYQKVVAGVANALAHKY 130 140 ----------
Here are the example outputs using other values of -markx:
% matcher sw:hba_human sw:hbb_human stdout -markx 1 Finds the best local alignments between two sequences 43.4% identity in 145 HBA_HUMAN overlap; score: 264 10 20 30 40 50 HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV x Xx xX X X xxX X x X xxxXx X xXx XX X xXx HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 10 20 30 40 50 60 60 70 80 90 100 110 HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA x XX xxxxx x xxXxxxxx x xX x X x xxX xXx X xXX HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 70 80 90 100 110 120 120 130 140 HBA_HU EFTPAVHASLDKFLASVSTVLTSKY X x xXx Xx x xxx xX HBB_HU EFTPPVQAAYQKVVAGVANALAHKY 130 140
% matcher sw:hba_human sw:hbb_human stdout -markx 2 Finds the best local alignments between two sequences 43.4% identity in 145 HBA_HUMAN overlap; score: 264 10 20 30 40 50 HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV HBB_HU .T.EE.SA.T.L....--NVD.V.G...G.LLVVY.W.QRF.ES.G...TPDAVM.NPK. 60 70 80 90 100 110 HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA HBB_HU .A.....LG.FSDGL..L.NLKGTFAT..E..CD..H...E..R..GNV.VCV..H.FGK 120 130 140 HBA_HU EFTPAVHASLDKFLASVSTVLTSKY HBB_HU ....P.Q.AYQ.VV.G.ANA.AH..
% matcher sw:hba_human sw:hbb_human stdout -markx 3 Finds the best local alignments between two sequences 43.4% identity in 145 HBA_HUMAN overlap; score: 264 >HBA_HUMAN .. LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY >HBB_HUMAN .. LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY
% matcher sw:hba_human sw:hbb_human stdout -markx 4 Finds the best local alignments between two sequences 43.4% identity in 145 HBA_HUMAN overlap; score: 264
% matcher sw:hba_human sw:hbb_human stdout -markx 10 Finds the best local alignments between two sequences >>#1 ; sw_score: 264 ; sw_ident: 0.434 ; sw_overlap: 145 >HBA_HUMAN .. ; sq_len: -115 ; al_start: 2 ; al_stop: 140 ; al_display_start: 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY >HBB_HUMAN .. ; sq_len: 5 ; al_start: 3 ; al_stop: 145 ; al_display_start: 3 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY
Note that the parseable output starts each alignment record with ">>" while each aligned sequence record starts with ">".
All parameters produced will be of the form: ; xx_yyyyy
In this version, we have xx: sw - Smith-Waterman scores sq - sequence length, type al - alignment start, stop, display_offset
All of the output parameters correspond to values that are presented in other output formats, with the exception of the "al_" parameters.
al_start gives the location of the alignment start in the original sequence
al_stop gives the location of the end of the alignment in the original sequence
al_display_start gives the location of the first displayed amino acid residue in the original sequence. The -markx=10 alignments are the same as those produced in the other modes. If the beginning of the first sequence aligns with the 10'th residue of the second sequence, then the first sequence will be padded with ten leading "-" to produce the alignment. The leading '-' are a formatting convenience only; they are not considered in the numbering system for al_display_start, al_start, or al_stop.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
seqmatchall | Does an all-against-all comparison of a set of sequences |
supermatcher | Finds a match of a large sequence against one or more sequences |
water | Smith-Waterman local alignment |
wordmatch | Finds all exact matches of a given size between 2 sequences |
water will give a single best rigorous local alignment. It will use memory of the order of the product of the lengths of the sequences to be aligned. If you wish the 'best' local alignment you should use water. If you run out of memory or want several possible good alignments, use matcher.
This application was modified for inclusion in EMBOSS by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
Completed 11th May 1999. Last modified 19th July 1999.