![]() |
EMBOSS: water |
The Smith-Waterman algorithm is a member of the class of algorithms that can calculate the best score and local alignment in the order of mn steps, (where 'n' and 'm' are the lengths of the two sequences). These dynamic programming algorithms were first developed for protein sequence comparison by Smith and Waterman, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.
A local alignment searches for regions of local similarity between two sequences and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or other circumsatnces when you wish to find matches between small regions of sequences, for example between protein domains.
Dynamic programming methods ensure the optimal local alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. water finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.
An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty).
Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.
% water sw:hba_human sw:hbb_human Gap opening penalty [10.0]: Gap extension penalty [0.5]: Output file [hba_human.water]:
Mandatory qualifiers: [-sequencea] sequence Sequence USA [-seqall] seqall Sequence database USA -gapopen float The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. -gapextend float The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. [-outfile] outfile Output file name Optional qualifiers: -datafile matrixf Matrix file -showinternals bool Show debugging information on the internal state of the search. Advanced qualifiers: -[no]similarity bool Display percent identity and similarity -fasta bool Output overlap as fasta sequences General qualifiers: -help bool report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequencea] (Parameter 1) |
Sequence USA | Readable sequence | Required |
[-seqall] (Parameter 2) |
Sequence database USA | Readable sequence(s) | Required |
-gapopen | The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. | Number from 1.000 to 100.000 | 10.0 for any sequence |
-gapextend | The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Number from 0.100 to 10.000 | 0.5 for any sequence |
[-outfile] (Parameter 3) |
Output file name | Output file | <sequence>.water |
Optional qualifiers | Allowed values | Default | |
-datafile | Matrix file | Comparison matrix file in EMBOSS data path | EBLOSUM62 for protein EDNAFULL for DNA |
-showinternals | Show debugging information on the internal state of the search. | Yes/No | No |
Advanced qualifiers | Allowed values | Default | |
-[no]similarity | Display percent identity and similarity | Yes/No | Yes |
-fasta | Output overlap as fasta sequences | Yes/No | No |
Local: HBA_HUMAN vs HBB_HUMAN Score: 293.50 HBA_HUMAN 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF 46 |:| :|: | | |||| : | | ||| |: : :| |: :| | HBB_HUMAN 3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESF 45 HBA_HUMAN 47 .DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD 85 ||| |: :|| ||||| | :: :||:|:: : ||: HBB_HUMAN 46 GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSE 90 HBA_HUMAN 86 LHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLA 130 || || ||| ||:|| : |: || | |||| | |: | :| HBB_HUMAN 91 LHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVA 135 HBA_HUMAN 131 SVSTVLTSKY 140 |: | || HBB_HUMAN 136 GVANALAHKY 145 %id = 45.99 %similarity = 64.23 Overall %id = 43.15 Overall %similarity = 60.27
The %id is the percentage of identical matches between the two sequences over the reported aligned region.
The %similarity is the percentage of matches between the two sequences over the reported aligned region where the scoring matrix value is greater or equal to 0.0.
The Overall %id and Overall %similarity are calculated in a similar manner for the number of matches over the length of the longest of the two sequences.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
water is for aligning the best matching subsequences of two sequences. It does not necessarily align whole sequences against each other; you should use needle if you wish to align closely related sequences along their whole lengths.
A true Smith Waterman implementation like water needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of ints and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.
If you run out of memory, try using supermatcher or matcher.
Uncaught exception Assertion failed raised at ajmem.c:xxx
Probably means you have run out of memory. Try using supermatcher or matcher if this happens.
matcher | Finds the best local alignments between two sequences |
seqmatchall | Does an all-against-all comparison of a set of sequences |
supermatcher | Finds a match of a large sequence against one or more sequences |
wordmatch | Finds all exact matches of a given size between 2 sequences |
matcher is a local alignment program that is less rigorous than
water
and therefore runs more quickly. It may be useful for database searching.supermatcher is designed for local alignments of very large sequences and is even less rigorous in its implementation.
supermatcher Finds a match of a large sequence against one or more sequences matcher Finds the best local alignments between two sequences
Completed 7th July 1999. Modified 27th July 1999 - tweaking scoring. Modified 22 Oct 2000 - added ID and Similarity scores.