EMBOSS: etandem


Program etandem ( YMBC , NCHC )

Function

Looks for tandem repeats in a nucleotide sequence

Description

etandem looks for tandem repeats in a sequence. It is normally used after equicktandem has been run to identify potential repeat sizes. It calculates a consensus for the repeat region and gives a score for how many matches there are to the consensus - the number of mismatches.

Input sequences are converted into ACGT or N (so ambiguity codes are ignored).
The score is +1 for a match, -1 for a mismatch.
The first copy of a repeat is ignored.
The highest score is kept for each start position and repeat size.

The lowest score to be reported is set by the threshold score. The threshold score can be set on the command-line using the -threshold qualifier, the default is 20. For perfect repeats, the score is the length of the repeat (except for the first copy). Reduce the threshold score a little if you wish to to allow mismatches. Each mismatch scores -1 instead of +1 so it scores 2 less than a perfect match of the same number of bases.

Running with a wide range of repeat sizes is inefficient. That is why equicktandem was written - to give a rapid estimate of the major repeat sizes.

Usage

Here is a sample session with etandem. The input sequence is the human herpesvirus tandem repeat.

% etandem
Input sequence: embl:hhtetra
Output file [hhtetra.tan]: 
Minimum repeat size [10]: 6
Maximum repeat size [6]: 

Command line arguments

   Mandatory qualifiers:
  [-sequence]          sequence   Sequence USA
  [-outfile]           outfile    Output file name
   -minrepeat          integer    Minimum repeat size
   -maxrepeat          integer    Maximum repeat size

   Optional qualifiers: (none)
   Advanced qualifiers:
   -threshold          integer    Threshold score
   -mismatch           bool       Allow N as a mismatch
   -uniform            bool       Allow uniform consensus

   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence USA Readable sequence Required
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.etandem
-minrepeat Minimum repeat size Integer, 2 or higher 10
-maxrepeat Maximum repeat size Integer, same as -minrepeat or higher Same as -minrepeat
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-threshold Threshold score Any integer value 20
-mismatch Allow N as a mismatch Yes/No No
-uniform Allow uniform consensus Yes/No No

Input file format

The input for etandem is a nucleotide sequence.

Output file format

The output from etandem is an uncommented list of identified repeats. In a future version this will change to be annotated sequence features.

The columns of the report show:

  1. Score
  2. Start base position
  3. End base position
  4. Repeat size
  5. Repeat count
  6. Percent identity
  7. Consensus sequence

An example of the output is:


   120        793        936  6  24  93.8 acccta
    90        283        420  6  23  84.8 taaccc
    38        432        485  6   9  90.7 ccctaa
    26        494        529  6   6  94.4 ccctaa
    24        568        597  6   5 100.0 aaccct

Data files

Notes

Running with a wide range of repeat sizes is inefficient. That is why equicktandem was written - to give a rapid estimate of the major repeat sizes.

References

None.

Warnings

None.

Diagnostics

None.

Exit status

None.

Known bugs

None.

See also

einvertedFinds DNA inverted repeats
equicktandemFinds tandem repeats
palindromeLooks for inverted repeats in a nucleotide sequence

Running with a wide range of repeat sizes is inefficient. That is why equicktandem was written - to give a rapid estimate of the major repeat sizes.

Authors

This program was originally written by Richard Durbin at the Sanger Centre.

This application was modified for inclusion in EMBOSS by Peter Rice (pmr@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

Priority

Completed 25 May 1999

Target

etandem is aimed at automated repeat identification in genomic sequnece but can also be used by general users.

Comments