EMBOSS: extractseq


Program extractseq ( YMBC , NCHC )

Function

Extract regions from a sequence

Description

extractseq allows you to specify one or more regions of a sequence to extract sub-sequences from to build up a contiguous resulting sequence.

This is modelled on the cell's process of splicing out exons from mRNA, but the program is generally applicable to any cutting and splicing or editing operation on a single sequence.

extractseq reads in a sequence and a set of regions of that sequence as specified by pairs of start and end positions (either on the command-line or contained in a file) and writes out the specified regions of the input sequence in the order in which they have been specified. Thus, if the sequence "AAAGGGTTT" has been input and the regions: "7-9, 3-4" have been specified, then the output sequence will be: "TTTAG".

Usage

Extract the region from position 10 to 20.
% extractseq main.seq result.seq -regions '10-20'

Extract the regions 10 to 20, 30 to 45, 533 to 537

% extractseq main.seq result2.seq -regions '10-20, 30-45, 533-537'

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
  [-outseq]            seqoutall  Output sequence(s) USA
   -regions            range      Regions to extract.
                                  A set of regions is specified by a set of
                                  pairs of positions.
                                  The positions are integers.
                                  They are separated by any non-digit,
                                  non-alpha character.
                                  Examples of region specifications are:
                                  24-45, 56-78
                                  1:45, 67=99;765..888
                                  1,5,8,10,23,45,57,99

   Optional qualifiers:
   -separate           bool       If this is set true then each specified
                                  region is written out as a separate
                                  sequence. The name of the sequence is
                                  created from the name of the original
                                  sequence with the start and end positions of
                                  the range appended with underscore
                                  characters between them, eg: XYZ region 2 to
                                  34 is written as: XYZ_2_34

   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-outseq]
(Parameter 2)
Output sequence(s) USA Writeable sequence(s) <sequence>.format
-regions Regions to extract. A set of regions is specified by a set of pairs of positions. The positions are integers. They are separated by any non-digit, non-alpha character. Examples of region specifications are: 24-45, 56-78 1:45, 67=99;765..888 1,5,8,10,23,45,57,99 Sequence range Whole sequence
Optional qualifiers Allowed values Default
-separate If this is set true then each specified region is written out as a separate sequence. The name of the sequence is created from the name of the original sequence with the start and end positions of the range appended with underscore characters between them, eg: XYZ region 2 to 34 is written as: XYZ_2_34 Yes/No No
Advanced qualifiers Allowed values Default
(none)

Input file format

Normal sequence.

You can specifiy a file of ranges to extract by giving the '-regions' qualifier the value '@' followed by the name of the file containing the ranges. (eg: '-regions @myfile').

The format of the range file is:

An example range file is:


# this is my set of ranges
12   23
 4   5       this is like 12-23, but smaller
67   10348   interesting region

Output file format

The output is a normal sequence file.

For example, the coding regions of em:hsfau1 are joined as:

% extractseq em:hsfau1 -reg "782..856,951..1095,1557..1612,1787..1912" stdout

>HSFAU X65923 H.sapiens fau mRNA
atgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacg
gtcgcccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtc
gtgctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggag
gccctgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactcttaa

If the option '-separate' is used then each specified region is written to the output file as a separate sequence. The name of the sequence is created from the name of the original sequence with the start and end positions of the range appended with underscore characters between them,

For example: "XYZ region 2 to 34" is written as: "XYZ_2_34"

To output each of the exons in em:hsfau1 to a separate entry:

% extractseq em:hsfau1 -reg "782..856,951..1095,1557..1612,1787..1912" stdout -separate

>HSFAU1_782_856 H.sapiens fau 1 gene
atgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacg
gtcgcccagatcaag
>HSFAU1_951_1095 H.sapiens fau 1 gene
gctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggc
gcgcccctggaggatgaggccactctgggccagtgcggggtggaggccctgactaccctg
gaagtagcaggccgcatgcttggag
>HSFAU1_1557_1612 H.sapiens fau 1 gene
gtaaagtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaag
>HSFAU1_1787_1912 H.sapiens fau 1 gene
gtggccaaacaggagaagaagaagaagaagacaggtcgggctaagcggcggatgcagtac
aaccggcgctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgccaac
tcttaa

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

Several warning messages about malformed region specifications:

Exit status

It exits with status 0, unless a region is badly constructed.

Known bugs

None noted.

See also

cutseqRemoves a specified section from a sequence
degapseqRemoves gap characters from sequences
descseqAlter the name or description of a sequence
entretReads and writes (returns) flatfile entries
infoseqDisplays some simple information about sequences
listorWrites a list file of the logical OR of two sets of sequences
maskfeatMask off features of a sequence
maskseqMask off regions of a sequence
newseqType in a short new sequence
noreturnRemoves carriage return from ASCII files
notseqExcludes a set of sequences and writes out the remaining ones
nthseqWrites one sequence from a multiple set of sequences
pasteseqInsert one sequence into another
revseqReverse and complement a sequence
seqretReads and writes (returns) sequences
seqretallReads and writes (returns) a set of sequences one at a time
seqretsetReads and writes (returns) a set of sequences all at once
seqretsplitReads and writes (returns) sequences in individual files
splitterSplit a sequence into (overlapping) smaller sequences
swissparseRetrieves sequences from swissprot using keyword search
trimestTrim poly-A tails off EST sequences
trimseqTrim ambiguous bits off the ends of sequences
vectorstripStrips out DNA between a pair of vector sequences

Author(s)

This application was written by Gary Williams (gwilliam@hgmp.mrc.ac.uk)

History

Written (2000) - Gary Williams

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments