![]() |
EMBOSS: est2genome |
Unless instructed otherwise, the program makes three alignments: First it compares both stands of the spliced sequence against the forward strand of the genomic, assuming the splice consensus GT/AG (ie in the forward gene direction). The maximum-scoring orientation is then realigned assuming the splice consensus CT/AC (ie in the reversed gene direction). Only the overall maximum-scoring alignment is reported.
The program outputs a list of the exons and introns it has found. The format is like that of MSPcrunch, ie a list of matching segments. This format is easy to parse into other software. The program also indicates, based on the splice site information, the gene's predicted direction of transcription. Optionally the full sequence alignment is printed as well (see the example).
% est2genome Align EST and genomic DNA sequences EST sequence(s): embl:hs989235 Genomic sequence: embl:hsnfg9 Output file [hs989235.est2genome]:
Mandatory qualifiers: [-est] seqall EST sequence(s) [-genome] sequence Genomic sequence -outfile outfile Output file name Optional qualifiers: -match integer Score for matching two bases -mismatch integer Cost for mismatching two bases -gappenalty integer Cost for deleting a single base in either sequence, excluding introns -intronpenalty integer Cost for an intron, independent of length. -splicepenalty integer Cost for an intron, independent of length and starting/ending on donor-acceptor sites -minscore integer You can exclude alignments with scores below a threshold by setting this to be false. Advanced qualifiers: -reverse bool Reverse the orientation of the EST sequence -[no]splice bool Use donor and acceptor splice sites. If you want to ignore donor-acceptor sites then set this to be false. -align bool Show the alignment. The alignment includes the first and last 5 bases of each intron, together with the intron width. The direction of splicing is indicated by angle brackets (forward or reverse) or ???? (unknown). -width integer Alignment width -mode string This determines the comparion mode. The default value is 'both', in which case both strands of the est are compared assuming a forward gene direction (ie GT/AG splice sites), and the best comparsion redone assuming a reversed (CT/AC) gene splicing direction. The other allowed modes are 'forward', when just the forward strand is searched, and 'reverse', ditto for the reverse strand. -[no]best bool You can print out all comparisons instead of just the best one by setting this to be false. -space float for linear-space recursion. If product of sequence lengths divided by 4 exceeds this then a divide-and-conquer strategy is used to control the memory requirements. In this way very long sequences can be aligned. If you have a machine with plenty of memory you can raise this parameter (but do not exceed the machine's physical RAM) -shuffle integer Shuffle -seed integer Random number seed General qualifiers: -help bool report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-est] (Parameter 1) |
EST sequence(s) | Readable sequence(s) | Required |
[-genome] (Parameter 2) |
Genomic sequence | Readable sequence | Required |
-outfile | Output file name | Output file | <sequence>.est2genome |
Optional qualifiers | Allowed values | Default | |
-match | Score for matching two bases | Any integer value | 1 |
-mismatch | Cost for mismatching two bases | Any integer value | 1 |
-gappenalty | Cost for deleting a single base in either sequence, excluding introns | Any integer value | 2 |
-intronpenalty | Cost for an intron, independent of length. | Any integer value | 40 |
-splicepenalty | Cost for an intron, independent of length and starting/ending on donor-acceptor sites | Any integer value | 20 |
-minscore | You can exclude alignments with scores below a threshold by setting this to be false. | Any integer value | 30 |
Advanced qualifiers | Allowed values | Default | |
-reverse | Reverse the orientation of the EST sequence | Yes/No | No |
-[no]splice | Use donor and acceptor splice sites. If you want to ignore donor-acceptor sites then set this to be false. | Yes/No | Yes |
-align | Show the alignment. The alignment includes the first and last 5 bases of each intron, together with the intron width. The direction of splicing is indicated by angle brackets (forward or reverse) or ???? (unknown). | Yes/No | No |
-width | Alignment width | Any integer value | 50 |
-mode | This determines the comparion mode. The default value is 'both', in which case both strands of the est are compared assuming a forward gene direction (ie GT/AG splice sites), and the best comparsion redone assuming a reversed (CT/AC) gene splicing direction. The other allowed modes are 'forward', when just the forward strand is searched, and 'reverse', ditto for the reverse strand. | Any string is accepted | both |
-[no]best | You can print out all comparisons instead of just the best one by setting this to be false. | Yes/No | Yes |
-space | for linear-space recursion. If product of sequence lengths divided by 4 exceeds this then a divide-and-conquer strategy is used to control the memory requirements. In this way very long sequences can be aligned. If you have a machine with plenty of memory you can raise this parameter (but do not exceed the machine's physical RAM) | Any integer value | 10.0 |
-shuffle | Shuffle | Any integer value | 0 |
-seed | Random number seed | Any integer value | 20825 |
Type score % gstart gstop genome estart estop EST EST doc Note Best alignment is between forward est and forward genome, but splice sites imply REVERSED GENE Exon 163 91.8 25685 25874 HSNFG9 1 193 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... -Intron -20 0.0 25875 26278 HSNFG9 Exon 207 98.1 26279 26492 HSNFG9 194 407 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... -Intron -20 0.0 26493 27390 HSNFG9 Exon 63 86.4 27391 27476 HSNFG9 408 494 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Span 393 93.6 25685 27476 HSNFG9 1 494 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 14 83.3 25685 25702 HSNFG9 1 18 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 28 85.7 25703 25737 HSNFG9 20 54 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 4 100.0 25738 25741 HSNFG9 56 59 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 13 100.0 25742 25754 HSNFG9 61 73 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 4 100.0 25756 25759 HSNFG9 74 77 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 110 97.4 25760 25874 HSNFG9 79 193 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 37 100.0 26279 26315 HSNFG9 194 230 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 162 98.8 26317 26480 HSNFG9 231 394 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 12 100.0 26481 26492 HSNFG9 396 407 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 16 100.0 27391 27406 HSNFG9 408 423 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 10 91.7 27407 27418 HSNFG9 425 436 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 19 95.2 27419 27439 HSNFG9 438 458 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ... Segment 24 80.6 27441 27476 HSNFG9 459 494 HS989235 yo13c02.s1 Soares adult brain N2b5HB55Y ...
The score for Exon segments is the alignment score excluding flanking intron penalties. The Span score is the total including the intron costs.
The coordinates of the genomic sequence always refer to the positive strand, but are swapped if the est has been reversed. The splice direction of Introns are indicated as +Intron (forward, splice sites GT/AG) or -Intron (reverse, splice sites CT/AC), or ?Intron (unknown direction). Segment entries give the alignment as a series of ungapped matching segments.
HSNFG9 vs HS989235: HSNFG9 25685 cccggaagctcatcttgg-ccaccgactctcgcttgcgccgccgcgggag 25733 || | ||||||| ||||| ||||||||||||| || ||||||||||| HS989235 1 ccggnaagctcancttggaccaccgactctcgantgnntcgccgcgggag 50 HSNFG9 25734 ccgg-tgga-aacctgagcgggagctgg-agaaggagcagagggaggcag 25780 |||| |||| ||||||||||||| |||| ||||||||||||||||||||| HS989235 51 ccggntgganaacctgagcggga-ctggnagaaggagcagagggaggcag 99 HSNFG9 25781 cacccggcgtgacgggagtgtgtggggcactcaggccttccgcagtgtca 25830 ||||||||||||||| |||||||||||||||||||||||||||||||||| HS989235 100 cacccggcgtgacggnagtgtgtggggcactcaggccttccgcagtgtca 149 HSNFG9 25831 tctgccacacggaaggcacggccacgggccagggggtctatgatctgga. 25874 ||||||||||||||||||||||||||||| |||||||||||||<<<<< HS989235 150 tctgccacacggaaggcacggccacgggcaggggggtctatgat...... 193 HSNFG9 25874 ....cataccttctgcatgcccagctggcatggccccacgtagagtgggg 26319 404 <<<<<||||||||||||||||||||||||||||||||||||| || HS989235 193 .........cttctgcatgcccagctggcatggccccacgtagagt-ggn 233 HSNFG9 26320 gtggcgtctcggtgctggtcagcgacacgttgtcctggctgggcaggtcc 26369 ||||||||||||||||||||||||||||||||||||||||||||||||| HS989235 234 ntggcgtctcggtgctggtcagcgacacgttgtcctggctgggcaggtcc 283 HSNFG9 26370 agctcccggaggacctggggcttcagcttcccgtagcgctggctgcagtg 26419 |||||||||||||||||||||||||||||||||||||||||||||||||| HS989235 284 agctcccggaggacctggggcttcagcttcccgtagcgctggctgcagtg 333 HSNFG9 26420 acggatgctcttgcgctgccatttctgggtgctgtcactgtccttgctca 26469 |||||||||||||||||||||||||||||||||||||||||||||||||| HS989235 334 acggatgctcttgcgctgccatttctgggtgctgtcactgtccttgctca 383 HSNFG9 26470 ctccaaaccag-tcggcggtccccctggc.....ggtacctgcggatggt 27401 ||||||||||| ||||||||||||<<<<< 898 <<<<<||||||||||| HS989235 384 ctccaaaccagttcggcggtcccc...............ctgcggatggt 418 HSNFG9 27402 ctgtg-tgatggacgtct-ggcgttgcagcaccggccgccggagctcatg 27449 ||||| |||||||||| | ||| ||||||||||||||||| ||| ||||| HS989235 419 ctgtgttgatggacgtttgggctttgcagcaccggccgcc-gagttcatg 467 HSNFG9 27450 gtggggtgaagagatgtgggctgtctc 27476 || |||| ||||||| |||| | | || HS989235 468 gtngggtnaagagatttgggttttttc 494 Alignment Score: 393In this example there are two reverse introns, of lengths 404 and 898 bp.
parameter default description match 1 score for matching two bases mismatch 1 cost for mismatching two bases gap_penalty 2 cost for deleting a single base in either sequence, excluding introns intron_penalty 40 cost for an intron, independent of length. splice_penalty 20 cost for an intron, independent of length and starting/ending on donor-acceptor sites. space 10 Space threshold (in megabytes) for linear-space recursion. If the product of the two sequence lengths divided by 4 exceeds this then a divide-and-conquer strategy is used to control the memory requirements. In this way very long sequences can be aligned. If you have a machine with plenty of memory you can raise this parameter (but do not exceed the machine's physical RAM) However, normally you should not need to change this parameter.There is no gap initiation cost for short gaps, just a penalty proportional to the length of the gap. Thus the cost of inserting a gap of length L in the EST is
L*gap_penaltyand the cost in the genome is
min { L*gap_penalty, intron_penalty } or min { L*gap_penalty, splice_penalty } if the gap starts with GT and ends with AG (or CT/AC if splice direction reversed)Introns are not allowed in the EST. The difference between the intron_penalty and splice_penalty allows for some slack in marking the intron end-points. It is often the case that the best intron boundaries, from the point of view of minimising mismatches, will not coincide exactly with the splice consensus, so provided the difference between the intron/splice penalties outweighs the extra mismatch/indel costs the alignment will respect the proper boundaries. If the alignment still prefers boundaries which don't start and end with the splice consensus then this may indicate errors in the sequences.
The default parameters work well, except for very short exons (length less than the splice_penalty, approx) which may be skipped. The intron penalties should not be set to less that the maximum expected random match between the sequences (typically 10-15 bp) in order to avoid spurious matches. The algorithm has the following steps:
needle | Needleman-Wunsch global alignment |
stretcher | Finds the best global alignment between two sequences |
The original program was est_genome, written by Richard Mott at the Sanger Centre. The original version is available from ftp://ftp.sanger.ac.uk/pub/pmr/est_genome.4.tar.Z
With the folowing changes it detects also a 14bp exon correctly:
mismatch 1 -> 3 intronpenalty 40 -> 20 splicepenalty 20 -> 10 minscore 30 -> 10
Dr. David Bauer GenProfile AG, Max-Delbrueck-Center, Erwin-Negelein-Haus Robert-Roessle-Str. 10, D-13125 Berlin, Germany bauer@genprofile.com, Tel:49-30-94892165, FAX:49-30-94892151