Release Note

Release 7

PALS db
Jan 24 2005 PALS db Release 7
50,075 human putative alternative splicing site pairs,
27,905 mouse putative alternative splicing site pairs,
3,642 worm putative alternative splicing site pairs

@

This document describes the format and content in the release 7 of the PUTATIVE ALTERNATIVE SPLICING SITE DATABASE (PALS db).

If you have any questions or comments about PALS db or this document, please contact the Bioinformatics Research Center of Yang-Ming University (YMBC) via email at binfo@ym.edu.tw or

Bioinformatics Research Center, National Yang-Ming University,

No. 155, Sec. 2, Li-Noun St, Taipei, Taiwan 11221, R.O.C.

Phone: +886-2-2826-7128

Fax:     +886-2-2826-4843

==========================================================================
TABLE OF CONTENTS
==========================================================================

1. INTRODUCTION
   1.1 Release 7
   1.2 Important Changes in Release 7
   1.3 Audience
   1.4 Rationale
   1.5 Data Sources
   1.6 Criteria in Determining Reliable AS sites

2. RESULTS
   2.1 Human
   2.2 Mouse
   2.3 Format of text output
   2.4 Abbreviations

3. PALSDB ADMINISTRATION
   3.1 Citing PALS db
   3.2 Other Methods of Accessing PALS db data
   3.3 Known Bugs
   3.4 Deposition of Experimental Data and Comments
   3.5 Credits and Acknowledgments
   3.6 Disclaimer

==========================================================================

1. INTRODUCTION

1.1 Release 7

     The Bioinformatics Research Center at the National Yang-Ming University is responsible for the producing and distributing of the putative alternative splicing site database (PALS db). Using all available human and mouse mRNA sequences as reference sequences, we tried to collect all available putative alternative splicing information hidden in biological sequence databases.

1.2 Important Changes in Release 7

Less unique human genes (26,324 unique genes in release 7 and 33,111 unique genes in release 6) were now included in this release. 
Less unique mouse genes (18,614 unique genes in release 7 and 18,942 unique genes in release 6) were now included in this release.
PALSdb adds C. elegans genes since release 7 (14,393 unique genes).

1.3 Audience

PALS db is NOT designed to collect precise splicing site information for designing new gene prediction programs. This database is designed for biologists, who are interested in discovering biological phenomenon and solving a biological problem. Determining an AS site experimentally is time-consuming; however, by looking up putative AS information, biologists can design simple experiments to prove the existence and to perform functional assays of these AS sites.
Though the splicing junctions in PALS db is not as precise as required in designing new gene-prediction programs, biologists may find it useful because PALS db provided piles of putative AS information. Thus people in wet labs can use those putative sites as hints to make hypothesis. For example,
In order to get reliable statistics about AS in human genes, we collected AS-related statistics using strict criteria. However, in order to provide biologists more chances to find novel phenomenon, we preserved all the putative information. We also created a user-friendly interface for biologists to judge the validities of the putative information by their expertise. Hopefully, PALS db can be a tool to energize the information-driven biomedical researches.

1.4 Rationale

In constructing PALS db, we tried to find all available AS information. As half of the human genomic sequences were still draft, we chose mRNA sequences as references to collect AS information from UniGene and dbEST. If there is AS information in the sequence, with either a deleted or inserted fragment, the alignments are separated. According to the relations between these alignments, there are 3 major types of alternative splicing transcripts. '+' means the region that can be aligned. '=' means the regions that had no corresponding fragments to the other sequences. Both the forward and backard slashes are to indicate the alignable regions between the reference sequences and putative AS-containing sequences. The identities of both ends of alignments are defined as ID1 and ID2 respectively. The lengths of the two ends of alignments are defined as Len1 and Len2 respectively.
In type one AS, EST may represent a transcript that the region between pos1 and pos2 is excluded at splicing. Pos3 is the position on the putative AS-containing sequence that the alignments are separated.
In type two AS, there is an extra sequence between pos3 and pos4 on the putative As-containing sequence. On the contrary to type one AS, there is only one cutting position (pos1) on the reference sequence.
In type three, there are two un-aligned sequences on both the reference and the putative AS-containing sequence. However, we discarded type three AS in current release of PALS db because, in many cases, the unaligned regions are low-quality sequences.

@

  1.             pos1     pos2
    ===++++++===++++++===     mRNA
            \         \     /         /        
              \          \ /         / 
               +++++++++++               EST
                       pos3

  2.                  pos1
    ===++++++++++++++===       mRNA
          /             / \            \
        /            /     \            \
       +++++++====+++++++        EST
                 pos3    pos4

  3.             pos1   pos2
    ==++++++====++++++===    mRNA
        /         /          \           \
      /          /            \           \
    ++++++=======+++++++     EST
              pos3          pos4

1.5 Data Sources

Human UniGene Build #176,  Mouse UniGene Build #141, and dbEST released on  Nov. 24, 2004
Human
Number of UniGene Cluster containing at least one CDS: 26,324 clusters (unique genes).
CDS 143,490 entries
EST 6,014,960 entries
Mouse
Number of UniGene Cluster containing at least one CDS: 18,614 clusters (unique genes)
CDS 66,348 entries
EST 4,266,272 entries
C.elegans
Number of UniGene Cluster containing at least one CDS: 14,393 clusters (unique genes)
CDS 18,411 entries
EST 303,342 entries
Homologous human and mouse genes from NCBI LocusLink (Nov 25, 2004)
Similar human and mouse genes from NCBI HomoloGene (Nov 25, 2004)
22,326 literature aliases collected by the Human Genome Organisation (HUGO) (Nov 25, 2004)
EST library information from the Cancer Genome Anatomy Project (CGAP) (Nov 25, 2004)

1.6 Criteria in Determining Reliable AS sites

In order to preserve information, we have employed a two-step filter, marking EST sources and adopting moderately high criteria, to manage the issues of paralogous genes and repeats respectively.
  1. If an EST sequence is derived from other UniGene Cluster, it will be marked as OtherClusterEST. In calculating statistics, it will not be counted as a supporting evidence for a putative AS site pair. However, in order to preserve all information, the putative AS site pairs derived from EST sequences of other UniGene Cluster were included. In some cases,
  2. At the stage of computing statistics, we have used 95% identity in a 50-bp fragment on both ends of separated alignments as the criteria to predict splicing sites.

2 RESULTS

2.1 Putative alternative splicing in human

374,923 sequences (mRNA, EST sequences) contained putative alternative splicing information
50,075 alternative splicing site pairs

2.2 Putative alternative splicing in mouse

118,640 sequences (mRNA, EST sequences) contained putative alternative splicing information
27,905 alternative splicing site pairs

2.3 Putative alternative splicing in C. elegans

6,977 sequences (mRNA, EST sequences) contained putative alternative splicing information
3,642 alternative splicing site pairs

2.3 Format of the "TEXT" output

The standard output
ug_id: UniGene Cluster ID
ref_unigene_id: the UniGene sequence accession number of the reference sequence used to collect AS information
ref_gb_id: the GenBank sequence accession number of the reference sequence used to collect AS information
ref_len: The length of the reference sequence
ALTER_INFO: essential information of an AS site pair
POS1, POS2: see 1.4 rationale
ID1, ID2: identities of both ends of alignments respectively
LEN1, LEN2: matched lengths of both ends of alignments respectively
VARIANT_SIZE: the change in length of candidate sequences containing putative AS information
ALTERTYPE: see 1.4 rationale. Either type one or type two.
AS_SEQ_COUNT: number of different sources of sequences supporting this AS site pair
CDS_C: number of CDS
C_CDS_C: number of complete CDS sequences
C_SEQ_C: number of complete sequences
S_EST_C: number of EST sequences from the same UniGene cluster
O_EST_C: number of EST sequences from other UniGene cluster
DB_EST_C: number of EST sequences from dbEST (sequences not classified to any UniGene Cluster)
AS_SEQ_INFO
For each sequence containing putative alternative splicing information
GenBank accession number
Sequence types: either one of the following types, including SelfClusterEST, OtherClusterEST, cds, Complete_cds,  Complete_sequence
LIB_INFO: the library information of an EST sequence, e.g.:
LIB=NIH_MGC_52
TISSUE=pooled tissue
HISTOLOGY=normal
CELL_TYPE=flow-sorted

2.4 Abbreviations

The following abbreviations are the naming conventions in PALS db. They may be appeared in the web interface, the TEXT output, and the web based HELP system.

AS alternative splicing
ASSPs non-redundant AS information for a gene
Gb_id the GenBank accession number of a DNA sequence. (either refseq, mRNA, or EST sequences)
Ug_id the UniGene cluster ID of a unique gene
Gene the gene symbol approved by HUGO
UniGene member the number of sequences clustered in a UniGene Cluster
AS lists (pic) clustered (non redundant) putative AS site pairs displayed in graphics
Text_Info clustered putative AS site pairs in details, including number of cds and EST, the cloning library characteristics of EST, the AS types, the length of a AS fragment, etc.
All seq info all candidate sequences containing putative AS displayed in graphics
Descriptions the gene name approved by HUGO
Cytoband the cytogenetic location of a gene

3. PALSDB ADMINISTRATION

3.1 Citing PALS db

If you have used PALS db in your research, we would appreciate it if you would include a reference to PALS db in all publications related to that research.
When citing data in PALS db, it is appropriate to give the database release number, and accession numbers of sequences containing putative AS information. If necessary, we will try to retrieve the information from the past PALS db release used in your publications.
The following publication, which describes the PALS db, should be cited:
Huang, Y-H, Chen, Y-T, Lai, J-J, Yang, S-T., and Yang, U-C. (2002) PALS db: Putative alternative splicing database. Nucleic Acids Res. 30, 186-190.

3.2 Other Methods of Accessing PALS db data

We are now trying to contact institutes for creating mirror sites outside the Bioinformatics Research Center, National Yang-Ming University, Taipei, Taiwan. For becoming mirror sites, please contact binfo@ym.edu.tw.
We plan to issue flat file distributions for installing to SRS in the future.

3.3 Known Bugs

In clustering candidate sequences containing AS information into AS site pairs, we mistakenly combined type two AS candidates with same POS1 but with different variant lengths.
This mistake might cause some aftereffects:
Some unique type two AS site pairs with different variant lengths are incorrectly clustered in both the "AS lists" and the "TEXT" output.
Under-estimation of the number of putative AS site pairs

3.4 Deposition of Experimental Data and Comments

Any experimental data that can provide further proof on the putative AS site pair collected in PALS db are welcomed to deposit into PALS db. Please contact binfo@ym.edu.tw.
Any comments should be directly sent to 39103016@ym.edu.tw.

3.5 Credits and Acknowledgments

Credits
Database Release 7: 2005/01/24, by Fu, G. C.-L, and Yang,U.-C.
Database Release 6: 2003/04/21, by Fu, G. C.-L, and Yang,U.-C.
Database Release 5: 2002/10/30, by Huang,Y.-H., Fu, G. C.-L, and Yang,U.-C.
Web Interface maintained by Fu, G. C.-L, and Huang,Y.-H.
Acknowledgement
Gloria, Chiung-Ling Fu were supported by NSC91-3112-B-010-013 and started to update PALSdb since release 5. Yen-Hua Huang has moved to Sanger Institute since Sept. 2002. Special thanks to his assist on verifying the quality of PALSdb.
YHH and YTC were supported by grants from National Science Council, Taiwan (NSC 89-2323-B-010-003 and NSC 89-2318-B-010-011-M51, respectively). STY was supported by a grant from Veterans General Hospital, Tsing-Hua University, and Yang-Ming University (VTY90-P5-40). The computational resource was supported by Ministry of Education, ROC (Program for Promoting Academic Excellence of Universities, 89-B-FA22-2-4).
Special thanks to MIS Mr. Lai,J.-J., and Mr. Wang, Y.-T. for their enthusiasm in maintaining the system stability.

3.6 Disclaimer

     The Bioinformatics Research Center of National Yang-Ming University makes no representation about the suitability or accuracy of the PALS db and the web interface for any purposes and makes no warranties, either express or imply, including merchantability and fitness for a particular purpose or that the use of this software of data will not infringe any third party patents, copyrights, trademarks, or other rights.

     This web interface and data are provided to enhance knowledge and encourage progress in the scientific community and are to be used only for research and educational purposes. Any reproduction or use for commercial prupose is prohibited without the prior express written permission of the Bioinformatics Research Center of National Yang-Ming University.

     For additional information about PALS db releases, please contact YMBC by e-mail at binfo@ym.edu.tw, by phone at +886-2-2826-7128, or by mail at:

Bioinformatics Research Center, National Yang-Ming University,
No. 155, Sec. 2, Li-Noun St, Taipei, Taiwan 11221, R.O.C.
FAX: +886-2-2826-4843