Release Note

Release 2

PALS db
Aug 15, 2001 PALS db Release 2
25,577 human putative alternative splicing site pairs,
9,214 mouse putative alternative splicing site pairs

¡@

This document describes the format and content in the release 2 of the PUTATIVE ALTERNATIVE SPLICING SITE DATABASE (PALS db).

If you have any questions or comments about PALS db or this document, please contact the Bioinformatics Research Center of Yang-Ming University (YMBC) via email at binfo@ym.edu.tw or

Bioinformatics Research Center, National Yang-Ming University,

No. 155, Sec. 2, Li-Noun St, Taipei, Taiwan 11221, R.O.C.

Phone: +886-2-2826-7128

Fax:     +886-2-2826-4843

==========================================================================
TABLE OF CONTENTS
==========================================================================

1. INTRODUCTION
   1.1 Release 2
   1.2 Important Changes in Release 2
   1.3 Audience
   1.4 Rationale
   1.5 Data Sources
   1.6 Criteria in Determining Reliable AS sites

2. RESULTS
   2.1 Human
   2.2 Mouse
   2.3 Format of text output
   2.4 Abbreviations

3. PALSDB ADMINISTRATION
   3.1 Citing PALS db
   3.2 Other Methods of Accessing PALS db data
   3.3 Known Bugs
   3.4 Deposition of Experimental Data and Comments
   3.5 Credits and Acknowledgments
   3.6 Disclaimer

==========================================================================

1. INTRODUCTION

1.1 Release 2

     The Bioinformatics Research Center at the National Yang-Ming University is responsible for the producing and distributing of the putative alternative splicing site database (PALS db). Using all available human and mouse mRNA sequences as reference sequences, we tried to collect all available putative alternative splicing information hidden in biological sequence databases.

1.2 Important Changes in Release 2

Mouse genes (16,615 unique genes) were incorporated in this database release.
More unique human genes (19,936 unique genes in release 2 and 17,595 unique genes in release 1) were now included in this release. 
The web interface was improved to 0.9.5
The length of matched region can be measured in units of either amino acid position or nucleotide position on the scale bar.
Designed to reveal the effect of alternative splicing on proteins
Coding region of the reference sequence is shaded by pale blue color
Sequence scales using both amino acid and nucleic acid positions
Hyperlink to InterPro, CDD, DART will reveal the positions of protein signatures, motifs, and domains
Homologous genes between mouse and human are linked using the information provided in NCBI LocusLink and HomoloGene databases.
Designed to display other useful information to compare with alternative splicing information

1.3 Audience

PALS db is NOT designed to collect precise splicing site information for designing new gene prediction programs. This database is designed for biologists, who are interested in discovering biological phenomenon and solving a biological problem. Determining an AS site experimentally is time-consuming; however, by looking up putative AS information, biologists can design simple experiments to prove the existence and to perform functional assays of these AS sites.
Though the splicing junctions in PALS db is not as precise as required in designing new gene-prediction programs, biologists may find it useful because PALS db provided piles of putative AS information. Thus people in wet labs can use those putative sites as hints to make hypothesis. For example, in the case of FOS.
In order to get reliable statistics about AS in human genes, we collected AS-related statistics using strict criteria. However, in order to provide biologists more chances to find novel phenomenon, we preserved all the putative information. We also created a user-friendly interface for biologists to judge the validities of the putative information by their expertise. Hopefully, PALS db can be a tool to energize the information-driven biomedical researches.

1.4 Rationale

In constructing PALS db, we tried to find all available AS information. As half of the human genomic sequences were still draft, we chose mRNA sequences as references to collect AS information from UniGene and dbEST. If there is AS information in the sequence, with either a deleted or inserted fragment, the alignments are separated. According to the relations between these alignments, there are 3 major types of alternative splicing transcripts. '+' means the region that can be aligned. '=' means the regions that had no corresponding fragments to the other sequences. Both the forward and backard slashes are to indicate the alignable regions between the reference sequences and putative AS-containing sequences. The identities of both ends of alignments are defined as ID1 and ID2 respectively. The lengths of the two ends of alignments are defined as Len1 and Len2 respectively.
In type one AS, EST may represent a transcript that the region between pos1 and pos2 is excluded at splicing. Pos3 is the position on the putative AS-containing sequence that the alignments are separated.
In type two AS, there is an extra sequence between pos3 and pos4 on the putative As-containing sequence. On the contrary to type one AS, there is only one cutting position (pos1) on the reference sequence.
In type three, there are two un-aligned sequences on both the reference and the putative AS-containing sequence. However, we discarded type three AS in current release of PALS db because, in many cases, the unaligned regions are low-quality sequences.

¡@

  1.             pos1     pos2
    ===++++++===++++++===     mRNA
            \         \     /         /        
              \          \ /         / 
               +++++++++++               EST
                       pos3

  2.                  pos1
    ===++++++++++++++===       mRNA
          /             / \            \
        /            /     \            \
       +++++++====+++++++        EST
                 pos3    pos4

  3.             pos1   pos2
    ==++++++====++++++===    mRNA
        /         /          \           \
      /          /            \           \
    ++++++=======+++++++     EST
              pos3          pos4

1.5 Data Sources

Human UniGene Build #138,  Mouse UniGene Build #93, and dbEST released on Aug 12, 2001
Human
Number of UniGene Cluster containing at least one CDS: 19936 clusters (unique genes)
CDS 68,409 entries
EST 3,735,344 entries
Mouse
Number of UniGene Cluster containing at least one CDS: 16615 clusters (unique genes)
CDS 38,216 entries
EST 2,068,128 entries
Homologous human and mouse genes from NCBI LocusLink (Aug 12, 2001)
Similar human and mouse genes from NCBI HomoloGene (Aug 03, 2001)
9,386 literature aliases collected by the Human Genome Organisation (HUGO) (Sept 10, 2001)
EST library information from the Cancer Genome Anatomy Project (CGAP) (Aug 12, 2001)

1.6 Criteria in Determining Reliable AS sites

In order to preserve information, we have employed a two-step filter, marking EST sources and adopting moderately high criteria, to manage the issues of paralogous genes and repeats respectively.
  1. If an EST sequence is derived from other UniGene Cluster, it will be marked as OtherClusterEST. In calculating statistics, it will not be counted as a supporting evidence for a putative AS site pair. However, in order to preserve all information, the putative AS site pairs derived from EST sequences of other UniGene Cluster were included. In some cases,
  2. At the stage of computing statistics, we have used 95% identity in a 50-bp fragment on both ends of separated alignments as the criteria to predict splicing sites.

2 RESULTS

2.1 Putative alternative splicing in human

79,609 sequences (mRNA, EST sequences) contained putative alternative splicing information
25,577 alternative splicing site pairs

2.2 Putative alternative splicing in mouse

23,768 sequences (mRNA, EST sequences) contained putative alternative splicing information
9,214 alternative splicing site pairs

2.3 Format of the "TEXT" output

The standard output
ug_id: UniGene Cluster ID
ref_unigene_id: the UniGene sequence accession number of the reference sequence used to collect AS information
ref_gb_id: the GenBank sequence accession number of the reference sequence used to collect AS information
ref_len: The length of the reference sequence
ALTER_INFO: essential information of an AS site pair
POS1, POS2: see 1.4 rationale
ID1, ID2: identities of both ends of alignments respectively
LEN1, LEN2: matched lengths of both ends of alignments respectively
VARIANT_SIZE: the change in length of candidate sequences containing putative AS information
ALTERTYPE: see 1.4 rationale. Either type one or type two.
AS_SEQ_COUNT: number of different sources of sequences supporting this AS site pair
CDS_C: number of CDS
C_CDS_C: number of complete CDS sequences
C_SEQ_C: number of complete sequences
S_EST_C: number of EST sequences from the same UniGene cluster
O_EST_C: number of EST sequences from other UniGene cluster
DB_EST_C: number of EST sequences from dbEST (sequences not classified to any UniGene Cluster)
AS_SEQ_INFO
For each sequence containing putative alternative splicing information
GenBank accession number
Sequence types: either one of the following types, including SelfClusterEST, OtherClusterEST, cds, Complete_cds,  Complete_sequence
LIB_INFO: the library information of an EST sequence, e.g.:
LIB=NIH_MGC_52
TISSUE=pooled tissue
HISTOLOGY=normal
CELL_TYPE=flow-sorted

2.4 Abbreviations

The following abbreviations are the naming conventions in PALS db. They may be appeared in the web interface, the TEXT output, and the web based HELP system.

AS alternative splicing
ASSPs non-redundant AS information for a gene
Gb_id the GenBank accession number of a DNA sequence. (either refseq, mRNA, or EST sequences)
Ug_id the UniGene cluster ID of a unique gene
Gene the gene symbol approved by HUGO
UniGene member the number of sequences clustered in a UniGene Cluster
AS lists (pic) clustered (non redundant) putative AS site pairs displayed in graphics
Text_Info clustered putative AS site pairs in details, including number of cds and EST, the cloning library characteristics of EST, the AS types, the length of a AS fragment, etc.
All seq info all candidate sequences containing putative AS displayed in graphics
Descriptions the gene name approved by HUGO
Cytoband the cytogenetic location of a gene

3. PALSDB ADMINISTRATION

3.1 Citing PALS db

If you have used PALS db in your research, we would appreciate it if you would include a reference to PALS db in all publications related to that research.
When citing data in PALS db, it is appropriate to give the database release number, and accession numbers of sequences containing putative AS information. If necessary, we will try to retrieve the information from the past PALS db release used in your publications.
The following publication, which describes the PALS db, should be cited:
Huang, Y-H, Chen, Y-T, Lai, J-J, Yang, S-T., and Yang, U-C. (2002) PALS db: Putative alternative splicing database. Nucleic Acids Res. 30, 186-190.

3.2 Other Methods of Accessing PALS db data

We are now trying to contact institutes for creating mirror sites outside the Bioinformatics Research Center, National Yang-Ming University, Taipei, Taiwan. The first mirror site will soon be available at the National Center for High-performance Computing (NCHC), Hsinchu, Taiwan. For becoming mirror sites, please contact binfo@ym.edu.tw
We plan to issue flat file distributions for installing to the SRS system in the future.

3.3 Known Bugs

In clustering candidate sequences containing AS information into AS site pairs, we mistakenly combined type two AS candidates with same POS1 but with different variant lengths.
This mistake might cause some aftereffects:
Some unique type two AS site pairs with different variant lengths are incorrectly clustered in both the "AS lists" and the "TEXT" output.
Under-estimation of the number of putative AS site pairs

3.4 Deposition of Experimental Data and Comments

Any experimental data that can provide further proof on the putative AS site pair collected in PALS db are welcomed to deposit into PALS db. Please contact binfo@ym.edu.tw.
Any comments should be directly sent to binfo@ym.edu.tw.

3.5 Credits and Acknowledgments

Credits
Database Release 2: 2001/08/15, by Huang,Y.-H., Lai,J.-J., Yang, S.-T., and Yang,U.-C.
Web Interface created by Chen,Y.-T., and Huang,Y.-H.
Acknowledge
YHH and YTC were supported by grants from National Science Council, Taiwan (NSC 89-2323-B-010-003 and NSC 89-2318-B-010-011-M51, respectively). STY was supported by a grant from Veterans General Hospital, Tsing-Hua University, and Yang-Ming University (VTY90-P5-40). The computational resource was supported by Ministry of Education, ROC (Program for Promoting Academic Excellence of Universities, 89-B-FA22-2-4).
Special thanks to MIS Mr. Wang, Y.-T. for his enthusiasm in maintaining the system stability.
Special thanks to RA Miss Chen, W.-H. for her help in preparing the Chinese version of help files.

3.6 Disclaimer

     The Bioinformatics Research Center of National Yang-Ming University makes no representation about the suitability or accuracy of the PALS db and the web interface for any purposes and makes no warranties, either express or imply, including merchantability and fitness for a particular purpose or that the use of this software of data will not infringe any third party patents, copyrights, trademarks, or other rights.

     This web interface and data are provided to enhance knowledge and encourage progress in the scientific community and are to be used only for research and educational purposes. Any reproduction or use for commercial prupose is prohibited without the prior express written permission of the Bioinformatics Research Center of National Yang-Ming University.

     For additional information about PALS db releases, please contact YMBC by e-mail at binfo@ym.edu.tw, by phone at +886-2-2826-7128, or by mail at:

Bioinformatics Research Center, National Yang-Ming University,
No. 155, Sec. 2, Li-Noun St, Taipei, Taiwan 11221, R.O.C.
FAX: +886-2-2826-4843