Protein Pattern, Motif, and Domain Databases

Sean Eddy; eddy@genetics.wustl.edu; http://genome.wustl.edu/eddy
modified by Ueng-Cheng Yang for the NCHC course (permission)


Contents


Lecture outline

Why use domain databases?

Choice of computational representations

A tradeoff between completeness and correctness

A closer look at PROSITE

A closer look at BLOCKS

A closer look at PRINTS

A closer look at PFAM

Always do control experiments: never trust a server


Annotated links to domain databases

There are a plethora of protein domain databases. They vary greatly in approach, completeness, documentation, and support. Here are links to three of the best (PROSITE, BLOCKS, and PRINTS), followed by our own (PFAM) which I feel is also pretty good, and some pointers to other domain databases that I know of. I've included my opinion on the strengths of the different databases.

PROSITE

Principal authors Amos Bairoch et al. (Geneva, Switzerland)
Strengths Exceptionally well documented. Closely tied to well-annotated Swissprot database. Reliable classification of Swissprot into homologous families.
Type of pattern Deterministic patterns. A growing number of profiles.
Family definition method Manual
Construction method Manual
Search method Regular expression matching (non-stochastic). (Also a growing number of profile HMMs.)
Current release 13.2 (September 1996)
Number of distinct patterns 1167 patterns linked to 889 documentation entries
Home page http://expasy.hcuge.ch/sprot/prosite.html
On-line analysis server http://expasy.hcuge.ch/sprot/scnpsite.html
Downloadable database ftp://ncbi.nlm.nih.gov/repository/prosite (Prosite 13.0)
Downloadable software Lots. For a list, go to http://expasy.hcuge.ch/cgi-bin/lists?prosite.prg
Recent reference "The PROSITE database: its status in 1997." NAR 25:217-221, 1997

BLOCKS

Principal authors Steve and Jorja Henikoff et al. (Seattle, USA)
Strengths Easily constructed and updated; most have links to PROSITE documentation; improves the sensitivity of searching for matches to PROSITE families.
Type of pattern ungapped alignments
Family definition method primarily taken from PROSITE
Construction method automated (PROTOMAT)
Search method ungapped HMM (aka position-specific score matrices, PSSMs)
Current release 9.3 (March 1997)
Number of distinct patterns 3417 blocks for 932 protein families
Home page http://www.blocks.fhcrc.org/
On-line analysis server http://www.blocks.fhcrc.org/blocks_search.html
Downloadable database ftp://ncbi.nlm.nih.gov/repository/blocks
Downloadable software ftp://ncbi.nlm.nih.gov/repository/blocks
Recent reference "The BLOCKS database - a system for protein classification" NAR 24:197-200, 1996.

PRINTS

Principal authors T.K. Attwood (University College London, UK)
Strengths Careful manual construction of families. Deals well with the presence of multiple short conserved blocks in many proteins.
Type of pattern short ungapped alignments, multiple per family
Family definition method manual
Construction method manual
Search method ungapped PSSMs, requiring multiple hits
Current release 17.0 (Sept 1997)
Number of distinct patterns 4460 motifs for 800 protein families
Home page http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/PRINTS.html
On-line analysis servers http://www.biochem.ucl.ac.uk/cgi-bin/attwood/SearchPrintsForm2.pl
http://www.blocks.fhcrc.org/blocks_search.html
Downloadable database ftp://ncbi.nlm.nih.gov/repository/PRINTS
Downloadable software ??
Recent reference "Novel Developments With the PRINTS Protein Fingerprint Database". NAR 25:212-217, 1997

PFAM

Principal authors Richard Durbin et al. (Cambridge UK), Erik Sonnhammer (NCBI), and Sean Eddy et al. (Washington University, USA)
Strengths Automated annotation of complete domains (start/end coordinates, multiple alignment to other family members) in addition to similarity detection
Type of pattern full multiple alignments of domains
Family definition method manual; but substantially reliant on PROSITE and SCOP
Construction method manual; but substantially reliant on profile HMM methods
Search method profile HMM
Current release 2.1 (November 1997)
Number of distinct patterns 527 domain alignments
Home pages http://genome.wustl.edu/Pfam/
http://www.sanger.ac.uk/Pfam/
On-line analysis servers http://genome.wustl.edu/eddy/cgi-bin/hmm_page.cgi
http://www.sanger.ac.uk/Pfam/HMM_search2.shtml
Downloadable databases ftp://genome.wustl.edu/pub/databases/Pfam/
ftp://ftp.sanger.ac.uk/pub/databases/Pfam/
Downloadable software http://genome.wustl.edu/eddy/hmmer.html
Recent reference "Pfam: A Comprehensive Database of Protein Families Based on Seed Alignments", Proteins 28:405-420, 1997.

Some other domain databases.

ISREC PROFILES
Philipp Bucher, Kay Hofmann. Similar to PFAM. Profiles of complete protein domain alignments. Searchable by profile HMMs.
PROCLASS
Cathy Wu. Neural net based family classification of proteins.
SBASE
Consists of single domain subsequences cut out of SwissProt. Searchable by BLAST.
PRODOM
Consists of ungapped alignments from an automatic all-vs-all clustering of SwissProt by the DOMAINER clustering algorithm. Searchable by BLAST.

Annotated bibliography

A. Bairoch, P. Bucher, K. Hofmann. (1997) The PROSITE Database, Its Status in 1997. NAR 25:217-221. Update on the status of PROSITE in the 1997 NAR database issue.

J.G. Henikoff, S. Henikoff. (1996) Blocks Database and its Applications. Meth. Enzymology 26:88-105. Review article about the BLOCKS database.

J.G. Henikoff, S. Pietrokovski, S. Henikoff. (1997) Recent Enhancements to the Blocks Database Servers. NAR 25:222-225. Update on the status of BLOCKS in the 1997 NAR database issue.

T.K. Attwood, M.E. Beck. (1994) PRINTS - A protein motif fingerprint database. Protein Engineering 7:841-848. Original reference for the PRINTS database.

T.K. Attwood, M.E. Beck, A.J. Bleasby, K. Degtyarenko, A.D. Michie, D.J. Parry-Smith. (1997) Novel Developments With the PRINTS Protein Fingerprint Database. NAR 25:212-217. Update on the status of PRINTS in the 1997 NAR database issue.

Erik L.L. Sonnhammer, Sean R. Eddy, Richard Durbin. (1997) Pfam: A Comprehensive Database of Protein Families Based on Seed Alignments. Proteins 28:405-420. Original reference for the PFAM database.


Sean Eddy, <eddy@genetics.wustl.edu>

Last modified: Mon Nov 3 11:24:28 1997