ProteinCenter User Manual
Table of Contents

Appendix A. Data Sources

Table of Contents

A.1. Protein Data Sources
A.2. Gene Data Sources
A.3.

This appendix provides an overview of the data sources in ProteinCenter™:

A.1. Protein Data Sources

Table A.1. Protein Data Sources

Short NameNameDescription
GIGINational Center for Biotechnology
UNIUniProtUniversal Protein Resource
SPSwiss-ProtUniProt, manually annotated
TRTremblUniProt, automatically annotated
REFRefSeqNCBI Reference Sequence Database
TBLISTTubercuListTubercuList, Institut Pasteur
PIRPIRProtein Information Resource
PDBPDBBrookhaven Protein Data Bank
EMBEMBLEuropean Molecular Biology Laboratory
TPETPEEMBL, third party annotation
TPGTPGGenBank, third party annotation
PRFPRFProtein Research Foundation
DBJDDBJDNA Data Bank of Japan
GBGenBankNCBI GenBank
ESBLEnsemblEuropean Bioinformatics Institute
IPIIPIInternational Protein Index
SGDSGDSaccharomyces Genome Database
FBFlyBaseDatabase of Drosophila Genes & Genomes
TAIRTAIRThe Arabidopsis Information Resource
PLASMOPlasmoThe Plasmodium Genome Resource
CMRCMRComprehensive Microbial Resource
TBLISTTbListTubercuList, Institut Pasteur
PSEPSEPseudomonas Genome Database
STRINGSTRINGSTRING Protein-Protein Interactions
TRITRYPTriTrypTrypanosomatidae functional genomic database

A.2. Gene Data Sources

Table A.2. Gene Data Sources

Short NameNameDescription
Primary  
ENTREZEntrezNCBI Entrez Gene
EMSEMBEnsemblEuropean Bioinformatics Institute
Alternate  
ENTOFFOfficial symbolEntrez official symbol
ENTSYNEntrez synonymEntrez synonym
HGNCHGNCHUGO Gene Nomenclature Committee
MGIMGIMouse Genome Informatics
MIMMIMOnline Mendelian Inheritance in Man
RGDRGDRat Genome Database
SGDSGDSaccharomyces Genome Database
WORMBWormBaseUnified gene database of Caenorhabditis elegans
FLYBFlyBaseUnified gene database of Drosophila
ZFINZFINZebrafish model organism database
IMGTIMGTInternational ImMunoGeneTics information system
ECOCYDECOCYCEncyclopedia of Escherichia coli
HPRDHPRDHuman Protein Reference Database at Hopkins University
TAIRTAIRThe Arabidopsis Information Resource
MAIZEMaizeGDBMaize genetics and genomics database
VECTORVectorBaseInvertebrate vectors of human pathogens
XBASEXenBaseXenopus biology and genomics resource
PBRPBRPoxvirus Bioinformatics Resource Center
PATHEMPathemaPathema bioinformatics resource center
API_DBApiDB/CryptoDBApicomplexan and Cryptosporidium bioinformatics resources
ECOEcoGeneEcoGene database of Escherichia coli sequence and function
CGNCCGNCChicken Gene Nomenclature Consortium
PCAPPseudoCapPseudomonas aeruginosa community annotation project
AQTLDBAnimal QTLdbAnimal Quantitative Trait Locus database
BEETLBBeetleBaseSequence database for Tribolium genetics, genomics and developmental biology
DICTYBDictyBaseDictyostelium discoideum genome database
BEEBBeeBaseSequence data source for the bee research community
APHIDBAphidBaseThe aphid genome database
BGDBGDThe Bovine Genome Database
MIRBMirBasePublished miRNA sequences and annotation
NASONBNasoniaBaseData repository for the Nasonia Species Complex Genome Projects
VEGAVEGAVErtebrate Genome Annotation Database

Table A.3. Description of the Most Prominent Data Sources in ProteinCenter

Data SourceDescription of Data SourceIncluded data
NCBI
NCBI protein sequences Genbank, NRDB, RefSeq etc

Copyright Status Most of the information available from this site is within the public domain. Public domain information on the NCBI Web pages may be freely downloaded and reproduced. However, it is requested that in any subsequent use of this work, NCBI be given appropriate acknowledgment. This site also contains some material, such as abstracts, full text of journal articles and books, and the OMIM database, which is copyright protected. For such material, the submitting authors or other copyright holders retain rights for reproduction or redistribution. All persons reproducing or redistributing this information are expected to adhere to the terms and constraints invoked by the copyright holder. Such protected material, however, may be used under the terms of "fair use"; as defined in the copyright laws, which generally permit use for non-commercial educational purposes such as teaching, research, criticism, and news reporting. Molecular Database Availability Databases of molecular data on the NCBI Web site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. However, some submitters of the original data may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims and, therefore, cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.
Sequences & taxonomy information, identifiers for proteins from prf, ddbj, pdb etc.
IPI
IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms: UniProt, RefSeq and Ensembl. effectively maintains a database of cross references between the primary data sources. Provides minimally redundant yet maximally complete sets of human, mouse and rat proteins (one sequence per transcript). Maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases.

Copyright © The European Bioinformatics Institute. The IPI may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy.

Sequences, protein identifiers and descriptions. Interpro domain. Crossmapping to ensembl. All IPI versions has been included in the system.
UniProt
UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

COPYRIGHT NOTICE © 2006 UniProt Consortium We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first.

Sequences (including alternative spliced isoforms (varsplic)) and annotation. Gene ontology mapping. All UniProt versions has been included in the system.
Ensembl
Sequence database for a number of model organisms

All datasets generated by the Ensembl project are freely available to download from the ftp.ensembl.org site.

Sequences, protein identifiers and descriptions.
InterPro
InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.

Copyright Notice InterPro - Integrated Resource Of Protein Families, Domains And Functional Sites Copyright (C) 2001 The InterPro Consortium. This manual and the accompanying database may be copied and redistributed freely, without advance permission, provided that this Copyright statement is reproduced with each copy. What does this mean? The InterPro member databases agreed that all data in InterPro and on the InterPro ftp server is freely distributable and no license agreements are required for use. For the databases that normally do require licenses for commercial use, all data from those databases that are distributed with the InterProScan package may be considered free and public and not subject to license agreements. These databases may have additional information not distributed by InterPro which is then subject to licensing.
Protein domain annotation
Entrez Gene
Entrez Gene has been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data

See NCBI above for copyright notice

Annotation and gene ontology mapping
Gene Ontology
The goal of the Gene Ontology project is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. GO provides three structured networks of defined terms to describe gene product attributes. GO is one of the controlled vocabularies of the Open Biological Ontologies.

GO Redistribution and Citation Policy GO Policy The GO database and vocabularies are freely available to the public. The annotations provided by member organizations in the Current Annotations table are also available to the public. The GO Consortium gives permission for any of its products to be used without license for any purpose under three conditions: 1. That the Gene Ontology Consortium is clearly acknowledged as the source of the product; 2. That any GO Consortium file(s) displayed publically include the version number(s) and/or date(s) of the relevant GO file(s) (The GO is evolving and changes will occur with time.); 3. That neither the content of a GO file(s) nor the logical relationships embedded within the GO file(s) be altered in any way. Please address any questions about this policy to

http://www.geneontology.org/GO.cite.html

Annotation and gene ontology mapping
IntAct
IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.

Availability: All IntAct data and software is freely available to all users, academic or commercial, under the terms of the Apache License, Version 2.0.

Protein - Protein interactions
PFAM
Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families

Pfam - A database of protein domain family alignments and HMMs Copyright (C) 1996-2001 The Pfam Consortium This database is free; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

ftp://ftp.sanger.ac.uk/pub/databases/Pfam/COPYRIGHT

Pfam domains annotation - for PFAM domains identified using rpsBLAST
NCBI BLAST package
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

BLAST is public domain software.from NCBI.

http://www.ncbi.nlm.nih.gov/blast/index.shtml

BLAST searches using BLASTp blast2seq and rpsblast
CDD
This ftp-directory archives collections of position-specific scoring matrices which have been created for the Conserved Domain (CD-Search) service (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). The PSSMs are meant to be used for compiling RPS-Blast search databases only. RPS-Blast and the software tools needed to convert files in this directory are part of the NCBI software development toolkit distribution (available at ftp://ncbi.nlm.nih.gov/toolbox).PFAM domains in CDD format - PSSMs from a mirror of the Pfam-A seed alignment database
PrediSI
PrediSi (PREDIction of SIgnal peptides) is a software tool for predicting signal peptide sequences and their cleavage positions in bacterial and eukaryotic proteins.

PrediSi predictions are included in ProteinCenter with kind permission from Karsten Hiller, Institute for Microbiology, Technical University of Braunschweig

Try out PreDisi at:

http://www.predisi.de/index.html

Signal peptide prediction
TMAP (EMBOSS)
This program predicts transmembrane segments in proteins, utilising the algorithm described in: "Persson, B. & Argos, P. (1994) Prediction of transmembrane segments in proteins utilising multiple sequence alignments J. Mol. Biol. 237, 182-192."

EMBOSS applications are released under the GNU General Public Licence.

http://emboss.sourceforge.net/

Transmembrane domain prediction
Gramene
As an information resource, Gramene's purpose is to provide added value to datasets available within the public sector, which will facilitate researchers' ability to understand the rice genome and leverage the rice genomic sequence for identifying and understanding corresponding genes, pathways and phenotypes in other crop grasses.

http://www.gramene.org

Go annotation
SGD
SGD™ is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast.

http://www.yeastgenome.org/

Sequences, protein identifiers and descriptions for yeast proteins
FlyBase

The FlyBase project is carried out by a consortium of Drosophila researchers and computer scientists at: Harvard University, University of Cambridge (UK), and Indiana University. http://flybase.bio.indiana.edu/

Sequences and protein identifiers for drosophila proteins
MIPS Interaction
The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators.

http://mips.gsf.de/services/ppi

Protein - Protein interactions
TAIR
The Arabidopsis Information Resource (TAIR) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana.

http://www.arabidopsis.org

Sequences, protein identifiers and descriptions for aribidopsis proteins
Pseudomonas Genome Database
The Pseudomonas Genome Database (PGD) is a repository for the Gram-negative bacterium species Pseudomonas aeruginosa.

http://www.pseudomonas.com/

Sequences, protein identifiers and descriptions for pseudomonas proteins
TubercuList
The annotated protein products from the re-annotated genome of Mycobacterium tuberculosis H37Rv.

http://genolist.pasteur.fr/TubercuList/

Sequences, protein identifiers and descriptions for Mycobacterium tuberculosis proteins from the TubercuList database
CMR
The Comprehensive Microbial Resource (CMR) contains robust annotation of all complete microbial genomes.

http://cmr.jcvi.org/cgi-bin/CMR/CmrHomePage.cgi

Sequences, protein identifiers and descriptions for proteins from the CMR database
KEGG
Molecular interaction and reaction networks for metabolism, various cellular processes, and human diseases.

http://www.genome.jp/kegg/

Pathway annotation
PlasmoDB
PlasmoDB.org hosts genomic and proteomic data (and more) for different species of the parasitic eukaryote Plasmodium, the cause of Malaria. It brings together data provided by numerous laboratories worldwide (see the Data Sources page), and adds its own data analysis.

http://plasmodb.org/plasmo/

Sequences, protein identifiers and descriptions for plasmodium proteins
STRING
STRING is a database of known and predicted protein interactions, including direct (physical) and indirect (functional) associations, derived from four sources: Genomic context, co-expression, previous knowledge and high-throughput experiments.

http://string-db.org/

Protein - Protein interactions
Reactome
Reactome is a free online database of biological pathways.

http://www.reactome.org/

License: Creative Commons Attribution 4.0 International License

Pathway annotation
WikiPathways
WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways.

http://www.wikipathways.org/

License: Creative Commons Attribution 3.0 Unported license

Pathway annotation