Table of Contents
This chapter gives a detailed description of the concept of a protein record in ProteinCenter™.
It is explained how identifiers from different databases (protein keys) map into protein records in ProteinCenter, how multiple entries are consolidated to avoid redundancy and how protein isoforms, fragments and outdated entries relate to each other.
In the list view of ProteinCenter™ only a single protein key and its corresponding description are displayed for each record, even though there may be several ProteinIDs from different databases that identify exactly the same protein. The choice of identifier and description is configured through the preferred type of protein key described below. In the ProteinCard it is possible to see all the protein keys identifying a particular protein (cf. Section 10.2.1, “Keys details”).
ProteinCenter™ contains more than 54 million protein sequences originating from 125 million protein records from all the popular sources, cf. Table A.1, “Protein Data Sources”.
ProteinCenter™ is a protein-centric system. All information (e.g. functional annotation, database links, genomic information) is mapped directly to the proteins, ensuring an efficient analysis of large dataset while retaining a resolution level that allows the distinction of different isoforms or native vs. processed proteins.
A protein record in ProteinCenter™ contains information about a particular protein sequence in a particular species. This means that where different alleles exists for the same protein, each of these alleles are represented by a distinct protein record. Furthermore, various fragments of the protein will exist as separate records. It also means that, in the rare cases where proteins with the exact same amino acid sequence occur in two or more different species, the protein from each species is treated as a distinct entity, even though they from a purely chemical point of view represent the same molecule. This situation sometimes occur between closely related species and with highly conserved proteins, e.g. histones.
This definition of a protein record is highly advantageous when working with experimental data from protein identification, since it does not compromise the resolution level of the experimental data.
The definition allows for:
Keeping track of distinct protein isoforms to the level of single amino acid changes.
Mapping results from mass spectrometry and protein chips directly to the database.
A highly detailed overview of all known isoforms.
Mapping identified fragments to full length isoforms.
Consolidation of all truly redundant entries.
The consolidation of identical sequences within a given species is an advantageous concept, which allows the system to deal with the fragmentary and redundant nature of many databases and with the merging of data from multiple databases. This consolidation is completely legitimate, since there is no difference among different database entries with respect to the protein (as a biological object). The only difference is in the associated annotation and the type of identifier (please consult Section 4.3, “Preferred accession” to understand how the user may customize the displayed type of identifiers).
The alternative to this consolidation would be that for certain proteins (i.e. specific sequences in a species) there could be as many as several hundreds (on average 5) different entries, which would all represent the same biological entity - exactly the same protein. E.g. gi|4758258 , gi|34921508, gi|1079483 , gi|561630, gi|11527779, gi|13325290, gi|30583163 , gi|37046849 , gi|1091782 , gi|48145655, IPI00002569.1, and Q13541 and Q6IBN3 from UniProt all refer to the same human protein.
This further means that when importing datasets into ProteinCenter™, the origin of the database identifiers (i.e. whether they are obtained from the IPI database, UniProt, PDF or NCBI nr etc.) essentially does not matter. The same protein is recognized as the same protein independently of the origin. This makes it possible to compare an IPI dataset to a set of GIs from a paper, or a set of UniProt accession numbers to a set of PDB accessions - or even datasets of mixed protein identifiers.
Moreover, this consolidation provides a very powerful method of consolidating annotation from a range of database records. Annotation from IPI, UniProt, Ensembl and NCBI are consolidated and standardized to remove redundancy and provide concise information. For example, all gene ontology annotation associated with any of the original database records are consolidated to a single ProteinCenter™ protein record.
How is this different to the protein records in UniProt?
In UniProt several protein isoforms are intentionally merged into one record in order to minimize the redundancy of the database. Differences between sequences due to splice variants, polymorphisms, disease-causing mutations, experimental sequence modifications or simply sequencing errors are listed in the feature table of the corresponding UniProt entry, while the actual sequence represent one single isoform. This means that even splice variants that may differ considerably in sequence are merged into one entry. However, many records in the trEMBL part of UniProt still exist as individual records, including a significant amount of fragments.
In ProteinCenter™ all reported isoforms are kept as separate entries to allow a high resolution for experimental protein identification thereby allowing distinction between different isoforms even at the allelic level.
How is this different to the protein record at GenBank?
The most important source of new data in GenBank® is direct submissions from scientists. GenBank depends on its contributors to help keep the database as comprehensive, current, and accurate as possible. Hence, errors are not corrected and redundant entries are not merged. Since GenBank is primarily a nucleotide sequence repository, many proteins in GenBank (like in TrEMBL) are created from conceptual translations of nucleotide records. This results in a lot of incomplete protein entries, an extreme case being protein sequences consisting of the N-terminal Methionine (as in the example below), because the protein is a result of a conceptual translation of a nucleotide entry from a promoter study where the sequence of interest is that upstream of the N-terminal Methionine. Likewise records consisting solely of a sequence of "X"'es exists - denoting the sequence is unknown.
NR is a blast database containing all non-redundant GenBank CDS translations+RefSeq Proteins+PDB+Swiss-Prot+PIR+PRF. In NR, identical sequences are merged into one entry. In order to be merged, two sequences must have identical lengths and every residue at every position must be the same. Therefore, in NR, identical proteins from different species are merged into a single entry, which is different from the approach used in ProteinCenter, where identical proteins from different species have different protein records to avoid the mixing of annotation etc. from different species (since the same protein may have different roles in different species).
Sequences shorter than 5 amino acids are excluded from the NR database. This means that "sequences" like 'M' in the figure above, consisting of only a single amino acid are not included in NR, even though they do have a GI. ProteinCenter™ includes sequences of any length.
In general, most scientists are using protein keys to keep track of proteins whether experimentally verified or identified in a database or the literature. In order to keep full control of this tracking, it is important to have some basic knowledge about both the version number of certain protein keys and the issue that protein keys may become obsolete/outdated. In the following, the basics of versioning and outdating is described.
The source databases, IPI, GenBank, EMBL, and DDBJ increment the entry accession version if a sequence entry is updated, corrected or extended as a result of new findings from recent experiments.
GIs do not have version numbers. The accession.version and GI systems of sequence identifiers are maintained in parallel. Therefore, when a change is made to a sequence, it receives a new GI number AND an increase to its version number. Thus, a GI is an unambiguous identifier for a specific protein sequence.
UniProt do not use versioning of accession codes, and allow changes to the associated sequence or even in rare cases changing of the taxonomy.
When the version number is incremented for a protein key, the changed entry which now have the lower version number is outdated. At GenBank the associated GI is likewise outdated.
In this section, it is outlined how the system ensures that the optimal identifier and description is shown when there are multiple choices for a specific type of protein key (e.g. multiple GIs for the same protein).
For one specific protein, multiple protein keys may exist (GI accession numbers, IPI accessions etc. from different databases and sometimes even several redundant identifiers from the same database - see more in Section 4.1, “The basics of a protein record” above).
Without ProteinCenter, the user is forced to keep track of all the protein keys that are associated with the same protein, a bookkeeping exercise which often take up a considerable amount of time. Moreover, it may even prevent the use of datasets that are reported with protein keys from other databases than the one that is usually used. With ProteinCenter, the bookkeeping of protein keys is handled automatically, the same protein key is shown for the same protein every time. Since an accession is also tied to a specific description this results in a much more consistent data view.
Some researchers prefer to use UniProt for their database searches, others use IPI or GenBank. If the same protein record exists in several of these databases, the user who normally use IPI will want to see an IPI accession code while the user who prefers GenBank will want to see a GI. To accommodate this, the system can be set up to preferentially show keys from a particular database, when several equal choices exist. This is done on the user settings page as described in Section 18.104.22.168.1, “How to choose a preferred type of protein keys”.
Even within the preferred database, there may be several different accession keys to choose between. In these cases, the system will automatically pick an accession key to show, using the criteria described below.
In cases where there are multiple protein keys of the preferred type for a protein, the description associated with the different protein identifiers may vary quite a lot for a given protein. See for example the record shown in Figure 4.2, “Multiple keys and descriptions for the same protein entry”, where the description from the "Emb" record clearly is uninformative, compared to the descriptions from PIR and RefSeq, which have probably been created at a later time.
ProteinCenter automatically picks the set of protein key/description that is expected to display the most optimal (informative) description. If several NCBI GIs are available, the choice of canonical GI is made while looking at their alternative sources according to this scheme (mostly preferred original data source listed first):
Ref - NCBI RefSeq - Confirmed entry
Sp - Swiss-Prot, now part of UniProt
PIR - Protein Information Resource International, now part of UniProt
PDB -Brookhaven Protein Database
PRF -Protein Research Foundation
TPE -Third party annotation, EMBL
TPG - Third party annotation, GenBank
Gb - GenPept (GenBank protein) identifier
Emb -EBI EMBL Database
Dbj -DNA Database of Japan
Tr - Trembl, now part of UniProt
Ref -- NCBI RefSeq- Non-confirmed entry
Live protein keys are always preferred over outdated. If all protein keys has been outdated the same scheme is used to pick among existing outdated protein keys.
© 2005-2017 Thermo Fisher Scientific