ProteinCenter User Manual
Table of Contents

Chapter 7. The Lookup

Table of Contents

7.1. Lookup by GI or accession code
7.1.1. Introduction
7.1.2. How to look up a protein key
7.2. Lookup by peptide
7.2.1. Introduction
7.2.2. How to look up proteins by tryptic peptides
7.3. Lookup by gene symbol
7.3.1. Introduction
7.3.2. How to look up proteins by gene symbols
7.4. Lookup by annotation
7.4.1. Introduction
7.4.2. How to look up proteins by annotation

The Lookup component is located on the switchboard, and is identified with a binoculars icon. It allows users to look up proteins based on:

  1. Protein keys (accession numbers, GIs or entry names)

  2. Tryptic peptides (sequences)

  3. Gene symbols

  4. Annotation

Figure 7.1. The Lookup Component

The Lookup Component

  1. Entry field for protein keys.

  2. Press Search Proteins to lookup protein keys and display information for the corresponding proteins.

  3. Entry field for peptide sequences.

  4. Press Search Proteins to lookup tryptic peptides and display information for the corresponding proteins.

  5. Entry field for gene symbols.

  6. Press Search Proteins to lookup gene symbols and display information for the corresponding proteins.

  7. Protein annotation type selection.

  8. Entry field for annotation descriptions.

  9. Press Search Annotation to lookup annotation identifiers for the given description.

  10. Entry text area for annotation identifier specifications.

  11. Press Search Proteins to lookup proteins annotated with the given identifiers.

As described in Chapter 4, The Protein record, each protein in ProteinCenter™ is a unified record of a number of external protein sequence records. For every species, external protein sequence records containing the same amino acid sequence are consolidated into a single protein in ProteinCenter™.

Therefore, many different accession numbers can be associated with a single protein record in ProteinCenter™, as shown in the example below. Here, all the accession numbers, the GIs and the single IPI reference are keys for the same protein sequence in the same species, and hence all these keys represent redundant entries of the same protein. In ProteinCenter™ there is only one protein record per specific isoform in a given species.

The original database entries may contain different annotation, which is captured in the single protein record in ProteinCenter™. Cross-linking and mapping different types of accession numbers represent a significant workload for many researchers wishing to compare different datasets from both their own experiments, from databases and from literature. By providing a complete mapping of protein keys to protein sequences, ProteinCenter™ is the ultimate bookkeeper that allows researchers to handle and analyze data independent of the ProteinID types.

7.1. Lookup by GI or accession code

7.1.1. Introduction

The lookup menu provides the users with a fast way to look up protein entries associated with a specific protein key (accession numbers, GIs, IPI identifiers, etc.). ProteinCenter™ contains protein sequences from all the popular sources, cf. Table A.1, “Protein Data Sources”.

When performing a lookup, it is not necessary to specify whether the key is a GI or IPI identifier etc.

The protein keys should be entered in their own standard format, like in these examples:

381974181NOP2B18_HUMAN
AAA52849AAF69647BAB32195.1
ENSMUSP00000051618IPI00167154.3IPI00215612
NP_996803P13761SYHUTP
O70455Q30134XP_235312.2
YAL002WEFB1AT1G44020.1

Please note that these examples do not represent a comprehensive list with regard to formats for valid protein keys.

Both primary and secondary protein keys may be used. Primary protein keys are e.g. the first accession number of a UniProt record. Researchers who wish to cite entries from UniProt in their publications are strongly advised (by UniProt) to always use primary accession numbers. Over time, however, a primary UniProt ProteinID may become secondary and even ambiguous. To address this potential problem, ProteinCenter™ allows the use of ambiguous secondary protein keys, by displaying the various choices of protein keys to the user in the cases where multiple proteins are associated with a secondary protein key.

The letters in the protein keys can be either uppercase or lowercase. Version numbers (".1" or ".2" etc) are optional. For more details on versioning, please refer to Section 4.2, “Versioning and outdated entries”.

Please note, that you may only enter one protein key at a time, and that the accession keys in the first column of the Proteins view are different depending on the configuration of preferred accessions keys.

7.1.2. How to look up a protein key

This section takes you through a few step-by-step lookups of a protein by its accession key.

7.1.2.1. Lookup of ProteinID

In this example a UniProt accession number is looked up:

  1. In the Lookup view find the text box for entering a protein accession.

  2. Type the accession number P13073 into the box.

  3. Press Search Proteins.

The associated proteins are fetched, and in this case a single protein record is returned and displayed in the selected view (ProteinCard or Proteins). In the Proteinsview the result will be presented as follows:

The individual columns in the Proteins view are described in Chapter 14, Proteins view.

Depending on the preferred choice of protein keys, different protein keys and descriptions may be shown, while it is always the exact same protein entry that is returned. In the example above the preferred source is NRDB so a GI and the NRDB description are displayed. If the ProteinCard pane had been selected prior to the lookup, the ProteinCard for this particular protein record will be displayed.

If no proteins are found for the protein key, the Proteins view will be empty, and the messages field above the workspace window will display 'Search returned 0 matches.

For more information regarding the choice of accession type and description to display, please refer to Section 4.3, “Preferred accession”

7.1.2.2. Lookup a protein key with or without versioning

If the version number is included with the accession number (e.g. NP_001007236.1), the specific entry is fetched by the lookup routine, ignoring any occurrence of the accession number with other versions. This allows the user to retrieve proteins by outdated accession numbers.

In the following example, the version of the protein key is included. Hence the particular protein entry corresponding to that accession.version key is returned. This occurs independent of whether the protein entry is live or outdated, and hence independent of whether a newer version exists or not.

  1. In the Lookup menu use the lookup by protein key.

  2. Type the protein key XP_007651.14 into the box, including the version number.

  3. Press Search Proteins.

The associated proteins are fetched and a single protein record is displayed in the Proteins view (if selected):

In this example, the retrieved protein entry is outdated, due to the fact that all associated protein keys are obsolete. In the ProteinCard the protein key is linked to the newer version (see Chapter 10, The ProteinCard).

If the version number is not included (e.g. NP_001007236), the system returns all versions e.g. proteins with keys NP_001007236.1 and NP_001007236.2 etc.

For more details on versioning and live vs. obsolete accession numbers, please refer to Section 4.2, “Versioning and outdated entries” and Section 10.2.1.1, “Flagging of outdated protein keys in ProteinCenter”.

7.2. Lookup by peptide

7.2.1. Introduction

The Lookupfeature also allows users to look up protein entries by one or more tryptic peptides. Only proteins that contain the specified set of peptides are returned.

This functionality allow users with data derived from mass spectrometry to quickly evaluate the information content of tryptic peptides, i.e. to assess whether the peptides are information rich (pointing to only a few proteins) or not. Secondly, it will allow users to identify other peptides related to the same set of proteins.

ProteinCenter™ enables instant lookup of completely cleaved tryptic peptides from any of the proteins in the database. Peptides must have a minimum length of 5, as shorter peptides will tend to return a disproportionately large number of false positives. Any search on peptides will return a maximum of 10000 hits.

Examples of valid peptides:

  • LTDSVLR

  • ACEAPPTCHSYR

  • NSHLYPLIETQYCPCDK

  • NAGIQINLQSECSSEEVTEIISQFTEK

  • VVALSMSPVDDTFISGSLDK

The peptide lookup only searches ProteinCenter™ for tryptic peptides - peptides imported by users will not be searched.

7.2.2. How to look up proteins by tryptic peptides

This section takes you through the step-by-step procedure for looking up proteins by tryptic peptides.

7.2.2.1. Lookup by a single tryptic peptide

In this example a single tryptic peptide is used to look up a set of proteins:

  1. In the Lookup component locate the Tryptic Peptide List textfield.

  2. Specify the peptide TGWGSR in the box.

  3. Press Search Proteins.

This runs the lookup for proteins by tryptic peptides, and returns a number of proteins:

To inspect the returned protein for a particular species, use the sorting function.

7.2.2.2. Lookup by multiple tryptic peptides

In this example three peptides are used to look up a set of proteins.

  1. In the Lookup component select Tryptic Peptide List textfield.

  2. Specify the peptides VWAFCC,YGTCFYLGR,DSVPGLR in the box.

  3. Press Search Proteins.

This runs the lookup for proteins by tryptic peptides, and returns the matching proteins:

7.3. Lookup by gene symbol

7.3.1. Introduction

Proteins can also be found from their gene symbols. ProteinCenter will look for the official symbol supplied from Entrez and Ensembl, and alternative symbol from other sources.

7.3.2. How to look up proteins by gene symbols

Lookup by gene symbol is similar to lookup by protein key or by tryptic peptides.

  1. In the Lookup component select the Gene Symbol textfield.

  2. Specify the gene symbol MSI in the box.

  3. Press Search Proteins.

This runs the lookup, returning all proteins from a gene with the symbol MSI. Here, only 5 proteins are shown, and the list has been sorted according to the Gene column.

The Gene column shows the official symbol for the gene. Note that the lookup for gene symbols is not case sentitive. Also note that the three bottom proteins have Ebp as their official symbol. This means that they must have MSI as one of their altenative symbols. This can be seen at the ProteinCard. To go to the ProteinCard, click one of the accession keys in the first column. If IPI00137471.2 is clicked, the following ProteinCard will appear (shown only in part):

This protein does indeed have the alternative gene symbol mSI, which is why it is returned my the 'MSI' lookup.

7.4. Lookup by annotation

7.4.1. Introduction

Targeted proteomics require the retrieval of proteins based on the particular annotation of interest. ProteinCenter implements this in a two-step process: an initial (and optional) search on annotation identifiers based on their description, followed by a protein search based on the given annotation identifiers.

7.4.2. How to look up proteins by annotation

  1. In the Lookup component, select the particular type of annotation in the Annotation drop-down box.

  2. (Optionally) Search for annotation identifiers based on their description (or name, title, GO term or similar), using the Search Annotation button. This will list the annotations with at least a partial match in the annotation textarea below, along with their exact description.

  3. Add, delete or edit annotation identifiers in the annotation textarea. Each identifier must be kept on its own line. The identifiers must adhere to the standard definition for that particular annotation:

    TaxonomyTaxonomy ID as defined by NCBI#######
    GO category'GO:' followed by 7 digit identifierGO:#######
    Enzyme Code'EC:' followed by four dot-separated numbersEC:#.#.#.#
    KEGG pathway3-letter taxonomy acronym and 5 digit pathway identifierORG#####
    PFAM domain'PF' followed by 5 digit identifierPF#####
    InterPro domain'IPR' followed by 6 digit identifierIPR######
  4. Press Search Proteins.

This runs the lookup, returning all proteins with the given annotation in the entire database. The maximum number of returned results is limited to 10000 hits.