Table of Contents
This chapter contains a number of tutorials. By following these tutorials you will get a taste of what types of analyses ProteinCenter can help you with. The datasets used in the tutorials can be downloaded from Proxeon's website http://www.proxeon.com/productrange/data_interpretation/Tutorials/index.html . In new ProteinCenter systems the datasets can be found in the workspace in DataSets->Proxeon->Tutorials .
Identification of proteins by mass spectrometry is usually based on the sequencing of corresponding peptides followed by database search and an assignment of these peptides to the protein sequence derived from the protein, genomic or EST databases. As a result the matched protein sequence is represented as an accession key (database identifier) in the corresponding database.
With a wealth of sequence based databases the correct assignment of sequenced peptides to the protein sequence can be a challenge. Protein databases are also very dynamic with constant updates and a significant level of redundancy.
ProteinCenter is the most comprehensive catalog of information on proteins.
Protein identification GI:6807647 — hypothetical protein [Homo sapiens]
The traditional approach to extraction of biological information includes:
Information extraction from NCBI
Perform BLAST analysis
Domain and structure analysis, for example SMART
Gene Ontology analysis, via UniProt or GOMiner
There is a significant number of software tools, both commercial and publicly available, that can perform various kinds of bioinformatical analysis based on the protein sequence.
They are often limited to a particular set of protein identifiers. For example a user would be allowed to perform bioinformatical analysis on the protein sequences derived from UNIPROT/SwissProt/NCBI nrdb databases but not on ENSEMBL or IPI databases.
Another limitation can be caused by the outdated accession codes that have been removed from the protein database. So it can take a significant amount of time to figure out the successor IDs.
ProteinCenter is based on a comprehensive master database that includes all major protein databases. Currently, it contains more than 580 million accessions codes including around 90 million outdated identifiers.
ProteinCenter'smenu allows the straightforward retrieval of the information from the ”typed” accession code or the list of tryptic peptides:
Type Accession Key
Go to thepane
The detailed information is available on the ProteinCard. ProteinCard has been developed to present a brief view of a particular protein.
It contains most valuable information about the selected protein including gene names (official symbols), all primary and secondary accession keys, protein sequence in FASTA format, domain structure and annotated post translational modification. The current example of the hypothetical protein — GI: 6807647 shows that protein is very well annotated as heat shock protein.
To retrieve all accession keys corresponding to the selected sequence click on
In the Details view on the bottom of the window the primary and secondary accession codes will be listed.
The red warning sign in front of the accession code shows that the key is outdated. See Section 4.2, “Versioning and outdated entries” for an explanation on outdated accessions and outdated proteins.
To see the amino acid sequence click
Clickto present all available annotations from PFAM Tmap and Interpro
The Features table gives the detailed view of all the detected domains for the protein sequence, their description, localization and external links.
Similar Proteins provides a list of protein neighbors (down to 60% sequence similarity) annotated like proteins in the lookup results. In addition, information on percent similarity to the query protein is available.
Clickto display protein neighbor details.
As a default ProteinCard shows a number of proteins with 98%
similarity. A user can redefine the similarity level by typing a
percentage in the Similarity field, for
To see the BLAST alignment for each sequence click on Sim numbers.
Gene Ontology (GO) in the ProteinCenter is presented according to three main terms:
Click on, all of the assigned molecular function gene ontology terms will be presented in details section.
Clicking on the GO Id will bring the EMBL-EBI description of the particular gene ontology term.
Clickand to show their content in a similar manner.
SelectThe list of interacting proteins will be presented in Details view.
The interacting proteins are presented similarly to the Protein Neighbours. To learn more about interacting proteins, click on the accession keys. They will be displayed in ProteinCard window.
The ProteinCard menu also contains a number of links to the external sources that allow to obtain additional information. Try some of them.
Try to use the Lookup menu with other accession keys or the list of tryptic peptides derived from own experiments.
With the advent of modern proteomics technology several thousands of proteins can be identified from one experiment. It is becoming increasingly clear that proteomic researchers are facing a new bottleneck: the ability to produce data has outpaced the ability to mine the data intelligently.
While the ability to generate data has exploded, many researchers are still relying on lists of protein hits — usually the direct output from the search engine — to evaluate their shotgun proteomics experiments with. This may be tedious and time consuming at best, but a more dire consequence may be that the interpretation of the data is compromised, e.g. if the protein evidence is not presented correctly.
We presented an example of bioinformatics data processing using ProteinCenter. The dataset of red blood cell proteins (hRBC) from the paper ”In-depth analysis of the membrane and cytosolic proteome of red blood cells”. Blood. 2006,108(3):791-801 is used as an example for bioinformatical analysis.
Download and save the hRBC data file from http://www.proxeon.com/productrange/data_interpretation/Tutorials/index.html to your computer. Alternatively find it in the ProteinCenter workspace. In the latter case there is no need to import the data. Just select the dataset (DataSets->Proxeon->Tutorials->Datasets comparison->hRBC_proteome) and skip to point 5.
Select a destination category in the workspace tree (e.g. DataSets).
Click on thepane in ProteinCenter.
The Import page is dedicated for data import. ProteinCenter currently supports the following file formats:
Proteome Discoverer protXML
Mascot Distiller XML
Gene proteins CSV
ProteinPilot protein summary
ProteinPilot peptide summary
MaxQuant protein groups
Waters PLGS protein summary
Select theformat, browse for the hRBC data file and select it to add it to the queue, and click to upload the hRBC dataset into the ProteinCenter.
When the import is done select the imported dataset by clicking on its name, and thepane is selected automatically. The µLIMS page summarizes the number of records added to the dataset, the number of proteins and unique proteins as well as any encountered errors.
In the current example each record is a protein accession code.
Click in thepane.
The protein list will be presented in the Protein menu according to the selected page size. Accession keys, protein name, gene name, the length of the protein, taxonomy, GO categories, prediction of transmembrane regions and signal peptides will be displayed.
Select an appropriate Page size , for
Move the cursor to the colored boxes to display the GO functions as tool tips.
To sort by gene name move the cursor to the column header and click on.
By clicking on a protein's accession key in the Acc. Key column its ProteinCard will be presented.
The presence of a symbol in the AS column indicates the occurrence of alternative splicing in the sequence.
To focus on the specific part of the dataset, ProteinCenter offers a substantial number of different filters.
To select a filter click thebutton.
cell surface from the
Apply the filter by clicking on the
Save and select
As a result of filtering only the proteins with cell surface annotation (first column of boxes in cellular component column) will be displayed. The system allows application of multiple filters as well as inverted filters.
Statistical analysis of list of proteins provides an overview of dataset by summarizing essential information into histogram tables and chart representations and is presented in Statistics menu.
Click on thepane.
Select show statistics for. Press . Statistics for the dataset will be calculated and displayed.
ProteinCenter calculates distributions for domain structures, transmembrane and signal peptide predictions, disease associations, enzyme and pathway annotations as well as for three GO categories: molecular function, cellular components and biological processes.
The Details view gives more in depth information about the statistical distribution for the dataset.
Click on Chapter 25, Statisticsto retrieve the statistics for the cellular localization distribution of proteins. For a detailed description of the statistics view please refer to
A large number of protein accession keys are annotated as either 'unknown' protein or 'hypothetical' protein. This can be misleading, since it may simply be the description given at the time when the protein sequence was submitted. Many protein records are old, and hence the protein may have been unknown at the time of submission, while in many cases even well characterized proteins are still submitted with bad descriptions.
Using ProteinCenter, the Similar Proteins gives a relatively quick way to determine whether or not an unknown protein is really an unknown protein:
Go to the lookup menu.
Enter the accession key (in this example:
If the Proteins view is selected, it will show one protein record as shown above. The description is 'unnamed protein product'.
Keep in mind that for a given protein, ProteinCenter attempts to pick the best possible description among the accession keys available (see Section 4.3.2, “How the optimal description is picked automatically” ).
Press the accession key link to bring up the ProteinCard (alternatively select the ProteinCard pane).
In the ProteinCard, press the blue header of similar proteins, which by default finds the protein neighbors that are 98% or more similar to this protein.
This will take a little while, as it involves a BLAST against virtually all known protein sequences.
In these neighbors we find a protein with the same length (747) as our "unnamed protein" ( Note: Accession keys ).
It is 99.8% identical, in other words very likely to be an allele (or a difference originating from a sequencing error), which can be confirmed by pressing the Blast alignment link (click the link saying 99.8 in the Sim column).
Sometimes it may be useful to convert between accession keys. For example, to ensure that a certain type of accessions are displayed in the documentation of protein identifications.
In this example we import a list of GIs, but it may be carried out with any type of accession keys.
First create a file of accessions to be converted, and save it (for
It should contains the keys 4502431, 10862703, 6005850 and 4503131
Select a destination category in the Workspace, for the imported data.
Go to thepane:
In thebox, select CSV.
Clickto locate the file to import.
When the file is selected it is added to the upload queue
A folder called "convert1" is created in the selected category in the workspace. Select the imported folder when the import is done.
Go to theview. (formerly named 'Exp. Data')
The imported keys are shown.
The preferred keys are displayed in the user settings . Remember, that ProteinCenter can be configured to display either UniProt, NCBI, Ensembl, SGD, TAIR, FlyBase, PLASMO, CMR, TBLIST, PSE, STRING, TRIPTRYP or IPI keys. Details of the configuration can be found in the administration guide Section 4.3, “Preferred accession” . Also keep in mind that if IPI is preferred, as in this example, it can only be shown if a live IPI accession exists for the given protein. More on live and outdated accessions is described in Section 10.2.1.1, “Flagging of outdated protein keys in ProteinCenter” .
Go to thepane.
Check the boxes for the pieces of information you want to appear in the exported table. In this case IPI, UNIPROT and GI have been checked (Canonical key, Imported key, Gene symbol, Taxonomy and Protein description are checked by default).
After pressing Export chapter for a rigorous explanation of the export process and formats.and opening the file (for example in Excel), the original accession keys populate the first column, and other accessions and best annotated GIs appear in subsequent columns. See the
One of the challenges of proteomics lies in the interpretation of the data: peptides are the experimental evidence in shotgun proteomics. However, researchers are mainly interested in the proteins that are present in the sample. Inferring protein identity from a peptide hit is not always straightforward since a peptide can be derived from multiple proteins, include splicing variants, homologous members of a gene family and variants derived from sequence polymorphism. Furthermore, a peptide may point towards multiple entries in a database due to sequence redundancy derived from the presence of sequence errors and partial sequences. Consequently, in most cases a peptide hit will point towards a group of proteins rather than a single protein. Protein evidence in shotgun proteomics should therefore be presented as protein groups rather than individual proteins. ProteinCenter allows performing clustering on the peptide level creating a protein group sharing at least one peptide.
The two datasets used in this tutorial can be found in the Proxeon -> Tutorials -> Peptide clustering category as shown in the following figure. Alternatively, they can be downloaded from Proxeon's web site http://www.proxeon.com/productrange/data_interpretation/Tutorials
The two datasets ( BSA 50fm EpiCenter and protein mixture std ) have been created using an NRDB search without taxonomy limitation. As a result the datasets contain a significant number of accession codes for the same proteins but with different taxonomies.
A comparison between the two datasets has also been made. For more information on comparing datasets in ProteinCenter see Chapter 19, Dataset comparison . The numbers in the parentheses reveal that the BSA 50fm EpiCenter dataset contains 18 unique proteins and the protein mixture std dataset contains 449 unique proteins. The comparison folder BSA vs std protein mixture contains 453 unique proteins, which means that the two sets have 14 proteins in common.
Load the content of the comparison folder by clicking on the name ( BSA vs std protein mixture )
Select the Proteins pane. The proteins in the comparison will be displayed as shown below. Note that the accessions displayed in the Acc. Key column are dependent on the preferred key set on a per-user basis — see Section 4.3, “Preferred accession” . For additional information on the other columns displayed in this view consult Chapter 14, Proteins view .
When viewing a comparison folder, the comparison toolbar is shown below the filter toolbar. More about the comparison toolbar later in the tutorial.
Also, when viewing a comparison folder, an extra column named Dataset is included in the view. The column shows which proteins are included in which datasets. Moving the mouse pointer to one of the colored rectangles will display the name of the dataset corresponding to that color.
Out of the proteins displayed in the figure only one is in both datasets. This may look different for you as it depends on the sorting of the columns.
In the following the comparison dataset is clustered — click on the Clusters pane
The cluster toolbar will appear.
Set the clustering method to Peptide Sharing and the clustering level to 1 as displayed in Figure 2.3, “Clustering by shared peptides” . This means that the dataset will be clustered by with at least one peptide shared by the protein sequences
Click the Cluster all data button to start the clustering. The nine resulting clusters will be displayed:
In the displayed view, the clusters can be sorted by their size (i.e. the number of proteins). This can be achieved by clicking on the Size column header.
By clicking on the number of proteins in a cluster, the content of the cluster will be displayed. Click on 43 to display the proteins in the cluster containing proteins from both datasets. (Clicking on again will hide the content of the cluster returning to the view displayed in Figure 2.4, “The clustered data” )
The number of peptides for each protein in a cluster is displayed — click on one of them to display the peptides in the ProteinCard .
All of the detected peptides are presented in the Details section of the ProteinCard . When analyzing a comparison dataset, as in this case, the peptides from the individual datasets are highlighted in the same colors as the Dataset columns in the Proteins and Clusters views
To look further into the comparison between the two sets go back to the Clusters view
The compare toolbar allows the user to see what the two datasets have in common. This is done by clicking both of the white rectangles following And as shown in Figure 2.7, “Viewing only what the two datasets have in common” .
It is immediately apparent that the Bovine albumin is present in both samples.
Note that the 43 in the Size column has been clicked again to hide the contents of a cluster.
To see the clusters containing proteins from one of the datasets only click the Not rectangle for the other set. In this example the BSA set have been excluded and the system displays the cluster that does not contain proteins from that set
More examples of comparing datasets can be found in Section 2.6, “Comparison of proteomics datasets” .
A comparison of a list of protein keys can be a tricky process especially if the comparison is performed between different species or data obtained in different laboratories at different time or coming from literature sources. One needs sophisticated tools that can handle the complexity of these data including: redundancy (same protein but different accession, or alleles & fragments), different types of accession codes or outdated accession codes. The following datasets are used in this tutorial:
hRBC proteome — ”In-depth analysis of the membrane and cytosolic proteome of red blood cells”. Blood. 2006,108(3):791-801;
Plasma proteome — “Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study”. Nature Biotechnology 2006, 3(24): 333-338.
CSF proteome — ”Detection of Biomarkers with a Multiplex Quantitative Proteomic Platform in Cerebrospinal Fluid of Patients with Neurodegenerative Disorders”. J Alzheimers Dis 2006, 9(3):293-348:
Urine proteome — “The human urinary proteome contains more than 1500 proteins including a large proportion of membranes proteins”. Genome Biol 2006, 7(9):R80.
Tear fluid proteome — “Identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitors”. Genome Biol 2006, 7(8):R72.
A comparison of the datasets has also been made. For more information on comparing datasets in ProteinCenter see Chapter 19, Dataset comparison . The number in parentheses shows that the five datasets contain a little less than 3000 unique proteins. Select the Comparison of body fluids folder to load the proteins in the dataset.
If the Proteins pane is selected, the content view will look as in the following figure:
Note the comparison toolbar which is shown only for comparison folders
Also note the Dataset column which is shown only for comparison folders. It makes it easy to see which proteins are present in which datasets. In the screen shot example, only one protein (thymosin-like 3) is in more than one set. The name of the dataset represented by a given color will be displayed when pointing the mouse pointer to the colored rectangle.
Finally, note that the proteins have been sorted by sequence length (the AA column) as indicated by the arrow. Using another sorting will display other proteins.
Comparisons can be performed in the Proteins view or in the Clusters view. Differences arising from allelic or splicing variations as well as dealing with fragment vs. full length sequences are processed automatically by clustering all proteins at a certain level of sequence similarity. Switch to the Clusters pane
Choose to cluster by Most homogeneous groups .
Choose a Clustering level of 98%. When clustering by most homogeneous groups the protein sequences are clustered by sequence similarity. Use a similarity of
100% — to be able to compare proteins at the level of fragments vs. full length;
98% — to be able to compare proteins at the level of alleles;
95% — to be able to compare proteins at the level of highly similar proteins — most likely alleles or splice variants;
60% — to be able to compare proteins at the level of homologous proteins.
Click the Cluster all data button to start the clustering.
In the image, the clusters have been sorted by the Size column, descending. This means that the clusters containing the most proteins will be on top.
The Dataset column is also a part of the Clusters view. Here it indicates whether any proteins from a particular dataset are present in the cluster. In this case, the top cluster contains proteins from three of the five datasets, whereas the second cluster contains proteins from a single set. To see the name of the set, just point to the colored rectangle and the name of the dataset will be displayed.
Also note that the number of proteins is close to 3000, the number of clusters is around 2800. This means that some of the proteins in the datasets were comparable to other proteins at the level of alleles or fragments.
To display the clusters containing proteins from all five datasets select And rectangles one by one. The result is shown Figure 2.12, “Clusters containing proteins from all sets” . It shows that in total 9 proteins were detected in all five body fluids proteomes.
To find out how many proteins from body fluids proteomes occur in any of the datasets except the HUPO plasma proteome, select the four other datasets in the Or boxes and the human plasma dataset in the Not boxes as displayed in Figure 2.13, “Showing clusters not containing plasma proteins (Actual clusters not displayed)” .
Slightly less than 2000 protein clusters match this criterion. This will allow to estimate a potential blood contamination of the other body fluids proteomes.
Next, filters are applied to the protein clusters. In ProteinCenter various biological filters can be applied to the comparison dataset. This allows, for example, to restrict the proteins in the set to be extracellular proteins or to exclude proteins from certain species.
Click the Edit filter definitions button to define filters. The filter component will be expanded
Make sure that the filter definition GO Slim - Cellular components is selected. Remove all components except extracellular as shown in the following figure
Click the Save filter definitions button to apply the filter.
The filter definition has now been saved and selected. It is marked by an *.
Press the filter icon to disable/enable the filter on the dataset.
When the filter is enabled, the filter icon turns green
Applying the filter has reduced the number of proteins considerably. The remaining proteins are not displayed in the figure, but they are of course shown in ProteinCenter. They are exactly the proteins that are not in human plasma and are extracellular proteins.
Now switch to the Statistics pane.
Choose Cluster Anchors in the Show statistics for drop-down box.
Select a reference set and an analysis set from the sets in the comparison.
Press the Analyze button to calculate the statistics. ProteinCenter now calculates statistical distributions for all five datasets.
After the calculations have finished, the Details section will display a Venn diagram showing the overlap of proteins between the two datasets selected.
Try to check a third box to show the overlap of proteins between three of the datasets in the comparison.
You can also click on each of the links in the overview section to display the details. Figure 2.21, “Statistics for GO Slim Molecular Function (Truncated)” shows the statistical details for GO Slim Molecular Functions. Similar views are available for GO Slim Biological Process, GO Slim Cellular Component, etc.
For more information on statistics in ProteinCenter see Chapter 25, Statistics.
Accessions keys can become outdated for many reasons. When you look
up an accession, this can manifest itself as a message like "
accession not found ", or similarly when looking up a
GI from NCBI's website:
This does not always imply that the protein is not a valid protein. An experimental protein identification, stored by its protein accession key, may appear outdated when using public databases like IPI, UniProt, Ensembl or NCBI. The following example shows how to check whether the outdated accession is associated with a protein that is still valid.
Assume that we have observed GI
2137747 in a mass
spectrometry experiment, where we found great sequence coverage. But when
searching NCBI Entrez we find that the protein has become outdated.
At NCBI try to look up the revision history for this outdated GI:
It is dead — hence the record is not replaced by another record, but simply deleted. Now let's find out whether the protein is really not valid:
In ProteinCenter go to the lookup menu.
Type in the GI
The protein is looked up and the summary displayed ( Note: Accession keys ).
The outdated column (header O ) indicates
that this protein is not outdated (no outdated icon). So already at
this point it is evident that this is most likely a valid protein.
Press the accession key field (
P49769 ) to open
In the ProteinCard we can see live accession keys from NCBI, IPI, Ensembl and UniProt in the Keys summary (in this case there is no live Ensembl key). Press the blue header to see the details:
The details show a lot of live primary and secondary key entries from different sources. Be aware that accession key fields, as well as these source details, may differ significantly from your actual output, because these data are continually updated.
The outdated GI we started out with appears at the bottom, but our conclusion is clear: The protein which we identified is completely valid, but the GI is outdated.
Detected peptides can be derived from multiple proteins, include splicing variants, homologous members of a gene family and variants derived from sequence polymorphism. In order to distinguish different isoforms, multiple sequence alignment is required. This tutorial illustrates how this can be done by using ProteinCenter's multiple alignment viewer. Additional information about the alignment viewer can be found in Chapter 20, The Alignment Viewer .
The tutorial uses the PRIDE dataset entitled HUPO Brain Proteome Project: BPP_PilotProject_Lab_01: mouse C57/Bl6, postnatal day 54-56 (P54-56), brain number 180: Peptide Fragmentation Identifications 384. (Pride accession: 1670)
The dataset can be found in the Proxeon -> Tutorials -> Multiple alignment category as shown in the following figure. Alternatively, they can be downloaded from Proxeon's web site http://www.proxeon.com/productrange/data_interpretation/Tutorials , or at the PRIDE web site http://www.ebi.ac.uk/pride/ .
In ProteinCenter do the following steps:
Select the PRIDE_Exp_IdentOnly_Ac_1670 data folder to load the proteins in the set.
Switch to the Clusters pane. At the Clusters pane it is possible to group the proteins in the dataset according to different measures of similariy. See Chapter 16, Clusters View for more information on clustering.
Choose to cluster by shared peptide evidence by choosing Peptide sharing as the clustering method
Set the clustering level to 1. This means that proteins that share one or more peptides will be grouped together.
Click the Cluster all data button to start the clustering. The list of clusters is shown as in the following figure (only the top 5 clusters of 51 are shown)
The clusters have been sorted by the Size column (formerly named 'No'). This means that the clusters containing the most proteins will be on top. In this case only four real clusters exists. The remaining are all clusters of a single protein that does not share peptides with other proteins in the dataset.
To inspect the cluster click the cluster name, e.g IPI00124499.2 in this example (the naming can be different depending on your accession preference settings). The cluster alignment will appear in a new browser window as shown here (figure is truncated to the right)
In the cluster alignment overview the sequences are placed above each other with the cluster anchor (IPI00124499.2) on top. The peptides are displayed on the sequences as colored boxes. The color will change if two or more peptides overlap. The example shows cluster alignment of two sequences that share several peptides (colored boxes). It is clear that each of these sequences contains unique peptides that can distinguish two protein isoforms.
The green triangles indicates where gabs have been inserted by the sequence alignment procedure. The sequence differences are shown below the aligned sequences.
To view the aligment with sequence information Click here to show alignment with sequence information . The following figure is a truncated view of the detailed alignment:
This view includes a scrollbar below the Differences line. Use it to walk along the sequence. Differences in the protein sequences are highlighted in the Differences line. That feature allows for visualization of amino acid residues substitutions.
Now view the alignment for the cluster with the anchor Figure 2.23, “The result of the clustering” (IPI00114375.2). As before, switch to the detailed view of the alignment. Notice that an additional line have been added indicating that some peptides in the alignment are modified. Move the scrollbare to focus on the modification at sequence position 448 as shown in the following figure. Note the two peptides with amino acids substitutions (A and C) as highlighted in the Differences line.
Pointing the mouse at the blue C in the Modifications line will reveal detail about the Modifications and highlight the peptide in the alignment.
Pointing the mouse to the peptide reveals details about the peptide.
For more information clustering in ProteinCenter see Chapter 16, Clusters View.
© 2005-2017 Thermo Fisher Scientific