ProteinCenter User Manual
Table of Contents

Chapter 2. Tutorials

Table of Contents

2.1. Single protein bioinformatical analysis
2.1.1. Example – GI:6807647
2.2. Bioinformatical and statistical analysis of a protein dataset
2.2.1. Traditional approach
2.2.2. Using ProteinCenter
2.3. Is the unknown protein really unknown?
2.4. Converting accession numbers
2.5. Bioinformatical analysis of experimental data using peptide clustering
2.6. Comparison of proteomics datasets
2.7. Outdated accession — but is the protein still valid?
2.8. Multiple alignment

This chapter contains a number of tutorials. By following these tutorials you will get a taste of what types of analyses ProteinCenter can help you with. The datasets used in the tutorials can be downloaded from Proxeon's website http://www.proxeon.com/productrange/data_interpretation/Tutorials/index.html . In new ProteinCenter systems the datasets can be found in the workspace in DataSets->Proxeon->Tutorials .

2.1. Single protein bioinformatical analysis

Identification of proteins by mass spectrometry is usually based on the sequencing of corresponding peptides followed by database search and an assignment of these peptides to the protein sequence derived from the protein, genomic or EST databases. As a result the matched protein sequence is represented as an accession key (database identifier) in the corresponding database.

With a wealth of sequence based databases the correct assignment of sequenced peptides to the protein sequence can be a challenge. Protein databases are also very dynamic with constant updates and a significant level of redundancy.

ProteinCenter is the most comprehensive catalog of information on proteins.

2.1.1. Example – GI:6807647

Protein identification GI:6807647 — hypothetical protein [Homo sapiens]

2.1.1.1. Traditional approach

The traditional approach to extraction of biological information includes:

  • Information extraction from NCBI

  • Perform BLAST analysis

  • Domain and structure analysis, for example SMART

  • Pathway analysis

  • Gene Ontology analysis, via UniProt or GOMiner

  • Literature resources

There is a significant number of software tools, both commercial and publicly available, that can perform various kinds of bioinformatical analysis based on the protein sequence.

They are often limited to a particular set of protein identifiers. For example a user would be allowed to perform bioinformatical analysis on the protein sequences derived from UNIPROT/SwissProt/NCBI nrdb databases but not on ENSEMBL or IPI databases.

Another limitation can be caused by the outdated accession codes that have been removed from the protein database. So it can take a significant amount of time to figure out the successor IDs.

2.1.1.2. Using ProteinCenter

ProteinCenter is based on a comprehensive master database that includes all major protein databases. Currently, it contains more than 580 million accessions codes including around 90 million outdated identifiers.

ProteinCenter's Lookup menu allows the straightforward retrieval of the information from the ”typed” accession code or the list of tryptic peptides:

  1. Go to Lookup menu

  2. Type Accession Key 6807647

  3. Press the Search Proteins button

  4. Go to the ProteinCard pane

    The detailed information is available on the ProteinCard. ProteinCard has been developed to present a brief view of a particular protein.

    It contains most valuable information about the selected protein including gene names (official symbols), all primary and secondary accession keys, protein sequence in FASTA format, domain structure and annotated post translational modification. The current example of the hypothetical protein — GI: 6807647 shows that protein is very well annotated as heat shock protein.

  5. To retrieve all accession keys corresponding to the selected sequence click on Keys.

    In the Details view on the bottom of the window the primary and secondary accession codes will be listed.

    The red warning sign in front of the accession code shows that the key is outdated. See Section 4.2, “Versioning and outdated entries” for an explanation on outdated accessions and outdated proteins.

  6. To see the amino acid sequence click Sequence

  7. Click Features to present all available annotations from PFAM Tmap and Interpro

    The Features table gives the detailed view of all the detected domains for the protein sequence, their description, localization and external links.

  8. Similar Proteins provides a list of protein neighbors (down to 60% sequence similarity) annotated like proteins in the lookup results. In addition, information on percent similarity to the query protein is available.

    Click Similar Proteins to display protein neighbor details.

    As a default ProteinCard shows a number of proteins with 98% similarity. A user can redefine the similarity level by typing a percentage in the Similarity field, for example 90 %.

    To see the BLAST alignment for each sequence click on Sim numbers.

  9. Gene Ontology (GO) in the ProteinCenter is presented according to three main terms:

    • Molecular Function

    • Biological Process

    • Cellular Component

    Click on Molecular Function , all of the assigned molecular function gene ontology terms will be presented in details section.

    Clicking on the GO Id will bring the EMBL-EBI description of the particular gene ontology term.

    Click Biological Process and Cellular Component to show their content in a similar manner.

  10. Select Interactions and Pathways The list of interacting proteins will be presented in Details view.

    The interacting proteins are presented similarly to the Protein Neighbours. To learn more about interacting proteins, click on the accession keys. They will be displayed in ProteinCard window.

  11. The ProteinCard menu also contains a number of links to the external sources that allow to obtain additional information. Try some of them.

Try to use the Lookup menu with other accession keys or the list of tryptic peptides derived from own experiments.

2.2. Bioinformatical and statistical analysis of a protein dataset

With the advent of modern proteomics technology several thousands of proteins can be identified from one experiment. It is becoming increasingly clear that proteomic researchers are facing a new bottleneck: the ability to produce data has outpaced the ability to mine the data intelligently.

2.2.1. Traditional approach

While the ability to generate data has exploded, many researchers are still relying on lists of protein hits — usually the direct output from the search engine — to evaluate their shotgun proteomics experiments with. This may be tedious and time consuming at best, but a more dire consequence may be that the interpretation of the data is compromised, e.g. if the protein evidence is not presented correctly.

2.2.2. Using ProteinCenter

We presented an example of bioinformatics data processing using ProteinCenter. The dataset of red blood cell proteins (hRBC) from the paper ”In-depth analysis of the membrane and cytosolic proteome of red blood cells”. Blood. 2006,108(3):791-801 is used as an example for bioinformatical analysis.

  1. Download and save the hRBC data file from http://www.proxeon.com/productrange/data_interpretation/Tutorials/index.html to your computer. Alternatively find it in the ProteinCenter workspace. In the latter case there is no need to import the data. Just select the dataset (DataSets->Proxeon->Tutorials->Datasets comparison->hRBC_proteome) and skip to point 5.

  2. Select a destination category in the workspace tree (e.g. DataSets).

  3. Click on the Import pane in ProteinCenter.

    The Import page is dedicated for data import. ProteinCenter currently supports the following file formats:

    • ProtXML

    • ProteinProphet protXML

    • IProphet protXML

    • Proteome Discoverer protXML

    • Scaffold protXML

    • Mascot XML

    • Mascot Distiller XML

    • X! Tandem

    • PRIDE

    • BioWorks XML

    • MSQuant

    • CSV

    • Gene proteins CSV

    • Spectrum Mill

    • ProteinPilot protein summary

    • ProteinPilot peptide summary

    • MaxQuant protein groups

    • MaxQuant peptides

    • Waters PLGS protein summary

  4. Select the CSV format, browse for the hRBC data file and select it to add it to the queue, and click Import to upload the hRBC dataset into the ProteinCenter.

    When the import is done select the imported dataset by clicking on its name, and the µLIMS pane is selected automatically. The µLIMS page summarizes the number of records added to the dataset, the number of proteins and unique proteins as well as any encountered errors.

    In the current example each record is a protein accession code.

  5. Click in the Proteins pane.

    The protein list will be presented in the Protein menu according to the selected page size. Accession keys, protein name, gene name, the length of the protein, taxonomy, GO categories, prediction of transmembrane regions and signal peptides will be displayed.

  6. Select an appropriate Page size , for example 30 .

  7. Move the cursor to the colored boxes to display the GO functions as tool tips.

  8. To sort by gene name move the cursor to the column header and click on Gene .

  9. By clicking on a protein's accession key in the Acc. Key column its ProteinCard will be presented.

  10. The presence of a symbol in the AS column indicates the occurrence of alternative splicing in the sequence.

  11. To focus on the specific part of the dataset, ProteinCenter offers a substantial number of different filters.

    To select a filter click the Edit filter definitions button.

  12. Select the "GO Slim - Cellular Component" and delete all lines except cell surface from the text box.

  13. Apply the filter by clicking on the Save and select filter button.

    As a result of filtering only the proteins with cell surface annotation (first column of boxes in cellular component column) will be displayed. The system allows application of multiple filters as well as inverted filters.

  14. Statistical analysis of list of proteins provides an overview of dataset by summarizing essential information into histogram tables and chart representations and is presented in Statistics menu.

    Click on the Statistics pane.

  15. Select show statistics for All Data . Press Analyze . Statistics for the dataset will be calculated and displayed.

    ProteinCenter calculates distributions for domain structures, transmembrane and signal peptide predictions, disease associations, enzyme and pathway annotations as well as for three GO categories: molecular function, cellular components and biological processes.

    The Details view gives more in depth information about the statistical distribution for the dataset.

  16. Click on GO Cellular Components to retrieve the statistics for the cellular localization distribution of proteins. For a detailed description of the statistics view please refer to Chapter 25, Statistics

2.3. Is the unknown protein really unknown?

A large number of protein accession keys are annotated as either 'unknown' protein or 'hypothetical' protein. This can be misleading, since it may simply be the description given at the time when the protein sequence was submitted. Many protein records are old, and hence the protein may have been unknown at the time of submission, while in many cases even well characterized proteins are still submitted with bad descriptions.

Using ProteinCenter, the Similar Proteins gives a relatively quick way to determine whether or not an unknown protein is really an unknown protein:

  1. Go to the lookup menu.

  2. Enter the accession key (in this example: BAC40627.1 ).

  3. Press the Search Proteins button.

  4. If the Proteins view is selected, it will show one protein record as shown above. The description is 'unnamed protein product'.

    Keep in mind that for a given protein, ProteinCenter attempts to pick the best possible description among the accession keys available (see Section 4.3.2, “How the optimal description is picked automatically” ).

  5. Press the accession key link to bring up the ProteinCard (alternatively select the ProteinCard pane).

  6. In the ProteinCard, press the blue header of similar proteins, which by default finds the protein neighbors that are 98% or more similar to this protein.

    This will take a little while, as it involves a BLAST against virtually all known protein sequences.

  7. In these neighbors we find a protein with the same length (747) as our "unnamed protein" ( Note: Accession keys ).

  8. It is 99.8% identical, in other words very likely to be an allele (or a difference originating from a sequencing error), which can be confirmed by pressing the Blast alignment link (click the link saying 99.8 in the Sim column).

2.4. Converting accession numbers

Sometimes it may be useful to convert between accession keys. For example, to ensure that a certain type of accessions are displayed in the documentation of protein identifications.

In this example we import a list of GIs, but it may be carried out with any type of accession keys.

First create a file of accessions to be converted, and save it (for example as convert1.txt )

It should contains the keys 4502431, 10862703, 6005850 and 4503131

Select a destination category in the Workspace, for the imported data.

  1. Go to the Import pane:

  2. In the Format box, select CSV.

  3. Click Browse... to locate the file to import.

  4. When the file is selected it is added to the upload queue

  5. Press the Import button.

  6. A folder called "convert1" is created in the selected category in the workspace. Select the imported folder when the import is done.

  7. Go to the Protein Data view. (formerly named 'Exp. Data')

  8. The imported keys are shown.

  9. The preferred keys are displayed in the user settings . Remember, that ProteinCenter can be configured to display either UniProt, NCBI, Ensembl, SGD, TAIR, FlyBase, PLASMO, CMR, TBLIST, PSE, STRING, TRIPTRYP or IPI keys. Details of the configuration can be found in the administration guide Section 4.3, “Preferred accession” . Also keep in mind that if IPI is preferred, as in this example, it can only be shown if a live IPI accession exists for the given protein. More on live and outdated accessions is described in Section 10.2.1.1, “Flagging of outdated protein keys in ProteinCenter” .

  10. Go to the export pane.

  11. Check the boxes for the pieces of information you want to appear in the exported table. In this case IPI, UNIPROT and GI have been checked (Canonical key, Imported key, Gene symbol, Taxonomy and Protein description are checked by default).

  12. After pressing Export and opening the file (for example in Excel), the original accession keys populate the first column, and other accessions and best annotated GIs appear in subsequent columns. See the Export chapter for a rigorous explanation of the export process and formats.

Note

Note that export of protein accession keys only include one key of each type, meaning that even though it has two GI accessions, only the first one is exported.

2.5. Bioinformatical analysis of experimental data using peptide clustering

One of the challenges of proteomics lies in the interpretation of the data: peptides are the experimental evidence in shotgun proteomics. However, researchers are mainly interested in the proteins that are present in the sample. Inferring protein identity from a peptide hit is not always straightforward since a peptide can be derived from multiple proteins, include splicing variants, homologous members of a gene family and variants derived from sequence polymorphism. Furthermore, a peptide may point towards multiple entries in a database due to sequence redundancy derived from the presence of sequence errors and partial sequences. Consequently, in most cases a peptide hit will point towards a group of proteins rather than a single protein. Protein evidence in shotgun proteomics should therefore be presented as protein groups rather than individual proteins. ProteinCenter allows performing clustering on the peptide level creating a protein group sharing at least one peptide.

The two datasets used in this tutorial can be found in the Proxeon -> Tutorials -> Peptide clustering category as shown in the following figure. Alternatively, they can be downloaded from Proxeon's web site http://www.proxeon.com/productrange/data_interpretation/Tutorials

The two datasets ( BSA 50fm EpiCenter and protein mixture std ) have been created using an NRDB search without taxonomy limitation. As a result the datasets contain a significant number of accession codes for the same proteins but with different taxonomies.

A comparison between the two datasets has also been made. For more information on comparing datasets in ProteinCenter see Chapter 19, Dataset comparison . The numbers in the parentheses reveal that the BSA 50fm EpiCenter dataset contains 18 unique proteins and the protein mixture std dataset contains 449 unique proteins. The comparison folder BSA vs std protein mixture contains 453 unique proteins, which means that the two sets have 14 proteins in common.

  1. Load the content of the comparison folder by clicking on the name ( BSA vs std protein mixture )

  2. Select the Proteins pane. The proteins in the comparison will be displayed as shown below. Note that the accessions displayed in the Acc. Key column are dependent on the preferred key set on a per-user basis — see Section 4.3, “Preferred accession” . For additional information on the other columns displayed in this view consult Chapter 14, Proteins view .

    Figure 2.1. Proteins view for the comparison.

    Proteins view for the comparison.

  3. When viewing a comparison folder, the comparison toolbar is shown below the filter toolbar. More about the comparison toolbar later in the tutorial.

  4. Also, when viewing a comparison folder, an extra column named Dataset is included in the view. The column shows which proteins are included in which datasets. Moving the mouse pointer to one of the colored rectangles will display the name of the dataset corresponding to that color.

  5. Out of the proteins displayed in the figure only one is in both datasets. This may look different for you as it depends on the sorting of the columns.

  6. In the following the comparison dataset is clustered — click on the Clusters pane

    Figure 2.2. Initial view of the Clusters pane.

    Initial view of the Clusters pane.

  7. The cluster toolbar will appear.

  8. Set the clustering method to Peptide Sharing and the clustering level to 1 as displayed in Figure 2.3, “Clustering by shared peptides” . This means that the dataset will be clustered by with at least one peptide shared by the protein sequences

    Figure 2.3. Clustering by shared peptides

    Clustering by shared peptides

  9. Click the Cluster all data button to start the clustering. The nine resulting clusters will be displayed:

    Figure 2.4. The clustered data

    The clustered data

  10. In the displayed view, the clusters can be sorted by their size (i.e. the number of proteins). This can be achieved by clicking on the Size column header.

  11. By clicking on the number of proteins in a cluster, the content of the cluster will be displayed. Click on 43 to display the proteins in the cluster containing proteins from both datasets. (Clicking on again will hide the content of the cluster returning to the view displayed in Figure 2.4, “The clustered data” )

    Figure 2.5. Viewing a single cluster

    Viewing a single cluster

  12. The number of peptides for each protein in a cluster is displayed — click on one of them to display the peptides in the ProteinCard .

    Figure 2.6. Viewing the peptides in the ProteinCard

    Viewing the peptides in the ProteinCard

  13. All of the detected peptides are presented in the Details section of the ProteinCard . When analyzing a comparison dataset, as in this case, the peptides from the individual datasets are highlighted in the same colors as the Dataset columns in the Proteins and Clusters views

  14. To look further into the comparison between the two sets go back to the Clusters view

    Figure 2.7. Viewing only what the two datasets have in common

    Viewing only what the two datasets have in common

  15. The compare toolbar allows the user to see what the two datasets have in common. This is done by clicking both of the white rectangles following And as shown in Figure 2.7, “Viewing only what the two datasets have in common” .

  16. It is immediately apparent that the Bovine albumin is present in both samples.

  17. Note that the 43 in the Size column has been clicked again to hide the contents of a cluster.

  18. To see the clusters containing proteins from one of the datasets only click the Not rectangle for the other set. In this example the BSA set have been excluded and the system displays the cluster that does not contain proteins from that set

    Figure 2.8. Excluding clusters with proteins from a particular set

    Excluding clusters with proteins from a particular set

More examples of comparing datasets can be found in Section 2.6, “Comparison of proteomics datasets” .

2.6. Comparison of proteomics datasets

A comparison of a list of protein keys can be a tricky process especially if the comparison is performed between different species or data obtained in different laboratories at different time or coming from literature sources. One needs sophisticated tools that can handle the complexity of these data including: redundancy (same protein but different accession, or alleles & fragments), different types of accession codes or outdated accession codes. The following datasets are used in this tutorial:

  • hRBC proteome — ”In-depth analysis of the membrane and cytosolic proteome of red blood cells”. Blood. 2006,108(3):791-801;

  • Plasma proteome — “Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study”. Nature Biotechnology 2006, 3(24): 333-338.

  • CSF proteome — ”Detection of Biomarkers with a Multiplex Quantitative Proteomic Platform in Cerebrospinal Fluid of Patients with Neurodegenerative Disorders”. J Alzheimers Dis 2006, 9(3):293-348:

  • Urine proteome — “The human urinary proteome contains more than 1500 proteins including a large proportion of membranes proteins”. Genome Biol 2006, 7(9):R80.

  • Tear fluid proteome — “Identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitors”. Genome Biol 2006, 7(8):R72.

The five datasets used in this tutorial can be found in the Proxeon -> Tutorials -> Datasets comparison category as shown in the following figure. Alternatively, they can be downloaded from Proxeon's web site http://www.proxeon.com/productrange/data_interpretation/Tutorials/index.html

A comparison of the datasets has also been made. For more information on comparing datasets in ProteinCenter see Chapter 19, Dataset comparison . The number in parentheses shows that the five datasets contain a little less than 3000 unique proteins. Select the Comparison of body fluids folder to load the proteins in the dataset.

  1. If the Proteins pane is selected, the content view will look as in the following figure:

    Figure 2.9. Displaying the comparison in the Proteins view

    Displaying the comparison in the Proteins view

  2. Note the comparison toolbar which is shown only for comparison folders

  3. Also note the Dataset column which is shown only for comparison folders. It makes it easy to see which proteins are present in which datasets. In the screen shot example, only one protein (thymosin-like 3) is in more than one set. The name of the dataset represented by a given color will be displayed when pointing the mouse pointer to the colored rectangle.

  4. Finally, note that the proteins have been sorted by sequence length (the AA column) as indicated by the arrow. Using another sorting will display other proteins.

  5. Comparisons can be performed in the Proteins view or in the Clusters view. Differences arising from allelic or splicing variations as well as dealing with fragment vs. full length sequences are processed automatically by clustering all proteins at a certain level of sequence similarity. Switch to the Clusters pane

    Figure 2.10. Setting up the clustering

    Setting up the clustering

  6. Choose to cluster by Most homogeneous groups .

  7. Choose a Clustering level of 98%. When clustering by most homogeneous groups the protein sequences are clustered by sequence similarity. Use a similarity of

    • 100% — to be able to compare proteins at the level of fragments vs. full length;

    • 98% — to be able to compare proteins at the level of alleles;

    • 95% — to be able to compare proteins at the level of highly similar proteins — most likely alleles or splice variants;

    • 60% — to be able to compare proteins at the level of homologous proteins.

  8. Click the Cluster all data button to start the clustering.

    Figure 2.11. The result of the clustering

    The result of the clustering

  9. In the image, the clusters have been sorted by the Size column, descending. This means that the clusters containing the most proteins will be on top.

  10. The Dataset column is also a part of the Clusters view. Here it indicates whether any proteins from a particular dataset are present in the cluster. In this case, the top cluster contains proteins from three of the five datasets, whereas the second cluster contains proteins from a single set. To see the name of the set, just point to the colored rectangle and the name of the dataset will be displayed.

    Also note that the number of proteins is close to 3000, the number of clusters is around 2800. This means that some of the proteins in the datasets were comparable to other proteins at the level of alleles or fragments.

  11. To display the clusters containing proteins from all five datasets select And rectangles one by one. The result is shown Figure 2.12, “Clusters containing proteins from all sets” . It shows that in total 9 proteins were detected in all five body fluids proteomes.

    Figure 2.12. Clusters containing proteins from all sets

    Clusters containing proteins from all sets

  12. To find out how many proteins from body fluids proteomes occur in any of the datasets except the HUPO plasma proteome, select the four other datasets in the Or boxes and the human plasma dataset in the Not boxes as displayed in Figure 2.13, “Showing clusters not containing plasma proteins (Actual clusters not displayed)” .

    Figure 2.13. Showing clusters not containing plasma proteins (Actual clusters not displayed)

    Showing clusters not containing plasma proteins (Actual clusters not displayed)

  13. Slightly less than 2000 protein clusters match this criterion. This will allow to estimate a potential blood contamination of the other body fluids proteomes.

  14. Next, filters are applied to the protein clusters. In ProteinCenter various biological filters can be applied to the comparison dataset. This allows, for example, to restrict the proteins in the set to be extracellular proteins or to exclude proteins from certain species.

    Click the Edit filter definitions button to define filters. The filter component will be expanded

    Figure 2.14. Editing filter definitions

    Editing filter definitions

  15. Make sure that the filter definition GO Slim - Cellular components is selected. Remove all components except extracellular as shown in the following figure

    Figure 2.15. Choosing extracellular

    Choosing extracellular

  16. Click the Save filter definitions button to apply the filter.

    Figure 2.16. The filter has been saved

    The filter has been saved

  17. The filter definition has now been saved and selected. It is marked by an *.

  18. Press the filter icon to disable/enable the filter on the dataset.

    Figure 2.17. Filtering result

    Filtering result

  19. When the filter is enabled, the filter icon turns green

  20. Applying the filter has reduced the number of proteins considerably. The remaining proteins are not displayed in the figure, but they are of course shown in ProteinCenter. They are exactly the proteins that are not in human plasma and are extracellular proteins.

  21. Now switch to the Statistics pane.

    Figure 2.18. Calculating statistics for all datasets

    Calculating statistics for all datasets

  22. Choose Cluster Anchors in the Show statistics for drop-down box.

  23. Select a reference set and an analysis set from the sets in the comparison.

  24. Press the Analyze button to calculate the statistics. ProteinCenter now calculates statistical distributions for all five datasets.

    After the calculations have finished, the Details section will display a Venn diagram showing the overlap of proteins between the two datasets selected.

    Figure 2.19. Venn diagram for two datasets

    Venn diagram for two datasets

    Try to check a third box to show the overlap of proteins between three of the datasets in the comparison.

    Figure 2.20. Venn diagram for three datasets

    Venn diagram for three datasets

    You can also click on each of the links in the overview section to display the details. Figure 2.21, “Statistics for GO Slim Molecular Function (Truncated)” shows the statistical details for GO Slim Molecular Functions. Similar views are available for GO Slim Biological Process, GO Slim Cellular Component, etc.

    Figure 2.21. Statistics for GO Slim Molecular Function (Truncated)

    Statistics for GO Slim Molecular Function (Truncated)

For more information on statistics in ProteinCenter see Chapter 25, Statistics.

2.7. Outdated accession — but is the protein still valid?

Accessions keys can become outdated for many reasons. When you look up an accession, this can manifest itself as a message like " accession not found ", or similarly when looking up a GI from NCBI's website:

This does not always imply that the protein is not a valid protein. An experimental protein identification, stored by its protein accession key, may appear outdated when using public databases like IPI, UniProt, Ensembl or NCBI. The following example shows how to check whether the outdated accession is associated with a protein that is still valid.

Assume that we have observed GI 2137747 in a mass spectrometry experiment, where we found great sequence coverage. But when searching NCBI Entrez we find that the protein has become outdated.

  1. At NCBI try to look up the revision history for this outdated GI:

  2. It is dead — hence the record is not replaced by another record, but simply deleted. Now let's find out whether the protein is really not valid:

  3. In ProteinCenter go to the lookup menu.

  4. Type in the GI 2137747 .

  5. Press the Search Proteins button.

  6. The protein is looked up and the summary displayed ( Note: Accession keys ).

  7. The outdated column (header O ) indicates that this protein is not outdated (no outdated icon). So already at this point it is evident that this is most likely a valid protein. Press the accession key field ( P49769 ) to open the ProteinCard.

  8. In the ProteinCard we can see live accession keys from NCBI, IPI, Ensembl and UniProt in the Keys summary (in this case there is no live Ensembl key). Press the blue header to see the details:

  9. The details show a lot of live primary and secondary key entries from different sources. Be aware that accession key fields, as well as these source details, may differ significantly from your actual output, because these data are continually updated.

  10. The outdated GI we started out with appears at the bottom, but our conclusion is clear: The protein which we identified is completely valid, but the GI is outdated.

2.8. Multiple alignment

Detected peptides can be derived from multiple proteins, include splicing variants, homologous members of a gene family and variants derived from sequence polymorphism. In order to distinguish different isoforms, multiple sequence alignment is required. This tutorial illustrates how this can be done by using ProteinCenter's multiple alignment viewer. Additional information about the alignment viewer can be found in Chapter 20, The Alignment Viewer .

The tutorial uses the PRIDE dataset entitled HUPO Brain Proteome Project: BPP_PilotProject_Lab_01: mouse C57/Bl6, postnatal day 54-56 (P54-56), brain number 180: Peptide Fragmentation Identifications 384. (Pride accession: 1670)

The dataset can be found in the Proxeon -> Tutorials -> Multiple alignment category as shown in the following figure. Alternatively, they can be downloaded from Proxeon's web site http://www.proxeon.com/productrange/data_interpretation/Tutorials , or at the PRIDE web site http://www.ebi.ac.uk/pride/ .

In ProteinCenter do the following steps:

  1. Select the PRIDE_Exp_IdentOnly_Ac_1670 data folder to load the proteins in the set.

  2. Switch to the Clusters pane. At the Clusters pane it is possible to group the proteins in the dataset according to different measures of similariy. See Chapter 16, Clusters View for more information on clustering.

    Figure 2.22. Clustering the proteins in the set according to shared peptide evidence

    Clustering the proteins in the set according to shared peptide evidence

  3. Choose to cluster by shared peptide evidence by choosing Peptide sharing as the clustering method

  4. Set the clustering level to 1. This means that proteins that share one or more peptides will be grouped together.

  5. Click the Cluster all data button to start the clustering. The list of clusters is shown as in the following figure (only the top 5 clusters of 51 are shown)

    Figure 2.23. The result of the clustering

    The result of the clustering

  6. The clusters have been sorted by the Size column (formerly named 'No'). This means that the clusters containing the most proteins will be on top. In this case only four real clusters exists. The remaining are all clusters of a single protein that does not share peptides with other proteins in the dataset.

  7. To inspect the cluster click the cluster name, e.g IPI00124499.2 in this example (the naming can be different depending on your accession preference settings). The cluster alignment will appear in a new browser window as shown here (figure is truncated to the right)

    In the cluster alignment overview the sequences are placed above each other with the cluster anchor (IPI00124499.2) on top. The peptides are displayed on the sequences as colored boxes. The color will change if two or more peptides overlap. The example shows cluster alignment of two sequences that share several peptides (colored boxes). It is clear that each of these sequences contains unique peptides that can distinguish two protein isoforms.

    The green triangles indicates where gabs have been inserted by the sequence alignment procedure. The sequence differences are shown below the aligned sequences.

  8. To view the aligment with sequence information Click here to show alignment with sequence information . The following figure is a truncated view of the detailed alignment:

    This view includes a scrollbar below the Differences line. Use it to walk along the sequence. Differences in the protein sequences are highlighted in the Differences line. That feature allows for visualization of amino acid residues substitutions.

  9. Now view the alignment for the cluster with the anchor Figure 2.23, “The result of the clustering” (IPI00114375.2). As before, switch to the detailed view of the alignment. Notice that an additional line have been added indicating that some peptides in the alignment are modified. Move the scrollbare to focus on the modification at sequence position 448 as shown in the following figure. Note the two peptides with amino acids substitutions (A and C) as highlighted in the Differences line.

  10. Pointing the mouse at the blue C in the Modifications line will reveal detail about the Modifications and highlight the peptide in the alignment.

  11. Pointing the mouse to the peptide reveals details about the peptide.

For more information clustering in ProteinCenter see Chapter 16, Clusters View.