ProteinCenter User Manual
Table of Contents

Chapter 21. Import

Table of Contents

21.1. How to import data
21.1.1. A note on ambiguous protein keys
21.2. Importing data from CSV files
21.2.1. A simple list of proteins
21.2.2. A list of proteins with peptides
21.2.3. Peptide modifications
21.2.4. Grouped proteins
21.2.5. Taxonomy limitation
21.2.6. Proteins from a list of gene identifiers
21.2.7. Quantitation
21.2.8. An example using more of the available columns
21.2.9. Including URLs and file links
21.2.10. Regional settings and the CSV format
21.2.11. Handling of incorrect values
21.3. Importing data from other software using CSV files
21.3.1. ProteinPilot from Applied BioSystems
21.3.2. MaxQuant from the Max Planck Institute of Biochemistry
21.3.3. Spectrum Mill from Agilent
21.3.4. ProteinLynx Global Server from Waters
21.3.5. VEMS III software from SDU
21.3.6. The Elucidator from Rosetta Biosoftware
21.3.7. ProteinProphet (Trans-Proteomic Pipeline) from ISB
21.3.8. Sorcerer from SageNResearch
21.4. Importing data from XML-based formats
21.4.1. ProtXML
21.4.2. Mascot XML
21.4.3. X! Tandem XML
21.4.4. PRIDE XML
21.4.5. BioWorks XML
21.4.6. Quantitation support for XML formats
21.5. Other formats
21.5.1. MSQuant
21.6. Direct upload from other applications
21.6.1. Phenyx
21.7. Identification of proteins and peptides

ProteinCenter allows you to import datasets from experiments obtained through a variety of protein identification software, from different databases, or from the literature. The datasets typically consist of lists of protein keys with optional additional information, and can be imported in the formats specified below.

ProteinCenter can import protein accession codes, GIs or other protein identifiers from any of these databases: UniProt, Swiss-Prot, Trembl, RefSeq, PlasmoDB, CMR, TubercuList, TriTrypDB, PSE, PIR, NCBI, RPF, PDB, Embl, Ensembl, TPE, TPG, PRF, SGD, Dbj, FlyBase, GenBank, TAIR and IPI.

Figure 21.1. Import pane

Import pane

21.1. How to import data

This section will guide you through the import of data step by step.

Note

The batch file import functionality requires Flash to be installed. If this is not the case a limited import will be available, allowing you to import only one file at a time.

  1. In the workspace tree, select the category in which you want the imported dataset to be located.

    Figure 21.2. Selecting the destination category for the imported dataset

    Selecting the destination category for the imported dataset

  2. Select the Import pane.

    Figure 21.3. Selecting a file for import

    Selecting a file for import

  3. Specify the format. Choose from Spectrum Mill, BioWorks XML, CSV, Mascot XML, PRIDE, X! Tandem, MSQuant, protXML, Mascot Distiller, ProteinPilot protein summary, ProteinPilot peptide summary, MaxQuant protein groups and MaxQuant peptides. See below for accepted formats and how to transfer data from other applications. If the file extension is '.csv', then the CSV format is automatically selected when clicking the interface.

  4. Optionally check Restrict proteins to only load the proteins already represented in checked data folders of the workspace.

  5. Click Browse to bring up the file browser, then select one or more files.

  6. Press Import to submit the import to the upload queue.

  7. Optionally press Cancel All Queued to clear the upload queue.

The upload queue will then show the status of each file import.

To see a summary of the import, select each of the resulting datasets to display the details on the μLIMS page

Figure 21.4. Upon import, selecting a dataset and clicking the μLIMS pane, detailed information about the dataset will be displayed

Upon import, selecting a dataset and clicking the μLIMS pane, detailed information about the dataset will be displayed

The μLIMS pane contains information about the imported file and its content.

  • Annotation - This section contains annotation obtained automatically from the imported file as well as annotation supplied by the users as shown in Chapter 22, μLIMS. The annotation stored by ProteinCenter during the import process depends on the the format of the file, but it can be information about the quantitation method, the cleavage agent, etc.

  • Properties - Basic properties for the imported data.

  • Permissions - In the permissions section it can be specified which users can view, change and delete the imported data. See Section 22.2, “Permissions”.

  • Quantitation name mappings - If the imported files contain quantitation data that could not be recognized by ProteinCenter, each imported quantitation data name will be mapped to a generic name. These mappings will be displayed in this section.

  • Imported File - The complete path and name of the imported file.

  • Records - The number of protein records in the input file.

  • Proteins - The number of protein records that was recognized by ProteinCenter. In order for a protein record to be recognized by ProteinCenter, the accession key must be known to ProteinCenter.

  • Unique proteins - In many cases the imported protein records are redundant. The same protein may occur several times with different accession keys. This redundancy is removed by the import procedure. This number is also displayed in the parenthesis after the folder names in the workspace as shown in Figure 21.2, “Selecting the destination category for the imported dataset”. When counting the number of unique proteins, the experimental evidence is not taken into consideration. This means that if a particular protein is present two times in a dataset with different peptide evidence, it will be counted as one unique protein, but the proteins count will be increased by two.

  • Rejected records - The protein records that were not recognized by ProteinCenter. In this example all records were recognized, but records may be rejected if their protein accession keys do not exist in ProteinCenter.

  • Unique rejected records - If the same protein record is rejected more than once, it is counted as one unique rejection.

  • Unmatched peptides - The number of peptides that could not be matched to ProteinCenter's protein string. Each parsed peptide sequence must be a substring of the protein sequence identified by the accession key. A few special cases are taken into consideration:

    • Some protein sequences contain the wildcard character X. Every amino acid will match this character. As an example the peptide sequence CDEF will match on the protein sequence ACXEFG.

    • Peptide sequences can contain the symbol J, which denotes that the amino acid is either I (Isoleucine) or L (Isoleucine). Js will be matched with both Is and Ls, such that the peptide sequence CJEF will match both the protein sequences ACIEFG and ACLEFG.

    The peptides in the input record that do not map to the protein sequence in ProteinCenter will be listed. In this example the peptide AAA could not be matched to the protein string for the accession 4502431.

  • Warnings - A number of warnings may occur during import. If, as an example, the importer expects a number but encounters something else, a warning will be shown. The record will still be imported, but the particular data causing the warning will not be included in ProteinCenter.

The imported dataset is placed in a folder with the specified name in the choosen category

Figure 21.5. An imported file results in a new data folder

An imported file results in a new data folder

It is always posible to move the data folder to a different category as described in Section 9.1.1, “Workspace controls”.

ProteinCenter supports different data formats for import of experimental data:

These are described below.

21.1.1. A note on ambiguous protein keys

A small subset of the protein keys in the database are ambiguous, in the sense that they point to multiple different proteins. Some protein keys point to the same protein sequence in different species, while others point to different sequences within the same species.

This is only the case for secondary accession codes (e.g. from UniProt), but since primary accession codes may turn into secondary accession codes over time, this issue might occur when importing old data.

ProteinCenter's dataset importer automatically picks a specific protein among those pointed to by a key. The following prioritized list is used to select between protein keys:

  1. Live keys are favored over dead ones

  2. Key that exactly match a potential protein accession are preferred

  3. Keys referring to proteins with the most other live keys are preferred

  4. The more trusted protein sources are preferred (RefSeq, SwissProt, ...)

  5. For equal sources, the following special considerations are given:

    1. NCBI: Keys with the maximum rank of their alternative sources are preferred

    2. IPI: The newest version is preferred (for IDs ending in a version number '.V')

    3. UniProt: Base versions are favored over sequence variants (IDs ending in a variant number '-V')

    4. RefSeq: Established entries (IDs starting with 'NP_') are preferred over other entries

  6. As a final tie-breaker, an arbitrary, though consistently chosen, key is selected

21.2. Importing data from CSV files

CSV (comma-separated values) format is a general data format suitable for any type of protein identification data.

The CSV format is a flat file format, which has become a de facto standard for exchange of data because of its simplicity. It is a tabular format, with column values separated by commas, and each record appearing on a separate line. The first line of the file contains headers; a list of reserved names describing the contents of the corresponding column. The most important, and only mandatory, header is KEY, or one of its synonyms in the table below (the highest prioritized takes precedence). This designates the column containing the protein key, by which the protein will eventually be identified. The table below lists the basic supported column headers, which all protein based CSV importers share:

Table 21.1. Legal column headers for all protein based CSV importers

CSV file columnInterpretation
KEY, IMPORTKEYIdentify proteins with this key, considering all primary and alternative keys and symbols (see Section 21.2.1, “A simple list of proteins”)
PEPTIDE, PEPTIDESPeptides for this protein - multiple peptides must be quoted and comma separated (see Section 21.2.2, “A list of proteins with peptides”)
CLUSTERCluster name to indicate proteins that are grouped together, e.g. "indistinguishable proteins" from search engines (see Section 21.2.4, “Grouped proteins”)
TAX_IDLimit any imported proteins to hits with this specific taxonomy ID (see Section 21.2.5, “Taxonomy limitation”)

It is relatively easy to create files in CSV format in any text editor, but it is especially convenient for users of spreadsheets (like Microsoft Excel), because spreadsheet data can be saved directly to this format. In cases where supplementary information is included, using a spreadsheet is often the more convenient solution.

The simplest dataset format is a list of protein keys, while a range of optional supplementary data can be included with each protein key.

In the following, examples of how to prepare datasets using text editors or spreadsheets are given.

21.2.1. A simple list of proteins

The data is ordered in a single column, which is given the header title "KEY" (case insensitive). The example below shows a case with 4 different protein keys.

A key column containing the protein key is mandatory when importing data, as this field stores the protein identifier. It can represent a GI, an accession number, an accession code, an IPI identifier, etc. or a mix of different types of protein keys. ProteinCenter supports a very broad selection of protein identifiers from GenBank, Swiss-Prot, Trembl, Ensembl, PIR, IPI, PDB, etc. (see Section 7.1, “Lookup by GI or accession code”).

Figure 21.6. The spreadsheet format for a dataset consisting of protein keys

The spreadsheet format for a dataset consisting of protein keys

To save the data as a CSV file in the spreadsheet, select save as and choose the CSV format.

Figure 21.7. The save as menu in Excel

The save as menu in Excel

The resulting file will look as shown below if opened in a text editor. In cases where the amount of data is limited (like this example) it is possible to create the input directly in a text editor.

Figure 21.8. The flat format in the CSV file

The flat format in the CSV file

Usually, additional information is associated with the proteins in the dataset. ProteinCenter defines a number of predefined column headers to capture this information. In the following, some of these headers are introduced by way of examples. The complete list of possible supplementary data is found in Table B.1, “Protein supplementary data”

21.2.2. A list of proteins with peptides

Peptides associated with a protein may be included by entering them in a column named PEPTIDES. The individual peptides are separated by commas.

In a spreadsheet it would look like this:

When creating datasets directly in a text editor, a problem is encountered because both the columns (KEY and PEPTIDES) and the individual peptides are separated by commas. This problem is handled by enclosing the peptide list in quotes like this:

KEY,PEPTIDES
Q6D6M3,IQLQDAGR
Q921G7,"GIATNDVGIQK,HHSIQPTLEGGK"
P41216,"RGLQGSFEELCR,LAQGEYIAPEK,GLQSFEELCR"
P26443,"IIAEGANGPTTPEADK,VYNEAGVTFT,LVEDLK"
Q99JY0,"DNGIRPSSLEQMAK,EAALGAGFSDK"
Q543H3,"QSDLDTLAK,FGMHLQAATPK,GYEPDPNIVK"

In the special case where only one peptide is associated with the protein, enclosing the peptide list is optional.

When a spreadsheet is used to enter the data, the quoting is done automatically.

Alternatively, you may associate peptides with a protein by specifying several lines with the same accession key. These lines give rise to a single experimental record comprising all the peptides from the individual lines. Thus, the following input will result in three experimental records.

KEY,PEPTIDES
Q6D6M3,IQLQDAGR
Q921G7,GIATNDVGIQK
Q921G7,HHSIQPTLEGGK
P41216,RGLQGSFEELCR
P41216,LAQGEYIAPEK
P41216,GLQSFEELCR

If information such as quantitation (see Section 21.2.7, “Quantitation”) or other supplementary info (see Section 21.2.8, “An example using more of the available columns”) is provided, lines will be regarded as equal, only if all fields except the peptide field are equal. Peptides from equal lines will be collected in one data point. The following lines present an example:

KEY, QSE, AQR,PEPTIDES
Q6D6M3, 0.1, 0.6, IQLQDAGR
Q921G7, 0.1, 0.6, GIATNDVGIQK
Q921G7, 0.1, 0.6, HHSIQPTLEGGK
P41216, 0.1, 0.6, RGLQGSFEELCR
P41216, 0.1, 0.6, LAQGEYIAPEK
P41216, 0.2, 0.6, GLQSFEELCR

This input will create four distinct experimental data points; one representing line 1, and one representing lines 2 and 3, since they only differ in peptides. Likewise, one experimental data point will be made for line 4 and 5. Line 6 will also be an experimental data point by itself, and it will not be collapsed with the data point representing lines 4 and 5, as its E-value differs from those in lines 4 and 5.

Peptides can also be imported with adjoining amino acids (e.g. K.SPLAQMEEERR.E).

21.2.3. Peptide modifications

Peptide modifications may be specified either by (UniMod standard) name, by (delta or absolute) mass or by a short-hand identifier.

To specify a peptide modifications by name, prepend the modified amino acid with the UniMod standard title in parentheses. The following example shows a peptide with a phosphorylated histidine:

KEY, PEPTIDES
Q8WNT7, ELSDIA(Phospho)HR

To specify a peptide modifications by mass, simply prepend the modified amino acid with the given mass in parentheses. Delta masses are specified by adding a plus (+) or minus (-) sign to the mass difference, while absolute mass is assumed, when the sign is omitted. The following example shows peptides with a phosphorylated histidine, specified with both absolute and delta mass respectively:

KEY, PEPTIDES
Q8WNT7, ELSDIA(217.03)HR
Q5FWB7, ELSDIA(+79.96633)HR

Some common peptide modifications may be imported into ProteinCenter by prepending modification identifiers immediately before the modified amino acid. Supported formats include either a single (lower-case) character, or one or two characters in parentheses. The following table shows all the supported abbreviated peptide modifications:

Table 21.2. Sequence encoded peptide modifications supported in the CSV format

Peptide modification1 character2 charactersMonoisotopic delta mass
acetylationcac42.01057
amidationaam-0.98402
deamidationdde0.98402
di-methylationidm28.03130
methylationmme14.01565
oxidationoox15.99492
phosphorylationpph79.96633
sulfationssu79.95682
ubiquitinationuub114.04293

The figure below gives an example of peptides phosphorylated on several sites:

Modifications on n- and c-terminals are specified by prepending them to residue letters '{' and '}' respectively. The CSV parser also recognizes the terminal residue '_' in both ends of the peptide sequence. N- and c-terminals only need to be explicitly given, when they are modified.

Note that the PTM column is intended to be used for post-translational modifications reported by e.g. a search engine.

21.2.4. Grouped proteins

Proteins may be grouped before being imported into ProteinCenter. One example may be that a set of peptides cannot distinguish between two different splicing isoforms. In those cases it is possible to import these proteins as a group.

For this purpose the CLUSTER column is used:

The CLUSTER value (which may be thought of as the group name) could be any type — number or text — as long as proteins belonging to the same cluster have the same CLUSTER value.

Note

It would not make sense for a protein to be included in two or more distinct groups. If an attempt is made to do this (e.g. because two different keys referring to the same protein are inadvertently assigned to different groups), ProteinCenter will detect it and merge the groups. The name of the merged group is obtained by joining the individual group names separated by ‘|’ (vertical bar).

21.2.5. Taxonomy limitation

Often, a particular species is of primary interest in a study. Use the taxonomy column (TAX_ID) to limit any imported proteins to a specific taxonomy ID on a row-per-row basis. The example CSV line below will only import the human protein for the given key, rather than including the adeno-virus with the same accession:

KEY, TAX_ID
Q2Y0J2, 9606

21.2.6. Proteins from a list of gene identifiers

Rather than referring to proteins directly, the "Gene proteins CSV" importer can be used to import proteins for the genes given in the file. Gene accessions can be specified as one of the types of identifiers below (the highest prioritized takes precedence), and a taxonomy field can limit the imported proteins to hits with that specific taxonomy ID:

Table 21.3. Legal column headers for Genes proteins CSV importer

CSV file columnInterpretation
KEY, IMPORTKEY, GENEIdentify proteins for genes with one of these keys, considering all primary and alternative keys and symbols
GENE EntrezIdentify proteins for genes with this Entrez gene ID
GENE EnsemblIdentify proteins for genes with this Ensembl gene ID
GENE <alternative source>Identify proteins for genes with any alternative gene ID, as given by Table A.2, “Gene Data Sources”
TAX_IDLimit any imported proteins to hits with this specific taxonomy ID

Every CSV line in this format is interpreted as a list of gene identifiers, and every protein in the ProteinCenter database associated with these genes will be imported. If several gene accessions are given on a line, the first accession returning at least one protein hit will be used.

This easily results in an infeasible number of proteins in the dataset, as each gene can be the source of numerous gene products. An effective way to limit the import to the proteins of interest (besides limiting on taxonomy) is to choose Restrict proteins, whilst having checked the data folders containing the proteins of interest in the workspace. This is a convenient way to integrate results from both proteomics and genomics assays.

Any quantitation (see Section 21.2.7, “Quantitation”) or other supplementary data (see Section 21.2.8, “An example using more of the available columns”) will also be attached to the imported proteins.

21.2.7. Quantitation

21.2.7.1. Protein quantitation

Proteins can have a number of generic quantitation values assigned to them. Below is a simple example with just one value:

Each protein is annotated with a quantitative value (AQR, or AQR# with # denoting numbers 1 through 19), and four optional values:

  1. Standard error for the quantitation (QSE, or QSE# with # denoting numbers 1 through 19)

  2. Standard deviation for the quantitation (QSD, or QSD# with # denoting numbers 1 through 19)

  3. The P-value for the quantitation (QPV, or QPV# with # denoting numbers 1 through 19)

  4. The number of peptides involved in the quantitation (NPEP, or NPEP# with # denoting numbers 1 through 19)

ProteinCenter also supports established protein quantitation methods, such as iTRAQ or SILAC. A full list of the columns headers for quantitative protein information can be found in in Appendix B, Supplementary Data.

21.2.7.2. Peptide quantitation

Each protein can have one or more peptides with generic quantitation values assigned to the peptides. Below is an example with just one quantitative value for the peptides:

Each peptide is annotated with a quantitation value (PEPTIDE_AQR, or PEPTIDE_AQR# with # denoting numbers 1 through 19), and one optional value:

  1. Standard error for the peptide quantitation (PEPTIDE_QSE, or PEPTIDE_QSE# with # denoting numbers 1 through 19)

ProteinCenter also supports established peptide quantitation methods, such as iTRAQ or SILAC. Here's an example of how iTRAQ 4 quantitative data for peptides can be imported from CSV files:

A full list of the column headers for quantitative peptide information can be found in Peptide Quantitation in Appendix B, Supplementary Data.

21.2.8. An example using more of the available columns

When preparing a multi-column dataset with various associated data, each column must be supplied with a header specifying the contents of the column as shown in Figure 21.9, “Excel view of an import file using many of the available columns”. It is very important that the header title is spelled correctly.

Figure 21.9. Excel view of an import file using many of the available columns

Excel view of an import file using many of the available columns

A given protein may occur multiple times in the same input file, in cases where it is associated with different supplementary data. In the ProteinCenter interface, only a single entry for each protein will be shown (summarizing the supplementary data) — except in cases where the same protein is pre-grouped into a number of different groups. In the latter case, the protein will occur once for every group it is included in.

A description of the types of supplementary data that may be imported, along with their required header titles, can be found in appendix Table B.1, “Protein supplementary data”.

Header titles are not case sensitive.

Only the KEY column is mandatory. It is possible to pick any combination of the remaining columns, or all of the columns.

Note that you are allowed to leave blank cells, as in the example above.

21.2.9. Including URLs and file links

The nine link columns (L1 through L9) are reserved for URLs. This is useful for referring to data in a LIMS or an on-line database, e.g. http://www.embl-ebi.ac.uk/interpro/DisplayIproEntry?ac=IPR000157. It is also possible to refer to files on a local or shared hard drive, where the latter is preferable as the datasets in ProteinCenter are likely to be viewed from a different computer, than the one from which is was originally imported. When processing link columns, ProteinCenter automatically recognizes file names and transforms them into a format that generates clickable links in the display. Three formats are recognized:

  1. Windows path names, e.g. S:\Proteomics Lab\John Doe\my ticket to Odense.dat

  2. UNC path names, e.g. \\OURSAN01\phospho.wiff

  3. UNIX-style path names, e.g. /home/ms/blood.out

21.2.10. Regional settings and the CSV format

Some regional settings use ';' (semi-colon) as the column separator for “comma-separated values” — a somewhat confusing terminology in this case. ProteinCenter automatically detects this and adjusts the processing of the file accordingly.

Warning

When the column separator is a semi-colon, the decimal separator is expected to be a comma (!), i.e. numbers should be entered as 3,14 rather than 3.14.

21.2.11. Handling of incorrect values

Generally, the CSV importer is quite tolerant. Only if a fatal error is detected in the header line, for instance a missing KEY column or duplicate column names, is the file rejected outright. Unknown columns are simply ignored with a "Unrecognized column" warning in the import summary. Errors in the actual data, typically incorrectly formatted numbers or links, result in empty values for the columns in question.

21.3. Importing data from other software using CSV files

Other applications may export CSV files with different headers than the ones employed by ProteinCenter. In this section it is explained how these data columns are mapped, and how to transform files for import into ProteinCenter.

21.3.1. ProteinPilot from Applied BioSystems

Files exported form the ProteinPilot application can be imported into ProteinCenter without any transformations. ProteinPilot exports data in two files. One containing data about the proteins, and one containing data about the peptides, called 'Protein Summary' and 'Peptide Summary' respectively. ProteinCenter cannot import data from more than one file into a single dataset, so the protein summary file and the peptide summary file must be imported separately as described below. Once both files have been imported, the resulting datasets can be merged into a single dataset (See Section 9.1.3, “How to merge datasets”).

21.3.1.1. ProteinPilot protein summary files

Protein summary files from ProteinPilot can be imported into ProteinCenter by selecting ProteinPilot protein summary in the Import format drop-down box.

Table 21.4. Data imported from ProteinPilot protein summary files

CSV file columnProteinCenter field
NCLUSTER
AccessionKEY
###:XXXITRAQ_4_RATIO_###
PVal ###:XXXITRAQ_4_PVAL_###
EF ###:XXXITRAQ_4_EF_###
###:XXXITRAQ_8_RATIO_###
PVal ###:XXXITRAQ_8_PVAL_##
EF ###:XXXITRAQ_8_EF_###

Where ### denotes reporter ion (114, 115, 116, 117 for iTRAQ4 and 113, 114, 115, 116, 117, 118, 119, 121 for iTRAQ8). For an explanation of the ProteinCenter fields, refer to Appendix B, Supplementary Data.

All other fields in the protein summary file will be ignored by the import routine.

21.3.1.2. ProteinPilot peptide summary files

Peptide summary files from ProteinPilot can be imported into ProteinCenter by selecting ProteinPilot peptide summary in the Import format drop-down box.

Table 21.5. Data imported from ProteinPilot peptide summary files

CSV file columnProteinCenter field
NCLUSTER
AccessionsKEY
ConfPEPTIDE_INIT_PROB
###:XXXPEPTIDE_ITRAQ_4_RATIO_###
%Err ###:XXXPEPTIDE_ITRAQ_4_EPCT_###
###:XXXPEPTIDE_ITRAQ_8_RATIO_###
%Err ###:XXXPEPTIDE_ITRAQ_8_EPCT_##

Where ### denotes reporter ion (114, 115, 116, 117 for iTRAQ4 and 113, 114, 115, 116, 117, 118, 119, 121 for iTRAQ8). For an explanation of the ProteinCenter fields, refer to Appendix B, Supplementary Data.

All other fields in the protein summary file will be ignored by the import routine.

21.3.2. MaxQuant from the Max Planck Institute of Biochemistry

The MaxQuant application exports a number of files for a given dataset. Quantitative data can be found in the exported files named proteinGroups.txt, peptides.txt and evidence.txt, which can all be imported into ProteinCenter. ProteinCenter cannot import data from more than one file into a single dataset, but datasets originating from different files can be imported separately and subsequently merged into a single dataset (see Section 9.1.3, “How to merge datasets”).

21.3.2.1. MaxQuant protein group files

Protein group files from MaxQuant (proteinGroups.txt) can be imported into ProteinCenter by selecting MaxQuant protein groups in the Import format drop-down box.

Table 21.6. Data imported from MaxQuant protein group files

CSV file columnProteinCenter field
idCLUSTER
Protein IDsKEY
Mascot ScoreMASCOT_SCORE
Ratio H/L NormalizedSILAC_RATIO_2
Ratio H/M NormalizedSILAC_RATIO_3
Ratio M/L NormalizedSILAC_RATIO_4

For an explanation of the ProteinCenter fields, refer to Appendix B, Supplementary Data.

All other fields in the protein group file will be ignored by the import routine.

21.3.2.2. MaxQuant peptide files

Peptide files from MaxQuant (peptides.txt and evidence.txt) can be imported into ProteinCenter by selecting MaxQuant peptides in the Import format drop-down box. Peptide modifications are only imported from evidence.txt files (parsing the modified sequence column). This type of file also aggregates all supplementary data differently than peptides.txt files, by choosing the peptide data with the best Mascot score, between peptides with the same sequence and modifications.

Table 21.7. Data imported from MaxQuant peptide files

CSV file columnProteinCenter field
SequencePEPTIDE
Modified SequencePEPTIDE
ProteinsKEY
Mascot ScorePEPTIDE_MASCOT_SCORE
Ratio H/L NormalizedPEPTIDE_SILAC_RATIO_2
Ratio H/M NormalizedPEPTIDE_SILAC_RATIO_3
Ratio M/L NormalizedPEPTIDE_SILAC_RATIO_4

For an explanation of the ProteinCenter fields, refer to Appendix B, Supplementary Data.

All other fields in the files will be ignored by the import routine.

21.3.3. Spectrum Mill from Agilent

From the protein/peptide summary page in Spectrum Mill, records containing accession keys, peptides and a variety of other data can be exported in a line-based format with semicolon-separated fields. This output may be imported directly into ProteinCenter, though only a few of the supplementary data are included (see table below). Certain common peptide modifications, such as phosphorylations and oxidations, are recognized as well.

To export summary data from Spectrum Mill to ProteinCenter follow this procedure:

  1. In the 'Review Fields' column of the protein/peptide summary page, check off Accession #, Sequence, Score, SPI (%), and Excel export.

    You may select additional fields, but they will be ignored by ProteinCenter.

    Warning

    Do not include the Sequence map field! In some versions of Spectrum Mill the presence of this field renders the export file impossible to parse. If its presence is detected, the import will be aborted by ProteinCenter.

  2. Press the Summarize button. The resulting page will look something like this:

    It includes a UNC-style filename ("\\SPECTRUM\...\peptideExport7.ssv"), which may be copied into ProteinCenter's import form.

    Note

    Remember to specify Spectrum Mill as the format of the import file.

When importing Spectrum Mill data into ProteinCenter, column headers are transformed according to this table:

Table 21.8. Data imported using the Spectrum Mill format

Spectrum Mill fieldProteinCenter field
Accession #KEY
SequencePEPTIDE
ScorePEPTIDE_SPECTRUMMILL_SCORE
SPI (%)PEPTIDE_SCORED_PEAK_INTENSITY

21.3.4. ProteinLynx Global Server from Waters

When importing ProteinLynx data into ProteinCenter, column headers are transformed according to this table:

Table 21.9. Data imported using the ProteinLynx format

ProteinLynx fieldProteinCenter field
AccessionKEY
PeptidesNPEP
PLGS Score / ScorePLGS_SCORE
Probability (%)PP

21.3.5. VEMS III software from SDU

VEMS III — Virtual Expert Mass Spectrometrist is a program for integrated proteome analysis. VEMS offers processing of raw data, MSMS database searches with many variable modifications, and clustering of quantitative results.

Certain VEMS results can be exported in ProteinCenter format.

VEMS allows export of protein identifications and includes quantitative data.

Note, that currently only datasets with IPI references are supported for this export mode in VEMS III.

  1. In VEMS select the menu 'Analysis'

  2. Next select the menu item 'Annotation'

  3. Finally select 'Export to ProteinCenter'

More information about VEMS is available here: http://yass.sdu.dk/.

21.3.6. The Elucidator from Rosetta Biosoftware

The Elucidator Protein Expression Data Analysis System is a comprehensive solution for organizations involved in protein expression analysis. It meets the challenges of proteomics research by providing capabilities to support mass-spectrometry-based, gel-free protein biomarker discovery. The system includes raw data management, LC/MS data processing for quantitative and differential analysis, protein identification, and high level analysis at the peptide and protein level.

Protein identifications can be exported with associated data in a generic format that can be readily imported into ProteinCenter by modification of the headers.

For various format options, find inspiration in Table B.1, “Protein supplementary data”.

For more information about Elucidator, see http://www.rosettabio.com/.

21.3.7. ProteinProphet (Trans-Proteomic Pipeline) from ISB

ProteinCenter fully supports import of ProteinProphet files in ProtXML format, and this method should be preferred by using the generic ProtXML importer. If for any reason this does not prove satisfactory, or a CSV output is preferred, it is possible to import ProteinProphet files in ProteinCenter, by saving the result to 'Excel' format (see below).

Of the various columns that may be output from ProteinProphet, 14 of the most essential columns can be easily mapped to ProteinCenter input fields. A few other columns of data may also be imported.

This is the workflow to get from ProteinProphet to ProteinCenter:

  1. In ProteinProphet you can save your results as an XLS file:

  2. Open the result in Excel or another spreadsheet software

  3. Change the names of headers according to the mapping given in this table:

    Table 21.10. Data headers in ProteinProphet and ProteinCenter

    ProteinProphetProteinCenter
    entry no.CLUSTER
    group probabilityGD1
    proteinKEY
    protein linkL1
    protein probabilityPP
    percent coverageGD2
    xpress ratio meanAQR
    xpress stdevQSD
    xpress num peptidesGI1
    xpress linkL2
    tot num pepsNPEP
    is nondegenerate evidenceUA
    peptide sequencePEPTIDES
    peptide linkL3

  4. You may import a few extra columns into ProteinCenter columns, e.g. GS# (# in 2-9), GI# (# in 2-9), PTM and IS. For further details, see Table B.1, “Protein supplementary data”

  5. Remove columns that are not mapped and should not be imported

  6. Save the file in the CSV format:

Note: One should ensure that there is only one KEY per entry line.

If you have multiple keys in the ProteinProphet field 'Protein', it is possible to:

  • Choose a representative to be used in column "KEY" and included the complete set in GS2.

  • Create a line for each protein, where you copy the remaining data for each protein. This way they still end up in the same group (cluster).

21.3.8. Sorcerer from SageNResearch

Currently, our solution for transfer of data from Sorcerer to ProteinCenter is the same as the one above for ProteinProphet.

For more information about Sorcerer, see http://www.sagenresearch.com/.

21.4. Importing data from XML-based formats

While CSV-based formats are very easy to use and are quite versatile, they are limited in their expressive power. A new generation of file formats have emerged, which are able to capture the highly structured output from e.g. search engines. The formats are based on the XML standard, and support for most of them has been added to ProteinCenter.

21.4.1. ProtXML

ProtXML files are imported in the same manner as CSV files; just choose protXML in the format drop-down box. ProteinCenter has been verified to process protXML files with schema versions 3, 4 and 5.

If the sample contains two or more peptides with the same sequence, but with different charge states, the entries are collapsed.

Example:

PepA:

<peptide 
    peptide_sequence="KDLYANTVLSGGTTM@YPGIADR"
    charge="2" initial_probability="1.00" 
    nsp_adjusted_probability="1.00"
    peptide_group_designator="a" weight="1.00" 
    is_nondegenerate_evidence="Y"
    n_tryptic_termini="2" n_sibling_peptides_bin="6" 
    n_instances="1" />

PepB:

<peptide 
    peptide_sequence="KDLYANTVLSGGTTM@YPGIADR"
    charge="3" initial_probability="1.00" 
    nsp_adjusted_probability="1.00"
    peptide_group_designator="a" weight="1.00" 
    is_nondegenerate_evidence="Y"
    n_tryptic_termini="2" n_sibling_peptides_bin="6" 
    n_instances="4" />

These two peptides will be imported with a single peptide with the sequence KDLYANTVLSGGTTMYPGIADR. The relevant supplementary values for this peptide will be an aggregation of the values for the individual peptides.

21.4.1.1. ProteinProphet and iProphet protXML

Files in the ProteinProphet and iProphet protXML format can be imported into ProteinCenter, by choosing ProteinProphet protXML or IProphet protXML in the Format drop-down box.

21.4.1.2. Proteome Discoverer protXML

Files in the Proteome Discoverer protXML format can be imported into ProteinCenter, by choosing Proteome Discoverer protXML in the Format drop-down box.

21.4.1.3. Scaffold protXML

Files in the Scaffold protXML format can be imported into ProteinCenter, by choosing Scaffold protXML in the Format drop-down box.

For more information about Scaffold visit http://www.proteomesoftware.com/Proteome_software_prod_Scaffold.html.

21.4.2. Mascot XML

XML result files from the Mascot search engine can be imported into ProteinCenter.

ProteinCenter supports XML result files from both Mascot 2.1 and Mascot 2.2. Files from both versions can be imported by choosing Mascot XML in the file format drop-down box. Since Mascot 2.1 XML files do not specify the location of peptide modifications in the peptides, ProteinCenter datasets imported from Mascot 2.1 will not include peptide modifications. Datasets imported from Mascot 2.2 files will contain peptide modifications.

To get data from Mascot into ProteinCenter follow these steps:

  1. In the Mascot interface, select Export Search Results in the Format As drop-down box:

    Figure 21.10. Exporting the Mascot Search Results from the Mascot application

    Exporting the Mascot Search Results from the Mascot application

  2. On the Export search results page, chose XML in the Export format drop-down box:

    Figure 21.11. Specifying what to include in the exported file

    Specifying what to include in the exported file

  3. Consider what to include in the general search information. Parts of the search information will be included in ProteinCenter by the import procedure and can subsequently be viewed in the μLIMS pane, as described in details in Chapter 22, μLIMS. Therefore it is recommended to include the Header, the Variable mod. info. and the Search parameters in the Search Information section.

  4. Scores for proteins are also included in ProteinCenter, if the Score information is included in the exported Mascot file.

  5. To include peptides into ProteinCenter, make sure to include the PeptideMatchInformation:

    Figure 21.12. 


  6. Since the peptide sequences are used to map the peptides on the protein sequence, the Sequence information must be included in order to import peptide information into ProteinCenter.

  7. Having specified the information to be included, press the Export search results button to export the file.

    Figure 21.13. Exporting the search results

    Exporting the search results

  8. A standard save box will be displayed. Save the file to the desired location.

  9. The exported file can now be imported into ProteinCenter as described in Section 21.1, “How to import data”, by choosing Mascot XML in the Format drop down box.

21.4.2.1. Mascot Distiller XML

XML files exported from Mascot Distiller can be imported into ProteinCenter as described in Section 21.1, “How to import data”, by choosing Mascot Distiller XML in the Format drop down box.

21.4.3. X! Tandem XML

The XML output from the X! Tandem software (http://www.thegpm.org/TANDEM/index.html) can be imported directly into ProteinCenter:

  1. At the top of the X! Tandem result page, a number of view options is listed. Right-click XML and choose Save Link as...:

    Figure 21.14. Saving the X! Tandem results as XML

    Saving the X! Tandem results as XML

  2. Make sure all peptide modification masses are given as monoisotopic.

  3. Having saved the file, it can now be imported into ProteinCenter as described in Section 21.1, “How to import data”, by choosing X! Tandem in the Format drop down box.

21.4.4. PRIDE XML

Files in the PRIDE format can be imported into ProteinCenter, by choosing PRIDE in the Format drop-down box.

More information about the PRIDE database can be found at http://www.ebi.ac.uk/pride/. A number of public datasets in the PRIDE format can be found at http://www.ebi.ac.uk/pride/startBrowse.do, downloaded and imported into ProteinCenter. Since ProteinCenter does not use the spectra information it makes more sense to download the 'Identifications only' files.

21.4.5. BioWorks XML

ProteinCenter can import files exported from the BioWorks 3.3 software. To get a dataset from BioWorks into ProteinCenter, follow these steps:

  1. In the BioWorks application, make sure that the Protein Information display option is chosen. This is done by right-clicking in the list view as displayed in the following figure:

    Figure 21.15. Selecting the Protein Information display in BioWorks 3.3

    Selecting the Protein Information display in BioWorks 3.3

  2. Having displayed the Protein Information, export the dataset by right-clicking the list view an choosing Export -> XML...:

    Figure 21.16. Exporting to XML from BioWorks 3.3

    Exporting to XML from BioWorks 3.3

  3. The exported file can now be imported into ProteinCenter as described in Section 21.1, “How to import data”, by choosing BioWorks XML in the Format drop down box.

When importing a BioWorks XML file, all proteins and the matching peptides will be imported into ProteinCenter, including protein and peptide scores. All variable peptide modifications will be included, but fixed modifications (such as Cysteine) are not included. Information about the used modifications, can be found on the μLIMS pane (see Chapter 22, μLIMS).

21.4.6. Quantitation support for XML formats

Quantitative data originating from LIBRA assays will be imported as generic quantitation ratios with standard errors (AQR#, QSE#), one for each 'mz' intensity entry. Quantitative data originating from ASAPRatio assays will be imported as generic quantitation ratio triplet (AQR1, QSD1, NPEP1), while optional ASAPRatio_pvalue entries will supplement these with adjusted ratio, adjusted standard deviation and p-value (AQR2, QSD2, QPV2).

Proteome Discoverer XML will recognize precursor and reporter ion based quantitation on peptides as well as proteins, which will be imported as generic quantitation ratios with standard errors (AQR#, QSE#) and generic quantitation ratio triplets (AQR1, QSD1, NPEP1) respectively. Ratio variability reported as 'coefficient of variation' is converted to standard deviation or standard error.

21.5. Other formats

21.5.1. MSQuant

ProteinCenter offers experimental support for importing files generated by the MSQuant quantitative proteomics/mass spectrometry software.

To transfer data from MSQuant to ProteinCenter follow these steps:

  1. In MSQuant, choose Export Proteins/Peptides... from the file menu:

  2. Depending on the results, the user can select all or quantified and verified peptide subsets.

    Warning

    Leave the Include check box for protein information unchecked, otherwise the result will be rejected by ProteinCenter:

  3. Save the result as a Microsoft Office Excel Workbook (*.xls).

  4. The exported file can now be imported into ProteinCenter as described in Section 21.1, “How to import data”, by choosing MSQuant in the Format drop down box.

    Warning

    Please don't make any changes to the Excel file exported from MSQuant before importing it into ProteinCenter.

More information on MSQuant is available from http://msquant.sourceforge.net/.

21.6. Direct upload from other applications

21.6.1. Phenyx

Data contained in Phenyx Jobs can be uploaded directly to ProteinCenter from the Proteins Overview window:

Figure 21.17. The Proteins Overview page in the Phenyx application

The Proteins Overview page in the Phenyx application

  1. By clicking Open Job in ProteinCenter, the proteins in the Job set is uploaded to ProteinCenter, and the Clusters page displayed in a new window:

    Figure 21.18. The uploaded proteins displayed in ProteinCenter

    The uploaded proteins displayed in ProteinCenter

    The data can now be used as any other dataset in ProteinCenter. Note that only peptides marked as valid in Phenyx will be included in the ProteinCenter dataset.

  2. By clicking Refresh Job in ProteinCenter, the proteins in the Job are re-uploaded to ProteinCenter. That is, a new dataset will be created in ProteinCenter, and the dataset from the previous upload will still be present.

  3. For a particular protein in Phenyx the ProteinCenter ProteinCard can be brought up with a single click:

    Figure 21.19. Bringing up the ProteinCard for a particular protein in ProteinCenter

    Bringing up the ProteinCard for a particular protein in ProteinCenter

    If the dataset has already been uploaded to ProteinCenter, the experimental peptides uploaded will be shown on the ProteinCard. If not, the ProteinCard will still be displayed, but without experimental data.

For more information on the Phenyx software product, visit http://www.genebio.com/products/phenyx/, or the Phenyx-ProteinCenter integration wiki: http://gbwiki.genebio.com/mediawiki/index.php/User_Manual_ProteinCenter.

21.7. Identification of proteins and peptides

The supported applications and data import formats above have different ways of reporting the quality of protein and peptide identifications. A general measure of identification is probability, but some formats report expectation values, p-values or special scores instead. The table below lists the primary measure of identification for proteins and peptides, imported from the different formats. Clicking an identification measure brings up the ProteinCenter supplementary data type representing it. The Peptide rank column indicates whether the peptide with the highest or lowest identification measure value is preferred during import. If there are no useful measures on which to base a preference, 'First' indicates that the first peptide encountered is chosen.

Table 21.11. Primary measure of identification for proteins and peptides