ProteinCenter User Manual
Table of Contents

Chapter 23. Export

Table of Contents

23.1. How to export data
23.1.1. Exporting the contents of the basket
23.2. Data formats for exported data
23.2.1. Proteins CSV / Protein Data CSV formats
23.2.2. Protein Genes CSV format
23.2.3. Genes CSV / Gene Data CSV formats
23.2.4. Peptides CSV format
23.2.5. Protein FASTA format
23.2.6. Protein interactions format
23.3. Using exported data
23.3.1. Analyzing interaction networks in Cytoscape

Datasets can be transferred to file storage, data exchange, archival, or other applications for further analysis.

23.1. How to export data

Figure 23.1. The Export Pane

The Export Pane

To export a selected dataset simply follow the steps outlined here:

  1. Select the Export pane.

  2. Preview the number of experimental data or proteins that will be exported.

  3. Choose the export format (see Section 23.2, “Data formats for exported data”).

  4. Choose whether to include only the selected data or all data.

  5. Choose whether to include only the data from cluster anchors.

  6. Choose a predefined set of columns.

  7. If the predefined set of columns does not fit your needs, manually adjust them here.

  8. Press one of the Export buttons to export the dataset.

Upon pressing Export, the file can be saved or opened in a program of choice.

Windows will allow the association of a file type with a particular program as shown here.

23.1.1. Exporting the contents of the basket

The contents of the basket cannot be exported directly, but you can save the basket as a dataset (by clicking the save icon in the basket pane), and subsequently open and export the dataset.

23.2. Data formats for exported data

Data can currently be exported in the standard FASTA format, or as CSV flat files (Comma-Separated Values for use in e.g. spreadsheets). The following data formats are supported:

  • Protein Data CSV: One line for every experimental data point, with the optional annotation specified in Table 23.1, “Data description for proteins and protein data CSV export formats”. Exported rows are ordered according to the current sorting on the Protein Data View.

  • Proteins CSV: One line for every protein, with the optional annotation specified in Table 23.1, “Data description for proteins and protein data CSV export formats”. Data are aggregated on the protein level, which means that IMPORTKEY, CLUSTER, PEPTIDES and string type supplementary data are accumulated as lists of unique items, and numeric supplementary data are given with descriptive statistics. For comparisons, the descriptive statistics for each numeric supplementary data will be specified per dataset. Exported rows are ordered according to the current sorting on the Proteins View.

  • Protein Genes CSV: One line for every gene, with the optional annotation specified in Table 23.2, “Data description for gene-centric CSV export format, where protein annotations are aggregated onto genes”. Data are aggregated on the gene level, which means that all types of annotation pertaining to the protein-based export above, are summarized over all proteins for a particular gene. String type supplementary data are accumulated as lists of unique items, and numeric supplementary data are given with descriptive statistics. For comparisons, the descriptive statistics for each numeric supplementary data will be specified per dataset. Exported rows are ordered according to the current sorting on the Genes View.

  • Proteins FASTA: One line for every protein, with the FASTA output format specified in Section 23.2.5, “Protein FASTA format”.

  • Protein interactions SIF: One line for every protein interacting with other proteins, with the Cytoscape SIF output format specified in Section 23.2.6, “Protein interactions format”.

  • Peptides CSV: One line for every peptide (with modifications), with the optional annotation specified in Table 23.4, “Data description for peptide-centric CSV export format”. Data are aggregated at the peptide level, which means that PROTEINS and string type supplementary data are accumulated as lists of unique items, and numeric supplementary data are given as the average value. For comparisons, the descriptive statistics for each numeric supplementary data will be specified per dataset. The peptide-based export will only be available, if the selected data contains peptides.

If the selected option is chosen, and no data is selected at all, the Export button will be disabled.

23.2.1. Proteins CSV / Protein Data CSV formats

Table 23.1. Data description for proteins and protein data CSV export formats

Header TitleName & data typeDescription
KEYPreferred canonical keyThe preferred canonical key, as specified by the accession preference settings (see Section 4.3, “Preferred accession”)
IMPORTKEYImport keyThe key by which the protein was originally identified
GI, UNIPROT, ..Primary protein source keyKey for primary protein sources, as given by Table A.1, “Protein Data Sources”
STATUSOutdated (as String)An outdated protein (i.e. a protein with no remaining live sources) is denoted as 'Outdated', otherwise an empty field
TAXTaxonomyThe taxonomy name
DESCDescriptionDescription for the protein
GENEPrimary gene identifierThe identifier from the primary source that establishes the gene (EntrezGene or Ensembl)
GENE EntrezEntrezGene identifierThe EntrezGene identifier, if one exists for the gene
GENE EnsemblEnsembl identifierThe Ensembl gene identifier, if one exists for the gene
GENE <alternative source>Alternative gene keysGene synonyms from other data sources, as given by Table A.2, “Gene Data Sources”
CHROMOSOMEChromosome numberIdentifier for the chromosome on which the gene resides
CHROMOSOMEMAPChromosome map addressDetailed chromosome location
GENESUMMARYGene summary descriptionThe description of the gene from the original source
Data Sets (multi-column)Dataset namesA column for each dataset, with an 'X' in a column denoting that the protein is represented in that dataset
PFAMPFAM IDs (multi-value)PFAM identifiers for the protein
TMTransmembrane domains (as Integer)The number of transmembrane domains
SIGNALSignal protein (as String)Column will contain an 'S' if protein is a signal protein - otherwise empty field
GO_CCGO IDs (multi-value)All GO IDs categorized under Cellular Components
GO_BPGO IDs (multi-value)All GO IDs categorized under Biological Processes
GO_MFGO IDs (multi-value)All GO IDs categorized under Molecular Functions
GO_SLIM_CCGO slim names (multi-value)All GO slim names categorized under Cellular Components
GO_SLIM_BPGO slim names (multi-value)All GO slim names categorized under Biological Processes
GO_SLIM_MFGO slim names (multi-value)All GO slim names categorized under Molecular Functions
ALTSPLICEDThe product of a gene with alternative splice forms (as String)If protein is from an alternatively spliced gene, an AS will be shown in field - otherwise an empty field
LENGTHLength (as Integer)The length of the protein sequence
MONOMASSMonoisotopic massThe monoisotopic mass of the protein sequence (including peptide modification masses for protein data)
AVGMASSAverage massThe average mass of the protein sequence (including peptide modification masses for protein data)
SEQProtein sequenceThe amino acid sequence of the protein
UNIQPEPSUnique peptide countNumber of unique peptides (including modifications)
UNIQPEPSEQSUnique peptide sequence countNumber of unique peptide sequences (ignoring modifications)
COVERAGEPeptide sequence coveragePercentage of protein sequence covered by peptide sequences
PEPTIDESPeptide sequences (multi-value)Peptide sequences for protein, possibly with modifications encoded (see note below)
CLUSTERCluster identifierEmpty if not in any group
Supplementary Data (multi-column)Integers, Doubles, StringsThe supplementary data defined for the dataset (see appendix Table B.1, “Protein supplementary data”)

Dataset columns are only exported for comparisons. Multi-column data types can span multiple columns (datasets and supplementary data), and multi-value data types can contain multiple values (e.g. GO_MF, GO_CC, GO_BP, PFAM, PEPTIDES), enclosed in quotes and separated by commas.

Peptide sequences with modifications encoded have phosphorylation sites prepended with a 'p'. Modifications on the n- and c-terminals are prepended to residues given as 'n' and 'c' respectively, but terminals are only visible when they are modified.

23.2.1.1. Protein supplementary data

All numeric supplementary data exported on the protein level can have several protein data values in the general case, and these will be reported as descriptive statistics with average ('avg'), standard deviation ('std'), minimum ('min'), and maximum ('max').

Textual supplementary data with multiple values (e.g. GS1, generic string 1) will be enclosed in quotes and separated by commas.

For comparisons on the protein level, the descriptive statistics for each numeric supplementary data will be specified per dataset.

23.2.2. Protein Genes CSV format

Table 23.2. Data description for gene-centric CSV export format, where protein annotations are aggregated onto genes

Header TitleName & data typeDescription
GENEPrimary gene identifierThe identifier from the primary source that establishes the gene (EntrezGene or Ensembl)
GENE Entrez GeneEntrezGene identifierThe EntrezGene identifier, if one exists for the gene
GENE EnsemblEnsembl identifierThe Ensembl gene identifier, if one exists for the gene
GENE <alternative source>Alternative gene keysGene synonyms from other data sources, as given by Table A.2, “Gene Data Sources”
CHROMOSOMEChromosome numberIdentifier for the chromosome on which the gene resides
CHROMOSOMEMAPChromosome map addressDetailed chromosome location
GENESUMMARYGene summary descriptionThe description of the gene from the original source
PROTEINKEYPreferred canonical keysThe preferred canonical keys for the proteins of the gene, as specified by the accession preference settings (see Section 4.3, “Preferred accession”)
IMPORTKEYImport keysThe keys by which the proteins of the gene were originally identified
GI, UNIPROT, ..Primary protein source keyPrimary protein source key for the gene's proteins, as given by Table A.1, “Protein Data Sources”
STATUSOutdated (as String)If no proteins of the gene are alive, this field is denoted as 'Outdated'; If only some are outdated, it will append this number in parentheses; If all proteins of the gene are alive, this field is empty
TAXTaxonomyThe taxonomy names for the proteins of the gene
Data Sets (multi-column)Dataset namesA column for each dataset, with an 'X' in a column denoting that the gene is represented in that dataset
PFAMPFAM IDs (multi-value)PFAM identifiers for all proteins of the gene
TMTransmembrane domains (as Integer)The combined number of transmembrane domains for all proteins of the gene
SIGNALSignal protein (as String)If all proteins of the gene are signal predicted, this field is denoted as 'S'; If only some are signal predicted, it will append this number in parentheses; If none are signal predicted, this field is empty
GO_CCGO IDs (multi-value)All GO IDs categorized under Cellular Components for all proteins of the gene
GO_BPGO IDs (multi-value)All GO IDs categorized under Biological Processes for all proteins of the gene
GO_MFGO IDs (multi-value)All GO IDs categorized under Molecular Functions for all proteins of the gene
GO_SLIM_CCGO slim names (multi-value)All GO slim names categorized under Cellular Components for all proteins of the gene
GO_SLIM_BPGO slim names (multi-value)All GO slim names categorized under Biological Processes for all proteins of the gene
GO_SLIM_MFGO slim names (multi-value)All GO slim names categorized under Molecular Functions for all proteins of the gene
ALTSPLICEDThe product of a gene with alternative splice forms (as String)If all proteins of the gene are denoted as having alternative splice forms, this field is denoted as 'AS'; If only some are alternatively spliced, it will append this number in parentheses; If none are alternatively spliced, this field is empty
UNIQPEPSUnique peptide countNumber of unique peptides (including modifications) for all proteins of the gene
UNIQPEPSEQSUnique peptide sequence countNumber of unique peptide sequences (ignoring modifications) for all proteins of the gene
COVERAGEPeptide sequence coverageAverage percentage of protein sequence covered by peptide sequences for all proteins of the gene
PEPTIDESPeptide sequences (multi-value)Peptide sequences for all proteins of the gene, possibly with modifications encoded (see note below)
CLUSTERCluster identifierThe cluster identifiers for all the proteins of the gene - Empty if not in any group
Supplementary Data (multi-column)Integers, Doubles, StringsThe supplementary data defined for the dataset, aggregated on the gene (see appendix Table B.1, “Protein supplementary data”)

Dataset columns are only exported for comparisons. Multi-column data types can span multiple columns (datasets and supplementary data), and multi-value data types can contain multiple values (e.g. GO_MF, GO_CC, GO_BP, PFAM, PEPTIDES), enclosed in quotes and separated by commas.

Peptide sequences with modifications encoded have phosphorylation sites prepended with a 'p'. Modifications on the n- and c-terminals are prepended to residues given as 'n' and 'c' respectively, but terminals are only visible when they are modified.

23.2.2.1. Gene aggregated protein supplementary data

All supplementary data exported for genes are aggregated from all the proteins of each gene. Numeric supplementary data will be reported as descriptive statistics with average ('avg'), standard deviation ('std'), minimum ('min'), and maximum ('max').

Textual supplementary data with multiple values will be enclosed in quotes and separated by commas.

For comparisons, the descriptive statistics for each numeric supplementary data will be specified per dataset.

23.2.3. Genes CSV / Gene Data CSV formats

Table 23.3. Data description for gene-centric CSV export format, where genes and their annotations originate from gene data (not aggregated from proteins)

Header TitleName & data typeDescription
GENEPrimary gene identifierThe identifier from the primary source that establishes the gene (EntrezGene or Ensembl)
IMPORTKEYImport keyThe key by which the gene was originally identified
GENE Entrez GeneEntrezGene identifierThe EntrezGene identifier, if one exists for the gene
GENE EnsemblEnsembl identifierThe Ensembl gene identifier, if one exists for the gene
GENE <alternative source>Alternative gene keysGene synonyms from other data sources, as given by Table A.2, “Gene Data Sources”
CHROMOSOMEChromosome numberIdentifier for the chromosome on which the gene resides
CHROMOSOMEMAPChromosome map addressDetailed chromosome location
GENESUMMARYGene summary descriptionThe description of the gene from the original source
PROTEINKEYPreferred canonical keysThe preferred canonical keys for the proteins of the gene, as specified by the accession preference settings (see Section 4.3, “Preferred accession”)
GI, UNIPROT, ..Primary protein source keyPrimary protein source key for the gene's proteins, as given by Table A.1, “Protein Data Sources”
STATUSOutdated (as String)An outdated gene (i.e. a gene with no remaining live sources) is denoted as 'Outdated', otherwise an empty field
TAXTaxonomyThe taxonomy name registered for the gene
Data Sets (multi-column)Dataset namesA column for each dataset, with an 'X' in a column denoting that the gene is represented in that dataset
PFAMPFAM IDs (multi-value)PFAM identifiers for all proteins of the gene
TMTransmembrane domains (as Integer)The combined number of transmembrane domains for all proteins of the gene
SIGNALSignal protein (as String)If all proteins of the gene are signal predicted, this field is denoted as 'S'; If only some are signal predicted, it will append this number in parentheses; If none are signal predicted, this field is empty
GO_CCGO IDs (multi-value)All GO IDs categorized under Cellular Components for the gene
GO_BPGO IDs (multi-value)All GO IDs categorized under Biological Processes for the gene
GO_MFGO IDs (multi-value)All GO IDs categorized under Molecular Functions for the gene
GO_SLIM_CCGO slim names (multi-value)All GO slim names categorized under Cellular Components for the gene
GO_SLIM_BPGO slim names (multi-value)All GO slim names categorized under Biological Processes for the gene
GO_SLIM_MFGO slim names (multi-value)All GO slim names categorized under Molecular Functions for the gene
ALTSPLICEDAlternative splice forms (as String)If the gene is the source of alternatively spliced products, this field is denoted as 'AS'; If only some are alternatively spliced, it will append this number in parentheses; If none are alternatively spliced, this field is empty
Supplementary Data (multi-column)Integers, Doubles, StringsThe gene supplementary data defined for the dataset (see appendix Table B.2, “Gene supplementary data”)

Dataset columns are only exported for comparisons. Multi-column data types can span multiple columns (datasets and supplementary data), and multi-value data types can contain multiple values (e.g. GO categories), enclosed in quotes and separated by commas.

23.2.3.1. Gene supplementary data

All supplementary data exported for genes are aggregated from all the gene data. Numeric supplementary data will be reported as descriptive statistics with average ('avg'), standard deviation ('std'), minimum ('min'), and maximum ('max').

Textual supplementary data with multiple values will be enclosed in quotes and separated by commas.

For comparisons, the descriptive statistics for each numeric supplementary data will be specified per dataset.

23.2.4. Peptides CSV format

Table 23.4. Data description for peptide-centric CSV export format

Header TitleName & data typeDescription
PEPTIDESequence StringThe unmodified peptide sequence
MODIFICATIONSList of peptide modificationsAll the modifications and their positions in the peptide, encoded as modificationType@position. The modificationType is substituted with a raw mass value, if it has not been identified as a standard modification. The field is empty if the peptide is unmodified
PROTEINSList of protein keysAll the proteins containing the peptide
Supplementary Data (multi-column)Integers, Doubles, StringsThe peptide supplementary data defined for the dataset (see appendix Table B.3, “Peptide supplementary data”)

Dataset columns are only exported for comparisons. Multi-column data types can span multiple columns (datasets and supplementary data), and multi-value data types can contain multiple values (PEPTIDES), enclosed in quotes and separated by commas.

23.2.4.1. Peptide supplementary data

All numeric supplementary data exported on the peptide level can have several values in the general case, and these will be reported as the average value.

Textual supplementary data with multiple values, e.g. GS1 (generic string 1), wil be enclosed in quotes and separated by commas.

For comparisons on the peptide level, the average value for each numeric supplementary data will be specified per dataset.

23.2.5. Protein FASTA format

To export in FASTA format, choose Proteins FASTA in the drop-down box (3) in Figure 23.1, “The Export Pane”. The procedure for exporting FASTA files is similar to the CSV export, except that the format itself is fixed and thus section 7 in the image above is not shown.

Each FASTA header consists of the protein key and a description separated by a vertical bar ( | ) and a space. The sequence is printed on the following line:

>89083017| hypothetical protein MED_14403

MLNQNKETPAPQDDEEGSVYILSDN

Please note that all sequences are printed on one separate line, and that the protein key is chosen according to the list of preferred protein sources as discussed in Section 4.1, “The basics of a protein record”. In the example above, the preferred key was GI. If a protein does not have a GI key, the system will look for the next key in the prioritized list of preferred keys , and continue to look in this manner for an existing key. This same algorithm is also used when displaying proteins in the Proteins View ( Chapter 14, Proteins view), which ensures that exported FASTA files will contain the same keys as shown in the Proteins View.

23.2.6. Protein interactions format

Export IntAct, MIPS and STRING interactions in Cytoscape SIF format, by choosing Protein interactions SIF in the drop-down box (3) in Figure 23.1, “The Export Pane”. The procedure for exporting interaction files is similar to the other exports, except that the format itself is fixed and thus section 7 in the image above is not shown. Instead, it is possible to limit the output by specifying a minimum interaction score (range: 0 - 1000).

The file includes no header, as is required by the Cytoscape SIF format. Each following line consists a protein key, the generic protein-protein interaction relationship type ('pp'), followed by a space-delimited list of the proteins with which it interacts:

4557485 pp 7657100 Q8CL89 4557321 121039

The protein keys are chosen according to the list of preferred protein sources as discussed in Section 4.1, “The basics of a protein record”. In the example above, the preferred key was GI.

The interacting networks of the proteins can be browsed and analyzed in a graph-oriented way, by importing the file into Cytoscape as a generic SIF network.

23.3. Using exported data

The exported data files may be imported into the system again, without any modifications to the file, using the generic CSV importer. See Chapter 21, Import for further details on how to import the data again.

Likewise, CSV export may be used for individual backup, although it's strongly recommended that this is not used as a substitute for regular backups at the system level.

Export can also be used in order to archive data that is expected not to be used anymore.

23.3.1. Analyzing interaction networks in Cytoscape

To browse and analyze protein networks in Cytoscape, export interaction data as decsribed in Section 23.2.6, “Protein interactions format”. Import resulting SIF file into Cytoscape as a 'network' file (File->Import->Network (Multiple File Types)), optionally applying a preferred layout to the network graph:

Any data exported with Table 23.1, “Data description for proteins and protein data CSV export formats” can be imported into the Cytoscape nodes as 'node attributes'. Include the preferred canonical key as the first column, as Cytoscape uses this as the node identifier to attach the row of data to. Alternative identifiers can be imported as 'aliases', and multi-value columns are imported as lists. The graphic below depicts the necessary options to specify in Cytoscape upon selecting File->Import->Attribute from Table (Text/MS Excel):

Subsequently, select the attributes of interest to visualize, and select individual nodes to have their attributes shown in the 'Data Panel':

The network and protein attributes can now be navigated and manipulated with Cytoscape's builtin capabilities, such as graph algorithms and various annotation based plugins. The graphic below shows a simple VizMapper visualization, where a ProteinCenter supplementary data type (imported as a Cytoscape attribute) is used to color-code nodes according to the quantitative value: