ProteinCenter User Manual
Table of Contents

Chapter 17. Profiling View

Table of Contents

17.1. Profiling
17.1.1. The profiling view
17.1.2. Profiling algorithm details
17.2. How to profile datasets
17.2.1. Profiling a comparison or dataset

17.1. Profiling

ProteinCenter enables several alternative partitions of protein data (datasets, filtering, clustering), which grants the user different views for directed data analysis. Profiling signifies another way of partitioning proteins, peptides, clusters, genes and their annotations, based on their supplementary data characteristics across datasets. Any set of numeric types of supplementary data are applicable to this approach, but various quantitative measures of expression are usually the best candidates (e.g. relative ratios from iTRAQ or SILAC assays).

Profiling can operate directly on either the protein, peptide, cluster, gene or chromosome level, depending on the focus of the analysis. On the protein and cluster level, profiling can identify proteins and protein groups involved in related processes and pathways, while profiling on the peptide level can distinguish similar dynamic relations for different isoforms, post-translational and other modifications. Another way of using profiling is to operate indirectly on the annotations of the proteins, aggregating supplementary data values on annotation categories (e.g. KEGG pathways), instead of the proteins themselves. The supported types of annotation include: GO Slim and GO terms (with and without more general terms), KEGG and UniProt pathways, enzyme codes (with and without more general terms), PFAM and InterPro domains, as well as diseases and keywords. Any such set of supplementary data for a certain category (regardless of type), will in the rest of the chapter be termed a 'profile'.

Profiles that are missing supplementary data in some of the selected supplementary data categories, can to a certain (user-specified) degree have these values patched, prior to profiling (a process known as imputation).

When comparisons are selected for profiling, the analysis spans datasets as well as supplementary data, enabling datasets to be treated as 'snapshots' and a comparison of these datasets as a time series of expression dynamics.

The primary action of a profiling involves a 'soft' clustering based on the selected supplementary data values and datasets. In conventional ('hard') clustering (see Chapter 16, Clusters View) each protein has a single and exclusive membership to one particular cluster. Soft clustering, however, embodies a more generalized partitioning approach, where each clusterable object (peptide/protein/cluster) belongs to every soft cluster group to a certain degree. In this regard, hard clustering can be viewed as a special type of soft clustering, where only full (1.0) or no (0.0) membership are allowed.

17.1.1. The profiling view

The basic profiling procedure consists of selecting a suitable set of supplementary data, setting the data processing and soft clustering parameters, and browsing the resulting partitioning of profiles (see Section 17.2, “How to profile datasets” for a more elaborate how-to). In the profiling view, each soft cluster is presented as a graph depicting the profiles of all its member peptides, proteins, clusters, genes, chromosomes or annotation categories. Each profile consists of the (normalized) values across the selected supplementary data. The number of profiles shown is subject to a post-processing filter parameter (called alpha core), which filters out any profiles with a cluster membership below that threshold. This filter allows the analyst to focus on the profiles that best represent the group characteristics.

Figure 17.1. The profiling view

The profiling view

  1. Result Summary - An overview of the basic result statistics of the algorithm:

    • The resulting number of groups. Groups that are empty after applying the alpha core filter will not be shown, but will be counted separately here.

    • The number of excluded proteins, peptides, clusters, genes, chromosomes or annotation categories due to their profile missing too many values. Clicking this link will direct the user to a filtered view containing the proteins related to these profiles. For peptide level profiling, this view shows all proteins related to any of the excluded peptides.

    • The total number of calculated values. This is the count of values that were missing, but were imputated (estimated and substituted) before applying the algorithm.

  2. The number of profiles of this group, along with the total membership sum of all the members in the group.

  3. The legend of the graph shows the identifier for each member profile, along with its membership in this group. For protein level profiling, the identifier is represented by the preferred accession. For peptide level profiling, the identifier is simply the peptide sequence with modifications. Cluster level profiles are identified with their anchor protein accession keys. Gene level profiles are identified with their official symbol, chromosomes with their number or name. Annotation level profiles are represented by their most natural identifier, typically a category description. Only the 9 profiles with the highest membership will be displayed. A dot on the graph color indicates that this profile has at least one imputated missing value.

  4. The graph depicts the profiles of all peptides/proteins/clusters/genes/chromosomes/annotation categories belonging to this group, with the x-axis representing the selected supplementary data, and the y-axis measuring their (normalized) values. If the profiling is carried out on a comparison, the number of x-axis categories will be multiplied by the number of datasets in the comparison, and each category title will have the dataset name appended.

    Clicking the graph will direct the user to a filtered proteins view containing the profiles of this group.

  5. Pressing the magnifying icon will open a large detailed view of the group graph, including a larger legend holding more information on profile identifiers.

17.1.2. Profiling algorithm details

The ProteinCenter profiling methodology was conceived as a peptide/protein/cluster/chromosome/annotation equivalent of the gene expression profiling suggested by [Futschik & Carlisle]*, but generalized in the way it spans datasets and data dimensions. Data pre-processing

When a profile has several values for a particular data type attributed to it (a common scenario), the resulting value is determined as the median over all the data. For proteins, clusters, genes, chromosomes and annotations this means taking the median over all experimental data. On the peptide level, the median is taken over all instances of that petide (considering modifications) for all experimental data.

Profiles with too many missing values are excluded, while those inside the given threshold have their missing values imputated (estimated by algorithm) according to the profile average. These data are subsequently Z-normalized over each profile, rendering them with a zero mean and unity standard deviation. Normalizing across profiles like this (rather than normalizing across dimensions) puts the clustering emphasis on the profile shape (i.e. the relative differences between values), rather than on the absolute profile values. This is deemed a desirable quality in profiling, especially when the main objective is to identify peptides/proteins/clusters/genes/chromosomes/annotation categories, which exhibit similar quantitative dynamics. Soft clustering

The soft clustering method used for profiling is a modified version of fuzzy c-means, which will produce clusters with gradual memberships for a specific number of clusters. If the auto option is specified, a meta algorithm will attempt to deduce the optimal number of clusters automatically. This is done by performing all soft clusterings with group count between 1 and the maximum specified as group count, and selecting the largest group count clustering that does not produce 'empty' clusters (i.e. clusters containing no members with at least 0.5 membership). The fuzzy c-means algorithm uses a membership weight exponent of 1.25, as it was found to generally strike the best balance between cluster separation and noise robustness.

* Futschik, M.E. and Carlisle, B. : "Noise-robust soft clustering of gene expression time-course data". Journal of Bioinformatics and Computational Biology, Vol. 3, No. 4, p. 965–988, 2005.

17.2. How to profile datasets

This section gives a step-by-step introduction to profiling datasets and comparisons.

17.2.1. Profiling a comparison or dataset

Select a comparison or single dataset for analysis, and go to the profiling pane:

Then follow these steps:

Figure 17.2. The profiling menu

The profiling menu

  1. Select whether to profile all proteins in a dataset, or only the selected ones.

  2. Select whether to perform profiling on the level of proteins, peptides, clusters, genes, chromosomes or a type of annotation. This determines the basic unit of analysis, which each profile refers to.

  3. Pick a specific number of groups for the soft clustering. Regardless of the character of the data, this structure will be forced upon the clustered data (unless auto adjust is used).

  4. Decide whether or not the clustering algorithm should attempt to deduce the optimal number of clusters automatically. With this option, the specified number above is interpreted as the maximum number of clusters to consider.

  5. Set the accepted number of missing values per protein profile. Profiles with more missing values will be excluded from the clustering. Profiles with missing values below this threshold will have these values estimated by an algorithm in accordance with the remaining profile. Profiles with no missing values will remain unaffected.

  6. Choose the (alpha core) level of the membership filter, to only present profiles with at least this degree of membership. Alpha core constitutes a post-processing filter, which allows for browsing cluster cores without re-partitioning the data. This allows the analyst to focus on the member profiles most closely resembling the group consensus.

  7. Select the supplementary data to be used as dimensions for the soft clustering algorithm. Data types pertaining to various quantitative measures of protein expression are usually the best candidates (e.g. QR#, AQR#, GD#, emPAI, etc.). Numbers in square brackets following a data type indicate the number of missing values for that type. Data types that have values for all profiles in the dataset(s) will not display missing value brackets. Selecting data types with a minimum number of missing values (or obtaining data with complete quantitation in general) will yield the most effective profiling results. The ordering of the selected supplementary data only reflects the order in which they will be displayed in the results.

  8. Press to soft cluster all profiles according to the selected supplementary data.

The result is shown as a set of clusters in the profiling view, with each graph corresponding to a soft cluster of profiles, as described above.