ProteinCenter User Manual
Table of Contents

Chapter 16. Clusters View

Table of Contents

16.1. The clusters view
16.2. Introduction to clustering
16.3. Types of clustering
16.3.1. Clustering based on sequence similarity
16.3.2. Clustering based on peptide sharing
16.4. The biological significance of clustering levels
16.5. The many advantages to clustering
16.5.1. Clustering reduces the complexity of analysis
16.5.2. Clustering ensures better annotation coverage
16.5.3. Clustering reduces redundancy
16.5.4. Grouping related proteins
16.5.5. Grouping of alleles, fragments, isoforms
16.5.6. Dataset comparisons
16.5.7. Peptide sharing
16.6. How to cluster datasets
16.6.1. Clustering a dataset
16.6.2. How to cluster different parts of a dataset at different clustering levels
16.6.3. Using preferred data from a particular species as anchor
16.6.4. How to uncluster a dataset
16.6.5. How to use imported clusters
16.6.6. Different clustering algorithms

16.1. The clusters view

The clusters view is an alternative way of displaying a list of proteins for a more efficient data analysis, since the information is summarized for groups of proteins. Hence, an analysis undertaken in the clusters view requires less decisions for a thorough analysis. Related proteins are grouped together based on clusters that are either imported, or derived from a clustering analysis as described in this chapter.

In the clusters view the user is presented with an easy overview of information for a very large number of proteins. Each line corresponds to a group of proteins with summary data for all proteins in a given cluster. The clusters may be expanded to inspect the details for the individual proteins. It is also possible to sort on each group's minimum and maximum values of the supplementary data fields.

Specifically for the clusters view, filters are applied at the cluster level. For example, if using filters to isolate membrane proteins, the result will include any cluster that has at least one membrane protein (more details are given in Chapter 11, Filters).

Figure 16.1. The Clusters view

The Clusters view

Below is a short description of the most common columns in the clusters view. To manually select which columns to display, see Section 5.4.5, “Selecting which columns to display”.

These explanations refer to the details pertaining to the clusters. For details related to the actual proteins, please refer to the proteins view just above in Section 14.1, “The proteins view”.

  1. Cluster - The cluster name/identifier. This value can either be imported or assigned when clustering is performed in ProteinCenter. When assigned, it will be set to the preferred accession key of the anchor protein in the cluster.

  2. Size - The number of cluster members. Press to expand the data for individual members of the cluster.

  3. CL - The clustering level for the group (sequence similarity percentage / peptides in common).

  4. - Press to ungroup all proteins in the cluster.

  5. O - Cluster outdated status. Flagged with if all proteins in a cluster are outdated, and with if some (but not all) members are outdated.

  6. Description - Description for the anchor protein.

  7. S - Check to select all proteins in the cluster.

  8. - Add all proteins in the cluster to Basket.

  9. Gene - The most frequent Gene symbol in the cluster. A '+' indicates that there are additional gene symbols in the cluster, which can be seen my hovering over the text with mouse to get a summary:

  10. AA Min - Length of shortest protein in the cluster.

  11. AA max - Length of longest protein in the cluster.

  12. AS - Flagging of proteins from alternatively spliced gene. The flag indicates all proteins in the cluster, while indicates some in the cluster.

  13. Tax - Most frequent species name for proteins in the cluster is shown. Hover mouse over to see details:

  14. GO MF - Molecular Function GO annotation summarized for the cluster.

  15. GO CC - Cellular component GO annotation summarized for the cluster.

  16. GO BP - Biological process GO annotation summarized for the cluster.

  17. TM Max - The maximum number of transmembrane domains for a protein in the cluster.

  18. SP - Signal peptide prediction. The flag indicates that all cluster members carry a signal peptide, while indicates that only some members do.

  19. Pep - Number of peptides associated with the cluster.

  20. All supplementary data with configurable statistics columns for the cluster as described in Chapter 13, Protein data view. Note that the image above has been truncated to only show some supplementary data.

  21. A cluster can be expanded to display details for its members, by clicking the integer in the No column. Clicking the integer again will hide the cluster information. Most columns in the expanded view are the same as in the protein view (see Chapter 14, Proteins view), and it is possible to expand all the way down to see experimental data.

  22. Sim - A measure for the degree of each member's relation to the group. Depending on the clustering algorithm used, this is either the sequence similarity percentage between the cluster anchor and any given member, or the number of peptides a member shares with the rest of the group. Press to get a blast alignment between any member and the anchor, as shown in the figure below. The anchor is marked in black without a link.

  23. - Press to exclude the protein from this cluster.

More details and descriptions on the clustering process is explained below, after a general introduction.

Figure 16.2. The blast alignment between two proteins in a cluster

The blast alignment between two proteins in a cluster

Keep in mind that BLAST may not always show the complete global alignment from N to C terminal. In this example only the first 125 AA are shown for the query sequence.

16.2. Introduction to clustering

Clustering a set of proteins enables efficient data analysis, and moreover allows for comprehensive comparisons of different datasets. ProteinCenter enables you to analyze a dataset as a clustered set of proteins, and at any time you may switch between analysis of the data as a clustered or unclustered dataset. When switching between clustered and unclustered view, the various functionalities will switch, too. For example, filtering and sorting refer to individual protein annotations in the unclustered view, but in the clustering view they refer to the combined cluster annotations.

In ProteinCenter, a protein dataset may be clustered in one of three ways:

  • Using imported groupings

  • Based on sequence similarity

  • Based on shared peptides

This chapter gives a short description of how the clustering actually works, and an explanation of the biological significance of different similarity levels. Furthermore, it is described how clustering facilitates the following:

  • Reducing complexity when analyzing a large set of proteins

  • Increasing annotation coverage

  • Dealing with redundancy issues

  • Identification of related proteins

  • Grouping of alleles, fragments, isoforms, and homologues

  • Comparison between heterogeneous protein datasets

Finally a tutorial is included.

16.3. Types of clustering

There are two basic types of protein clustering algorithms implemented in ProteinCenter: The first type is based on sequence similarity between protein sequences, and the other is based on sharing of peptides from experimental data. Neither uses modifications of any type on sequences or peptides, i.e. only the raw sequence of amino acids is used.

16.3.1. Clustering based on sequence similarity

All sequence similarity clustering algorithms are based on local sequence alignments of amino acid identity, which are aggregated to a single percentile similarity measure. For datasets containing an intractable number of proteins (> ~6000 proteins), prolonged execution times (> 10 minutes) must be expected. As with all clustering techniques, there is a fundamental discrepancy in that the homological similarity between sequences is a rather continuous measure, while any sequence can only belong to one particular group. Hence, hard clustering signifies the best possible compromise attainable, when proteins are forced to belong to only one group.

It is possible to cluster at any level from 1% and up to 100% sequence identity. The identity is calculated along the complete length of the shortest of the two protein sequences. A few simplified examples will help visualize this:

Two sequences that are 90% similar to each other:

Identities = 9/10 (90%)

Query: 1 MVAKLEQLIA 10

Identity |||||| |||

Sbjct: 1 MVAKLELLIA 10

Two sequences that are 100% similar to each other:

Identities = 6/6 (100%)

Query: 1 MVAKLEQLIA 10

Identity ||||||

Sbjct: 1 MVAKLE 6

The bottom sequence (6 AA) is 66.7% similar to the top sequence (10 AA):

Identities = 4/6 (66%)

Query: 1 MVAKLEQLIA 10

Identity || ||

Sbjct: 1 MVVKLI 6

For example, in cases of alternative splicing, where a large chunk of a sequence has been removed (e.g. by exon skipping), the similarity may still remain as high as 100%, since all residues of the shorter sequence will remain identical to the longer isoform.

16.3.1.1. Sequence similarity cluster anchors

To increase the size of individual clusters and thereby minimize the number of clusters, the longest sequence is usually chosen as the anchor sequence (with few exceptions regarding preferred species and group coverage).

When a large dataset of proteins is clustered, the composition of the individual clusters will vary depending on the chosen anchor sequence. All other members of a cluster have been included based on their similarity to the anchor sequence. This means that for a dataset clustered at a 90% similarity level, all members of every cluster will be at least 90% identical to the anchor sequence of the individual cluster. However, the individual members may have less than 90% identity with each other. A rather extreme example of this is shown in Table 16.1, “Members of a cluster do not have to be similar - they only have to be similar to the anchor”, where two members share no similarity at all.

Table 16.1. Members of a cluster do not have to be similar - they only have to be similar to the anchor

ProteinSequence (aligned)Identity to anchor (%)
AnchorMINQKKYVHFVTMYIIIFSLWIFLIPKDLNIKEIGILFLFCFATLFSCYC_
Member1MINQKKYVHFVTMYIIIFSLWI100
Member2______________________FLIPKDLNIKEIGILFLFCFATLFSCYC100
Member3MINQKKYVHFVTMYIIIFSLWIFLIPKDLNIKEIGILFLFCFATLFMINQ92

16.3.2. Clustering based on peptide sharing

When a dataset contains experimental data in the form of peptides, it is possible to group those proteins that share a certain number of peptides. For this type of clustering, the clustering level is defined as the least number of peptides a protein must have in common with the rest of the group as a whole (not with any other member in particular) to be included. Moreover, any experimental data that refer to the same protein, will be grouped regardless of the number of peptides they share (although they will usually share the majority of their related peptides).

The most eligible candidate for an anchor in a peptide sharing group, is the protein that has the most peptides in common with the rest of the members of the group.

16.3.2.1. Clustering indistinguishable proteins

Figure 16.3. Clustering indistinguishable proteins

Clustering indistinguishable proteins

Another form of peptide sharing clustering is to group proteins that can not be distinguished, based on the peptide evidence they share. Any (unmodified) peptide sequence matches, which were not identified as peptides, are deemed "possible peptide matches" (assuming the original peptide assignment was arbitrary), and are also considered in the algorithm. The protein with the most peptides becomes the anchor of a group, and any other member will have all of its peptides in common with this anchor. In case of equality in this respect, the anchor decision tie breakers are in order: The protein with the most assigned peptides (not including "possible peptide matches"), and finally the sequence length. Each member's similarity is the number of peptides it has.

16.4. The biological significance of clustering levels

Different sequence similarity levels may be used, depending on the purpose of an analysis. Some rough guidelines may be used to categorize the biological significance of different similarity levels (see Table 16.2, “Types of proteins that are grouped, based on clustering level”). Keep in mind that there will always be exceptions to these general categorizations.

To get more precise results, analysis/inspection of individual clusters must be undertaken, but it is certainly possible to get biological meaningful results using the automated clustering. For example, rough distribution of sub-cellular localization of proteins in the dataset may be obtained by clustering a dataset at 98% similarity, and considering each cluster to roughly represent a protein. At the 98% similarity level some splicing variants may end up in the same cluster, but the effect is most often much smaller than the effect of redundancy in the dataset, if allowing different alleles (same protein) to be counted multiple times (unless the resolution of the analysis was at the allelic level).

Table 16.2. Types of proteins that are grouped, based on clustering level

Similarity levelTypes of proteins grouped
100%

Native & processed sequences

Fragment, longer fragments & full length sequences

Certain splicing isoforms (e.g. exon skipping)

98-100%

Alleles (including allelic fragments)

All above

95-100%

Highly similar proteins

Proteins with sequencing errors

All above

80%

Closely related paralogous proteins

Orthologous proteins from closely related species

All above

60%

Homologous proteins - closer to the level of protein families

All above


16.5. The many advantages to clustering

The complexity of proteomics data analysis is (in addition to the complex nature of the biology of the cell) increased by search databases containing a mix of different alleles, splice variants, fragments, sequencing errors and predicted sequences. This makes it hard to distinguish which proteins correspond to the same isoform, or even the same gene. Therefore, a set of 100 protein keys rarely correspond to 100 distinct proteins, and even the most simple statistics (e.g. counting the number of proteins with a certain feature) becomes difficult. Secondly, often the annotation needed is not comprehensive, leading to false negatives: The proteins are not considered as being part of the interesting subset, when in fact they are.

16.5.1. Clustering reduces the complexity of analysis

The clustering of a set of proteins will always lead to fewer clusters than the number of proteins in the original set. The degree of reduction is dependent on the level at which the data are clustered. Hence, by analyzing the annotation for clusters rather than proteins, it is much faster to browse through the dataset. Finally, similar proteins are always analyzed together, avoiding that the same conclusion need be taken multiple times.

16.5.2. Clustering ensures better annotation coverage

By clustering similar proteins, the annotation of these proteins is significantly enhanced, especially with respect to their function. Obviously clusters containing more broad categories of proteins may result in the mixing of different annotations, which is generally not useful. Such broad groups are rarely created for annotation analysis, but are useful if you want to group a particular irrelevant type of protein, in order to disregard it (e.g. contaminants). Therefore, used in the right context, clustering may provide a strong way of enhancing annotation.

16.5.3. Clustering reduces redundancy

Often a set of proteins is highly redundant, due to the presence of closely related proteins. One widespread example is datasets derived from MS experimentation of more complex samples, where hundreds or even thousands of protein identifications may contain lots of closely related isoforms that could not be distinguished from each other. This redundancy must be dealt with to enable further analysis – otherwise it does not make much sense to see which fraction of proteins that have a particular characteristic (sub-cellular localization, etc.). Clustering deals with the issues that different protein keys may represent closely related protein isoforms, or even different fragments of the same isoform. Closely related proteins are grouped together, and fragments are grouped together with longer (possibly) full length isoforms. Subsequent analysis at the clustering level gives more reliable distributions of e.g. sub-cellular localization.

A set of protein identifications rarely represent a clean dataset without redundancy. This is due to the amount of work that needs to be put into the bookkeeping process, when it is not automated as in ProteinCenter. Even when a few clean datasets have been created, it is not trivial to compare these datasets without ProteinCenter, since there is no guarantee that the same identifiers are used for the same proteins. Also, two proteins in two different datasets may represent two alleles of the same protein, without this being apparent from identifiers or description (keep in mind that even the same protein may have widely different descriptions associated with the various accessions keys). See more about comparison of datasets in Chapter 19, Dataset comparison.

16.5.4. Grouping related proteins

Clustering at lower similarity levels may be used to isolate proteins of same or similar protein families for detailed analysis.

To get an overview of a large dataset, it may also help grouping proteins into very large clusters to understand the nature of the groups of proteins. Then one may drill down, either inside individual clusters, or by re-clustering the complete dataset again. ProteinCenter enables both approaches.

Alternatively, this may be a way to isolate and remove a large group of proteins, which are deemed irrelevant to the assay at hand (e.g. proteins originating from contamination).

16.5.5. Grouping of alleles, fragments, isoforms

At high sequence similarity levels, alleles, fragments and isoforms are grouped together. This enables detailed analysis of which isoforms have been detected, or simply deals with the more basic redundancy issues mentioned above.

16.5.6. Dataset comparisons

Finally, clustering is a clear advantage when comparing datasets. Even though ProteinCenter integrates every protein record with the same sequence into one record, the overlap of a comparison of two datasets will still be much smaller, as when fragments, alleles, etc. are dealt with. Hence, comparing clusters from different datasets is much more useful, since this ensures a more comprehensive comparison, and allows the user to decide how proteins are considered the same. This makes it possible to compare datasets at different levels of granularity, be it fragments, alleles, or even 95% sequence identical proteins.

16.5.7. Peptide sharing

Clustering proteins that share identified peptides significantly eases the analysis of experimental data, and establishes the basis for decisions on which proteins have been identified. By using either of the peptide sharing clustering algorithms, protein hits that are indiscernible based on their peptide evidence can be grouped.

16.6. How to cluster datasets

This section gives a step-by-step introduction to clustering of datasets.

16.6.1. Clustering a dataset

Select a dataset for analysis, and go to the Cluster pane:

Then follow these simple steps:

Figure 16.4. The cluster menu

The cluster menu

  1. Pick a clustering algorithm (see Section 16.6.6, “Different clustering algorithms” for descriptions of individual algorithms).

  2. Choose the clustering level at which to cluster sequences, depending on the type of clustering algorithm:

  3. Set additional parameters for the chosen type of clustering (e.g. the preferred species of the anchor).

  4. Press to cluster all proteins. Any previous clustering will be abandoned before applying the algorithm.

  5. If proteins have already been clustered, this button will uncluster all proteins.

  6. Use this to cluster only proteins that are not already clustered.

The result is shown as a set of clusters in the clusters view, where each line corresponds to a cluster as described above.

16.6.2. How to cluster different parts of a dataset at different clustering levels

It is possible to cluster different parts of a dataset with different clustering level criteria. This can be done by using the functionality that allows clustering of previously unclustered proteins, combined with the functionality to ungroup members from clusters (either for individual clusters, or for complete clusters).

  1. Pick a clustering level and cluster all proteins.

  2. Optionally uncluster specific clusters, or specific proteins (function 4 and 23 in Figure 16.1, “The Clusters view”).

  3. Pick another clustering level (1), and cluster all proteins again with the Cluster all unclustered data (6) button.

16.6.3. Using preferred data from a particular species as anchor

If a dataset consist of proteins from a range of species, it is possible to specify a particular species that should be favored in the selection of anchor sequence. This way, you may ensure that all or most anchor proteins are from the species of interest. To use a preferred species select one from the preferred species menu (2), before selecting Cluster all data (4). Using this option provides a constraint on the default choice of anchor, and results in the anchor being the longest sequence from the chosen species.

16.6.4. How to uncluster a dataset

Click Uncluster all data (5) to uncluster the complete dataset.

16.6.5. How to use imported clusters

It is possible to import datasets that are already clustered, e.g. groups of proteins that are grouped based on MS evidence.

The clustering will automatically appear when a pre-clustered dataset is selected.

Imported clusters may be used in many different types of analyses, but where multiple datasets are combined (e.g. to do a comparative analysis), a clustering performed in ProteinCenter is required.

16.6.6. Different clustering algorithms

Several different clustering algorithms are available, but for most uses the default (Most Homogeneous Groups) will suffice. Sometimes a protein can belong to more than one group, and it is in these cases the different algorithms differ the most, according to the criteria described below. In all cases the anchors are the protein with the longest sequence, unless another protein that is only 1 or 2 amino acids shorter creates a better grouping according to the grouping criteria. This way all non-anchor proteins are always more similar to their respective anchor proteins than the similarity threshold given.

  1. Most Homogeneous Groups - Proteins are clustered to make the individual groups as homogeneous as possible, while adhering to the specified minimum sequence similarity. Each member's similarity is the sequence similarity it shares with the cluster anchor.

  2. Largest Groups - Proteins are clustered to create the largest possible groups, while adhering to the specified minimum sequence similarity. Each member's similarity is the sequence similarity it shares with the cluster anchor.

  3. Required Species - The Most Homogeneous Groups algorithm, with the added requirement that anchors have to be a specific species.

  4. Peptide Sharing - Proteins are clustered according to the number of peptides they share (clustering level (1)). Each member's similarity is the number of peptides it shares with the rest of the group members.

  5. Indistinguishable Proteins - Proteins that can not be distinguished, based on the peptide evidence they share, are clustered. The protein with the most peptides becomes the anchor of a group. Each member's similarity is the number of peptides it has.