Table of Contents
The clusters view is an alternative way of displaying a list of proteins for a more efficient data analysis, since the information is summarized for groups of proteins. Hence, an analysis undertaken in the clusters view requires less decisions for a thorough analysis. Related proteins are grouped together based on clusters that are either imported, or derived from a clustering analysis as described in this chapter.
In the clusters view the user is presented with an easy overview of information for a very large number of proteins. Each line corresponds to a group of proteins with summary data for all proteins in a given cluster. The clusters may be expanded to inspect the details for the individual proteins. It is also possible to sort on each group's minimum and maximum values of the supplementary data fields.
Specifically for the clusters view, filters are applied at the cluster level. For example, if using filters to isolate membrane proteins, the result will include any cluster that has at least one membrane protein (more details are given in Chapter 11, Filters).
Below is a short description of the most common columns in the clusters view. To manually select which columns to display, see Section 5.4.5, “Selecting which columns to display”.
These explanations refer to the details pertaining to the clusters. For details related to the actual proteins, please refer to the proteins view just above in Section 14.1, “The proteins view”.
Cluster - The cluster name/identifier. This value can either be imported or assigned when clustering is performed in ProteinCenter. When assigned, it will be set to the preferred accession key of the anchor protein in the cluster.
Size - The number of cluster members. Press to expand the data for individual members of the cluster.
CL - The clustering level for the group (sequence similarity percentage / peptides in common).
- Press to ungroup all proteins in the cluster.
O - Cluster outdated status. Flagged with if all proteins in a cluster are outdated, and with if some (but not all) members are outdated.
Description - Description for the anchor protein.
S - Check to select all proteins in the cluster.
- Add all proteins in the cluster to Basket.
Gene - The most frequent Gene symbol in the cluster. A '+' indicates that there are additional gene symbols in the cluster, which can be seen my hovering over the text with mouse to get a summary:
AA Min - Length of shortest protein in the cluster.
AA max - Length of longest protein in the cluster.
AS - Flagging of proteins from alternatively spliced gene. The flag indicates all proteins in the cluster, while indicates some in the cluster.
Tax - Most frequent species name for proteins in the cluster is shown. Hover mouse over to see details:
GO MF - Molecular Function GO annotation summarized for the cluster.
GO CC - Cellular component GO annotation summarized for the cluster.
GO BP - Biological process GO annotation summarized for the cluster.
TM Max - The maximum number of transmembrane domains for a protein in the cluster.
SP - Signal peptide prediction. The flag indicates that all cluster members carry a signal peptide, while indicates that only some members do.
Pep - Number of peptides associated with the cluster.
All supplementary data with configurable statistics columns for the cluster as described in Chapter 13, Protein data view. Note that the image above has been truncated to only show some supplementary data.
A cluster can be expanded to display details for its members, by
clicking the integer in the
column. Clicking the integer again will hide the cluster information.
Most columns in the expanded view are the same as in the protein view
(see Chapter 14, Proteins view), and it is possible to expand all the
way down to see experimental data.
Sim - A measure for the degree of each member's relation to the group. Depending on the clustering algorithm used, this is either the sequence similarity percentage between the cluster anchor and any given member, or the number of peptides a member shares with the rest of the group. Press to get a blast alignment between any member and the anchor, as shown in the figure below. The anchor is marked in black without a link.
- Press to exclude the protein from this cluster.
More details and descriptions on the clustering process is explained below, after a general introduction.
Keep in mind that BLAST may not always show the complete global alignment from N to C terminal. In this example only the first 125 AA are shown for the query sequence.
Clustering a set of proteins enables efficient data analysis, and moreover allows for comprehensive comparisons of different datasets. ProteinCenter enables you to analyze a dataset as a clustered set of proteins, and at any time you may switch between analysis of the data as a clustered or unclustered dataset. When switching between clustered and unclustered view, the various functionalities will switch, too. For example, filtering and sorting refer to individual protein annotations in the unclustered view, but in the clustering view they refer to the combined cluster annotations.
In ProteinCenter, a protein dataset may be clustered in one of three ways:
Using imported groupings
Based on sequence similarity
Based on shared peptides
This chapter gives a short description of how the clustering actually works, and an explanation of the biological significance of different similarity levels. Furthermore, it is described how clustering facilitates the following:
Reducing complexity when analyzing a large set of proteins
Increasing annotation coverage
Dealing with redundancy issues
Identification of related proteins
Grouping of alleles, fragments, isoforms, and homologues
Comparison between heterogeneous protein datasets
Finally a tutorial is included.
There are two basic types of protein clustering algorithms implemented in ProteinCenter: The first type is based on sequence similarity between protein sequences, and the other is based on sharing of peptides from experimental data. Neither uses modifications of any type on sequences or peptides, i.e. only the raw sequence of amino acids is used.
All sequence similarity clustering algorithms are based on local sequence alignments of amino acid identity, which are aggregated to a single percentile similarity measure. For datasets containing an intractable number of proteins (> ~6000 proteins), prolonged execution times (> 10 minutes) must be expected. As with all clustering techniques, there is a fundamental discrepancy in that the homological similarity between sequences is a rather continuous measure, while any sequence can only belong to one particular group. Hence, hard clustering signifies the best possible compromise attainable, when proteins are forced to belong to only one group.
It is possible to cluster at any level from 1% and up to 100% sequence identity. The identity is calculated along the complete length of the shortest of the two protein sequences. A few simplified examples will help visualize this:
Two sequences that are 90% similar to each other:
Identities = 9/10 (90%)
Query: 1 MVAKLEQLIA 10
Identity |||||| |||
Sbjct: 1 MVAKLELLIA 10
Two sequences that are 100% similar to each other:
Identities = 6/6 (100%)
Query: 1 MVAKLEQLIA 10
Sbjct: 1 MVAKLE 6
The bottom sequence (6 AA) is 66.7% similar to the top sequence (10 AA):
Identities = 4/6 (66%)
Query: 1 MVAKLEQLIA 10
Identity || ||
Sbjct: 1 MVVKLI 6
For example, in cases of alternative splicing, where a large chunk
of a sequence has been removed (e.g. by exon skipping), the similarity
may still remain as high as 100%, since all residues of the shorter
sequence will remain identical to the longer isoform.
To increase the size of individual clusters and thereby minimize the number of clusters, the longest sequence is usually chosen as the anchor sequence (with few exceptions regarding preferred species and group coverage).
When a large dataset of proteins is clustered, the composition of the individual clusters will vary depending on the chosen anchor sequence. All other members of a cluster have been included based on their similarity to the anchor sequence. This means that for a dataset clustered at a 90% similarity level, all members of every cluster will be at least 90% identical to the anchor sequence of the individual cluster. However, the individual members may have less than 90% identity with each other. A rather extreme example of this is shown in Table 16.1, “Members of a cluster do not have to be similar - they only have to be similar to the anchor”, where two members share no similarity at all.
Table 16.1. Members of a cluster do not have to be similar - they only have to be similar to the anchor
When a dataset contains experimental data in the form of peptides, it is possible to group those proteins that share a certain number of peptides. For this type of clustering, the clustering level is defined as the least number of peptides a protein must have in common with the rest of the group as a whole (not with any other member in particular) to be included. Moreover, any experimental data that refer to the same protein, will be grouped regardless of the number of peptides they share (although they will usually share the majority of their related peptides).
The most eligible candidate for an anchor in a peptide sharing group, is the protein that has the most peptides in common with the rest of the members of the group.
Another form of peptide sharing clustering is to group proteins that can not be distinguished, based on the peptide evidence they share. Any (unmodified) peptide sequence matches, which were not identified as peptides, are deemed "possible peptide matches" (assuming the original peptide assignment was arbitrary), and are also considered in the algorithm. The protein with the most peptides becomes the anchor of a group, and any other member will have all of its peptides in common with this anchor. In case of equality in this respect, the anchor decision tie breakers are in order: The protein with the most assigned peptides (not including "possible peptide matches"), and finally the sequence length. Each member's similarity is the number of peptides it has.
Different sequence similarity levels may be used, depending on the purpose of an analysis. Some rough guidelines may be used to categorize the biological significance of different similarity levels (see Table 16.2, “Types of proteins that are grouped, based on clustering level”). Keep in mind that there will always be exceptions to these general categorizations.
To get more precise results, analysis/inspection of individual clusters must be undertaken, but it is certainly possible to get biological meaningful results using the automated clustering. For example, rough distribution of sub-cellular localization of proteins in the dataset may be obtained by clustering a dataset at 98% similarity, and considering each cluster to roughly represent a protein. At the 98% similarity level some splicing variants may end up in the same cluster, but the effect is most often much smaller than the effect of redundancy in the dataset, if allowing different alleles (same protein) to be counted multiple times (unless the resolution of the analysis was at the allelic level).
Table 16.2. Types of proteins that are grouped, based on clustering level
|Similarity level||Types of proteins grouped|
Native & processed sequences
Fragment, longer fragments & full length sequences
Certain splicing isoforms (e.g. exon skipping)
Alleles (including allelic fragments)
Highly similar proteins
Proteins with sequencing errors
Closely related paralogous proteins
Orthologous proteins from closely related species
Homologous proteins - closer to the level of protein families
The complexity of proteomics data analysis is (in addition to the complex nature of the biology of the cell) increased by search databases containing a mix of different alleles, splice variants, fragments, sequencing errors and predicted sequences. This makes it hard to distinguish which proteins correspond to the same isoform, or even the same gene. Therefore, a set of 100 protein keys rarely correspond to 100 distinct proteins, and even the most simple statistics (e.g. counting the number of proteins with a certain feature) becomes difficult. Secondly, often the annotation needed is not comprehensive, leading to false negatives: The proteins are not considered as being part of the interesting subset, when in fact they are.
The clustering of a set of proteins will always lead to fewer clusters than the number of proteins in the original set. The degree of reduction is dependent on the level at which the data are clustered. Hence, by analyzing the annotation for clusters rather than proteins, it is much faster to browse through the dataset. Finally, similar proteins are always analyzed together, avoiding that the same conclusion need be taken multiple times.
By clustering similar proteins, the annotation of these proteins is significantly enhanced, especially with respect to their function. Obviously clusters containing more broad categories of proteins may result in the mixing of different annotations, which is generally not useful. Such broad groups are rarely created for annotation analysis, but are useful if you want to group a particular irrelevant type of protein, in order to disregard it (e.g. contaminants). Therefore, used in the right context, clustering may provide a strong way of enhancing annotation.
Often a set of proteins is highly redundant, due to the presence of closely related proteins. One widespread example is datasets derived from MS experimentation of more complex samples, where hundreds or even thousands of protein identifications may contain lots of closely related isoforms that could not be distinguished from each other. This redundancy must be dealt with to enable further analysis – otherwise it does not make much sense to see which fraction of proteins that have a particular characteristic (sub-cellular localization, etc.). Clustering deals with the issues that different protein keys may represent closely related protein isoforms, or even different fragments of the same isoform. Closely related proteins are grouped together, and fragments are grouped together with longer (possibly) full length isoforms. Subsequent analysis at the clustering level gives more reliable distributions of e.g. sub-cellular localization.
A set of protein identifications rarely represent a clean dataset without redundancy. This is due to the amount of work that needs to be put into the bookkeeping process, when it is not automated as in ProteinCenter. Even when a few clean datasets have been created, it is not trivial to compare these datasets without ProteinCenter, since there is no guarantee that the same identifiers are used for the same proteins. Also, two proteins in two different datasets may represent two alleles of the same protein, without this being apparent from identifiers or description (keep in mind that even the same protein may have widely different descriptions associated with the various accessions keys). See more about comparison of datasets in Chapter 19, Dataset comparison.
Clustering at lower similarity levels may be used to isolate proteins of same or similar protein families for detailed analysis.
To get an overview of a large dataset, it may also help grouping proteins into very large clusters to understand the nature of the groups of proteins. Then one may drill down, either inside individual clusters, or by re-clustering the complete dataset again. ProteinCenter enables both approaches.
Alternatively, this may be a way to isolate and remove a large group of proteins, which are deemed irrelevant to the assay at hand (e.g. proteins originating from contamination).
At high sequence similarity levels, alleles, fragments and isoforms are grouped together. This enables detailed analysis of which isoforms have been detected, or simply deals with the more basic redundancy issues mentioned above.
Finally, clustering is a clear advantage when comparing datasets. Even though ProteinCenter integrates every protein record with the same sequence into one record, the overlap of a comparison of two datasets will still be much smaller, as when fragments, alleles, etc. are dealt with. Hence, comparing clusters from different datasets is much more useful, since this ensures a more comprehensive comparison, and allows the user to decide how proteins are considered the same. This makes it possible to compare datasets at different levels of granularity, be it fragments, alleles, or even 95% sequence identical proteins.
Clustering proteins that share identified peptides significantly eases the analysis of experimental data, and establishes the basis for decisions on which proteins have been identified. By using either of the peptide sharing clustering algorithms, protein hits that are indiscernible based on their peptide evidence can be grouped.
This section gives a step-by-step introduction to clustering of datasets.
Select a dataset for analysis, and go to the Cluster pane:
Then follow these simple steps:
Pick a clustering algorithm (see Section 16.6.6, “Different clustering algorithms” for descriptions of individual algorithms).
Choose the clustering level at which to cluster sequences, depending on the type of clustering algorithm:
Sequence similarity percentage - Any level in the range 1-100% identity can be applied (60-100% for large datasets, see above). See Table 16.2, “Types of proteins that are grouped, based on clustering level” for the meaning of different sequence similarity based clustering levels.
Shared peptides - The number of peptides that should be shared by grouped proteins.
Set additional parameters for the chosen type of clustering (e.g. the preferred species of the anchor).
Press to cluster all proteins. Any previous clustering will be abandoned before applying the algorithm.
If proteins have already been clustered, this button will uncluster all proteins.
Use this to cluster only proteins that are not already clustered.
The result is shown as a set of clusters in the clusters view, where each line corresponds to a cluster as described above.
It is possible to cluster different parts of a dataset with different clustering level criteria. This can be done by using the functionality that allows clustering of previously unclustered proteins, combined with the functionality to ungroup members from clusters (either for individual clusters, or for complete clusters).
Pick a clustering level and cluster all proteins.
Optionally uncluster specific clusters, or specific proteins (function 4 and 23 in Figure 16.1, “The Clusters view”).
Pick another clustering level (1), and cluster all proteins again with the button.
If a dataset consist of proteins from a range of species, it is possible to specify a particular species that should be favored in the selection of anchor sequence. This way, you may ensure that all or most anchor proteins are from the species of interest. To use a preferred species select one from the, before selecting . Using this option provides a constraint on the default choice of anchor, and results in the anchor being the longest sequence from the chosen species.
It is possible to import datasets that are already clustered, e.g. groups of proteins that are grouped based on MS evidence.
The clustering will automatically appear when a pre-clustered dataset is selected.
Imported clusters may be used in many different types of analyses, but where multiple datasets are combined (e.g. to do a comparative analysis), a clustering performed in ProteinCenter is required.
Several different clustering algorithms are available, but for most uses the default () will suffice. Sometimes a protein can belong to more than one group, and it is in these cases the different algorithms differ the most, according to the criteria described below. In all cases the anchors are the protein with the longest sequence, unless another protein that is only 1 or 2 amino acids shorter creates a better grouping according to the grouping criteria. This way all non-anchor proteins are always more similar to their respective anchor proteins than the similarity threshold given.
- Proteins are clustered to make the individual groups as homogeneous as possible, while adhering to the specified minimum sequence similarity. Each member's similarity is the sequence similarity it shares with the cluster anchor.
- Proteins are clustered to create the largest possible groups, while adhering to the specified minimum sequence similarity. Each member's similarity is the sequence similarity it shares with the cluster anchor.
- The algorithm, with the added requirement that anchors have to be a specific species.
clustering level (1)). Each member's similarity is the number of peptides it shares with the rest of the group members.- Proteins are clustered according to the number of peptides they share (
- Proteins that can not be distinguished, based on the peptide evidence they share, are clustered. The protein with the most peptides becomes the anchor of a group. Each member's similarity is the number of peptides it has.
© 2005-2017 Thermo Fisher Scientific