ProteinCenter User Manual
Table of Contents

Chapter 11. Filters

Table of Contents

11.1. Introduction to filters
11.2. How to use filters
11.3. Categories of filters
11.3.1. Accession Key
11.3.2. Alternatively Spliced Gene
11.3.3. Cluster Anchor
11.3.4. Cluster Size
11.3.5. Disease
11.3.6. Enzyme
11.3.7. Gene Id
11.3.8. Alternative Gene Key
11.3.9. Gene Symbols
11.3.10. Gene Official Symbol
11.3.11. Gene Summary
11.3.12. Chromosome
11.3.13. GO Cellular Component
11.3.14. GO Molecular Function
11.3.15. GO Biological Process
11.3.16. GO ID
11.3.17. GO Source
11.3.18. GO Evidence Code
11.3.19. InterPro Accession
11.3.20. InterPro Description
11.3.21. Interaction Source
11.3.22. Outdated
11.3.23. Pathway
11.3.24. Peptide Modification
11.3.25. Peptide Modification (Unannotated)
11.3.26. Peptide Length
11.3.27. Peptide Sequence
11.3.28. Peptide Unique Count
11.3.29. PFAM Accession
11.3.30. PFAM Description
11.3.31. Post Translational Modification
11.3.32. Protein Description
11.3.33. Protein Function
11.3.34. Protein Keyword
11.3.35. Protein Sequence
11.3.36. Protein Sequence Length
11.3.37. Signal
11.3.38. Source Database
11.3.39. Taxonomy
11.3.40. Transmembrane Count
11.3.41. Supplementary data

Putting a long lists of proteins into a biological context, is often a matter of identifying subsets of proteins that are associated with a specific function, a certain cellular compartment, a particular disease, etc. The filter functionality in ProteinCenter provides an easy way of drilling down to different subsets of data, based on such types of criteria. Different types of protein filters can be set on different kinds of views. In ProteinCenter five views are defined: Peptides, Experimental Data, Proteins, Genes and Clusters. Each of these views has its own filters, contained in the filter bar shown at the top of each view.

For more details on how to use filters please refer to Section 11.2, “How to use filters”

11.1. Introduction to filters

Filtering allow the user to exclude or isolate subsets of data by applying one or more filters to a dataset. This ensures that the time spent analyzing the data is spent on the relevant data, while the remaining data remains hidden (without being removed from the dataset).

The filters can be applied at five different levels - Peptides, Experimental Data, Proteins, Genes and Clusters.

  1. The peptide level : The individual peptides (considering modifications) are filtered. When applying the filters, peptides that do not fulfill the filter criteria are hidden.

  2. The experimental data and protein level : The individual exp.data/proteins are filtered. When applying the filters, exp.data/proteins that do not fulfill the filter criteria are hidden.

  3. The gene level : The individual genes are filtered. When applying the filters, genes that do not fulfill the filter criteria are hidden.

  4. The protein cluster level : The complete clusters are filtered. When applying the filters, clusters where none of the members fulfill the filter criteria are hidden. Hence, a cluster of e.g. hundreds of proteins will be included, if just a single member fulfills the criteria.

Multiple filters may be combined, analogously to making complex queries. The implicit operator between filter declarations is logical AND, i.e. the data must fulfill all criteria to pass the combined filter. For instance, one may filter to only see membrane proteins from IPI above a certain length in the particular dataset, by applying all 3 filters simultaneously.

Applying filters at the level of clusters ensures a more comprehensive filtering, since one entry may be annotated with a feature that is not annotated in a closely related protein.

11.2. How to use filters

Figure 11.1. The Filter menu

The Filter menu

  1. Turn filters on or off . The icon is grey, if filters are disabled. When filters are enabled, the icon is green, while an inverted filter will render it yellow.

  2. A status bar showing the number of selected filters. Use mouse-over to see more details on the filters configured.

  3. Edit filter definition - Opens a menu where filters can be defined (see the expanded view of this section below)

  4. Invert filter - Inverts the selection specified by the filter. The filter icon turns yellow .

  5. Disable all filter definitions - the simple way to clear/reset all filter definitions.

Upon clicking "Edit filter definition" (marked 3 in the figure above), the following menu is shown

Figure 11.2. The Filter definition menu

The Filter definition menu

  1. Drop-down menu with the categories of filters, described below in Section 11.3, “Categories of filters”

  2. Drop-down menu with filter operators. The list of operators depends on the type of filter selected (e.g. some apply to text, while others apply only to numbers).

    • one of - data that match any of the values for a given category

    • all of - data that match all values for a given category

    • none of - data that match none of the values for a given category

    • like - data that partially match a textual value. For example, ' mitoch ' will match everything containing mitochondria, mitochondrial, etc.

    • = - data that match one specific value exactly

    • > - data that are larger than a specific value

    • < - data that are less than a specific value

    • >= - data that are equal to or larger than a specific value

    • <= - data that are equal to or less than a specific value

  3. A text box for specifying values for a given filter and operator. Each value must be written on a separate line. For filters with a tractable number of possible values, this box already contains all possible values for the filter.

  4. Save a filter definition for a given filter category, i.e. the filter is applied to the data).

  5. Disable this particular filter definition.

  6. Reset to default values - this will reset the input field for filters with a tractable number of possible values.

11.3. Categories of filters

In this section the different categories of filters are described along with some measurements of the comprehensiveness of the annotation.

For all categories of annotation, various approaches have been used to make a comprehensive coverage of annotation. This allows for an efficient analysis of datasets of thousands of proteins.

11.3.1. Accession Key

Protein keys may be used to filter for specific proteins. The function is analogous to the lookup of protein keys (see Section 7.2, “Lookup by peptide”), but it allows you to include a list of protein keys for increased flexibility. The basic Accession Key filter considers both primary and secondary accessions, the Accession key (Imported) filter only looks at the imported accession, while the Accession Key (Preferred) limits the filter to operate solely on the preffered accession for the proteins.

Although this functionality can be used to compare a dataset to a smaller set of proteins, this can be done in a more optimal way in the comparison module - please refer to Chapter 19, Dataset comparison.

11.3.2. Alternatively Spliced Gene

The splice variants filter allows the user to isolate those entries that arise from genes known to be alternatively spliced.

This may help in cases where further analysis is needed to establish which isoforms has been identified.

Despite the fact that the annotation of splice variants derived from Entrez Gene & UniProt is not comprehensive, the application of this filter on a clustered set of data is a very powerful way to identify and evaluate cases of alternative splicing in the dataset.

When filtering on proteins, the operator is always "=" and the input must be a boolean value, e.g. "yes" or "no".

When filtering on clusters, more operators are available and input must be a number, e.g. select operator ">=" and input "2" to show clusters containing two or more proteins that arise from genes known to be alternatively spliced.

11.3.3. Cluster Anchor

Match all proteins that are anchor in a cluster in the dataset.

11.3.4. Cluster Size

Match clusters by their size, i.e. the number of proteins in the cluster.

11.3.5. Disease

The disease category is a consolidation of the disease annotation from UniProt.

This annotation allows you to filter out proteins associated with one or more specific diseases. However be aware that only a minor set of proteins are annotated. Here is an example:

Applying this filter returns all proteins in the dataset which have a disease annotation containing the word ' Alzheimer '.

Using the 'like' operator and setting the value to "." (just the punctuation) like this

will bring up all proteins in the dataset annotated with some sort of disease annotation.

Note that the * in front of Disease indicates that this filter is selected.

11.3.6. Enzyme

This annotation allows for filtering proteins involved in enzymatic processes. The Enzyme filter identifies a particular enzyme based on its IUBMB (International Union of Biochemistry and Molecular Biology) accepted name. The Enzyme Code Tree filter uses the IUBMB Enzyme Code Nomenclature (EC:X.X.X.X), and matches any enzyme of the same or a more specific kind, i.e. any enzyme downwards in the EC hierarchy.

Specific enzymes, or group of enzymes, may be filtered as follows:

Use the single dot expression to show all proteins with enzyme annotation:

Conversely, inverting the filter above will filter out all enzymes from the data view.

11.3.7. Gene Id

Match all row objects with specific primary gene ID.

11.3.8. Alternative Gene Key

Match all row objects with specific alternative gene key (alternative gene keys are listed on the 2nd line on the ProteinCard).

11.3.9. Gene Symbols

Match all row objects with specific primary or alternative gene key.

11.3.10. Gene Official Symbol

Match all row objects with specific Entrez Gene official symbol.

11.3.11. Gene Summary

Match all row objects with specific text in the gene summary field.

11.3.12. Chromosome

Match all row objects with specific chromosome identifier (number or name).

11.3.13. GO Cellular Component

Cellular Component is one of the three basic types of Gene Ontology terms. A GOslim defined specifically for ProteinCenter reduces the many GO terms of type 'Cellular Component' to a manageable set of approximately 20 high-level terms that can be used for filtering. Details of the GOslim is described in Appendix D, GOslim Categories.

Click the reset button to set the input field to all possible Cellular Component terms in current dataset. Then remove the ones that are not of interest.

11.3.14. GO Molecular Function

Molecular Function is the second of the Gene Ontology basic types. As with Cellular component, a GOslim reduces the many different 'Molecular Function' terms to a manageable set of approximately 20 high-level terms that can be used for filtering. Details of the GOslim is described in Appendix D, GOslim Categories.

Click the reset button to set the input field to all possible Molecular Function terms in current dataset. Then remove the ones that are not of interest.

11.3.15. GO Biological Process

Biological Process, the third basic type of Gene Ontology terms. Please refer to Appendix D, GOslim Categories for details.

Click the reset button to set the input field to all possible Biological Process terms in current dataset. Then remove the ones that are not of interest.

11.3.16. GO ID

It is possible to filter by one or more GO IDs, thereby setting up very specific filters.

To find interesting GO IDs, or sets of GO IDs, use the tools available through the Gene Ontology website at http://www.geneontology.org/GO.tools.shtml

For an example see http://www.godatabase.org/cgi-bin/amigo/go.cgi

11.3.17. GO Source

GO based filters can be augmented with this filter to only consider GO annotations from certain sources. This can be relevant when certain types of GO information are considered uninteresting or unreliable.

Click the reset button to set the input field to all possible sources represented in the current dataset. The input field is automatically reset in this way, when the filter is initially enabled, serving as a basis for a new filter specification.

11.3.18. GO Evidence Code

GO based filters can be augmented with this filter to only consider GO annotations with certain evidence codes. This can be relevant when certain types of GO information are considered uninteresting or unreliable.

Click the reset button to set the input field to all possible evidence codes represented in the current dataset. The input field is automatically reset in this way, when the filter is initially enabled, serving as a basis for a new filter specification.

11.3.19. InterPro Accession

Filtering can be done by InterPro accessions.

Individual accessions and sets of accessions of interest can be found here: http://www.ebi.ac.uk/interpro/

11.3.20. InterPro Description

Filtering can be done by InterPro descriptions

Individual accessions and sets of descriptions of interest can be found here: http://www.ebi.ac.uk/interpro/

11.3.21. Interaction Source

Interactions from IntAct, MIPS and STRING are annotated to proteins in ProteinCenter.

Choose the following to filter for proteins with annotated interactions:

Click the reset button to set the input field to all Interaction sources in current dataset, and edit these values.

11.3.22. Outdated

Outdated entries (i.e. all accession keys associated with the given protein are obsolete) can be filtered out by setting the outdated filter to "true".

Setting the filter to 'false' will only display live entries.

Keep in mind that a protein is only outdated if all protein keys referring to this protein are obsolete. For more details on outdated entries see Section 10.2.1.1, “Flagging of outdated protein keys in ProteinCenter”.

11.3.23. Pathway

Pathway annotation in ProteinCenter is retrieved from KEGG and UniProt.

This annotation allows you to filter out proteins associated with one or more specific pathways. UniProt Pathway and KEGG Pathway identify pathways based on their title and description, while KEGG Pathway Id uses the KEGG ID nomenclature (org#####, where org signifies a 3-letter organism abbreviations and ##### signifies a 5-letter integer ID).

Using the 'like' operator and setting the value to "." (just the punctuation) will bring up all proteins in the dataset annotated with some sort of pathway annotation.

11.3.24. Peptide Modification

Match all proteins found by peptides with specific modification. Modifications not recognized at import time as a standard UniMod modification, are matched by specifying the modification as 'Unknown'

Figure 11.3. Filtering by peptide modifications

Filtering by peptide modifications

Click 'reset filter' to set the value to all possible values in selected dataset.

11.3.25. Peptide Modification (Unannotated)

Match all proteins found by peptides with specific unannotated modification, i.e. proteins that have the equivalent UniProt PTM feature are filtered out. Modifications not recognized at import time as a standard UniMod modification, are matched by specifying the modification as 'Unknown'

Figure 11.4. Filtering by peptide modifications

Filtering by peptide modifications

Click 'reset filter' to set the value to all possible values in selected dataset.

11.3.26. Peptide Length

Match all proteins having peptides with a specific length.

11.3.27. Peptide Sequence

Match all proteins having peptides with specific sequence.

11.3.28. Peptide Unique Count

Match all proteins with a specific number of peptides.

11.3.29. PFAM Accession

Filtering can be done by PFAM accession number. Format: "PF*****" where '*' is a number. e.g. PF00051.

More information about PFAM can be found here: http://www.sanger.ac.uk/Software/Pfam/

11.3.30. PFAM Description

Filtering can be done by PFAM descriptions. E.g. description for PFAM accession PF00051:

"Kringle domain. Kringle domains have been found in plasminogen, hepatocyte growth factors, prothrombin, and apolipoprotein A. Structure is disulfide-rich, nearly all-beta."

So entering "Kringle" as input value and using the "LIKE" operator will these proteins.

More information about PFAM can be found here: http://www.sanger.ac.uk/Software/Pfam/

11.3.31. Post Translational Modification

This category allows the user to filter proteins with one or more particular types of post translational modifications (PTMs).

For now, only PTM annotation from UniProt are included, hence the coverage of annotation will correspond to this fact.

The UniProt documentation for types of PTMs can be used if a particular type of PTM is in interest: (e.g. All entries with annotated formylation sites as here):

Click the reset button to set the input field to all PTMs in current dataset. Then remove the ones not interesting.

Alternatively, the 'like' operator will be helpful if searching for a group of PTMs (e-g- phosphorylations as here):

Or all proteins annotated with PTMs can be found this way:

11.3.32. Protein Description

Using the filter on the protein description allows you to make a text match against the annotated protein description, however since many protein description are not carefully annotated this is not a sensitive way of filtering certain types of proteins.

11.3.33. Protein Function

Match all proteins with specific text in the Protein Function field (see all the functions on the ProteinCard).

11.3.34. Protein Keyword

Match all proteins with specific text in the Protein Keyword fields (see all the keywords on the ProteinCard).

11.3.35. Protein Sequence

Match all proteins with specific amino acids in the Protein sequence.

11.3.36. Protein Sequence Length

Using the filter on the protein sequence length allows the user to match proteins of a certain length, or larger or smaller than a certain length. The idea is to allow analysis or the fraction of large or small proteins in the experimental dataset, e.g. to allow validation of an experimental procedure.

11.3.37. Signal

Filter in all proteins predicted to have a signal peptide by setting the filter value for signal to 'true'.

Signal peptides are predicted by PrediSi - PrediSi web site

See relevant papers here - including original release paper and an external evaluation.

11.3.38. Source Database

This filter may be used to analyze proteins (or clusters including proteins) from a specific database.

The sources included in this filter include:

  1. IPI = IPI

  2. SP = Swiss-Prot section of UniProt

  3. TR = Trembl section of UniProt

  4. GI = GenBank

  5. ESBL = Ensembl

  6. DBJ = DNA Database of Japan

  7. EMB = EBI EMBL Database

  8. GB = GenPept (GenBank protein) identifier

  9. PDB = Brookhaven Protein Database

  10. PIR = Protein Information Resource International, now part of UniProt

  11. PRF = Protein Research Foundation

  12. REF = NCBI RefSeq- Non-confirmed entry

  13. UNI = UNIPROT identifiers

  14. SGD = SGD (Saccharomyces Genome Database) identifiers

  15. FB = FlyBase identifiers

  16. TAIR = Arabidopsis identifiers

  17. PLASMO = PlasmoDB keys

11.3.39. Taxonomy

The taxonomy filters allows filtering of one or more species, in cases where proteins origin from multiple species.

Click reset filter to set the value to all possible values in selected dataset.

Species names for all species may be found through the NCBI Taxonomy Browser accessible at: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

11.3.40. Transmembrane Count

To filter proteins predicted (by TMAP) to contain transmembrane domains, specify that proteins should contain > 0 transmembrane domains.

For information about TMAP see http://emboss.sourceforge.net/apps/tmap.html

11.3.41. Supplementary data

Filters based on supplementary data constitutes all the types defined in Appendix B, Supplementary Data. Only the supplementary data types that are actually represented in the dataset are shown, i.e. if a specific type of supplementary data is missing from the filter selection, it is due to the dataset lacking that particular type of data.

11.3.41.1. Textual supplementary data

All supplementary data fields are treated as text, and the filters function as text matching to include data that matches one or more text strings.

11.3.41.2. Numeric supplementary data

All supplementary data fields are treated as numbers, and it is possible to filter by values equal to (=), larger than (>, >=) or less than (<, <=) the value specified as argument.

For each protein there may be multiple numbers for a given field, e.g. when the same protein exists multiple times as experimental data in the same dataset, but with different quantification values. In these cases the filter always searches the maximum values when using the "larger than" (>, >=) type operators, and likewise the minimum values when using the "less than" (<, <=) type operators.