Table of Contents
Putting a long lists of proteins into a biological context, is often a matter of identifying subsets of proteins that are associated with a specific function, a certain cellular compartment, a particular disease, etc. The filter functionality in ProteinCenter provides an easy way of drilling down to different subsets of data, based on such types of criteria. Different types of protein filters can be set on different kinds of views. In ProteinCenter five views are defined: Peptides, Experimental Data, Proteins, Genes and Clusters. Each of these views has its own filters, contained in the filter bar shown at the top of each view.
For more details on how to use filters please refer to Section 11.2, “How to use filters”
Filtering allow the user to exclude or isolate subsets of data by applying one or more filters to a dataset. This ensures that the time spent analyzing the data is spent on the relevant data, while the remaining data remains hidden (without being removed from the dataset).
The filters can be applied at five different levels - Peptides, Experimental Data, Proteins, Genes and Clusters.
The peptide level : The individual peptides (considering modifications) are filtered. When applying the filters, peptides that do not fulfill the filter criteria are hidden.
The experimental data and protein level : The individual exp.data/proteins are filtered. When applying the filters, exp.data/proteins that do not fulfill the filter criteria are hidden.
The gene level : The individual genes are filtered. When applying the filters, genes that do not fulfill the filter criteria are hidden.
The protein cluster level : The complete clusters are filtered. When applying the filters, clusters where none of the members fulfill the filter criteria are hidden. Hence, a cluster of e.g. hundreds of proteins will be included, if just a single member fulfills the criteria.
Multiple filters may be combined, analogously to making complex queries. The implicit operator between filter declarations is logical AND, i.e. the data must fulfill all criteria to pass the combined filter. For instance, one may filter to only see membrane proteins from IPI above a certain length in the particular dataset, by applying all 3 filters simultaneously.
Applying filters at the level of clusters ensures a more comprehensive filtering, since one entry may be annotated with a feature that is not annotated in a closely related protein.
Turn filters on or off . The icon is grey, if filters are disabled. When filters are enabled, the icon is green, while an inverted filter will render it yellow.
A status bar showing the number of selected filters. Use mouse-over to see more details on the filters configured.
Edit filter definition - Opens a menu where filters can be defined (see the expanded view of this section below)
Invert filter - Inverts the selection specified by the filter. The filter icon turns yellow .
Disable all filter definitions - the simple way to clear/reset all filter definitions.
Upon clicking "Edit filter definition" (marked 3 in the figure above), the following menu is shown
Drop-down menu with the categories of filters, described below in Section 11.3, “Categories of filters”
Drop-down menu with filter operators. The list of operators depends on the type of filter selected (e.g. some apply to text, while others apply only to numbers).
one of - data that match any of the values for a given category
all of - data that match all values for a given category
none of - data that match none of the values for a given category
like - data that partially
match a textual value. For example, '
will match everything containing mitochondria, mitochondrial,
= - data that match one specific value exactly
> - data that are larger than a specific value
< - data that are less than a specific value
>= - data that are equal to or larger than a specific value
<= - data that are equal to or less than a specific value
A text box for specifying values for a given filter and operator. Each value must be written on a separate line. For filters with a tractable number of possible values, this box already contains all possible values for the filter.
Save a filter definition for a given filter category, i.e. the filter is applied to the data).
Disable this particular filter definition.
Reset to default values - this will reset the input field for filters with a tractable number of possible values.
In this section the different categories of filters are described along with some measurements of the comprehensiveness of the annotation.
For all categories of annotation, various approaches have been used to make a comprehensive coverage of annotation. This allows for an efficient analysis of datasets of thousands of proteins.
Protein keys may be used to filter for specific proteins. The function is analogous to the lookup of protein keys (see Section 7.2, “Lookup by peptide”), but it allows you to include a list of protein keys for increased flexibility. The basic Accession Key filter considers both primary and secondary accessions, the Accession key (Imported) filter only looks at the imported accession, while the Accession Key (Preferred) limits the filter to operate solely on the preffered accession for the proteins.
Although this functionality can be used to compare a dataset to a smaller set of proteins, this can be done in a more optimal way in the comparison module - please refer to Chapter 19, Dataset comparison.
The splice variants filter allows the user to isolate those entries that arise from genes known to be alternatively spliced.
This may help in cases where further analysis is needed to establish which isoforms has been identified.
Despite the fact that the annotation of splice variants derived from Entrez Gene & UniProt is not comprehensive, the application of this filter on a clustered set of data is a very powerful way to identify and evaluate cases of alternative splicing in the dataset.
When filtering on proteins, the operator is always "=" and the input must be a boolean value, e.g. "yes" or "no".
When filtering on clusters, more operators are available and input must be a number, e.g. select operator ">=" and input "2" to show clusters containing two or more proteins that arise from genes known to be alternatively spliced.
The disease category is a consolidation of the disease annotation from UniProt.
This annotation allows you to filter out proteins associated with one or more specific diseases. However be aware that only a minor set of proteins are annotated. Here is an example:
Applying this filter returns all proteins in the dataset which
have a disease annotation containing the word '
Using the 'like' operator and setting the value to "." (just the punctuation) like this
Note that the * in front of Disease indicates that this filter is selected.
This annotation allows for filtering proteins involved in enzymatic processes. The Enzyme filter identifies a particular enzyme based on its IUBMB (International Union of Biochemistry and Molecular Biology) accepted name. The Enzyme Code Tree filter uses the IUBMB Enzyme Code Nomenclature (EC:X.X.X.X), and matches any enzyme of the same or a more specific kind, i.e. any enzyme downwards in the EC hierarchy.
Specific enzymes, or group of enzymes, may be filtered as follows:
Use the single dot expression to show all proteins with enzyme annotation:
Conversely, inverting the filter above will filter out all enzymes from the data view.
Match all row objects with specific alternative gene key (alternative gene keys are listed on the 2nd line on the ProteinCard).
Cellular Component is one of the three basic types of Gene Ontology terms. A GOslim defined specifically for ProteinCenter reduces the many GO terms of type 'Cellular Component' to a manageable set of approximately 20 high-level terms that can be used for filtering. Details of the GOslim is described in Appendix D, GOslim Categories.
Click the reset button to set the input field to all possible Cellular Component terms in current dataset. Then remove the ones that are not of interest.
Molecular Function is the second of the Gene Ontology basic types. As with Cellular component, a GOslim reduces the many different 'Molecular Function' terms to a manageable set of approximately 20 high-level terms that can be used for filtering. Details of the GOslim is described in Appendix D, GOslim Categories.
Click the reset button to set the input field to all possible Molecular Function terms in current dataset. Then remove the ones that are not of interest.
Biological Process, the third basic type of Gene Ontology terms. Please refer to Appendix D, GOslim Categories for details.
Click the reset button to set the input field to all possible Biological Process terms in current dataset. Then remove the ones that are not of interest.
It is possible to filter by one or more GO IDs, thereby setting up very specific filters.
To find interesting GO IDs, or sets of GO IDs, use the tools available through the Gene Ontology website at http://www.geneontology.org/GO.tools.shtml
For an example see http://www.godatabase.org/cgi-bin/amigo/go.cgi
GO based filters can be augmented with this filter to only consider GO annotations from certain sources. This can be relevant when certain types of GO information are considered uninteresting or unreliable.
Click the reset button to set the input field to all possible sources represented in the current dataset. The input field is automatically reset in this way, when the filter is initially enabled, serving as a basis for a new filter specification.
GO based filters can be augmented with this filter to only consider GO annotations with certain evidence codes. This can be relevant when certain types of GO information are considered uninteresting or unreliable.
Click the reset button to set the input field to all possible evidence codes represented in the current dataset. The input field is automatically reset in this way, when the filter is initially enabled, serving as a basis for a new filter specification.
Filtering can be done by InterPro accessions.
Individual accessions and sets of accessions of interest can be found here: http://www.ebi.ac.uk/interpro/
Filtering can be done by InterPro descriptions
Individual accessions and sets of descriptions of interest can be found here: http://www.ebi.ac.uk/interpro/
Choose the following to filter for proteins with annotated interactions:
Click the reset button to set the input field to all Interaction sources in current dataset, and edit these values.
Outdated entries (i.e. all accession keys associated with the given protein are obsolete) can be filtered out by setting the outdated filter to "true".
Setting the filter to 'false' will only display live entries.
Keep in mind that a protein is only outdated if all protein keys referring to this protein are obsolete. For more details on outdated entries see Section 10.2.1.1, “Flagging of outdated protein keys in ProteinCenter”.
Pathway annotation in ProteinCenter is retrieved from KEGG and UniProt.
This annotation allows you to filter out proteins associated with one or more specific pathways. UniProt Pathway and KEGG Pathway identify pathways based on their title and description, while KEGG Pathway Id uses the KEGG ID nomenclature (org#####, where org signifies a 3-letter organism abbreviations and ##### signifies a 5-letter integer ID).
Using the 'like' operator and setting the value to "." (just the punctuation) will bring up all proteins in the dataset annotated with some sort of pathway annotation.
Match all proteins found by peptides with specific modification. Modifications not recognized at import time as a standard UniMod modification, are matched by specifying the modification as 'Unknown'
Click 'reset filter' to set the value to all possible values in selected dataset.
Match all proteins found by peptides with specific unannotated modification, i.e. proteins that have the equivalent UniProt PTM feature are filtered out. Modifications not recognized at import time as a standard UniMod modification, are matched by specifying the modification as 'Unknown'
Click 'reset filter' to set the value to all possible values in selected dataset.
Filtering can be done by PFAM accession number. Format: "PF*****" where '*' is a number. e.g. PF00051.
More information about PFAM can be found here: http://www.sanger.ac.uk/Software/Pfam/
Filtering can be done by PFAM descriptions. E.g. description for PFAM accession PF00051:
"Kringle domain. Kringle domains have been found in plasminogen, hepatocyte growth factors, prothrombin, and apolipoprotein A. Structure is disulfide-rich, nearly all-beta."
So entering "Kringle" as input value and using the "LIKE" operator will these proteins.
More information about PFAM can be found here: http://www.sanger.ac.uk/Software/Pfam/
This category allows the user to filter proteins with one or more particular types of post translational modifications (PTMs).
For now, only PTM annotation from UniProt are included, hence the coverage of annotation will correspond to this fact.
The UniProt documentation for types of PTMs can be used if a particular type of PTM is in interest: (e.g. All entries with annotated formylation sites as here):
Click the reset button to set the input field to all PTMs in current dataset. Then remove the ones not interesting.
Alternatively, the 'like' operator will be helpful if searching for a group of PTMs (e-g- phosphorylations as here):
Or all proteins annotated with PTMs can be found this way:
Using the filter on the protein description allows you to make a text match against the annotated protein description, however since many protein description are not carefully annotated this is not a sensitive way of filtering certain types of proteins.
Match all proteins with specific text in the Protein Function field (see all the functions on the ProteinCard).
Match all proteins with specific text in the Protein Keyword fields (see all the keywords on the ProteinCard).
Using the filter on the protein sequence length allows the user to match proteins of a certain length, or larger or smaller than a certain length. The idea is to allow analysis or the fraction of large or small proteins in the experimental dataset, e.g. to allow validation of an experimental procedure.
Filter in all proteins predicted to have a signal peptide by setting the filter value for signal to 'true'.
Signal peptides are predicted by PrediSi - PrediSi web site
See relevant papers here - including original release paper and an external evaluation.
This filter may be used to analyze proteins (or clusters including proteins) from a specific database.
The sources included in this filter include:
IPI = IPI
SP = Swiss-Prot section of UniProt
TR = Trembl section of UniProt
GI = GenBank
ESBL = Ensembl
DBJ = DNA Database of Japan
EMB = EBI EMBL Database
GB = GenPept (GenBank protein) identifier
PDB = Brookhaven Protein Database
PIR = Protein Information Resource International, now part of UniProt
PRF = Protein Research Foundation
REF = NCBI RefSeq- Non-confirmed entry
UNI = UNIPROT identifiers
SGD = SGD (Saccharomyces Genome Database) identifiers
FB = FlyBase identifiers
TAIR = Arabidopsis identifiers
PLASMO = PlasmoDB keys
The taxonomy filters allows filtering of one or more species, in cases where proteins origin from multiple species.
Click reset filter to set the value to all possible values in selected dataset.
Species names for all species may be found through the NCBI Taxonomy Browser accessible at: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi
To filter proteins predicted (by TMAP) to contain transmembrane domains, specify that proteins should contain > 0 transmembrane domains.
For information about TMAP see http://emboss.sourceforge.net/apps/tmap.html
Filters based on supplementary data constitutes all the types defined in Appendix B, Supplementary Data. Only the supplementary data types that are actually represented in the dataset are shown, i.e. if a specific type of supplementary data is missing from the filter selection, it is due to the dataset lacking that particular type of data.
All supplementary data fields are treated as text, and the filters function as text matching to include data that matches one or more text strings.
All supplementary data fields are treated as numbers, and it is possible to filter by values equal to (=), larger than (>, >=) or less than (<, <=) the value specified as argument.
For each protein there may be multiple numbers for a given field, e.g. when the same protein exists multiple times as experimental data in the same dataset, but with different quantification values. In these cases the filter always searches the maximum values when using the "larger than" (>, >=) type operators, and likewise the minimum values when using the "less than" (<, <=) type operators.
© 2005-2017 Thermo Fisher Scientific