ProteinCenter User Manual
Table of Contents

Chapter 25. Statistics

Table of Contents

25.1. The Statistics view
25.2. Introduction to the Statistics view
25.2.1. Summary view
25.2.2. Details view
25.3. Types of statistics
25.3.1. Statistics based on all data
25.3.2. Statistics based on selected data
25.3.3. Statistics based on cluster anchors
25.4. How to calculate statistics
25.4.1. How to calculate statistics for a single dataset
25.4.2. How to calculate statistics for comparison datasets
25.4.3. Significance statistics
25.4.4. Significance statistics theory
25.5. How to copy images from ProteinCenter
25.5.1. How to use images from ProteinCenter in MS PowerPoint
25.5.2. How to use images from ProteinCenter in Adobe Photoshop

25.1. The Statistics view

The Statistics view provides high level overview of datasets, by summarizing essential information into histogram tables and chart representations. It gives a good starting point for interpretation of the biological meaning of the data.

The Statistics view also enables you to compare quantitatively the characteristics of several datasets with side-by-side comparisons, including identification of over- and under-represented features.

25.2. Introduction to the Statistics view

The Statistics view is split into a specification bar (on the top, framed in yellow in the figure), a summary view (the middle part, framed in green), and a details view (the lower part, framed in red).

Figure 25.1. The Statistics View for a single dataset ('TestSet')

The Statistics View for a single dataset ('TestSet')

25.2.1. Summary view

The summary view in the upper part of the screen is split into a number of summary sections, for which more detailed information may be displayed in the details view by clicking the header ('General Info', 'Taxonomy', etc.). Each summary section contains one column of statistical data per dataset, and each column header carries the dataset name as its title.

The following list describes the different summary sections:

  1. Select and analysis bar. Click Analyze to show statistics for one or more datasets, with the selected options and datasets. The options in this pane are described in Section 25.4, “How to calculate statistics”.

  2. General Info section - shows the number of proteins, the number of experimental data, the number of peptides, and the number of unique peptide sequences that the statistics are based on.

  3. Taxonomy section - shows the number of different species.

  4. Domains section - shows the number of proteins that have at least one PFAM domain, InterPro domain, transmembrane (TM) domain, or signal peptide.

  5. Annotation section - shows the total number of genes, chromosomes, enzymes, SwissProt keywords, and diseases associated with the dataset.

  6. Sequence Length header. Click header to see distribution of protein sequence lengths (number of amino acids).

  7. Pathways section - shows the total number of KEGG and UniProt pathways that the dataset proteins are involved in. KEGG pathways are only available if the license grants that right.

  8. GO Molecular Functions header. Click header to see summarized Gene Ontology molecular function annotation (see Section 11.3.14, “GO Molecular Function”).

  9. GO Cellular Components header. Click header to see summarized Gene Ontology cellular component annotation (see Section 11.3.13, “GO Cellular Component”).

  10. GO Biological Processes header. Click header to see summarized Gene Ontology biological process annotation (see Section 11.3.15, “GO Biological Process”).

  11. Supplementary Info header. Click header to see statistical values for numerical supplementary data (see Section 25.2.2.3, “Binned histograms and descriptive statistics”).

Clicking the header of any of these sections will display details for that particular group of statistics in a graphical view below. More specific information on the details views for the individual sections is found in the subsections of this chapter.

25.2.2. Details view

The Details view gives more in depth information about the statistical distribution for one or more datasets. The view is updated when the different headers are clicked in the summary view described above.

25.2.2.1. Protein comparison statistics

For the General Information page, details are presented as two Venn diagrams and accompanying tables listing the sets, counts and percentages:

Figure 25.2. Details for General Information

Details for General Information

  1. Select datasets to compare. Either 2 or 3 datasets can be compared with each other at a time.

  2. Toggle selection for Venn representation. Select Percentage to be able to do a relative comparison between datasets.

  3. A Venn diagram of 3 datasets. The datasets are always named A, B and C, and it is possible to click on the numbers to see that subset of either proteins, selected proteins or protein anchors. It is also possible to click on the magnifying glass to see the diagram in a better resolution. The circle sizes are not scaled to the numbers, so a count of zero will still draw a circle. The colors of the circles are the same as the dataset colors used in the protein list view.

  4. List of all possible subsets with number of proteins and percentage.

    • A - number of all proteins in A. Here 60% of all proteins are in dataset A (the dataset named: venn1).

    • B - number of all proteins in B

    • C - number of all proteins in C

    • AB - number of proteins A and B have in common

    • AC - number of proteins A and C have in common

    • BC - number of proteins B and C have in common

    • AB-C - number of all proteins in A and B, that are not in C

    • AC-B - number of all proteins in A and C, that are not in B

    • BC-A - number of all proteins in B and C, that are not in A

    • A-BC - number of all proteins only in A

    • B-AC - number of all proteins only in B

    • C-AB - number of all proteins only in C

    • AB-C - number of proteins A and B have in common, that are not in C

    • AC-B - number of proteins A and C have in common, that are not in B

    • BC-A - number of proteins B and C have in common, that are not in A

    • ABC - number of proteins A, B and C have in common

    • ABC - number of all proteins in A, B and C

    Click on the subsets to see the protein list or cluster list.

  5. The peptides section show the exact same type of information, but is of course based on the peptides contained in the comparison.

25.2.2.2. Histogram statistics

Most details are presented as histogram statistics:

Figure 25.3. Details for GO Cellular Components

Details for GO Cellular Components

  1. Toggle selection for chart representation. Select Percentage to be able to do a relative comparison between datasets in the bar chart.

  2. Bar chart representation of histogram entries for one or more datasets.

  3. Table representation of the histogram. Histogram entries are shown in the first column, followed by two columns for each dataset showing the number of proteins with the corresponding characteristic (in absolute numbers and relative to the total number of proteins). Click on the column headers to sort the entries accordingly.

  4. Pie chart representation of the histogram. Small values are summarized into a special pie section titled 'other' (if such entries exist in the dataset). Click the image to get a high resolution view.

    Note

    The contents of pie charts might sum up to more than 100%, as each protein can have several GO annotations.

  5. Cell image representation of the histogram. The rectangles shown on the cell are filled according to the percentage of proteins in the dataset, annotated with that GO type. A cross is shown if no proteins in the dataset have a given annotation. Click the image to get a high resolution view.

The bar chart and the table can be collapsed and expanded by clicking the caption bar with the - or + sign.

Proteins that do not contain an annotation for the characteristic in question, have their own histogram 'Unannotated' entry.

It is easy to copy tables and charts from the browser, and paste them into other tools like Excel™ or Word™ for further processing (see Section 25.5, “How to copy images from ProteinCenter”).

25.2.2.3. Binned histograms and descriptive statistics

Binned histograms and descriptive statistical values are calculated for numerical (integers, decimals, ratios, etc.) supplementary data (see appendix Table B.1, “Protein supplementary data” for a complete list).

The descriptive statistical values are presented as a table per imported numerical column:

  • Min - the minimum value of the column in the dataset.

  • Max - the maximum value of the column in the dataset.

  • Sum - the sum of all values of the column in the dataset.

  • Mean - the average of all values of the column in the dataset.

  • Std.dev - the standard deviation of all values of the column in the dataset.

The columns that follow in the tables hold the values for each dataset.

The binned histograms are shown next to the descriptive statistics tables.

25.3. Types of statistics

Despite the vast majority of knowledge in ProteinCenter™ being annotated directly to proteins, statistics are still supported for all types of datasets (proten, gene, mixed). Throughout this chapter, when proteins, selected proteins, and anchor proteins are being referred to, simply substitute with "protein products of the " genes, selected genes and anchor genes respectively, for gene-based or mixed datasets.

For every type of dataset there are three basic types of statistics, depending on whether the basic unit of calculation is any protein in the dataset, selected proteins, or only the anchor proteins of every cluster.

25.3.1. Statistics based on all data

This mode of statistics operation includes all proteins in the dataset, and descriptive statistics are calculated for all their supplementary data. For gene-based datasets this pertains to all protein products of the gene, and the supplementary data statistics are given on the gene data level.

25.3.2. Statistics based on selected data

This mode of statistics operation works as above, but includes only the selected proteins. For gene-based datasets this pertains to all protein products of the selected genes.

25.3.3. Statistics based on cluster anchors

This mode of statistics operation includes only proteins that are declared anchors in a cluster. Descriptive statistics are also only calculated for the supplementary data of anchor proteins.

For datasets with particular characteristics, the main advantages of this type of statistics are twofold:

  • More reliable statistics will be obtained for clusters that represent several instances of essentially identical proteins. This reduces or eliminates the redundancy of protein data, and therefore produces more accurate statistics.

  • Often anchor proteins have better annotations than the rest of a cluster, and more reliable statistics will result from focusing entirely on the anchors.

25.4. How to calculate statistics

This section gives a step-by-step introduction to calculating statistics for datasets.

For very large datasets, it can take several minutes to retrieve data and calculate statistics. However, this information is cached for the duration of a session, so that the execution time for a dataset is minimal after the initial calculation.

25.4.1. How to calculate statistics for a single dataset

To calculate statistics for a single dataset follow these steps:

  1. Select the dataset in the Workspace.

  2. Select the Statistics pane. The statistics specification bar will occur on the top of the pane as showed on Figure 25.4, “Calculating statistics for a single folder”.

    Figure 25.4. Calculating statistics for a single folder

    Calculating statistics for a single folder

  3. Choose what the statistics calculation should be based upon:

    • All Data includes data for all proteins in the dataset.

    • Selected Data includes data for the selected proteins in the dataset.

    • Cluster Anchors only includes the data for all anchor proteins from each cluster.

  4. Select the significance Level (FDR): A value between 0 and 100%. For the current example, we choose the a typical alpha-value of 5.0, i.e. the overall probability of identifying at least one false positive is 5%. This value is used as a threshold for the significance statistics results (see Section 25.4.4, “Significance statistics theory” for more information).

  5. Select the reference set used for comparative statistics. Reference set are pre-calculated datasets, specified by a source and a taxonomy. Hence a reference set is selected by first choosing a source and subsequently a taxonomy:

    Figure 25.5. 


  6. Select the source database from the following options:

    • A subset of the database sources given in Table A.1, “Protein Data Sources”

    • All - Reference set based on all proteins in the ProteinCenter database. Available only for some species

    • None - Indicates that no source database has been selected. Statistics will be calculated and displayed for the selected dataset, without a reference set and any comparative statistics

    In the current example All source databases have been selected.

  7. Select the species from a ranked list of taxonomy names. The first half of this list is comprised of all taxonomies represented in the dataset, ordered by their prevalence and annotated with the number of proteins with each taxonomy in square brackets. This is followed by an alphabetically sorted list of the remaining choices, which are most likely irrelevant for this dataset, unless you are conducting a cross-species analysis.

    Since the current dataset (YeastExp1), contains only yeast proteins, it makes sense to compare against all yeast proteins in the database. Another choice would be the SGD source database.

  8. Press the Analyze button to start the statistics calculation.

Please note that the reference sets are a part of the weekly data updates. This means that new reference set may be added with time, and existing ones will be recalculated.

25.4.2. How to calculate statistics for comparison datasets

For comparison datasets, the statistics calculation is similar to single datasets, but the reference set is chosen among the datasets in the comparison.

Figure 25.6. Calculating statistics for comparison

Calculating statistics for comparison

Now follow these 4 steps:

  1. Choose what the statistics calculation should be based upon:

    • Proteins includes data for all proteins in the dataset.

    • Selected Proteins includes data for the selected proteins in the dataset.

    • Cluster Anchors only includes the data for all anchor proteins from each cluster.

  2. Select the significance Level (FDR): A value between 0 and 100%. For the current example, we choose the typical value of 5.0. This value is used as a threshold for the significance statistics results (see Section 25.4.4, “Significance statistics theory” for more information).

  3. Select the dataset to be analyzed, and the dataset to be used as a reference.

  4. Press Analyze to start processing.

The result of comparing a yeast dataset with the whole yeast reference dataset could look like this:

Figure 25.7. Statistics for the comparison of YeastExp1 and YeastRef

Statistics for the comparison of YeastExp1 and YeastRef

Notice that all proteins from the analysis dataset are contained in the reference set, and that it contains 8.4% of all yeast proteins.

The General Information details are shown as the default view.

25.4.3. Significance statistics

When statistics for a comparison dataset has been calculated, it is possible to view feature distributions that are significantly different between the selected datasets in the comparison.

Both over- and under-represented features are listed in tables for PFAM annotated proteins, signal peptide proteins, transmembrane proteins, enzymes, genes, chromosomes, SwissProt keywords, InterPro and KEGG pathways, and all the GO Categories. Features are recorded on a per-protein basis, i.e. any protein in a dataset can contribute to any feature category count with either 0 or 1 occurrences. Features with a hierarchical structure (e.g. all the GO terms) also contribute with the entire generalized category hierarchy above the particular category annotated for a protein. For GO features this means that any GO category annotated for a protein, will also contribute with all GO categories of a more general nature in the GO hierarchy (i.e. following any is_a or part_of relationship upwards in the GO hierarchy).

As an example, click GO Molecular Functions to see over- and under-represented GO categories of this kind:

Figure 25.8. Over- and under-represented Molecular functions in YeastExp1 compared to YeastRef

Over- and under-represented Molecular functions in YeastExp1 compared to YeastRef

Both significance statistics tables lists the name of the analysis dataset and the reference dataset, as well as the dataset features analyzed. Both tables contain 6 columns:

  1. Description: Description of a feature (some descriptions are also links to more information about that feature).

  2. Occurrence: Bar showing the percentage of occurrence of that specific feature in a dataset. The Red bar represents the analysis dataset, and the Green bar the reference dataset (hover over the bar to see the percentage numbers). All percentages are given relative to the total number of annotated proteins, i.e. not counting proteins without any annotation of this kind.

  3. Count: The number of times this feature occurs in the analysis dataset. Click to see the proteins with this feature*.

  4. Ref. Count: The number of times this feature occurs in the reference dataset. Click to see the proteins with this feature*.

  5. Raw p-value: The raw p-value indicating the significance of this difference in feature occurrence between the datasets.

  6. FDR p-value: The FDR corrected version of the raw p-value (see Section 25.4.4, “Significance statistics theory” for explanation). Change the Significance Level (FDR), if more or less significant results should to be shown. At the full 100%, all features will be shown, along with their FDR p-values.

  7. CSV export: Click the icon to export all significance statistics to a comma-separated text file.

P-values range between 0 and 1, where the scientific notation 0E0 signifies zero, while 1E0 signifies 1 (or 100%). In case the analysis set is not a true subset of the selected reference set, and consequently has a higher analysis count than reference count for a certain category, the calculated (infinitely small) p-values will show up in red as a warning.

* Note that a filter will be created to match the proteins. If a filter has already been defined, then it will be overwritten with this one. The view will switch to the Proteins pane or the Clusters pane, to show the list of matching proteins or protein anchors respectively.

25.4.3.1. KEGG pathways

Similar to other types of significance statistics, clicking the description link of KEGG pathways will open a map of that particular pathway. In addition for comparisons, every protein annotated with the pathway will be highlighted in a special color on the map; Red for proteins from the analysis set only, green for proteins from the reference set only, and yellow for proteins represented in both sets. The yellow color is darker for exact protein matches between the analysis and reference set, while a lighter hue denotes that proteins from both sets merely inhabit the same pathway element in the map (A KEGG pathway element signifies a collection of one or more proteins/genes, which are grouped on the map to represent a pathway concept at a certain detail level).

For non-comparison datasets, where the reference set is pre-calculated, only the analysis set will have its annotated proteins highlighted in Red.

25.4.4. Significance statistics theory

25.4.4.1. Distribution and significance of feature category occurrences

To estimate if a certain category of a feature is disproportionately represented in a dataset, a statistical test is carried out for every category. The formal premise of the statistical test is, that a particular category count is the result of a sampling of an underlying reference distribution. If this base hypothesis can be rejected for any category, it will be characterized as over- or under-represented respectively. The fact that two arbitrary datasets rarely constitute a true subset-superset relationship, the combined category data for both sets is used as the reference set.

Sampling such a reference set (without replacement) results in a hypergeometric distribution of occurrences, which can readily be tested by estimating the corresponding p-value for the number of occurrences in the analysis dataset. The extent to which a certain analysis category count is situated on the extreme ends of the reference set distribution, governs how unlikely that event is, and hence how small a p-value will result. In this respect, the p-value for a single category can be regarded as the chance that this distribution event occurred by chance alone, and not as a result of being disproportionately represented in the analysis set. The smaller the p-value, the more inclined you would be to reject the original hypothesis that the analysis set distribution is unexceptional.

To reach a decision on whether to claim an event with an estimated p-value as significant or not, a significance threshold must be chosen prior to the assay.

25.4.4.2. Correction for multiple statistical testing

The p-value for a certain statistical test is applicable when only a single scenario is investigated. However, we are inquiring the significance of distribution anomalies on numerable feature categories at the same time. In this case, the original p-value can no longer be used as a good indicator of significance, because it is too lenient when applying multiple statistical tests in series. If a global measure of significance is needed, the p-values must therefore be corrected for multiple statistical testing.

The correction used in ProteinCenter™ is based on a method suggested by [Benjamini and Hochberg]*. This method corrects p-values based on the False Discovery Rate (FDR), i.e. the expected number of false positives of all the positives found. For the significance statistics of any feature, this means that the FDR p-value threshold can be interpreted as the probability that any of the significant anomalies found, is a false positive. When applying multiple statistical testing, the FDR p-value is to be used as the threshold for deciding on the significance of the category abundance anomalies.

Note that the described transformation of p-values is monotonic, i.e. any numerical ordering between original p-values is retained.

* Benjamini, Y. and Hochberg, Y. : "Controlling the false discovery rate: A practical and powerful approach to multiple testing". J. Roy. Statist. Soc. Ser. B 57 289–300, 1995.

25.5. How to copy images from ProteinCenter

This section gives a step-by-step example on how to use images from the statistics page in other applications like MS PowerPoint or Adobe Photoshop.

25.5.1. How to use images from ProteinCenter in MS PowerPoint

Select dataset, go to the Statistics pane, select options and click Analysis to see the statistics:

Then follow these simple steps:

  1. Click on an image to view it in better resolution (on Venn Diagrams click on the Magnifying glass).

  2. Right click on the larger image and select Copy in IE. In Firefox select 'Copy Image'.

  3. Open MS PowerPoint.

  4. Select 'Paste Special' when using IE. In Firefox just select 'Paste'.

  5. Select 'Device Independent Bitmap' when using IE.

The high resolution image should now be displayed in MS PowerPoint.

25.5.2. How to use images from ProteinCenter in Adobe Photoshop

Select dataset, go to the Statistics pane, select options and click Analysis to see the statistics:

Then follow these simple steps:

  1. Click on an image to view it in better resolution (on Venn Diagrams click on the Magnifying glass).

  2. Right click on the larger image and select Copy in IE. In Firefox select 'Copy Image'.

  3. Open Adobe Photoshop and create a new image.

  4. Select 'Paste'

The high resolution image should now be displayed in Adobe Photoshop.