All Projects → maximtrp → Scikit Posthocs

maximtrp / Scikit Posthocs

Licence: mit
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Scikit Posthocs

Git Quick Stats
▁▅▆▃▅ Git quick statistics is a simple and efficient way to access various statistics in git repository.
Stars: ✭ 5,139 (+2662.9%)
Mutual labels:  statistics, statistical-analysis, stats
Superseriousstats
superseriousstats is a fast and efficient program to create statistics out of various types of chat logs
Stars: ✭ 78 (-58.06%)
Mutual labels:  statistics, stats
Github Traffic
Get the Github traffic for the specified repository
Stars: ✭ 77 (-58.6%)
Mutual labels:  statistics, stats
Ee Outliers
Open-source framework to detect outliers in Elasticsearch events
Stars: ✭ 172 (-7.53%)
Mutual labels:  statistics, statistical-analysis
Pycm
Multi-class confusion matrix library in Python
Stars: ✭ 1,076 (+478.49%)
Mutual labels:  statistics, statistical-analysis
Ruby Statistics
Ruby gem for some statistical operations without any statistical language dependency
Stars: ✭ 67 (-63.98%)
Mutual labels:  statistics, stats
Pypistats
Command-line interface to PyPI Stats API to get download stats for Python packages
Stars: ✭ 86 (-53.76%)
Mutual labels:  statistics, stats
Gameday api
A Ruby API for using the Major League Baseball Gameday statistics data. MLB provides very deep statistics for all major league baseball games through Gameday. Statistics include not only the typical boxscore stats, but also down to the physics of every single pitch thrown in the game. You can find the speed, movement, and position of every pitch thrown. The Gameday API makes it easy for Ruby developers to work with all this statistical information. The test directory included with the source code contains many examples of how the API can be used. If you prefer to use SVN, the gameday_api is also available via an SVN repository at: http://code.google.com/p/gamedayapi/ If you like this project, be sure to also check out the Baseball-Tracker project also hosted on GitHub. Baseball-Tracker is a web application that uses the gameday_api. You can find a hosted version of Baseball Tracker at http://baseballstatz.heroku.com
Stars: ✭ 137 (-26.34%)
Mutual labels:  statistics, stats
Streaker
🔥 GitHub contribution streak & stat tracking menu bar app
Stars: ✭ 131 (-29.57%)
Mutual labels:  statistics, stats
Npm Stats
📈 npm package statistics dashboard build with vue
Stars: ✭ 106 (-43.01%)
Mutual labels:  statistics, stats
Uc Davis Cs Exams Analysis
📈 Regression and Classification with UC Davis student quiz data and exam data
Stars: ✭ 33 (-82.26%)
Mutual labels:  statistics, statistical-analysis
Data Science Toolkit
Collection of stats, modeling, and data science tools in Python and R.
Stars: ✭ 169 (-9.14%)
Mutual labels:  statistics, statistical-analysis
Gramm
Gramm is a complete data visualization toolbox for Matlab. It provides an easy to use and high-level interface to produce publication-quality plots of complex data with varied statistical visualizations. Gramm is inspired by R's ggplot2 library.
Stars: ✭ 541 (+190.86%)
Mutual labels:  statistics, stats
Gitinspector
📊 The statistical analysis tool for git repositories
Stars: ✭ 2,058 (+1006.45%)
Mutual labels:  statistics, statistical-analysis
Python For Probability Statistics And Machine Learning
Jupyter Notebooks for Springer book "Python for Probability, Statistics, and Machine Learning"
Stars: ✭ 481 (+158.6%)
Mutual labels:  statistics, statistical-analysis
Memcache Info
Simple and efficient way to show information about Memcache.
Stars: ✭ 84 (-54.84%)
Mutual labels:  statistics, stats
Csinva.github.io
Slides, paper notes, class notes, blog posts, and research on ML 📉, statistics 📊, and AI 🤖.
Stars: ✭ 342 (+83.87%)
Mutual labels:  statistics, stats
Tautulli
A Python based monitoring and tracking tool for Plex Media Server.
Stars: ✭ 4,152 (+2132.26%)
Mutual labels:  statistics, stats
Devstats
📊 A CLI application that fetches stats from developer sites
Stars: ✭ 105 (-43.55%)
Mutual labels:  statistics, stats
Css Analyzer
Analytics for CSS
Stars: ✭ 146 (-21.51%)
Mutual labels:  statistics, stats

.. image:: images/logo.png

===============

.. image:: https://img.shields.io/circleci/build/github/maximtrp/scikit-posthocs :target: https://app.circleci.com/pipelines/github/maximtrp/scikit-posthocs .. image:: https://img.shields.io/readthedocs/scikit-posthocs.svg :target: https://scikit-posthocs.readthedocs.io .. image:: http://joss.theoj.org/papers/10.21105/joss.01169/status.svg :target: https://doi.org/10.21105/joss.01169 .. image:: https://codecov.io/gh/maximtrp/scikit-posthocs/branch/master/graph/badge.svg :target: https://codecov.io/gh/maximtrp/scikit-posthocs .. image:: https://app.codacy.com/project/badge/Grade/50d2a82a6dd84b51b515cebf931067d7 :target: https://www.codacy.com/gh/maximtrp/scikit-posthocs/dashboard?utm_source=github.com&utm_medium=referral&utm_content=maximtrp/scikit-posthocs&utm_campaign=Badge_Grade .. image:: https://pepy.tech/badge/scikit-posthocs :target: https://pepy.tech/project/scikit-posthocs .. image:: https://img.shields.io/github/issues/maximtrp/scikit-posthocs.svg :target: https://github.com/maximtrp/scikit-posthocs/issues .. image:: https://img.shields.io/pypi/v/scikit-posthocs.svg :target: https://pypi.python.org/pypi/scikit-posthocs/ .. image:: https://img.shields.io/conda/vn/conda-forge/scikit-posthocs.svg :target: https://anaconda.org/conda-forge/scikit-posthocs

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data analysis to assess the differences between group levels if a statistically significant result of ANOVA test has been obtained.

scikit-posthocs is tightly integrated with Pandas DataFrames and NumPy arrays to ensure fast computations and convenient data import and storage.

This package will be useful for statisticians, data analysts, and researchers who use Python in their work.

Background

Python statistical ecosystem comprises multiple packages. However, it still has numerous gaps and is surpassed by R packages and capabilities.

SciPy <https://www.scipy.org/>_ (version 1.2.0) offers Student, Wilcoxon, and Mann-Whitney tests that are not adapted to multiple pairwise comparisons. Statsmodels <http://statsmodels.sourceforge.net/>_ (version 0.9.0) features TukeyHSD test that needs some extra actions to be fluently integrated into a data analysis pipeline. Statsmodels <http://statsmodels.sourceforge.net/>_ also has good helper methods: allpairtest (adapts an external function such as scipy.stats.ttest_ind to multiple pairwise comparisons) and multipletests (adjusts p values to minimize type I and II errors). PMCMRplus <https://rdrr.io/cran/PMCMRplus/>_ is a very good R package that has no rivals in Python as it offers more than 40 various tests (including post hoc tests) for factorial and block design data. PMCMRplus was an inspiration and a reference for scikit-posthocs.

scikit-posthocs attempts to improve Python statistical capabilities by offering a lot of parametric and nonparametric post hoc tests along with outliers detection and basic plotting methods.

Features

.. image:: images/flowchart.png :alt: Tests Flowchart

  • Omnibus tests:

    • Durbin test (for balanced incomplete block design).
    • Mack-Wolfe test.
    • Hayter (OSRT) test.
  • Parametric pairwise multiple comparisons tests:

    • Scheffe test.
    • Student T test.
    • Tamhane T2 test.
    • TukeyHSD test.
  • Non-parametric tests for factorial design:

    • Conover test.
    • Dunn test.
    • Dwass, Steel, Critchlow, and Fligner test.
    • Mann-Whitney test.
    • Nashimoto and Wright (NPM) test.
    • Nemenyi test.
    • van Waerden test.
    • Wilcoxon test.
  • Non-parametric tests for block design:

    • Conover test.
    • Durbin and Conover test.
    • Miller test.
    • Nemenyi test.
    • Quade test.
    • Siegel test.
  • Outliers detection tests:

    • Simple test based on interquartile range (IQR).
    • Grubbs test.
    • Tietjen-Moore test.
    • Generalized Extreme Studentized Deviate test (ESD test).
  • Other tests:

    • Anderson-Darling test.
  • Global null hypothesis tests:

    • Fisher's combination test.
    • Simes test.
  • Plotting functionality (e.g. significance plots).

All post hoc tests are capable of p adjustments for multiple pairwise comparisons.

Dependencies

  • NumPy and SciPy packages <https://www.scipy.org/>_
  • Statsmodels <http://statsmodels.sourceforge.net/>_
  • Pandas <http://pandas.pydata.org/>_
  • Matplotlib <https://matplotlib.org/>_
  • Seaborn <https://seaborn.pydata.org/>_

Compatibility

Package is compatible with only Python 3.

Install

You can install the package from PyPi:

.. code:: bash

pip install scikit-posthocs

Examples

Parametric ANOVA with post hoc tests


Here is a simple example of the one-way analysis of variance (ANOVA)
with post hoc tests used to compare *sepal width* means of three
groups (three iris species) in *iris* dataset.

To begin, we will import the dataset using statsmodels
``get_rdataset()`` method.

.. code:: python

  >>> import statsmodels.api as sa
  >>> import statsmodels.formula.api as sfa
  >>> import scikit_posthocs as sp
  >>> df = sa.datasets.get_rdataset('iris').data
  >>> df.columns = df.columns.str.replace('.', '')
  >>> df.head()
      SepalLength   SepalWidth   PetalLength   PetalWidth Species
  0           5.1          3.5           1.4          0.2  setosa
  1           4.9          3.0           1.4          0.2  setosa
  2           4.7          3.2           1.3          0.2  setosa
  3           4.6          3.1           1.5          0.2  setosa
  4           5.0          3.6           1.4          0.2  setosa

Now, we will build a model and run ANOVA using statsmodels ``ols()``
and ``anova_lm()`` methods. Columns ``Species`` and ``SepalWidth``
contain independent (predictor) and dependent (response) variable
values, correspondingly.

.. code:: python

  >>> lm = sfa.ols('SepalWidth ~ C(Species)', data=df).fit()
  >>> anova = sa.stats.anova_lm(lm)
  >>> print(anova)
                 df     sum_sq   mean_sq         F        PR(>F)
  C(Species)    2.0  11.344933  5.672467  49.16004  4.492017e-17
  Residual    147.0  16.962000  0.115388       NaN           NaN

The results tell us that there is a significant difference between
groups means (p = 4.49e-17), but does not tell us the exact group pairs which
are different in means. To obtain pairwise group differences, we will carry
out a posteriori (post hoc) analysis using ``scikits-posthocs`` package.
Student T test applied pairwisely gives us the following p values:

.. code:: python

  >>> sp.posthoc_ttest(df, val_col='SepalWidth', group_col='Species', p_adjust='holm')
                    setosa    versicolor     virginica
  setosa     -1.000000e+00  5.535780e-15  8.492711e-09
  versicolor  5.535780e-15 -1.000000e+00  1.819100e-03
  virginica   8.492711e-09  1.819100e-03 -1.000000e+00

Remember to use a `FWER controlling procedure <https://en.wikipedia.org/wiki/Family-wise_error_rate#Controlling_procedures>`_,
such as Holm procedure, when making multiple comparisons. As seen from this
table, significant differences in group means are obtained for all group pairs.

Non-parametric ANOVA with post hoc tests

If normality and other assumptions <https://en.wikipedia.org/wiki/One-way_analysis_of_variance>_ are violated, one can use a non-parametric Kruskal-Wallis H test (one-way non-parametric ANOVA) to test if samples came from the same distribution.

Let's use the same dataset just to demonstrate the procedure. Kruskal-Wallis test is implemented in SciPy package. scipy.stats.kruskal method accepts array-like structures, but not DataFrames.

.. code:: python

import scipy.stats as ss import statsmodels.api as sa import scikit_posthocs as sp df = sa.datasets.get_rdataset('iris').data df.columns = df.columns.str.replace('.', '') data = [df.loc[ids, 'SepalWidth'].values for ids in df.groupby('Species').groups.values()]

data is a list of 1D arrays containing sepal width values, one array per each species. Now we can run Kruskal-Wallis analysis of variance.

.. code:: python

H, p = ss.kruskal(*data) p 1.5692820940316782e-14

P value tells us we may reject the null hypothesis that the population medians of all of the groups are equal. To learn what groups (species) differ in their medians we need to run post hoc tests. scikit-posthocs provides a lot of non-parametric tests mentioned above. Let's choose Conover's test.

.. code:: python

sp.posthoc_conover(df, val_col='SepalWidth', group_col='Species', p_adjust = 'holm') setosa versicolor virginica setosa -1.000000e+00 2.278515e-18 1.293888e-10 versicolor 2.278515e-18 -1.000000e+00 1.881294e-03 virginica 1.293888e-10 1.881294e-03 -1.000000e+00

Pairwise comparisons show that we may reject the null hypothesis (p < 0.01) for each pair of species and conclude that all groups (species) differ in their sepal widths.

Block design


In block design case, we have a primary factor (e.g. treatment) and a blocking
factor (e.g. age or gender). A blocking factor is also called a *nuisance*
factor, and it is usually a source of variability that needs to be accounted
for.

An example scenario is testing the effect of four fertilizers on crop yield in
four cornfields. We can represent the results with a matrix in which rows
correspond to the blocking factor (field) and columns correspond to the
primary factor (yield).

The following dataset is artificial and created just for demonstration
of the procedure:

.. code:: python

  >>> data = np.array([[ 8.82, 11.8 , 10.37, 12.08],
                       [ 8.92,  9.58, 10.59, 11.89],
                       [ 8.27, 11.46, 10.24, 11.6 ],
                       [ 8.83, 13.25,  8.33, 11.51]])

First, we need to perform an omnibus test — Friedman rank sum test. It is
implemented in ``scipy.stats`` subpackage:

.. code:: python

  >>> import scipy.stats as ss
  >>> ss.friedmanchisquare(*data.T)
  FriedmanchisquareResult(statistic=8.700000000000003, pvalue=0.03355726870553798)

We can reject the null hypothesis that our treatments have the same
distribution, because p value is less than 0.05. A number of post hoc tests are
available in ``scikit-posthocs`` package for unreplicated block design data.
In the following example, Nemenyi's test is used:

.. code:: python

  >>> import scikit_posthocs as sp
  >>> sp.posthoc_nemenyi_friedman(data)
            0         1         2         3
  0 -1.000000  0.220908  0.823993  0.031375
  1  0.220908 -1.000000  0.670273  0.823993
  2  0.823993  0.670273 -1.000000  0.220908
  3  0.031375  0.823993  0.220908 -1.000000

This function returns a DataFrame with p values obtained in pairwise
comparisons between all treatments.
One can also pass a DataFrame and specify the names of columns containing
dependent variable values, blocking and primary factor values.
The following code creates a DataFrame with the same data:

.. code:: python

  >>> data = pd.DataFrame.from_dict({'blocks': {0: 0, 1: 1, 2: 2, 3: 3, 4: 0, 5: 1, 6:
  2, 7: 3, 8: 0, 9: 1, 10: 2, 11: 3, 12: 0, 13: 1, 14: 2, 15: 3}, 'groups': {0:
  0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 3,
  13: 3, 14: 3, 15: 3}, 'y': {0: 8.82, 1: 8.92, 2: 8.27, 3: 8.83, 4: 11.8, 5:
  9.58, 6: 11.46, 7: 13.25, 8: 10.37, 9: 10.59, 10: 10.24, 11: 8.33, 12: 12.08,
  13: 11.89, 14: 11.6, 15: 11.51}})
  >>> data
      blocks  groups      y
  0        0       0   8.82
  1        1       0   8.92
  2        2       0   8.27
  3        3       0   8.83
  4        0       1  11.80
  5        1       1   9.58
  6        2       1  11.46
  7        3       1  13.25
  8        0       2  10.37
  9        1       2  10.59
  10       2       2  10.24
  11       3       2   8.33
  12       0       3  12.08
  13       1       3  11.89
  14       2       3  11.60
  15       3       3  11.51

This is a *melted* and ready-to-use DataFrame. Do not forget to pass ``melted``
argument:

.. code:: python

  >>> sp.posthoc_nemenyi_friedman(data, y_col='y', block_col='blocks', group_col='groups', melted=True)
            0         1         2         3
  0 -1.000000  0.220908  0.823993  0.031375
  1  0.220908 -1.000000  0.670273  0.823993
  2  0.823993  0.670273 -1.000000  0.220908
  3  0.031375  0.823993  0.220908 -1.000000


Data types
~~~~~~~~~~

Internally, ``scikit-posthocs`` uses NumPy ndarrays and pandas DataFrames to
store and process data. Python lists, NumPy ndarrays, and pandas DataFrames
are supported as *input* data types. Below are usage examples of various
input data structures.

Lists and arrays
^^^^^^^^^^^^^^^^

.. code:: python

  >>> x = [[1,2,1,3,1,4], [12,3,11,9,3,8,1], [10,22,12,9,8,3]]
  >>> # or
  >>> x = np.array([[1,2,1,3,1,4], [12,3,11,9,3,8,1], [10,22,12,9,8,3]])
  >>> sp.posthoc_conover(x, p_adjust='holm')
            1         2         3
  1 -1.000000  0.057606  0.007888
  2  0.057606 -1.000000  0.215761
  3  0.007888  0.215761 -1.000000

You can check how it is processed with a hidden function ``__convert_to_df()``:

.. code:: python

  >>> sp.__convert_to_df(x)
  (    vals  groups
   0      1       1
   1      2       1
   2      1       1
   3      3       1
   4      1       1
   5      4       1
   6     12       2
   7      3       2
   8     11       2
   9      9       2
   10     3       2
   11     8       2
   12     1       2
   13    10       3
   14    22       3
   15    12       3
   16     9       3
   17     8       3
   18     3       3, 'vals', 'groups')

It returns a tuple of a DataFrame representation and names of the columns
containing dependent (``vals``) and independent (``groups``) variable values.

*Block design* matrix passed as a NumPy ndarray is processed with a hidden
``__convert_to_block_df()`` function:

.. code:: python

  >>> data = np.array([[ 8.82, 11.8 , 10.37, 12.08],
                       [ 8.92,  9.58, 10.59, 11.89],
                       [ 8.27, 11.46, 10.24, 11.6 ],
                       [ 8.83, 13.25,  8.33, 11.51]])
  >>> sp.__convert_to_block_df(data)
  (    blocks groups      y
   0        0      0   8.82
   1        1      0   8.92
   2        2      0   8.27
   3        3      0   8.83
   4        0      1  11.80
   5        1      1   9.58
   6        2      1  11.46
   7        3      1  13.25
   8        0      2  10.37
   9        1      2  10.59
   10       2      2  10.24
   11       3      2   8.33
   12       0      3  12.08
   13       1      3  11.89
   14       2      3  11.60
   15       3      3  11.51, 'y', 'groups', 'blocks')

DataFrames
^^^^^^^^^^

If you are using DataFrames, you need to pass column names containing variable
values to a post hoc function:

.. code:: python

  >>> import statsmodels.api as sa
  >>> import scikit_posthocs as sp
  >>> df = sa.datasets.get_rdataset('iris').data
  >>> df.columns = df.columns.str.replace('.', '')
  >>> sp.posthoc_conover(df, val_col='SepalWidth', group_col='Species', p_adjust = 'holm')

``val_col`` and ``group_col`` arguments specify the names of the columns
containing dependent (response) and independent (grouping) variable values.


Significance plots
------------------

P values can be plotted using a heatmap:

.. code:: python

  >>> pc = sp.posthoc_conover(x, val_col='values', group_col='groups')
  >>> heatmap_args = {'linewidths': 0.25, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.80, 0.35, 0.04, 0.3]}
  >>> sp.sign_plot(pc, **heatmap_args)

.. image:: images/plot-conover.png

Custom colormap applied to a plot:

.. code:: python

  >>> pc = sp.posthoc_conover(x, val_col='values', group_col='groups')
  >>> # Format: diagonal, non-significant, p<0.001, p<0.01, p<0.05
  >>> cmap = ['1', '#fb6a4a',  '#08306b',  '#4292c6', '#c6dbef']
  >>> heatmap_args = {'cmap': cmap, 'linewidths': 0.25, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.80, 0.35, 0.04, 0.3]}
  >>> sp.sign_plot(pc, **heatmap_args)

.. image:: images/plot-conover-custom-cmap.png

Citing
------

If you want to cite *scikit-posthocs*, please refer to the publication in
the `Journal of Open Source Software <http://joss.theoj.org>`_:

Terpilowski, M. (2019). scikit-posthocs: Pairwise multiple comparison tests in
Python. Journal of Open Source Software, 4(36), 1169, https://doi.org/10.21105/joss.01169

.. code::

  @ARTICLE{Terpilowski2019,
    title    = {scikit-posthocs: Pairwise multiple comparison tests in Python},
    author   = {Terpilowski, Maksim},
    journal  = {The Journal of Open Source Software},
    volume   = {4},
    number   = {36},
    pages    = {1169},
    year     = {2019},
    doi      = {10.21105/joss.01169}
  }

Acknowledgement
---------------

Thorsten Pohlert, PMCMR author and maintainer
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].