All Projects → pandas-profiling → Pandas Profiling

pandas-profiling / Pandas Profiling

Licence: mit
Create HTML profiling reports from pandas DataFrame objects

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
HTML
75241 projects
CSS
56736 projects
Batchfile
5799 projects
Makefile
30231 projects

Projects that are alternatives of or similar to Pandas Profiling

Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (-77.78%)
Mutual labels:  data-science, statistics, data-analysis, pandas, pandas-dataframe, exploratory-data-analysis, eda, exploration, data-exploration, data-profiling
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (-96.72%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, pandas, pandas-dataframe
datatile
A library for managing, validating, summarizing, and visualizing data.
Stars: ✭ 419 (-94.97%)
Mutual labels:  pandas, data-analysis, data-exploration, data-quality, data-profiling
Spark R Notebooks
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 109 (-98.69%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis, exploratory-data-analysis
Great expectations
Always know what to expect from your data.
Stars: ✭ 5,808 (-30.27%)
Mutual labels:  data-science, exploratory-data-analysis, eda, data-quality, data-profiling
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (-73.62%)
Mutual labels:  artificial-intelligence, jupyter-notebook, data-science, statistics, pandas
Code
Compilation of R and Python programming codes on the Data Professor YouTube channel.
Stars: ✭ 287 (-96.55%)
Mutual labels:  jupyter-notebook, data-science, pandas, exploratory-data-analysis
Datascience course
Curso de Data Science em Português
Stars: ✭ 294 (-96.47%)
Mutual labels:  artificial-intelligence, jupyter-notebook, data-science, data-analysis
Cookbook 2nd
IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018
Stars: ✭ 704 (-91.55%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis
Quantitative Notebooks
Educational notebooks on quantitative finance, algorithmic trading, financial modelling and investment strategy
Stars: ✭ 356 (-95.73%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis
Ai Learn
人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域
Stars: ✭ 4,387 (-47.33%)
Mutual labels:  artificial-intelligence, data-science, data-analysis, pandas
Data Science
Collection of useful data science topics along with code and articles
Stars: ✭ 315 (-96.22%)
Mutual labels:  artificial-intelligence, jupyter-notebook, data-science, data-analysis
Gophernotes
The Go kernel for Jupyter notebooks and nteract.
Stars: ✭ 3,100 (-62.78%)
Mutual labels:  artificial-intelligence, jupyter-notebook, data-science, jupyter
Just Pandas Things
An ongoing list of pandas quirks
Stars: ✭ 660 (-92.08%)
Mutual labels:  jupyter-notebook, data-science, pandas, pandas-dataframe
data-analysis-using-python
Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data
Stars: ✭ 81 (-99.03%)
Mutual labels:  pandas-dataframe, exploratory-data-analysis, pandas, data-analysis
Evidently
Interactive reports to analyze machine learning models during validation or production monitoring.
Stars: ✭ 304 (-96.35%)
Mutual labels:  html-report, jupyter-notebook, data-science, pandas-dataframe
Stats Maths With Python
General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python
Stars: ✭ 381 (-95.43%)
Mutual labels:  jupyter-notebook, data-science, statistics, pandas
Lux
Python API for Intelligent Visual Data Discovery
Stars: ✭ 787 (-90.55%)
Mutual labels:  data-science, jupyter, pandas, exploratory-data-analysis
Data Science Your Way
Ways of doing Data Science Engineering and Machine Learning in R and Python
Stars: ✭ 530 (-93.64%)
Mutual labels:  jupyter-notebook, data-science, jupyter, exploratory-data-analysis
Imodels
Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: ✭ 194 (-97.67%)
Mutual labels:  artificial-intelligence, jupyter-notebook, data-science, statistics

Pandas Profiling

Pandas Profiling Logo Header

Build Status Code Coverage Release Version Python Version Code style: black

Documentation | Slack | Stack Overflow | Latest changelog

Generates profile reports from a pandas DataFrame.

The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Type inference: detect the types of columns in a dataframe.
  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values
  • Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

Announcements

Spark backend in progress: We can happily announce that we're nearing v1 for the Spark backend for generating profile reports. Beta testers wanted! The Spark backend will be released as a pre-release for this package.

Monitoring time series?: I'd like to draw your attention to popmon. Whereas pandas-profiling allows you to explore patterns in a single dataset, popmon allows you to uncover temporal patterns. It's worth checking out!

Support pandas-profiling The development of pandas-profiling relies completely on contributions. If you find value in the package, we welcome you to support the project directly through GitHub Sponsors! Please help me to continue to support this package. Find more information: Sponsor the project on GitHub


Contents: Examples | Installation | Documentation | Large datasets | Command line usage | Advanced usage | Support | Go beyond | Support the project | Types | How to contribute | Editor Integration | Dependencies


Examples

The following examples can give you an impression of what the package can do:

  • Census Income (US Adult Census data relating income)
  • NASA Meteorites (comprehensive set of meteorite landings) Open In Colab Binder
  • Titanic (the "Wonderwall" of datasets) Open In Colab Binder
  • NZA (open data from the Dutch Healthcare Authority)
  • Stata Auto (1978 Automobile data)
  • Vektis (Vektis Dutch Healthcare data)
  • Colors (a simple colors dataset)
  • UCI Bank Dataset (banking marketing dataset)
  • RDW (RDW, the Dutch DMV's vehicle registration 10 million rows, 71 features)

Specific features:

Tutorials:

Installation

Using pip

PyPi Downloads PyPi Monthly Downloads PyPi Version

You can install using the pip package manager by running

pip install pandas-profiling[notebook]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using conda

Conda Downloads Conda Version

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

From source

Download the source code by cloning the repository or by pressing 'Download ZIP' on this page.

Install by navigating to the proper directory and running:

python setup.py install

Documentation

The documentation for pandas_profiling can be found here. Previous documentation is still available here.

Getting started

Start by loading in your pandas DataFrame, e.g. by using:

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])

To generate the report, run:

profile = ProfileReport(df, title="Pandas Profiling Report")

Explore deeper

You can configure the profile report in any way you like. The example code below loads the explorative configuration, that includes many features for text (length distribution, unicode information), files (file size, creation time) and images (dimensions, exif information). If you are interested what exact settings were used, you can compare with the default configuration file.

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

Learn more about configuring pandas-profiling on the Advanced usage page.

Jupyter Notebook

We recommend generating reports interactively by using the Jupyter notebook. There are two interfaces (see animations below): through widgets and through a HTML report.

Notebook Widgets

This is achieved by simply displaying the report. In the Jupyter Notebook, run:

profile.to_widgets()

The HTML report can be included in a Jupyter notebook:

HTML

Run the following code:

profile.to_notebook_iframe()

Saving the report

If you want to generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, you can obtain the data as JSON:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

Large datasets

Version 2.4 introduces minimal mode.

This is a default configuration that disables expensive computations (such as correlations and duplicate row detection).

Use the following syntax:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")

Benchmarks are available here.

Command line usage

For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable.

Run the following for information about options and arguments.

pandas_profiling -h

Advanced usage

A set of options is available in order to adapt the report generated.

  • title (str): Title for the report ('Pandas Profiling Report' by default).
  • pool_size (int): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
  • progress_bar (bool): If True, pandas-profiling will display a progress bar.
  • infer_dtypes (bool): When True (default) the dtype of variables are inferred using visions using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric).

More settings can be found in the default configuration file and minimal configuration file.

You find the configuration docs on the advanced usage page here

Example

profile = df.profile_report(
    title="Pandas Profiling Report", plot={"histogram": {"bins": 8}}
)
profile.to_file("output.html")

Support

Need help? Want to share a perspective? Want to report a bug? Ideas for collaboration? You can reach out via the following channels:

  • Stack Overflow: ideal for asking questions on how to use the package
  • Github Issues: bugs, proposals for change, feature requests
  • Slack: general chat, questions, collaboration
  • Email: project collaboration or sponsoring

Go beyond

Popmon

Popmon

For many real-world problems we are interested how the data changes over time. The excellent pacakge popmon allows you to profile and monitor data trends over time and generates reports in a similar fashion as you're used to using pandas-profiling. Inspecting the report often shows patterns that are going by undetected during standard data exploration. Moreover, popmon can be used to monitor the stability of input and output of machine learning models. The package is fully open-source and you can find it here!

To learn more on Popmon, have a look at these resources here

Great Expectations

Great Expectations

Profiling your data is closely related to data validation: often validation rules are defined in terms of well-known statistics. For that purpose, pandas-profiling integrates with Great Expectations. This a world-class open-source library that helps you to maintain data quality and improve communication about data between teams. Great Expectations allows you to create Expectations (which are basically unit tests for your data) and Data Docs (conveniently shareable HTML data reports). pandas-profiling features a method to create a suite of Expectations based on the results of your ProfileReport, which you can store, and use to validate another (or future) dataset.

You can find more details on the Great Expectations integration here

Supporting open source

Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible without support of our gracious sponsors.

Lambda Labs

Lambda workstations, servers, laptops, and cloud services power engineers and researchers at Fortune 500 companies and 94% of the top 50 universities. Lambda Cloud offers 4 & 8 GPU instances starting at $1.50 / hr. Pre-installed with TensorFlow, PyTorch, Ubuntu, CUDA, and cuDNN.

We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible:

Martin Sotir, Stephanie Rivera, abdulAziz

More info if you would like to appear here: Github Sponsor page

Types

Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.). pandas-profiling currently, recognizes the following types: Boolean, Numerical, Date, Categorical, URL, Path, File and Image.

We have developed a type system for Python, tailored for data analysis: visions. Choosing an appropriate typeset can both improve the overall expressiveness and reduce the complexity of your analysis/code. To learn more about pandas-profiling's type system, check out the default implementation here. In the meantime, user customized summarizations and type definitions are now fully supported - if you have a specific use-case please reach out with ideas or a PR!

Contributing

Read on getting involved in the Contribution Guide.

A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. Join the Slack community.

Editor integration

PyCharm integration

  1. Install pandas-profiling via the instructions above
  2. Locate your pandas-profiling executable.
    • On macOS / Linux / BSD:
      $ which pandas_profiling
      (example) /usr/local/bin/pandas_profiling
    • On Windows:
      $ where pandas_profiling
      (example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
  3. In PyCharm, go to Settings (or Preferences on macOS) > Tools > External tools
  4. Click the + icon to add a new external tool
  5. Insert the following values
    • Name: Pandas Profiling
    • Program: The location obtained in step 2
    • Arguments: "$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
    • Working Directory: $ProjectFileDir$

PyCharm Integration

To use the PyCharm Integration, right click on any dataset file:

External Tools > Pandas Profiling.

Other integrations

Other editor integrations may be contributed via pull requests.

Dependencies

The profile report is written in HTML and CSS, which means pandas-profiling requires a modern browser.

You need Python 3 to run this package. Other dependencies can be found in the requirements files:

Filename Requirements
requirements.txt Package requirements
requirements-dev.txt Requirements for development
requirements-test.txt Requirements for testing
setup.py Requirements for Widgets etc.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].