All Projects → polyaxon → datatile

polyaxon / datatile

Licence: Apache-2.0 license
A library for managing, validating, summarizing, and visualizing data.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to datatile

Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+1887.83%)
Mutual labels:  pandas, data-analysis, data-exploration, data-quality, data-profiling
ipython-notebooks
A collection of Jupyter notebooks exploring different datasets.
Stars: ✭ 43 (-89.74%)
Mutual labels:  pandas, data-analysis, matplotlib, data-exploration
Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (+341.77%)
Mutual labels:  pandas, data-analysis, data-exploration, data-profiling
Mlcourse.ai
Open Machine Learning Course
Stars: ✭ 7,963 (+1800.48%)
Mutual labels:  plotly, pandas, data-analysis, matplotlib
Edaviz
edaviz - Python library for Exploratory Data Analysis and Visualization in Jupyter Notebook or Jupyter Lab
Stars: ✭ 220 (-47.49%)
Mutual labels:  plotly, pandas, data-analysis, matplotlib
re-data
re_data - fix data issues before your users & CEO would discover them 😊
Stars: ✭ 955 (+127.92%)
Mutual labels:  data-analysis, data-quality-checks, data-quality, data-quality-monitoring
traceml
Engine for ML/Data tracking, visualization, dashboards, and model UI for Polyaxon.
Stars: ✭ 445 (+6.21%)
Mutual labels:  plotly, matplotlib, data-profiling, mlops
pandas-workshop
An introductory workshop on pandas with notebooks and exercises for following along.
Stars: ✭ 161 (-61.58%)
Mutual labels:  pandas, data-analysis, dataframes
Data Analysis
主要是爬虫与数据分析项目总结,外加建模与机器学习,模型的评估。
Stars: ✭ 142 (-66.11%)
Mutual labels:  pandas, data-analysis, matplotlib
Exploratory Data Analysis Visualization Python
Data analysis and visualization with PyData ecosystem: Pandas, Matplotlib Numpy, and Seaborn
Stars: ✭ 78 (-81.38%)
Mutual labels:  plotly, pandas, matplotlib
jun
JUN - python pandas, plotly, seaborn support & dataframes manipulation over erlang
Stars: ✭ 21 (-94.99%)
Mutual labels:  plotly, pandas, dataframes
Lantern
Data exploration glue
Stars: ✭ 292 (-30.31%)
Mutual labels:  plotly, pandas, matplotlib
PandasVersusExcel
Python数据分析入门,数据分析师入门
Stars: ✭ 120 (-71.36%)
Mutual labels:  pandas, data-analysis, matplotlib
Ai Learn
人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域
Stars: ✭ 4,387 (+947.02%)
Mutual labels:  pandas, data-analysis, matplotlib
data-analysis-using-python
Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data
Stars: ✭ 81 (-80.67%)
Mutual labels:  pandas, data-analysis, matplotlib
optimus
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Stars: ✭ 1,351 (+222.43%)
Mutual labels:  dask, data-exploration, data-profiling
python-data-visualization
Curated Python Notebooks for Data Visualization
Stars: ✭ 22 (-94.75%)
Mutual labels:  plotly, data-analysis, matplotlib
Django-Data-quality-system
数据治理、数据质量检核/监控平台(Django+jQuery+MySQL)
Stars: ✭ 143 (-65.87%)
Mutual labels:  data-quality-checks, data-quality, data-quality-monitoring
Dexplot
Simple plotting library that wraps Matplotlib and integrated with DataFrames
Stars: ✭ 208 (-50.36%)
Mutual labels:  plotly, pandas, matplotlib
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (-92.6%)
Mutual labels:  pandas, data-analysis, matplotlib

License: Apache 2 Datatile Slack Docs GitHub GitHub

Datatile


datatile


A library for managing, summarizing, and visualizing data.

N.B.1: pandas-summary was renamed to datatile, a more ambitious project with sevral planned features and enhancements to add support for visualizations, quality checks, linking summaries to versions, and integrations with third party libraries.

Installation

The module can be easily installed with pip:

> pip install datatile

This module depends on numpy and pandas. Optionally you can get also some nice visualisations if you have matplotlib installed.

Tests

To run the tests, execute the command python setup.py test

Usage

DataFrameSummary

An extension to pandas dataframes describe function.

The module contains DataFrameSummary object that extend describe() with:

  • properties
    • dfs.columns_stats: counts, uniques, missing, missing_perc, and type per column
    • dsf.columns_types: a count of the types of columns
    • dfs[column]: more in depth summary of the column
  • function
    • summary(): extends the describe() function with the values with columns_stats

The DataFrameSummary expect a pandas DataFrame to summarise.

from datatile.summary.df import DataFrameSummary

dfs = DataFrameSummary(df)

getting the columns types

dfs.columns_types


numeric     9
bool        3
categorical 2
unique      1
date        1
constant    1
dtype: int64

getting the columns stats

dfs.columns_stats


                      A            B        C              D              E
counts             5802         5794     5781           5781           4617
uniques            5802            3     5771            128            121
missing               0            8       21             21           1185
missing_perc         0%        0.14%    0.36%          0.36%         20.42%
types            unique  categorical  numeric        numeric        numeric

getting a single column summary, e.g. numerical column

# we can also access the column using numbers A[1]
dfs['A']

std                                                                 0.2827146
max                                                                  1.072792
min                                                                         0
variance                                                           0.07992753
mean                                                                0.5548516
5%                                                                  0.1603367
25%                                                                 0.3199776
50%                                                                 0.4968588
75%                                                                 0.8274732
95%                                                                  1.011255
iqr                                                                 0.5074956
kurtosis                                                            -1.208469
skewness                                                            0.2679559
sum                                                                  3207.597
mad                                                                 0.2459508
cv                                                                  0.5095319
zeros_num                                                                  11
zeros_perc                                                               0,1%
deviating_of_mean                                                          21
deviating_of_mean_perc                                                  0.36%
deviating_of_median                                                        21
deviating_of_median_perc                                                0.36%
top_correlations                         {u'D': 0.702240243124, u'E': -0.663}
counts                                                                   5781
uniques                                                                  5771
missing                                                                    21
missing_perc                                                            0.36%
types                                                                 numeric
Name: A, dtype: object

Future development

Summaries

  • Add summary analysis between columns, i.e. dfs[[1, 2]]

Visualizations

  • Add summary visualization with matplotlib.
  • Add summary visualization with plotly.
  • Add summary visualization with altair.
  • Add predefined profiling.

Catalog and Versions

  • Add possibility to persist summary and link to a specific version.
  • Integrate with quality libraries.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].