All Projects → ResidentMario → Missingno

ResidentMario / Missingno

Licence: mit
Missing data visualization module for Python.

Programming Languages

python
139335 projects - #7 most used programming language
TeX
3793 projects

Projects that are alternatives of or similar to Missingno

Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (-38.69%)
Mutual labels:  data-analysis, pandas, data-visualization
Edaviz
edaviz - Python library for Exploratory Data Analysis and Visualization in Jupyter Notebook or Jupyter Lab
Stars: ✭ 220 (-92.71%)
Mutual labels:  data-analysis, pandas, data-visualization
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (-90.96%)
Mutual labels:  data-analysis, pandas, data-visualization
Dtale Desktop
Build a data visualization dashboard with simple snippets of python code
Stars: ✭ 128 (-95.76%)
Mutual labels:  data-analysis, pandas, data-visualization
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (-49.78%)
Mutual labels:  data-analysis, pandas, data-visualization
Seaborn Tutorial
This repository is my attempt to help Data Science aspirants gain necessary Data Visualization skills required to progress in their career. It includes all the types of plot offered by Seaborn, applied on random datasets.
Stars: ✭ 114 (-96.22%)
Mutual labels:  data-analysis, pandas, data-visualization
Data Forge Ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Stars: ✭ 967 (-67.97%)
Mutual labels:  data-analysis, pandas, data-visualization
Deepgraph
Analyze Data with Pandas-based Networks. Documentation:
Stars: ✭ 232 (-92.32%)
Mutual labels:  data-analysis, pandas, data-visualization
Pbpython
Code, Notebooks and Examples from Practical Business Python
Stars: ✭ 1,724 (-42.89%)
Mutual labels:  data-analysis, pandas, data-visualization
Dtale
Visualizer for pandas data structures
Stars: ✭ 2,864 (-5.13%)
Mutual labels:  data-analysis, pandas, data-visualization
Dexplot
Simple plotting library that wraps Matplotlib and integrated with DataFrames
Stars: ✭ 208 (-93.11%)
Mutual labels:  pandas, data-visualization
Awkward 1.0
Manipulate JSON-like data with NumPy-like idioms.
Stars: ✭ 203 (-93.28%)
Mutual labels:  data-analysis, pandas
Python Novice Inflammation
Programming with Python
Stars: ✭ 199 (-93.41%)
Mutual labels:  data-analysis, data-visualization
Discovery
Frontend framework for rapid data (JSON) analysis, sharable serverless reports and dashboards
Stars: ✭ 199 (-93.41%)
Mutual labels:  data-analysis, data-visualization
Deep Learning Machine Learning Stock
Stock for Deep Learning and Machine Learning
Stars: ✭ 240 (-92.05%)
Mutual labels:  data-analysis, data-visualization
Helicalinsight
Helical Insight software is world’s first Open Source Business Intelligence framework which helps you to make sense out of your data and make well informed decisions.
Stars: ✭ 214 (-92.91%)
Mutual labels:  data-analysis, data-visualization
Cjworkbench
The data journalism platform with built in training
Stars: ✭ 244 (-91.92%)
Mutual labels:  data-analysis, data-visualization
Data Science Notebook
📖 每一个伟大的思想和行动都有一个微不足道的开始
Stars: ✭ 196 (-93.51%)
Mutual labels:  data-analysis, pandas
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (-92.78%)
Mutual labels:  data-analysis, data-visualization
Datav
📊https://datav.io is a modern APM, provide observability for your business, application and infrastructure. It's also a lightweight alternative to Grafana.
Stars: ✭ 2,757 (-8.68%)
Mutual labels:  data-analysis, data-visualization

missingno PyPi version t

Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just pip install missingno to get started.

quickstart

This quickstart uses a sample of the NYPD Motor Vehicle Collisions Dataset dataset.

import pandas as pd
collisions = pd.read_csv("https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv")

matrix

The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

import missingno as msno
%matplotlib inline
msno.matrix(collisions.sample(250))

alt text

At a glance, date, time, the distribution of injuries, and the contribution factor of the first vehicle appear to be completely populated, while geographic information seems mostly complete, but spottier.

The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.

If you are working with time-series data, you can specify a periodicity using the freq keyword parameter:

null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)
null_pattern = pd.DataFrame(null_pattern).replace({False: None})
msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')

alt text

bar

msno.bar is a simple visualization of nullity by column:

msno.bar(collisions.sample(1000))

alt text

You can switch to a logarithmic scale by specifying log=True. bar provides the same information as matrix, but in a simpler format.

heatmap

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

msno.heatmap(collisions)

alt text

In this example, it seems that reports which are filed with an OFF STREET NAME variable are less likely to have complete geographic data.

Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does).

Variables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization—in this case for instance the datetime and injury number columns, which are completely filled, are not included.

Entries marked <1 or >-1 have a correlation that is close to being exactingly negative or positive, but is still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For example, in this dataset the correlation between VEHICLE CODE TYPE 3 and CONTRIBUTING FACTOR VEHICLE 3 is <1, indicating that, contrary to our expectation, there are a few records which have one or the other, but not both. These cases will require special attention.

The heatmap works great for picking out data completeness relationships between variable pairs, but its explanatory power is limited when it comes to larger relationships and it has no particular support for extremely large datasets.

dendrogram

The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:

msno.dendrogram(collisions)

alt text

The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.

Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity (for example, as CONTRIBUTING FACTOR VEHICLE 2 and VEHICLE TYPE CODE 2 ought to), then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.

As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.

configuration

For more advanced configuration details for your plots, refer to the CONFIGURATION.md file in this repository.

contributing

For thoughts on features or bug reports see Issues. If you're interested in contributing to this library, see details on doing so in the CONTRIBUTING.md file in this repository.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].