All Projects β†’ 0011001011 β†’ Vizuka

0011001011 / Vizuka

Explore high-dimensional datasets and how your algo handles specific regions.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Vizuka

My Journey In The Data Science World
πŸ“’ Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+1075%)
Mutual labels:  data-science, big-data, data-visualization
Cookbook 2nd
IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018
Stars: ✭ 704 (+604%)
Mutual labels:  data-science, data-mining, data-visualization
Cookbook 2nd Code
Code of the IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018 [read-only repository]
Stars: ✭ 541 (+441%)
Mutual labels:  data-science, data-mining, data-visualization
Book Socialmediaminingpython
Companion code for the book "Mastering Social Media Mining with Python"
Stars: ✭ 462 (+362%)
Mutual labels:  data-science, data-mining, data-visualization
Pretzel
Javascript full-stack framework for Big Data visualisation and analysis
Stars: ✭ 26 (-74%)
Mutual labels:  data-science, big-data, data-visualization
Verticapy
VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
Stars: ✭ 59 (-41%)
Mutual labels:  data-science, big-data, data-visualization
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+5556%)
Mutual labels:  data-science, big-data, pca
Pydataroad
open source for wechat-official-account (ID: PyDataLab)
Stars: ✭ 302 (+202%)
Mutual labels:  data-science, data-mining, data-visualization
Dex
Dex : The Data Explorer -- A data visualization tool written in Java/Groovy/JavaFX capable of powerful ETL and publishing web visualizations.
Stars: ✭ 1,238 (+1138%)
Mutual labels:  data-science, data-mining, data-visualization
Model Describer
model-describer : Making machine learning interpretable to humans
Stars: ✭ 22 (-78%)
Mutual labels:  data-science, data-mining, data-visualization
Pyod
A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)
Stars: ✭ 5,083 (+4983%)
Mutual labels:  data-science, data-mining, unsupervised-learning
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-57%)
Mutual labels:  data-science, data-mining, unsupervised-learning
Courses
Quiz & Assignment of Coursera
Stars: ✭ 454 (+354%)
Mutual labels:  data-science, big-data, data-visualization
Linkedingiveaway
πŸ‘¨πŸ½β€πŸ«You can learn about anything over here. What Giveaways I do and why it's important in today's modern world. Are you interested in Giveaway's?πŸ”‹
Stars: ✭ 67 (-33%)
Mutual labels:  data-science, data-mining, data-visualization
Mlxtend
A library of extension and helper modules for Python's data analysis and machine learning libraries.
Stars: ✭ 3,729 (+3629%)
Mutual labels:  data-science, data-mining, unsupervised-learning
Data Science With Ruby
Practical Data Science with Ruby based tools.
Stars: ✭ 549 (+449%)
Mutual labels:  data-science, data-mining, data-visualization
kmeans
A simple implementation of K-means (and Bisecting K-means) clustering algorithm in Python
Stars: ✭ 18 (-82%)
Mutual labels:  data-mining, kmeans, unsupervised-learning
Knowage Server
Knowage is the professional open source suite for modern business analytics over traditional sources and big data systems.
Stars: ✭ 276 (+176%)
Mutual labels:  big-data, data-mining, data-visualization
Biolitmap
Code for the paper "BIOLITMAP: a web-based geolocated and temporal visualization of the evolution of bioinformatics publications" in Oxford Bioinformatics.
Stars: ✭ 18 (-82%)
Mutual labels:  data-science, data-mining, data-visualization
Dataflowjavasdk
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Stars: ✭ 854 (+754%)
Mutual labels:  data-science, big-data, data-mining

Data visualization

This is a collection of tools to represent and navigate through high-dimensional datasets.

  • The algorithm t-SNE is set as default to construct the 2D space.
  • The module should be agnostic of the data provided.
  • It ships with MNIST for quick testing.

For commercial use, and user support, please contact Sofian Medbouhi ([email protected]) who propose a business version with additional features.

alt zoomview Usage

How to install ?

$ pip install vizuka

or clone the repo :)

build-essential is required for the wordcloud

# apt-get install build-essential

How to run?

$ vizuka # launch the visualization tool

# For a quick working example with MNIST run :
$ vizuka --mnist
# Similar to downloading MNIST, fit a basic logistic and project in 2D with tSNE

$ vizuka --show-required-files
# To show the format of files you need to launch a data viz

But you don't want to use MNIST toy-dataset right ? Here is a complete working example:

# EXAMPLE :
# you have your preprocessed data in 		~/data/set/preprocessed_MYDATASET01.npz
#                 and predictions in 		~/data/set/predict_MYDATASET01.npz
#		  optionnaly the raw dataset in ~/data/set/raw_data_MYDATASET01.npz
# Run :
$ vizuka-reduce --path ~/data --version MYDATASET01 # projects in 2D
$ vizuka 	--path ~/data --version MYDATASET01

It will search in __package__/data/ the datas but you can force your own with --path argument

  • Note that if you are effectively doing big data you should install MulticoreTSNE in vizuka/dimension_reduction/tSNE.py unless you want to discover t-SNE crashed with a segfault. Instructions for installation can be found in requirements/requirements.apt

Plugin structure

Add your plugins in vizuka/plugins/

You can define your own:

  • dimension reduction algorithm
  • heatmaps
  • clustering engines
  • cluster delimiter (frontiers)
  • cluster viewer (cf last image)

Only thing to do : use the vizuka/plugin/dimension_reduction/ (for e.g) and follow the instruction from How_to_add_dimension_reduction_plugin.py

Your plugin will be available in all vizuka, in IHM and cmdline without further adjustment

What will I get ?

A nice tool to draw clusters, find details about inside distribution and zoom in. Example with MNIST toy dataset (vizuka --mnist): (for real life example please scroll down)

alt zoomview alt colorview alt clusterview

How to use ?

Navigate inside the 2D space and look at the data, selecting it in the main window (the big one). Data is grouped by cluster, you can select cluster individually (left click).

Main window represents all the data in 2D space. Blue are good-predicted transactions, Red are the bad ones, Green are the special class (by default the label 0).

Below are three subplots :

  • a summary of the data inside the selected buckets (see navigation)
  • a heatmap of the red/blue/green representation
  • a heatmap of the cross-entropy of each bucket empirical distribution with empirical global empirical distribution.

Data viz navigation :

  • left click selects a bucket of data
  • right click reset all in-memory buckets

Other options:

  • clusterize with an algo, a simple grid, or kMeans, DBSCAN...
  • visualize the distribution of classes inside selected clusters
  • visualize the distribution of features inside selected clusters
  • filter by predicted class or by real class.
  • filter by any feature you may have in your raw (non-preprocessed) dataset
  • export x : export the data you selected in an output.csv
  • cluster borders : draw borders between clusters based on bhattacharyya similarity measure, or just all
  • choose a different set of predictions to display

What does it needs to be executed ?

$ vizuka --show-required-files

VERSION: string that identifies your dataset (default is MNIST_example)
PATH	: data/ is located in /home/sofian/data_viz/manakin-ml-analytics/vizuka, change with --path
FORMAT	: all are .npz file

REQUIRED:
=========
	 + data/set/preprocessed_inputs_VERSION.npz
	 ------------------------------------------
		 x:	 preprocessed inputs, your feature space
		 y:	 outputs to be predicted, the "true" class
		 NB:	 this is the only mandatory file, the following is highly recommended:


OPTIONAL BUT USEFUL:
===================
	 + data/models/predict_VERSION.npz -> optional but recommended
	 -------------------------------------------------------------
		 y:	 predictions returned by your algorithm
		 NB:	 should be same formatting as in preprocessed_inputs_VERSION["y"])
				 if you don't have one, use --force-no-predict


	 + data/set/raw_data_VERSION.npz -> optional
	 --------------------------
		 x:		 array of inputs BEFORE preprocessing
					 probably human-readable, thus useful for visualization
		 columns:	 the name of the columns variable in x
		 NB:	 this file is used if you run vizuka with
			    --feature-name-to-display COLUMN_NAME:PLOTTER COLUMN_NAME2:PLOTTER2 or
			    (see help for details)


GENERATED BY VIZUKA:
====================
	 + data/reduced/algoname#VERSION#PARAM1_NAME::VALUE#PARAM2_NAME::VALUE.npz
	 ------------------------------------------------------------------------
		 x2D:	 projections of the preprocessed inputs x in a 2D space
		 NB:	 you can change default projection parameters and works with several ones
			 see vizuka-reduce

Typical use-case :

You have your preprocessed data ? Cool, this is the only mandatory file you need. Place it in the folder data/set/preprocessed_inputs_VERSION.npz, VERSION being a string specific to this specific dataset. It must contains at least the key 'x' representing the vectors you learn from. If you have both the correct output and your own predicitons (inside data/models/ and prediction under npz key "y", default loaded is predict_VERSION.npz, cf --show-requiredfiles).

Optionally you can add a raw_data_VERSION.npz file containing raw data non-preprocessed. The vector should be the key "x" and the name of the human-readable "features" in the key "columns".

Now you may want to launch Vizuka ! First project your preprocessed space in 2D with vizuka-reduce, then visualize with vizuka.

And take some coffee.

Or two.

Or three, Vizuka is busy reducing the dimension.

...

Congratulations ! Now you may want to display your 2D-data, as your able to browse your embedded space. Maybe you want to look for a specific cluster. Explore the data with graph options, zoom in and zoom out, and use the filters provided to find an interesting area.

When you are satisfied, click to select clusters. This is quite unefficient you will select small rectangular tiles one by one on a grid, you may want to Clusterize using KMeans or DBSCAN.

Great now you can select whole clusters of data at once. But what's in there ? Use the menu Cluster exploration for that. When you are done click on the export button to find out in a nicely formatted csv (assuming you provided "raw" data) containing the data in the clusters you selected.

You finished your session but still want to dive in the clusters later ? Just select Save clusterization to save your session.

Default parameters

See config.py

Real life example

alt zoomview alt clusterview

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].