All Projects → mbernico → Snape

mbernico / Snape

Licence: apache-2.0
Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Snape

Php Ml
PHP-ML - Machine Learning library for PHP
Stars: ✭ 7,900 (+4996.77%)
Mutual labels:  dataset, classification, regression
Openml R
R package to interface with OpenML
Stars: ✭ 81 (-47.74%)
Mutual labels:  dataset, classification, regression
Dataset
Crop/Weed Field Image Dataset
Stars: ✭ 98 (-36.77%)
Mutual labels:  dataset, classification
Universal Data Tool
Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
Stars: ✭ 1,356 (+774.84%)
Mutual labels:  dataset, classification
Autoannotationtool
A label tool aim to reduce semantic segmentation label time, rectangle and polygon annotation is supported
Stars: ✭ 113 (-27.1%)
Mutual labels:  dataset, classification
Thundersvm
ThunderSVM: A Fast SVM Library on GPUs and CPUs
Stars: ✭ 1,282 (+727.1%)
Mutual labels:  classification, regression
Lossfunctions.jl
Julia package of loss functions for machine learning.
Stars: ✭ 89 (-42.58%)
Mutual labels:  classification, regression
Gpstuff
GPstuff - Gaussian process models for Bayesian analysis
Stars: ✭ 106 (-31.61%)
Mutual labels:  classification, regression
Pytsetlinmachine
Implements the Tsetlin Machine, Convolutional Tsetlin Machine, Regression Tsetlin Machine, Weighted Tsetlin Machine, and Embedding Tsetlin Machine, with support for continuous features, multigranularity, and clause indexing
Stars: ✭ 80 (-48.39%)
Mutual labels:  classification, regression
Machine Learning Projects
This repository consists of all my Machine Learning Projects.
Stars: ✭ 135 (-12.9%)
Mutual labels:  classification, regression
Tiny ml
numpy 实现的 周志华《机器学习》书中的算法及其他一些传统机器学习算法
Stars: ✭ 129 (-16.77%)
Mutual labels:  classification, regression
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+1317.42%)
Mutual labels:  classification, regression
Ml
A high-level machine learning and deep learning library for the PHP language.
Stars: ✭ 1,270 (+719.35%)
Mutual labels:  classification, regression
Dlcv for beginners
《深度学习与计算机视觉》配套代码
Stars: ✭ 1,244 (+702.58%)
Mutual labels:  classification, regression
Machine Learning Algorithms
A curated list of almost all machine learning algorithms and deep learning algorithms grouped by category.
Stars: ✭ 92 (-40.65%)
Mutual labels:  classification, regression
Neuroflow
Artificial Neural Networks for Scala
Stars: ✭ 105 (-32.26%)
Mutual labels:  classification, regression
Benchmarks
Comparison tools
Stars: ✭ 139 (-10.32%)
Mutual labels:  classification, regression
Mlbox
MLBox is a powerful Automated Machine Learning python library.
Stars: ✭ 1,199 (+673.55%)
Mutual labels:  classification, regression
Pointclouddatasets
3D point cloud datasets in HDF5 format, containing uniformly sampled 2048 points per shape.
Stars: ✭ 80 (-48.39%)
Mutual labels:  dataset, classification
Mlr
Machine Learning in R
Stars: ✭ 1,542 (+894.84%)
Mutual labels:  classification, regression

Build status Coverage Status

Snape

Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values.

Motivation

Snape was primarily created for academic and educational settings. It has been used to create datasets that are unique per student, per assignment for various homework assignments. It has also been used to create class wide assessments in conjunction with 'Kaggle In the Classroom.'

Other users have suggested non-academic uses cases as well, including 'interview screening problems,' model comparison, etc.

Installation

Via Github

git clone https://github.com/mbernico/snape.git
cd snape
python setup.py install

Via pip

Coming Soon...

Quick Start

Snape can run either as a python module or as a command line application.

Command Line Usage

Creating a Dataset

From the main directory in the git repo:

python snape/make_dataset.py -c example/config_classification.json

Will use the configuration file example/config_classification.json to create an artificial dataset called 'my_dataset' (which is specified in the json config, more on this later...).

The dataset will consist of three files:

  • my_dataset_train.csv (80% of the artificial dataset with all dependent and independent variables)
  • my_dataset_test.csv (20% of the artificial dataset with only the dependent variables present)
  • my_dataset_testkey.csv (the same 20% as _test, including the dependent variables)

Note that if a star schema is generated, additional csv files will be generated. There will be one extra csv file per dimension, but only the main 'fact table' dataset will be split into test and train files.

The train and test files can be given to a student. The student can respond with a file of predictions, which can be scored against the testkey as follows:

Scoring a Dataset

snape/score_dataset.py  -p example/student_predictions.csv  -k example/student_testkey.csv

Snape's score_dataset.py will attempt to detect the problem type and then score it, printing some metrics

Problem Type Detection: binary
---Binary Classification Score---
             precision    recall  f1-score   support

          0       0.81      0.99      0.89      1601
          1       0.50      0.06      0.11       399

avg / total       0.75      0.80      0.73      2000

Python Module Usage

Creating a Dataset

from snape.make_dataset import make_dataset

# configuration json examples can be found in doc
conf = {
    "type": "classification",
    "n_classes": 2,
    "n_samples": 1000,
    "n_features": 10,
    "out_path": "./",
    "output": "my_dataset",
    "n_informative": 3,
    "n_duplicate": 0,
    "n_redundant": 0,
    "n_clusters": 2,
    "weights": [0.8, 0.2],
    "pct_missing": 0.00,
    "insert_dollar": "Yes",
    "insert_percent": "Yes",
    "n_categorical": 0,
    "star_schema": "No",
    "label_list": []
}

make_dataset(config=conf)

Scoring a Dataset

from snape.score_dataset import score_dataset

# a dataset's testkey can be compared to a prediction file using score_dataset()
results = score_dataset(y_file="student_testkey.csv", y_hat_file="student_predictions.csv")
# results is a tuple of (a_primary_metric, classification_report)
print("AUC = " + str(results[0]))
print(results[1])

Dataset Generation Config

  1. Classification JSON
  2. Regression JSON

Why Snape?

Snape is primarily used for creating complex datasets that challenge students and teach defense against the dark arts of machine learning. :)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].