All Projects β†’ explosion β†’ ml-datasets

explosion / ml-datasets

Licence: MIT license
🌊 Machine learning dataset loaders for testing and example scripts

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ml-datasets

cifair
A duplicate-free variant of the CIFAR test set.
Stars: ✭ 13 (-67.5%)
Mutual labels:  datasets, machine-learning-datasets
Projects
πŸͺ End-to-end NLP workflows from prototype to production
Stars: ✭ 397 (+892.5%)
Mutual labels:  spacy, datasets
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Stars: ✭ 58 (+45%)
Mutual labels:  spacy
parlitools
A collection of useful tools for UK politics
Stars: ✭ 22 (-45%)
Mutual labels:  datasets
time-series-classification
Classifying time series using feature extraction
Stars: ✭ 75 (+87.5%)
Mutual labels:  datasets
json2python-models
Generate Python model classes (pydantic, attrs, dataclasses) based on JSON datasets with typing module support
Stars: ✭ 119 (+197.5%)
Mutual labels:  datasets
spacy-french-models
French models for spacy
Stars: ✭ 22 (-45%)
Mutual labels:  spacy
dw-jdbc
JDBC driver for data.world
Stars: ✭ 17 (-57.5%)
Mutual labels:  datasets
datasets
πŸ€— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Stars: ✭ 13,870 (+34575%)
Mutual labels:  datasets
ake-datasets
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Stars: ✭ 125 (+212.5%)
Mutual labels:  datasets
Dataset-Sentimen-Analisis-Bahasa-Indonesia
Repositori ini merupakan kumpulan dataset terkait analisis sentimen Berbahasa Indonesia. Apabila Anda menggunakan dataset-dataset yang ada pada repositori ini untuk penelitian, maka cantumkanlah/kutiplah jurnal artikel terkait dataset tersebut. Dataset yang tersedia telah diimplementasikan dalam beberapa penelitian dan hasilnya telah dipublikasi…
Stars: ✭ 38 (-5%)
Mutual labels:  datasets
datasets
The primary repository for all of the CORGIS Datasets
Stars: ✭ 19 (-52.5%)
Mutual labels:  datasets
airy
πŸ’¬ Open source conversational platform to power conversations with an open source Live Chat, Messengers like Facebook Messenger, WhatsApp and more - πŸ’Ž UI from Inbox to dashboards - πŸ€– Integrations to Conversational AI / NLP tools and standard enterprise software - ⚑ APIs, WebSocket, Webhook - πŸ”§ Create any conversational experience
Stars: ✭ 299 (+647.5%)
Mutual labels:  spacy
spacy hunspell
✏️ Hunspell extension for spaCy 2.0.
Stars: ✭ 94 (+135%)
Mutual labels:  spacy
multi-task-defocus-deblurring-dual-pixel-nimat
Reference github repository for the paper "Improving Single-Image Defocus Deblurring: How Dual-Pixel Images Help Through Multi-Task Learning". We propose a single-image deblurring network that incorporates the two sub-aperture views into a multitask framework. Specifically, we show that jointly learning to predict the two DP views from a single …
Stars: ✭ 29 (-27.5%)
Mutual labels:  datasets
DaCy
DaCy: The State of the Art Danish NLP pipeline using SpaCy
Stars: ✭ 66 (+65%)
Mutual labels:  spacy
alter-nlu
Natural language understanding library for chatbots with intent recognition and entity extraction.
Stars: ✭ 45 (+12.5%)
Mutual labels:  spacy
agile
🌌 Global State and Logic Library for JavaScript/Typescript applications
Stars: ✭ 90 (+125%)
Mutual labels:  spacy
bert-tensorflow-pytorch-spacy-conversion
Instructions for how to convert a BERT Tensorflow model to work with HuggingFace's pytorch-transformers, and spaCy. This walk-through uses DeepPavlov's RuBERT as example.
Stars: ✭ 26 (-35%)
Mutual labels:  spacy
dataset
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections
Stars: ✭ 21 (-47.5%)
Mutual labels:  datasets

Machine learning dataset loaders for testing and examples

Loaders for various machine learning datasets for testing and example scripts. Previously in thinc.extra.datasets.

PyPi Version

Setup and installation

The package can be installed via pip:

pip install ml-datasets

Loaders

Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.

# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()

Available loaders

NLP datasets

ID / Function Description NLP task From URL
imdb IMDB sentiment dataset Binary classification: sentiment analysis βœ“
dbpedia DBPedia ontology dataset Multi-class single-label classification βœ“
cmu CMU movie genres dataset Multi-class, multi-label classification βœ“
quora_questions Duplicate Quora questions dataset Detecting duplicate questions βœ“
reuters Reuters dataset (texts not included) Multi-class multi-label classification βœ“
snli Stanford Natural Language Inference corpus Recognizing textual entailment βœ“
stack_exchange Stack Exchange dataset Question Answering
ud_ancora_pos_tags Universal Dependencies Spanish AnCora corpus POS tagging βœ“
ud_ewtb_pos_tags Universal Dependencies English EWT corpus POS tagging βœ“
wikiner WikiNER data Named entity recognition

Other ML datasets

ID / Function Description ML task From URL
mnist MNIST data Image recognition βœ“

Dataset details

IMDB

Each instance contains the text of a movie review, and a sentiment expressed as 0 or 1.

train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
    print(f"Review: {text}")
    print(f"Sentiment: {annot}")
Property Training Dev
# Instances 25000 25000
Label values {0, 1} {0, 1}
Labels per instance Single Single
Label distribution Balanced (50/50) Balanced (50/50)

DBPedia

Each instance contains an ontological description, and a classification into one of the 14 distinct labels.

train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Category: {annot}")
Property Training Dev
# Instances 560000 70000
Label values 1-14 1-14
Labels per instance Single Single
Label distribution Balanced Balanced

CMU

Each instance contains a movie description, and a classification into a list of appropriate genres.

train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Genres: {annot}")
Property Training Dev
# Instances 41793 0
Label values 363 different genres -
Labels per instance Multiple -
Label distribution Imbalanced: 147 labels with less than 20 examples, while Drama occurs more than 19000 times -

Quora

train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
    q1, q2 = questions
    print(f"Question 1: {q1}")
    print(f"Question 2: {q2}")
    print(f"Similarity: {annot}")

Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0: no, 1: yes). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Property Training Dev
# Instances 363859 40429
Label values {0, 1} {0, 1}
Labels per instance Single Single
Label distribution Imbalanced: 63% label 0 Imbalanced: 63% label 0

Registering loaders

Loaders can be registered externally using the loaders registry as a decorator. For example:

@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
    return load_some_data()

assert "my_custom_loader" in ml_datasets.loaders
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].