Top 203 datasets open source projects

Pydataset
Instant access to many datasets in Python.
Entity Recognition Datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Conversational Datasets
Large datasets for conversational AI
Ogb
Benchmark datasets, data loaders, and evaluators for graph machine learning
Audino
Open source audio annotation tool for humans™
Awesome Transit
Community list of transit APIs, apps, datasets, research, and software 🚌🌟🚋🌟🚂
Datasets For Recommender Systems
This is a repository of a topic-centric public data sources in high quality for Recommender Systems (RS)
Annotated Semantic Relationships Datasets
A collections of public and free annotated datasets of relationships between entities/nominals (Portuguese and English)
Loghub
A large collection of system log datasets for AI-powered log analytics
Datasets
Machine learning datasets used in tutorials on MachineLearningMastery.com
Voice datasets
🔊 A comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
Awesome Dataset Tools
🔧 A curated list of awesome dataset tools
Chinese Nlp Corpus
Collections of Chinese NLP corpus
Geobr
Easy access to official spatial data sets of Brazil in R and Python
Projects
🪐 End-to-end NLP workflows from prototype to production
Awesome Holistic 3d
A list of papers and resources (data,code,etc) for holistic 3D reconstruction in computer vision
Video Understanding Dataset
A collection of recent video understanding datasets, under construction!
Paperrobot
Code for PaperRobot: Incremental Draft Generation of Scientific Ideas
Animal Matting
Github repository for the paper End-to-end Animal Image Matting
Dr.sure
🏫DeepLearning学习笔记以及Tensorflow、Pytorch的使用心得笔记。Dr. Sure会不定时往项目中添加他看到的最新的技术,欢迎批评指正。
Chakin
Simple downloader for pre-trained word vectors
Akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Awesome Segmentation Saliency Dataset
A collection of some datasets for segmentation / saliency detection. Welcome to PR...😄
Medical Datasets
tracking medical datasets, with a focus on medical imaging
Cleora
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.
Tdc
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics
Cluecorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Meglass
An eyeglass face dataset collected and cleaned for face recognition evaluation, CCBR 2018.
Hub
Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai
Roapi
Create full-fledged APIs for static datasets without writing a single line of code.
newsletter-archive
Markdown archive & RSS/Atom feeds for Data Is Plural.
datasets
TFDS data loaders for sign language datasets.
covid-19-data-cleanup
Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19
NetEmb-Datasets
A collection of real-world networks/graphs for Network Embedding
databrewer-recipes
DataBrewer Recipes Repository.
recurrent-defocus-deblurring-synth-dual-pixel
Reference github repository for the paper "Learning to Reduce Defocus Blur by Realistically Modeling Dual-Pixel Data". We propose a procedure to generate realistic DP data synthetically. Our synthesis approach mimics the optical image formation found on DP sensors and can be applied to virtual scenes rendered with standard computer software. Lev…
podium
Podium: a framework agnostic Python NLP library for data loading and preprocessing
disent
🧶 Modular VAE disentanglement framework for python built with PyTorch Lightning ▸ Including metrics and datasets ▸ With strongly supervised, weakly supervised and unsupervised methods ▸ Easily configured and run with Hydra config ▸ Inspired by disentanglement_lib
TSForecasting
This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.
dplace-data
The data repository for the D-PLACE Project (Database of Places, Language, Culture and Environment)
ml4se
A curated list of papers, theses, datasets, and tools related to the application of Machine Learning for Software Engineering
opendatasets
A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.
databrewer
The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!
ck-env
CK repository with components and automation actions to enable portable workflows across diverse platforms including Linux, Windows, MacOS and Android. It includes software detection plugins and meta packages (code, data sets, models, scripts, etc) with the possibility of multiple versions to co-exist in a user or system environment:
RData.jl
Read R data files from Julia
61-120 of 203 datasets projects