All Projects → dbcollection → dbcollection

dbcollection / dbcollection

Licence: MIT License
A collection of popular datasets for deep learning.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to dbcollection

Persian Swear Words
دیتاست کلمات نامناسب و بد فارسی برای فیلتر کردن متن ها
Stars: ✭ 95 (+265.38%)
Mutual labels:  dataset, datasets
Awesome Json Datasets
A curated list of awesome JSON datasets that don't require authentication.
Stars: ✭ 2,421 (+9211.54%)
Mutual labels:  dataset, datasets
Exposure correction
Reference code for the paper "Learning Multi-Scale Photo Exposure Correction", CVPR 2021.
Stars: ✭ 98 (+276.92%)
Mutual labels:  dataset, datasets
Colour
Colour Science for Python
Stars: ✭ 1,131 (+4250%)
Mutual labels:  dataset, datasets
Datasets
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Stars: ✭ 3,094 (+11800%)
Mutual labels:  dataset, datasets
Atis dataset
The ATIS (Airline Travel Information System) Dataset
Stars: ✭ 81 (+211.54%)
Mutual labels:  dataset, datasets
Aesthetics
Image Aesthetics Toolkit - includes Fisher Vector implementation, AVA (Image Aesthetic Visual Analysis) dataset and fast multi-threaded downloader
Stars: ✭ 113 (+334.62%)
Mutual labels:  dataset, datasets
Voice datasets
🔊 A comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
Stars: ✭ 494 (+1800%)
Mutual labels:  dataset, datasets
Retriever
Quickly download, clean up, and install public datasets into a database management system
Stars: ✭ 241 (+826.92%)
Mutual labels:  dataset, datasets
Datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
Stars: ✭ 231 (+788.46%)
Mutual labels:  dataset, datasets
French Sentiment Analysis Dataset
A collection of over 1.5 Million tweets data translated to French, with their sentiment.
Stars: ✭ 35 (+34.62%)
Mutual labels:  dataset, datasets
craft-text-detector
Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector
Stars: ✭ 151 (+480.77%)
Mutual labels:  anaconda, pypi
Label Studio
Label Studio is a multi-type data labeling and annotation tool with standardized output format
Stars: ✭ 7,264 (+27838.46%)
Mutual labels:  dataset, datasets
Openml R
R package to interface with OpenML
Stars: ✭ 81 (+211.54%)
Mutual labels:  dataset, datasets
Awesome Twitter Data
A list of Twitter datasets and related resources.
Stars: ✭ 533 (+1950%)
Mutual labels:  dataset, datasets
Wb srgb
White balance camera-rendered sRGB images (CVPR 2019) [Matlab & Python]
Stars: ✭ 101 (+288.46%)
Mutual labels:  dataset, datasets
Awesome Segmentation Saliency Dataset
A collection of some datasets for segmentation / saliency detection. Welcome to PR...😄
Stars: ✭ 315 (+1111.54%)
Mutual labels:  dataset, datasets
Doccano
Open source annotation tool for machine learning practitioners.
Stars: ✭ 5,600 (+21438.46%)
Mutual labels:  dataset, datasets
Automated Resume Screening System
Automated Resume Screening System using Machine Learning (With Dataset)
Stars: ✭ 224 (+761.54%)
Mutual labels:  dataset, datasets
Elyra
Elyra extends JupyterLab Notebooks with an AI centric approach.
Stars: ✭ 839 (+3126.92%)
Mutual labels:  anaconda, pypi

dbcollection

Join the chat at https://gitter.im/dbcollection/dbcollection

Build Status Build status codecov License: MIT Documentation Status PyPI version

dbcollection is a library for downloading/parsing/managing datasets via simple methods. It was built from the ground up to be cross-platform (Windows, Linux, MacOS) and cross-language (Python, Lua, Matlab, etc.). This is achieved by using the popular HDF5 file format to store (meta)data of manually parsed datasets and the power of Python for scripting. By doing so, this library can target any platform that supports Python and any language that has bindings for HDF5.

This package allows to easily manage and load datasets by using HDF5 files to store metadata. By storing all the necessary metadata to disk, managing either big or small datasets has an equal or very similar impact on the system's resource usage. Also, once a dataset is setup, it is setup forever! This means users can reuse any previously set dataset as many times as needed without having to set it each time they are used.

dbcollection allows users to focus on more important tasks like prototyping new models or testing them in different datasets without having to incur the loss of spending time managing datasets or creating/modyfing scripts to load/fetch data by taking advantage of the work of the community that shared these resources.

WARNING - Project Inactivity

Until further notice, this project won't receive new contributions for the time being due to lack of personal time and interest. Moreover, other solutions work sufficiently good enough for most use cases (for example, kaggle). Therefore, consider this project deprecated unless further interest grows for a centralised solution like dbcollection by enough people.

Main features

Here are some of key features dbcollection provides:

  • Simple API to load/download/setup/manage datasets.
  • Simple API to fetch data from a dataset.
  • Store and pull data from disk or from memory, you choose!
  • Datasets only need to be set/processed once, so next time you use it it will load instantly!
  • Cross-platform (Windows, Linux, MacOs).
  • Cross-language (Python, Lua/Torch7, Matlab).
  • Easily extensible to other languages that support HDF5 files format.
  • Concurrent/parallel data access thanks to HDF5.
  • Contains a diverse (and growing!) list of popular datasets for machine-, deep-learning tasks (object detection, action recognition, human pose estimation, etc.)

Supported languages

  • Python (>=2.7 or >=3.5)
  • Lua/Torch7 (link)
  • Matlab (>=2014a) (link)

Package installation

From PyPi

Installing dbcollection using pip is simple. For that purpose, simply do the following command:

$ pip install dbcollection

From source

To install dbcollection from source you need to do the following setps:

  • Clone the repo to your hard drive:
$ git clone --recursive https://github.com/dbcollection/dbcollection
  • cd to the dbcollection folder and do the command
$ python setup.py install

Getting started

Basic usage

Using the module is pretty straight-forward. To import it just do:

>>> import dbcollection as dbc

To load a dataset, you only need to use a single method that returns a data loader object which can then be used to fetch data from.

>>> mnist = dbc.load('mnist')

This data loader object contains information about the dataset’s name, task, data, cache paths, set splits, and some methods for querying and loading data from the HDF5 metadata file.

For example, if you want to know how the data is structured inside the metadata file, you can simply do the following:

>>> mnist.info()

> Set: test
   - classes,        shape = (10, 2),          dtype = uint8
   - images,         shape = (10000, 28, 28),  dtype = uint8,  (in 'object_ids', position = 0)
   - labels,         shape = (10000,),         dtype = uint8,  (in 'object_ids', position = 1)
   - object_fields,  shape = (2, 7),           dtype = uint8
   - object_ids,     shape = (10000, 2),       dtype = uint8

   (Pre-ordered lists)
   - list_images_per_class,  shape = (10, 1135),  dtype = int32

> Set: train
   - classes,        shape = (10, 2),          dtype = uint8
   - images,         shape = (60000, 28, 28),  dtype = uint8,  (in 'object_ids', position = 0)
   - labels,         shape = (60000,),         dtype = uint8,  (in 'object_ids', position = 1)
   - object_fields,  shape = (2, 7),           dtype = uint8
   - object_ids,     shape = (60000, 2),       dtype = uint8

   (Pre-ordered lists)
   - list_images_per_class,  shape = (10, 6742),  dtype = int32

To fetch data samples from a field, its is as easy as calling a method with the set and field names and the row id(s) you want to select. For example, to retrieve the 10 first images all you need to do is the following:

>>> imgs = mnist.get('train', 'images', range(10))
>>> imgs.shape
(10, 28, 28)
Note: For more information about using this module, please check the documentation or the available notebooks for guidance.

Notebooks

For a more pratical introduction to dbcollection’s module for managing datasets and fetching data, there are some python notebooks available in the notebooks/ folder for a more hands-on tutorial on how to use this package.

Documentation

The package documentation is hosted on Read The Docs.

It provides a more detailed guide on how to use this package as well as additional information that you might find relevant about this project.

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. If you would like to see additional languages being supported, please consider contributing to the project.

If you are interested in fixing issues and contributing directly to the code base, please see the document How to Contribute.

Feedback

For now, use the Github issues for requests/bug issues or use our Gitter room for any other questions you may have.

License

MIT License

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].