All Projects → edwardlib → Observations

edwardlib / Observations

Licence: other
Tools for loading standard data sets in machine learning

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Observations

Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+1056.32%)
Mutual labels:  data-science, statistics
Aulas
Aulas da Escola de Inteligência Artificial de São Paulo
Stars: ✭ 166 (-12.63%)
Mutual labels:  data-science, statistics
Tidyversity
🎓 Tidy tools for academics
Stars: ✭ 155 (-18.42%)
Mutual labels:  statistics, science
Collapse
Advanced and Fast Data Transformation in R
Stars: ✭ 184 (-3.16%)
Mutual labels:  data-science, statistics
Virgilio
Virgilio is developed and maintained by these awesome people. You can email us virgilio.datascience (at) gmail.com or join the Discord chat.
Stars: ✭ 13,200 (+6847.37%)
Mutual labels:  data-science, statistics
Data Science Question Answer
A repo for data science related questions and answers
Stars: ✭ 2,000 (+952.63%)
Mutual labels:  data-science, statistics
Pzad
Курс "Прикладные задачи анализа данных" (ВМК, МГУ имени М.В. Ломоносова)
Stars: ✭ 160 (-15.79%)
Mutual labels:  data-science, education
Rpy2
Interface to use R from Python
Stars: ✭ 132 (-30.53%)
Mutual labels:  data-science, statistics
Awesome Python Applications
💿 Free software that works great, and also happens to be open-source Python.
Stars: ✭ 13,275 (+6886.84%)
Mutual labels:  education, science
Covid19 Severity Prediction
Extensive and accessible COVID-19 data + forecasting for counties and hospitals. 📈
Stars: ✭ 170 (-10.53%)
Mutual labels:  data-science, statistics
Design Of Experiment Python
Design-of-experiment (DOE) generator for science, engineering, and statistics
Stars: ✭ 143 (-24.74%)
Mutual labels:  statistics, science
Docker Galaxy Stable
🐳📊📚 Docker Images tracking the stable Galaxy releases.
Stars: ✭ 179 (-5.79%)
Mutual labels:  data-science, science
Book
This book serves as an introduction to a whole new way of thinking systematically about geographic data, using geographical analysis and computation to unlock new insights hidden within data.
Stars: ✭ 141 (-25.79%)
Mutual labels:  data-science, statistics
Uncertainty Metrics
An easy-to-use interface for measuring uncertainty and robustness.
Stars: ✭ 145 (-23.68%)
Mutual labels:  data-science, statistics
Interactive machine learning
IPython widgets, interactive plots, interactive machine learning
Stars: ✭ 140 (-26.32%)
Mutual labels:  data-science, statistics
Zigzag
Python library for identifying the peaks and valleys of a time series.
Stars: ✭ 156 (-17.89%)
Mutual labels:  data-science, statistics
Awesome Scientific Python
A curated list of awesome scientific Python resources
Stars: ✭ 127 (-33.16%)
Mutual labels:  data-science, science
Lifelines
Survival analysis in Python
Stars: ✭ 1,766 (+829.47%)
Mutual labels:  data-science, statistics
Data Science Toolkit
Collection of stats, modeling, and data science tools in Python and R.
Stars: ✭ 169 (-11.05%)
Mutual labels:  data-science, statistics
Datasets For Good
List of datasets to apply stats/machine learning/technology to the world of social good.
Stars: ✭ 174 (-8.42%)
Mutual labels:  data-science, education

Observations

Build Status Coverage Status

Announcement (September 16, 2018): Observations is in the process of being replaced by TensorFlow Datasets. Unlike Observations, TensorFlow Datasets is more performant, provides pipelining for >2GB data sets and all of Tensor2Tensor's, and better interfaces with tf.data. We're working to add all features from Observations, such as its relatively simple API, supporting all of Observations' data sets, and providing a method to return NumPy arrays instead of TensorFlow Tensors.

Observations provides a one line Python API for loading standard data sets in machine learning. It automates the process from downloading, extracting, loading, and preprocessing data. Observations helps keep the workflow reproducible and follow sensible standards.

It can be used in two ways.

1. As a package

Install it.

pip install observations

Import it.

from observations import svhn

(x_train, y_train), (x_test, y_test) = svhn("~/data")

All functions take as input a filepath and optional preprocessing arguments. They return a tuple in the form of training data, test data, and validation data (if available). Each element in the tuple is typically a NumPy array, a tuple of NumPy arrays (e.g., features and labels), or a string (text). See the API for details.

2. As source code

Copy and paste functions inside the codebase relevant for your experiments.

def enwik8(path):
  ...

x_train, x_test, x_valid = enwik8("~/data")

Each function has minimal dependencies. For example, enwik8.py only depends on core libraries and the external function maybe_download_and_extract in util.py. The functions are designed to be easy to read and hack at.

FAQ

Which approach should I take?

It depends on your use case.

  1. As a package, dozens of data sets are at your disposal. The package establishes sensible standards for conveniently loading in data and thus quickly experimenting with them.
  2. As source code, you have complete flexibility—from the initial download all the way to preprocessing the data as NumPy arrays.

How do I use minibatches of data?

The data loading functions return the full data. It's up to your needs to generate batches.

One helpful utility is

def generator(array, batch_size):
  """Generate batch with respect to array's first axis."""
  start = 0  # pointer to where we are in iteration
  while True:
    stop = start + batch_size
    diff = stop - array.shape[0]
    if diff <= 0:
      batch = array[start:stop]
      start += batch_size
    else:
      batch = np.concatenate((array[start:], array[:diff]))
      start = diff
    yield batch

To use it, simply write

from observations import cifar10
(x_train, y_train), (x_test, y_test) = cifar10("~/data")
x_train_data = generator(x_train, 256)

for batch in x_train_data:
  ...  # operate on batch

batch = next(x_train_data)  # alternatively, increment the iterator

There's also an extended version. It takes a list of arrays as input and yields a list of batches.

def generator(arrays, batch_size):
  """Generate batches, one with respect to each array's first axis."""
  starts = [0] * len(arrays)  # pointers to where we are in iteration
  while True:
    batches = []
    for i, array in enumerate(arrays):
      start = starts[i]
      stop = start + batch_size
      diff = stop - array.shape[0]
      if diff <= 0:
        batch = array[start:stop]
        starts[i] += batch_size
      else:
        batch = np.concatenate((array[start:], array[:diff]))
        starts[i] = diff
      batches.append(batch)
    yield batches

To use it, simply write

from observations import cifar10
(x_train, y_train), (x_test, y_test) = cifar10("~/data")
train_data = generator([x_train, y_train], 256)

for x_batch, y_batch in train_data:
  ...  # operate on batch

x_batch, y_batch = next(train_data)  # alternatively, increment the iterator

Contributing

We'd like your help! Any pull requests which help maintain the existing functions and/or add new ones are appreciated. We follow Edward's standards for style and documentation.

Each function takes as input a filepath and optional preprocessing arguments. All necessary packages that aren't from the Python Standard Library, NumPy, or six are imported inside the function's body. The functions proceed as follows:

  1. Check if the extracted file(s) exist in the filepath. If it does, skip to step 4.
  2. Check if the compressed file(s) exist in the filepath. If it doesn't, download it.
  3. Extract the compressed file(s).
  4. Load the data into memory.
    • For data sets larger than 1 GB, the function will terminate with a message advising to load the files as batches.
  5. Preprocess the data.
  6. Return a tuple in the form of training data, test data, and validation data (if available). Each element in the tuple is typically a NumPy array, a tuple of NumPy arrays (e.g., features and labels), or a string (text).
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].