All Projects → robustness-gym → meerkat

robustness-gym / meerkat

Licence: Apache-2.0 license
Flexible data structures for complex machine learning datasets.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to meerkat

deprecated-coalton-prototype
Coalton is (supposed to be) a dialect of ML embedded in Common Lisp.
Stars: ✭ 209 (+81.74%)
Mutual labels:  ml
revisiting rainbow
Revisiting Rainbow
Stars: ✭ 71 (-38.26%)
Mutual labels:  ml
lm-scorer
📃Language Model based sentences scoring library
Stars: ✭ 264 (+129.57%)
Mutual labels:  ml
VickyBytes
Subscribe to this GitHub repo to access the latest tech talks, tech demos, learning materials & modules, and developer community updates!
Stars: ✭ 48 (-58.26%)
Mutual labels:  ml
TrackMania AI
Racing game AI
Stars: ✭ 65 (-43.48%)
Mutual labels:  ml
leetspeek
Open and collaborative content from leet hackers!
Stars: ✭ 11 (-90.43%)
Mutual labels:  ml
sharpmask
TensorFlow implementation of DeepMask and SharpMask
Stars: ✭ 31 (-73.04%)
Mutual labels:  ml
Hacktoberfest-2k19
Just add pull requests to this repo and stand a chance to win a limited edition Hacktoberfest T-shirt.
Stars: ✭ 33 (-71.3%)
Mutual labels:  ml
pypmml
Python PMML scoring library
Stars: ✭ 65 (-43.48%)
Mutual labels:  ml
card-scanner-flutter
A flutter package for Fast, Accurate and Secure Credit card & Debit card scanning
Stars: ✭ 82 (-28.7%)
Mutual labels:  ml
project-code-py
Leetcode using AI
Stars: ✭ 100 (-13.04%)
Mutual labels:  ml
SENet-for-Weakly-Supervised-Relation-Extraction
No description or website provided.
Stars: ✭ 39 (-66.09%)
Mutual labels:  ml
DevSoc21
Official website for DEVSOC 21, our annual flagship hackathon.
Stars: ✭ 15 (-86.96%)
Mutual labels:  ml
industrial-ml-datasets
A curated list of datasets, publically available for machine learning research in the area of manufacturing
Stars: ✭ 45 (-60.87%)
Mutual labels:  ml
RE-VERB
speaker diarization system using an LSTM
Stars: ✭ 22 (-80.87%)
Mutual labels:  ml
osdg-tool
OSDG is an open-source tool that maps and connects activities to the UN Sustainable Development Goals (SDGs) by identifying SDG-relevant content in any text. The tool is available online at www.osdg.ai. API access available for research purposes.
Stars: ✭ 22 (-80.87%)
Mutual labels:  ml
community
README for Rekcurd projects
Stars: ✭ 16 (-86.09%)
Mutual labels:  ml
predict Lottery ticket
双色球+大乐透彩票AI预测
Stars: ✭ 341 (+196.52%)
Mutual labels:  ml
dask-sql
Distributed SQL Engine in Python using Dask
Stars: ✭ 271 (+135.65%)
Mutual labels:  ml
CustomVisionMicrosoftToCoreMLDemoApp
This app recognises 3 hand signs - fist, high five and victory hand [ rock, paper, scissors basically :) ] with live feed camera. It uses a HandSigns.mlmodel which has been trained using Custom Vision from Microsoft.
Stars: ✭ 25 (-78.26%)
Mutual labels:  ml
Meerkat logo

GitHub Workflow Status GitHub Documentation Status pre-commit codecov

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Getting Started | What is Meerkat? | Docs | Contributing | Blogpost | About

⚡️ Quickstart

pip install meerkat-ml

Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install meerkat-ml[dev,text] instead. See setup.py for a full list of optional dependencies.

Installing from dev: pip install "meerkat-ml[text] @ git+https://github.com/robustness-gym/meerkat@dev"

Load a dataset into a DataPanel and get going!

import meerkat as mk
from meerkat.contrib.imagenette import download_imagenette

download_imagenette(".")
dp = mk.DataPanel.from_csv("imagenette2-160/imagenette.csv")
dp["img"] = mk.ImageColumn.from_filepaths(dp["img_path"])

dp[["label", "split", "img"]].lz[:3]

readme_figure

To learn more, continue following along in our tutorial:
Open intro

💡 What is Meerkat?

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Meerkat's core contribution is the DataPanel, a simple columnar data abstraction. The Meerkat DataPanel can house columns of arbitrary type – from integers and strings to complex, high-dimensional objects like videos, images, medical volumes and graphs.

DataPanel loads high-dimensional data lazily. A full high-dimensional dataset won't typically fit in memory. Behind the scenes, DataPanel handles this by only materializing these objects when they are needed.

import meerkat as mk

# Images are NOT read from disk at DataPanel creation...
dp = mk.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'image': mk.ImageColumn.from_filepaths(['fox.png', 'jump.png', 'dog.png']),
    'label': [0, 1, 0]
}) 

# ...only at this point is "fox.png" read from disk
dp["image"][0]

DataPanel supports advanced indexing. Using indexing patterns similar to those of Pandas and NumPy, we can access a subset of a DataPanel's rows and columns.

import meerkat as mk
dp = ... # create DataPanel

# Pull a column out of the DataPanel
new_col: mk.ImageColumn = dp["image"]

# Create a new DataPanel from a subset of the columns in an existing one
new_dp: mk.DataPanel = dp[["image", "label"]] 

# Create a new DataPanel from a subset of the rows in an existing one
new_dp: mk.DataPanel = dp[10:20] 
new_dp: mk.DataPanel = dp[np.array([0,2,4,8])]

# Pull a column out of the DataPanel and get a subset of its rows 
new_col: mk.ImageColumn = dp["image"][10:20]

DataPanel supports map, update and filter operations. When training and evaluating our models, we often perform operations on each example in our dataset (e.g. compute a model's prediction on each example, tokenize each sentence, compute a model's embedding for each example) and store them . The DataPanel makes it easy to perform these operations and produce new columns (via DataPanel.map), store the columns alongside the original data (via DataPanel.update), and extract an important subset of the datset (via DataPanel.filter). Under the hood, dataloading is multiprocessed so that costly I/O doesn't bottleneck our computation. Consider the example below where we use update a DataPanel with two new columns holding model predictions and probabilities.

# A simple evaluation loop using Meerkat 
dp: DataPanel = ... # get DataPanel
model: nn.Module = ... # get the model
model.to(0).eval() # prepare the model for evaluation

@torch.no_grad()
def predict(batch: dict):
    probs = torch.softmax(model(batch["input"].to(0)), dim=-1)
    return {"probs": probs.cpu(), "pred": probs.cpu().argmax(dim=-1)}

# updated_dp has two new `TensorColumn`s: 1 for probabilities and one
# for predictions
updated_dp: mk.DataPanel = dp.update(function=predict, batch_size=128, is_batched_fn=True)

✉️ About

Meerkat is being developed at Stanford's Hazy Research Lab. Please reach out to kgoel [at] cs [dot] stanford [dot] edu if you would like to use or contribute to Meerkat.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].