All Projects → HazyResearch → reef

HazyResearch / reef

Licence: Apache-2.0 license
Automatically labeling training data

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to reef

Fullstackmachinelearning
Mostly free resources for end-to-end machine learning engineering, including open courses from CalTech, Columbia, Berkeley, MIT, and Stanford (in alphabetical order).
Stars: ✭ 39 (-61.76%)
Mutual labels:  stanford
Journalism Syllabi
Computer-Assisted Reporting and Data Journalism Syllabuses, compiled by Dan Nguyen
Stars: ✭ 136 (+33.33%)
Mutual labels:  stanford
Cs231a Notes
The course notes for Stanford's CS231A course on computer vision
Stars: ✭ 230 (+125.49%)
Mutual labels:  stanford
Cs193p Ios9 Solutions
My solutions to the assignments for Stanford's CS193P: Developing iOS 9 Apps with Swift [Spring 2016]
Stars: ✭ 42 (-58.82%)
Mutual labels:  stanford
Stanford Tensorflow Tutorials
This repository contains code examples for the Stanford's course: TensorFlow for Deep Learning Research.
Stars: ✭ 10,098 (+9800%)
Mutual labels:  stanford
Cs193p Fall 2017
These are the lectures, slides, reading assignments, and problem sets for the Developing Apps for iOS 11 with Swift 4 CS193p course offered at the Stanford School of Engineering and available on iTunes U.
Stars: ✭ 141 (+38.24%)
Mutual labels:  stanford
Stanford self driving car code
Stanford Code From Cars That Entered DARPA Grand Challenges
Stars: ✭ 687 (+573.53%)
Mutual labels:  stanford
stanford-beamer-presentation
This is an unofficial LaTeX Beamer presentation template for Stanford University.
Stars: ✭ 47 (-53.92%)
Mutual labels:  stanford
Cs193p 2020 Swiftui
📘 Stanford CS193p Spring 2020 - Developing Apps for iOS (SwiftUI)
Stars: ✭ 135 (+32.35%)
Mutual labels:  stanford
Cs224n 2019
My completed implementation solutions for CS224N 2019
Stars: ✭ 178 (+74.51%)
Mutual labels:  stanford
Actionroguelike
Third-person Action Roguelike made in Unreal Engine C++ (for Stanford CS193U 2020)
Stars: ✭ 1,121 (+999.02%)
Mutual labels:  stanford
Pynlp
A pythonic wrapper for Stanford CoreNLP.
Stars: ✭ 103 (+0.98%)
Mutual labels:  stanford
Stanford Cs229
Python solutions to the problem sets of Stanford's graduate course on Machine Learning, taught by Prof. Andrew Ng
Stars: ✭ 151 (+48.04%)
Mutual labels:  stanford
Simple Cryptography
Scripts that illustrate basic cryptography concepts based on Coursera Standford Cryptography I course and more.
Stars: ✭ 40 (-60.78%)
Mutual labels:  stanford
Weld
High-performance runtime for data analytics applications
Stars: ✭ 2,709 (+2555.88%)
Mutual labels:  stanford
Stanford dbclass
Collection of my solutions to the (infamous) dbclass (2014 version) offered by Stanford.
Stars: ✭ 35 (-65.69%)
Mutual labels:  stanford
Datasciencecoursera
Data Science Repo and blog for John Hopkins Coursera Courses. Please let me know if you have any questions.
Stars: ✭ 1,928 (+1790.2%)
Mutual labels:  stanford
MCIS wsss
Code for ECCV 2020 paper (oral): Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation
Stars: ✭ 151 (+48.04%)
Mutual labels:  weakly-supervised-learning
Stanford Cs231
Resources for students in the Udacity's Machine Learning Engineer Nanodegree to work through Stanford's Convolutional Neural Networks for Visual Recognition course (CS231n).
Stars: ✭ 249 (+144.12%)
Mutual labels:  stanford
Cs253.stanford.edu
CS 253 Web Security course at Stanford University
Stars: ✭ 155 (+51.96%)
Mutual labels:  stanford

Reef: Overcoming the Barrier to Labeling Training Data

Code for VLDB 2019 paper Snuba: Automating Weak Supervision to Label Training Data

Reef is an automated system for labeling training data based on a small labeled dataset. Reef utilizes ideas from program synthesis to automatically generate a set of interpretable heuristics that are then used to label unlabeled training data efficiently.

Installation

Reef uses Python 2. The Python package requirements are in the file requirements.txt. If you have Snorkel, can set a flag here as True but there is a simple version of learning heuristic accuracies in this repo as well.

Reef Workflow Overview

The inputs to Reef are the following:

  • A labeled dataset, which contains a numerical feature matrix and a vector of ground truth labels (currently only supports binary classification)
  • An unlabeled dataset, which contains a numerical feature matrix

The following is the overall workflow Reef follows to label training data automatically. The overall process is encoded in [1] generate_reef_labels.ipynb and the main file program_synthesis/heuristic_generator.py

  1. Using the labeled dataset, Reef generates heuristics like decision trees, or small logistic regression models. The synthesis code is in program_synthesis/synthesizer.py.
    1. A heuristic is generated for each possible combination of c features, where c is the cardinality. For example, with c=1 and 10 features, 10 heuristics will be generated.
    2. For each generated heuristic, a beta parameter is calculated. This represents the minimum confidence level at which the heuristics will assign a label. This is done by maximizing the F1 score on the labeled dataset.
  2. These heuristics are passed to a pruner that selects the best heuristic by maximizing a combination of the F1 score on the labeled dataset and diversity in terms of how many points it labels that previously selected heuristics don’t.
  3. The selected heuristic and previously chosen heuristics are then passed to the verifier which learns accuracies for the heuristics based on the labels the heuristics assign to the unlabeled dataset.
  4. Finally, Reef calculates the probabilistic labels the heuristics assign to the labeled dataset and pass datapoint with low confidence labels to the synthesizer. We repeat this procedure in an iterative manner.

Tutorial

The tutorial notebooks are based on a text-based plot classification dataset. We go through generating heuristics with Reef and then train a simple LSTM model to see how an end model trained with Reef labels compares to an end model trained with ground truth training labels.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].