All Projects → cgnorthcutt → rankpruning

cgnorthcutt / rankpruning

Licence: MIT license
🧹 Formerly for binary classification with noisy labels. Replaced by cleanlab.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to rankpruning

Cleanlab
The standard package for machine learning with noisy labels, finding mislabeled data, and uncertainty quantification. Works with most datasets and models.
Stars: ✭ 2,526 (+3018.52%)
Mutual labels:  machine-learning-algorithms, semi-supervised-learning, learning-with-confident-examples
Minerva Training Materials
Learn advanced data science on real-life, curated problems
Stars: ✭ 37 (-54.32%)
Mutual labels:  training, machine-learning-algorithms
machine-learning-templates
Template codes and examples for Python machine learning concepts
Stars: ✭ 40 (-50.62%)
Mutual labels:  machine-learning-algorithms
github-user-rank-extension
Your Github fame is getting closer with every open-source project you've built and promoted, with every new follower starring, using and forking your solution. This extension supplements every Github developer profile with language bars that show how far they've advanced on their road to the glory among %that_programming_language% community memb…
Stars: ✭ 38 (-53.09%)
Mutual labels:  ranking
densratio py
A Python Package for Density Ratio Estimation
Stars: ✭ 112 (+38.27%)
Mutual labels:  machine-learning-algorithms
shortest-tutorial-ever
A list of the shortest tutorials ever.
Stars: ✭ 14 (-82.72%)
Mutual labels:  training
RobustPCA
No description or website provided.
Stars: ✭ 15 (-81.48%)
Mutual labels:  machine-learning-algorithms
data sciences campaign
【数据科学家系列课程】
Stars: ✭ 91 (+12.35%)
Mutual labels:  machine-learning-algorithms
ml course
"Learning Machine Learning" Course, Bogotá, Colombia 2019 #LML2019
Stars: ✭ 22 (-72.84%)
Mutual labels:  machine-learning-algorithms
lolo
A random forest
Stars: ✭ 37 (-54.32%)
Mutual labels:  machine-learning-algorithms
Statistical-Learning-using-R
This is a Statistical Learning application which will consist of various Machine Learning algorithms and their implementation in R done by me and their in depth interpretation.Documents and reports related to the below mentioned techniques can be found on my Rpubs profile.
Stars: ✭ 27 (-66.67%)
Mutual labels:  machine-learning-algorithms
Pro-GNN
Implementation of the KDD 2020 paper "Graph Structure Learning for Robust Graph Neural Networks"
Stars: ✭ 202 (+149.38%)
Mutual labels:  semi-supervised-learning
cytrone
CyTrONE: Integrated Cybersecurity Training Framework
Stars: ✭ 72 (-11.11%)
Mutual labels:  training
metric-transfer.pytorch
Deep Metric Transfer for Label Propagation with Limited Annotated Data
Stars: ✭ 49 (-39.51%)
Mutual labels:  semi-supervised-learning
scikit-learn-intelex
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
Stars: ✭ 887 (+995.06%)
Mutual labels:  machine-learning-algorithms
temporal-ensembling-semi-supervised
Keras implementation of temporal ensembling(semi-supervised learning)
Stars: ✭ 22 (-72.84%)
Mutual labels:  semi-supervised-learning
Time-Series-Analysis-and-Forecasting-with-Python
No description or website provided.
Stars: ✭ 24 (-70.37%)
Mutual labels:  machine-learning-algorithms
ngs-in-bioc
A course on Analysing Next Generation (/High Throughput etc..) Sequencing data using Bioconductor
Stars: ✭ 37 (-54.32%)
Mutual labels:  training
studio-lab-examples
Example notebooks for working with SageMaker Studio Lab. Sign up for an account at the link below!
Stars: ✭ 319 (+293.83%)
Mutual labels:  training
OLSTEC
OnLine Low-rank Subspace tracking by TEnsor CP Decomposition in Matlab: Version 1.0.1
Stars: ✭ 30 (-62.96%)
Mutual labels:  machine-learning-algorithms

UPDATE!! This package, rankpruning, is now deprecated. You should instead be using cleanlab, the official Python framework for machine learning and deep learning with noisy labels, available here: https://github.com/cleanlab/cleanlab/. cleanlab generalizes for any dataset, number of classes, model, and framework, including scikit-learn, pytorch, tensorflow, fasttext, and others. For a familiar interface with rank pruning, start with /cleanlab/classification.py file for a familiar interface.

You should only use this package if you are a research scientist wishing to reproduce the results our UAI 2017 publication Learning with Confident Examples: Rank Pruning for Binary Classification with Noisy Labels. Paper available here.

rankpruning is a python package for state-of-the-art binary classification with mislabeled training examples. This machine learning package implements the Rank Pruning algorithm and other methods for P̃Ñ learning (binary classification where some fraction of positive example labels are uniformly randomly flipped and some fraction of negative example labels are uniformly randomly flipped). Rank Pruning is theoretically grounded and trivial to use. The Rank Pruning algorithm (Curtis G. Northcutt, Tailin Wu, & Isaac L. Chuang, 2017) was published in the proceedings of Uncertainty in Artificial Intelligence (UAI) 2017. You can view the publication here. The RankPruning() class:

  • works with any probabilistic classifer (e.g. neural network, logistic regression)
  • is fast (time-efficient), taking about 2-3 times the training time of the classifier)
  • also computes the fraction of noise in the positive and negative sets
  • provides state-of-the-art (as of 2017) F1 score, AUC-PR, accuracy, etc. for binary classification with mislabeled training data (P̃Ñ learning).
  • also works well when noise examples drawn from a third distribution are mixed into the training data.

A tutorial is provided at tutorial/tutorial.ipynb. An ipynb (Jupyter Notebook) is used to allow you to view the tutorial output without installing tutorial-specific dependiences. We provide both Jupyter Notebook and python implementations of most files for portability and ease of use.

Citation

If you find this repository helpful, please cite us: http://auai.org/uai2017/proceedings/papers/35.pdf

@inproceedings{northcutt2017rankpruning,
 author={Northcutt, Curtis G. and Wu, Tailin and Chuang, Isaac L.},
 title={Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels},
 booktitle = {Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence},
 series = {UAI'17},
 year = {2017},
 location = {Sydney, Australia},
 numpages = {10},
 url = {http://auai.org/uai2017/proceedings/papers/35.pdf},
 publisher = {AUAI Press},
} 

Classification with Rank Pruning is easy.

rp = RankPruning(clf=logreg()) # or a CNN(), or NaiveBayes(), etc.
rp.fit(X, s)
pred = rp.predict(X)

It is trained with:

  1. a matrix X of training examples (sometimes called a feature matrix), with each row in X comprising a unique training example and each column comprising a single dimension of the examples' feature representation.
  2. a vector s of binary (0 or 1) labels where an unknown fraction of labels may be mislabeled (flipped)
  3. ANY probabilistic classifier clf as long as it has clf.predict_proba(), clf.predict(), and clf.fit() defined.

Ideally, given training feature matrix X and noisy labels s (instead of the hidden, true labels y), fit clf as if you had called clf.fit(X, y) not clf.fit(X, s), even though y is not available.#

How does Rank Pruning work?

rankpruning is based on a joint research effort between the Massachusetts Institute of Technology's Department of Electrical Engineering and Computer Science, Office of Digital Learning, and Department of Physics. The Rank Pruning algorithm is theoretically grounded and trivial to use. rankpruning embodies the "learning with confident examples" paradigm and works as follows:

  1. estimate the fraction of mislabeling in both the positive and negative sets
  2. use these estimates to rank examples by confidence of being correctly labeled
  3. prune out likely mislabeled data
  4. train on the pruned set (an intended subset of the correctly labeled training data)

Installation

To use the rankpruning package just run:

$ pip install git+https://github.com/cgnorthcutt/rankpruning.git

If you'd like to explore the tutorial, test files, or make changes; clone the repo and run:

$ cd rankpruning
$ pip install -e .

Python Usage

import rankpruning

# RankPruning() class for classification with mislabeled training data
from rankpruning import RankPruning

# module containing other prior art methods for pnlearning
from rankpruning import other_pnlearning_methods

If you wish to use the tutorial_and_testing package, a few additional dependencies are needed. See below.

Dependencies

rankpruning requires sklearn and numpy. We've taken care of these for you.

Since Rank Pruning works for any probabilistic classifer, we provide a CNN (convolutional neural network). Using this classifier requires two additional dependencies.

To use our CNN with conda:

# Linux/Mac OS X, Python 2.7/3.4/3.5, CPU only:
$ conda install -c conda-forge tensorflow
$ conda install keras>=2.0.0 # Requires version 2.0.0 or greater

With pip, first follow the instructions for installing tensorflow here, then install keras 2.0.0 using:

$ sudo pip install keras>=2.0.0 # Requires version 2.0.0 or greater

We also provide a basic tutorial to test out Rank Pruning. The tutorial and testing examples also depend on the following additional packages:

  • pandas
  • matplotlib
  • jupyter

Simple Example: Comparing Rank Pruning with other models for P̃Ñ learning.

from __future__ import print_function
from rankpruning import RankPruning, other_pnlearning_methods
import numpy as np

# Libraries uses only for the purpose of this example
from numpy.random import multivariate_normal
from sklearn.metrics import precision_recall_fscore_support as prfs
from sklearn.metrics import accuracy_score as acc
from sklearn.linear_model import LogisticRegression

# Create the training dataset with positive and negative examples
# drawn from two-dimensional Guassian distributions.
neg = multivariate_normal(mean=[2,2], cov=[[10,-1.5],[-1.5,5]], size=1000)
pos = multivariate_normal(mean=[5,5], cov=[[1.5,1.3],[1.3,4]], size=500)
X = np.concatenate((neg, pos))
y = np.concatenate((np.zeros(len(neg)), np.ones(len(pos))))

# For this example, choose the following mislaeling noise rates.
frac_pos2neg = 0.8 # rh1, P(s=0|y=1) in literature
frac_neg2pos = 0.15 # rh0, P(s=1|y=0) in literature

# Generate s, the observed noisy label vector (flipped uniformly randomly with noise rates).
s = y * (np.cumsum(y) <= (1 - frac_pos2neg) * sum(y))
s_only_neg_mislabeled = 1 - (1 - y) * (np.cumsum(1 - y) <= (1 - frac_neg2pos) * sum(1 - y))
s[y==0] = s_only_neg_mislabeled[y==0]

# Create testing dataset:
neg_test = multivariate_normal(mean=[2,2], cov=[[10,-1.5],[-1.5,5]], size=2000)
pos_test = multivariate_normal(mean=[5,5], cov=[[1.5,1.3],[1.3,4]], size=1000)
X_test = np.concatenate((neg_test, pos_test))
y_test = np.concatenate((np.zeros(len(neg_test)), np.ones(len(pos_test))))

# We choose logistic regression, but Rank Pruning can use 
# any probabilistic classifier such as CNN(), or NaiveBayes(), etc.
clf = LogisticRegression()

# Initilize models: 
models = {
  "Baseline" : other_pnlearning_methods.BaselineNoisyPN(clf = clf),
  "Rank Pruning" : RankPruning(clf = clf),
  "Rank Pruning (noise rates given)": RankPruning(frac_pos2neg, frac_neg2pos, clf = clf),
  "Elk08 (noise rates given)": other_pnlearning_methods.Elk08(e1 = 1 - frac_pos2neg, clf = clf),
  "Liu16 (noise rates given)": other_pnlearning_methods.Liu16(frac_pos2neg, frac_neg2pos, clf = clf),
  "Nat13 (noise rates given)": other_pnlearning_methods.Nat13(frac_pos2neg, frac_neg2pos, clf = clf),
}

# For the models, fit on (X, s) and predict on X_test:
for key in models.keys():
  model = models[key]
  model.fit(X, s)
  pred = model.predict(X_test)
  pred_proba = model.predict_proba(X_test) # Produces P(y=1|x)

  print("\n%s Model Performance:\n==============================\n" % key)
  print(
    "Accuracy:", acc(y_test, pred), "|", 
    "Precision:", prfs(y_test, pred)[0], "|", 
    "Recall:", prfs(y_test, pred)[1], "|",
    "F1:", prfs(y_test, pred)[2]
  )

More examples

For more examples, see the tutorial_and_testing module.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].