All Projects → knodle → knodle

knodle / knodle

Licence: Apache-2.0 license
A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to knodle

Relation-Classification
Relation Classification - SEMEVAL 2010 task 8 dataset
Stars: ✭ 46 (-39.47%)
Mutual labels:  classification, relation-extraction
WeFEND-AAAI20
Dataset for paper "Weak Supervision for Fake News Detection via Reinforcement Learning" published in AAAI'2020.
Stars: ✭ 67 (-11.84%)
Mutual labels:  weak-supervision, weakly-supervised-learning
ASTRA
Self-training with Weak Supervision (NAACL 2021)
Stars: ✭ 127 (+67.11%)
Mutual labels:  weak-supervision, weakly-supervised-learning
wrench
WRENCH: Weak supeRvision bENCHmark
Stars: ✭ 185 (+143.42%)
Mutual labels:  weak-supervision, weakly-supervised-learning
Learning-From-Rules
Implementation of experiments in paper "Learning from Rules Generalizing Labeled Exemplars" to appear in ICLR2020 (https://openreview.net/forum?id=SkeuexBtDr)
Stars: ✭ 46 (-39.47%)
Mutual labels:  weak-supervision, weakly-supervised-learning
concept-based-xai
Library implementing state-of-the-art Concept-based and Disentanglement Learning methods for Explainable AI
Stars: ✭ 41 (-46.05%)
Mutual labels:  weak-supervision, weakly-supervised-learning
Snorkel
A system for quickly generating training data with weak supervision
Stars: ✭ 4,953 (+6417.11%)
Mutual labels:  weak-supervision, snorkel
weasel
Weakly Supervised End-to-End Learning (NeurIPS 2021)
Stars: ✭ 117 (+53.95%)
Mutual labels:  weak-supervision, weakly-supervised-learning
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (-27.63%)
Mutual labels:  weak-supervision, weakly-supervised-learning
ml-competition-template-titanic
Kaggle Titanic example
Stars: ✭ 51 (-32.89%)
Mutual labels:  classification
Classification Nets
Implement popular models by different DL framework. Such as tensorflow and caffe
Stars: ✭ 17 (-77.63%)
Mutual labels:  classification
flexinfer
A flexible Python front-end inference SDK based on TensorRT
Stars: ✭ 83 (+9.21%)
Mutual labels:  classification
Metric Learning Adversarial Robustness
Code for NeurIPS 2019 Paper
Stars: ✭ 44 (-42.11%)
Mutual labels:  classification
NN-scratch
Coding up a Neural Network Classifier from Scratch
Stars: ✭ 78 (+2.63%)
Mutual labels:  classification
SGDLibrary
MATLAB/Octave library for stochastic optimization algorithms: Version 1.0.20
Stars: ✭ 165 (+117.11%)
Mutual labels:  classification
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-68.42%)
Mutual labels:  classification
Machine-Learning-Specialization
Project work and Assignments for Machine learning specialization course on Coursera by University of washington
Stars: ✭ 27 (-64.47%)
Mutual labels:  classification
BIRADS classifier
High-resolution breast cancer screening with multi-view deep convolutional neural networks
Stars: ✭ 122 (+60.53%)
Mutual labels:  classification
machine learning from scratch matlab python
Vectorized Machine Learning in Python 🐍 From Scratch
Stars: ✭ 28 (-63.16%)
Mutual labels:  classification
hamnet
PyTorch implementation of AAAI 2021 paper: A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization
Stars: ✭ 30 (-60.53%)
Mutual labels:  weak-supervision

Python Version license GitHub Release build status PyPI codecov

Knodle (Knowledge-supervised Deep Learning Framework) - a new framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training.

More details about Knodle are in our recent paper.


Latest news

Installation

pip install knodle

Usage

knodle offers various methods for denoising weak supervision sources and improve them. There are several methods available for denoising. Examples can be seen in the tutorials folder.

There are four mandatory inputs for knodle:

  1. model_input_x: Your model features (e.g. TF-IDF values) without any labels. Shape: (n_instances x features)
  2. mapping_rules_labels_t: This matrix maps all weak rules to a label. Shape: (n_rules x n_classes)
  3. rule_matches_z: This matrix shows all applied rules on your dataset. Shape: (n_instances x n_rules)
  4. model: A PyTorch model which can take your provided model_input_x as input. Examples are in the model folder.

If you know which denoising method you want to use, you can directly call the corresponding module (the list of currently supported methods is provided below).

Example for training the baseline classifier:

from knodle.model.logistic_regression_model import LogisticRegressionModel
from knodle.trainer.baseline.majority import MajorityVoteTrainer

NUM_OUTPUT_CLASSES = 2

model = LogisticRegressionModel(model_input_x.shape[1], NUM_OUTPUT_CLASSES)

trainer = MajorityVoteTrainer(
  model=model,
  mapping_rules_labels_t=mapping_rules_labels_t,
  model_input_x=model_input_x,
  rule_matches_z=rule_matches_z,
  dev_model_input_x=X_dev,
  dev_gold_labels_y=Y_dev
)

trainer.train()

trainer.test(X_test, Y_test)

A more detailed example of classifier training is here.

Main Principles

The framework provides a simple tensor-driven abstraction based on PyTorch allowing researchers to efficiently develop and compare their methods. The emergence of machine learning software frameworks is the biggest enabler for the wide spread adoption of machine learning and its speed of development. With Knodle we want to empower researchers in a similar fashion.

Knodle main goals:

  • Data abstraction. The interface is a tensor-driven data abstraction which unifies a large number of input variants and is applicable to a large number of tasks.
  • Method independence. We distinguish between weak supervision and prediction model. This enables comparability and accounts for a domain-specific inductive biases.
  • Accessibility. There is a high-level access to the library, that makes it easy to test existing methods, incorporate new ones and benchmark them against each other.

Datasets

Apart from that, Knodle includes a selection of well-known data sets from prior work in weak supervision. Knodle ecosystem provides modular access to datasets and denoising methods (that can, in turn, be combined with arbitrary deep learning models), enabling easy experimentation.

Datasets currently provided in Knodle:

  • Spam Dataset - a dataset, based on the YouTube comments dataset from Alberto et al. (2015). Here, the task is to classify whether a text is relevant to the video or holds spam, such as adver- tisement.
  • Spouse Dataset - relation extraction dataset is based on the Signal Media One-Million News Articles Dataset from Corney et al. (2016).
  • IMDb Dataset - a dataset, that consists of short movie reviews. The task is to determine whether a review holds a positive or negative sentiment.
  • TAC-based Relation Extraction Dataset - a dataset built over Knowledge Base Population challenges in the Text Analysis Conference. For development and test purposes the corpus annotated via crowdsourcing and human labeling from KBP is used (Zhang et al. (2017). The training is done on a weakly-supervised noisy dataset based on TAC KBP corpora (Surdeanu (2013)).

All datasets are added to the Knodle framework in the tensor format described above and could be dowloaded here. To see how the datasets were created please have a look at the dedicated tutorial.

Denoising Methods

There are several denoising methods available.

Trainer Name Module Description
MajorityVoteTrainer knodle.trainer.baseline This builds the baseline for all methods. No denoising takes place. The final label will be decided by using a simple majority vote approach and the provided model will be trained with these labels.
AutoTrainer knodle.trainer This incorporates all denoising methods currently provided in Knodle.
KNNAggregationTrainer knodle.trainer.knn_aggregation This method looks at the similarities in sentence values. The intuition behind it is that similar samples should be activated by the same rules which is allowed by a smoothness assumption on the target space. Similar sentences will receive the same label matches of the rules. This counteracts the problem of missing rules for certain labels.
WSCrossWeighTrainer knodle.trainer.wscrossweigh This method weighs the training samples basing on how reliable their labels are. The less reliable sentences (i.e. sentences, whose weak labels are possibly wrong) are detected using a DS-CrossWeigh method, which is similar to k-fold cross-validation, and got reduced weights in further training. This counteracts the problem of wrongly classified sentences.
SnorkelTrainer knodle.trainer.snorkel A wrapper of the Snorkel system, which incorporates both generative and discriminative Snorkel steps in a single call.

Each of the methods has its own default config file, which will be used in training if no custom config is provided.

Details about negative samples

Tutorials

We also aimed at providing the users with basic tutorials that would explain how to use our framework. All of them are stored in examples folder and logically divided into two groups:

  • tutorials that demonstrate how to prepare the input data for Knodle Framework...
    • ... on the example of a well-known ImdB dataset. A weakly supervised dataset is created by incorporating keywords as weak sources (link).
    • ... on the example of a TAC-based dataset in .conll format. A relation extraction dataset is created using entity pairs from Freebase as weak sources (link).
  • tutorials how to work with Knodle Framework...
    • ... on the example of AutoTrainer. This trainer is to be called when user wants to train a weak classifier, but has no intention to use any specific denoising method, but rather try all currently provided in Knodle (link).
    • ... on the example of WSCrossWeighTrainer. With this trainer a weak classifier with WSCrossWeigh denoising method will be trained (link).

Compatibility

Currently the package is tested on Python 3.7. It is possible to add further versions. The CI/CD pipeline needs to be updated in that case.

Structure

The structure of the code is as follows

knodle
├── knodle
│    ├── evaluation
│    ├── model
│    ├── trainer
│          ├── baseline
│          ├── knn_aggregation
│          ├── snorkel
│          ├── wscrossweigh
│          └── utils
│    ├── transformation
│    └── utils
├── tests
│    ├── data
│    ├── evaluation
│    ├── trainer
│          ├── baseline
│          ├── wscrossweigh
│          ├── snorkel
│          └── utils
│    └── transformation
└── examples
     ├── data_preprocessing
           ├── imdb_dataset
           └── tac_based_dataset
     └── training
           ├── simple_auto_trainer
           └── wscrossweigh

License

Licensed under the Apache 2.0 License.

Contact

If you notices a problem in the code, you can report it by submitting an issue.

If you want to share your feedback with us or take part in the project, contact us via [email protected].

And don't forget to follow @knodle_ai on Twitter :)

Authors

Citation

@misc{sedova2021knodle,
      title={Knodle: Modular Weakly Supervised Learning with PyTorch}, 
      author={Anastasiia Sedova, Andreas Stephan, Marina Speranskaya, and Benjamin Roth},
      year={2021},
      eprint={2104.11557},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgments

This research was funded by the WWTF though the project “Knowledge-infused Deep Learning for Natural Language Processing” (WWTF Vienna Research Group VRG19-008).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].