Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

JohnGiorgi / Declutr

Licence: apache-2.0

The corresponding code from our paper "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations". Do not hesitate to open an issue if you run into any trouble!

Programming Languages

python

139335 projects - #7 most used programming language

Labels

pytorch natural-language-processing representation-learning metric-learning

Projects that are alternatives of or similar to Declutr

TCE

This repository contains the code implementation used in the paper Temporally Coherent Embeddings for Self-Supervised Video Representation Learning (TCE).

Stars: ✭ 51 (-54.05%)

Mutual labels: metric-learning, representation-learning

Catalyst

Accelerated deep learning R&D

Stars: ✭ 2,804 (+2426.13%)

Mutual labels: natural-language-processing, metric-learning

Good Papers

I try my best to keep updated cutting-edge knowledge in Machine Learning/Deep Learning and Natural Language Processing. These are my notes on some good papers

Stars: ✭ 248 (+123.42%)

Mutual labels: natural-language-processing, representation-learning

disent

🧶 Modular VAE disentanglement framework for python built with PyTorch Lightning ▸ Including metrics and datasets ▸ With strongly supervised, weakly supervised and unsupervised methods ▸ Easily configured and run with Hydra config ▸ Inspired by disentanglement_lib

Stars: ✭ 41 (-63.06%)

Mutual labels: metric-learning, representation-learning

Swem

The Tensorflow code for this ACL 2018 paper: "Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms"

Stars: ✭ 279 (+151.35%)

Mutual labels: natural-language-processing, representation-learning

Pointglr

Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds (CVPR 2020)

Stars: ✭ 86 (-22.52%)

Mutual labels: representation-learning, metric-learning

Knowledge Graphs

A collection of research on knowledge graphs

Stars: ✭ 845 (+661.26%)

Mutual labels: natural-language-processing, representation-learning

Codesearchnet

Datasets, tools, and benchmarks for representation learning of code.

Stars: ✭ 1,378 (+1141.44%)

Mutual labels: natural-language-processing, representation-learning

Transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+50118.02%)

Mutual labels: natural-language-processing

Ampligraph

Python library for Representation Learning on Knowledge Graphs https://docs.ampligraph.org

Stars: ✭ 1,662 (+1397.3%)

Mutual labels: representation-learning

Nltk

NLTK Source

Stars: ✭ 10,309 (+9187.39%)

Mutual labels: natural-language-processing

Ua Gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Stars: ✭ 108 (-2.7%)

Mutual labels: natural-language-processing

Awesome Emotion Recognition In Conversations

A comprehensive reading list for Emotion Recognition in Conversations

Stars: ✭ 111 (+0%)

Mutual labels: natural-language-processing

Allennlp

An open-source NLP research library, built on PyTorch.

Stars: ✭ 10,699 (+9538.74%)

Mutual labels: natural-language-processing

Commonsense Rc

Code for Yuanfudao at SemEval-2018 Task 11: Three-way Attention and Relational Knowledge for Commonsense Machine Comprehension

Stars: ✭ 112 (+0.9%)

Mutual labels: natural-language-processing

Linguistic Style Transfer

Neural network parametrized objective to disentangle and transfer style and content in text

Stars: ✭ 106 (-4.5%)

Mutual labels: natural-language-processing

Chatbot

Русскоязычный чатбот

Stars: ✭ 106 (-4.5%)

Mutual labels: natural-language-processing

Deep Nlp Seminars

Materials for deep NLP course

Stars: ✭ 113 (+1.8%)

Mutual labels: natural-language-processing

Opus Mt

Open neural machine translation models and web services

Stars: ✭ 111 (+0%)

Mutual labels: natural-language-processing

Xlnet extension tf

XLNet Extension in TensorFlow

Stars: ✭ 109 (-1.8%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

The corresponding code for our paper: DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Results on SentEval are presented below (as averaged scores on the downstream and probing task test sets), along with existing state-of-the-art methods.

Model	Requires labelled data?	Parameters	Embed. dim.	Downstream (-SNLI)	Probing	Δ
InferSent V2	Yes	38M	4096	76.00	72.58	-3.06
Universal Sentence Encoder	Yes	147M	512	78.89	66.70	-0.17
Sentence Transformers ("roberta-base-nli-mean-tokens")	Yes	125M	768	77.19	63.22	-1.87
Transformer-small (DistilRoBERTa-base)	No	82M	768	72.58	74.57	-6.48
Transformer-base (RoBERTa-base)	No	125M	768	72.70	74.19	-6.36
DeCLUTR-small (DistilRoBERTa-base)	No	82M	768	77.41	74.71	-1.65
DeCLUTR-base (RoBERTa-base)	No	125M	768	79.06	74.65	--

Transformer-* is the same underlying architecture and pretrained weights as DeCLUTR-* before continued pretraining with our contrastive objective. Transformer-* and DeCLUTR-* use mean pooling on their token-level embeddings to produce a fixed-length sentence representation. Downstream scores are computed without considering perfomance on SNLI (denoted "Downstream (-SNLI)") as InferSent, USE and Sentence Transformers all train on SNLI. Δ: difference to DeCLUTR-base downstream score.

Notebooks
Installation
Usage
Citing

Notebooks

The easiest way to get started is to follow along with one of our notebooks:

Training your own model
Embedding text with a pretrained model
Evaluating a model with SentEval

Installation

This repository requires Python 3.6.1 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. See here for detailed instructions.

Installing the library and dependencies

If you don't plan on modifying the source code, install from git using pip

pip install git+https://github.com/JohnGiorgi/DeCLUTR.git

Otherwise, clone the repository locally and then install

git clone https://github.com/JohnGiorgi/DeCLUTR.git
cd DeCLUTR
pip install --editable .

Gotchas

If you plan on training your own model, you should also install PyTorch with CUDA support by following the instructions for your system here.

Usage

Preparing a dataset

A dataset is simply a file containing one item of text (a document, a scientific paper, etc.) per line. For demonstration purposes, we have provided a script that will download the WikiText-103 dataset and match our minimal preprocessing

python scripts/preprocess_wikitext_103.py path/to/output/wikitext-103/train.txt --min-length 2048

See scripts/preprocess_openwebtext.py for a script that can be used to recreate the (much larger) dataset used in our paper.

You can specify the train set path in the configs under "train_data_path".

Gotchas

A training dataset should contain documents with a minimum of num_anchors * max_span_len * 2 whitespace tokens. This is required to sample spans according to our sampling procedure. See the dataset reader and/or our paper for more details on these hyperparameters.

Training

To train the model, use the allennlp train command with our declutr.jsonnet config. For example, to train DeCLUTR-small, run the following

# This can be (almost) any model from https://huggingface.co/ that supports masked language modelling.
TRANSFORMER_MODEL="distilroberta-base"

allennlp train "training_config/declutr.jsonnet" \
    --serialization-dir "output" \
    --overrides "{'train_data_path': 'path/to/your/dataset/train.txt'}" \
    --include-package "declutr"

The --overrides flag allows you to override any field in the config with a JSON-formatted string, but you can equivalently update the config itself if you prefer. During training, models, vocabulary, configuration, and log files will be saved to the directory provided by --serialization-dir. This can be changed to any directory you like.

Multi-GPU training

To train on more than one GPU, provide a list of CUDA devices in your call to allennlp train. For example, to train with four CUDA devices with IDs 0, 1, 2, 3

--overrides "{'distributed.cuda_devices': [0, 1, 2, 3]}"

Training with mixed-precision

If your GPU supports it, mixed-precision will be used automatically during training and inference.

Embedding

You can embed text with a trained model in one of three ways:

As a library: import and initialize an object from this repo, which can be used to embed sentences/paragraphs.
🤗 Transformers: load our pretrained model with the 🤗 Transformers library.
Bulk embed: embed all text in a given text file with a simple command-line interface.

Available pre-trained models:

As a library

To use the model as a library, import Encoder and pass it some text (it accepts both strings and lists of strings)

from declutr import Encoder

# This can be a path on disk to a model you have trained yourself OR
# the name of one of our pretrained models.
pretrained_model_or_path = "declutr-small"

encoder = Encoder(pretrained_model_or_path)
embeddings = encoder([
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella."
])

these embeddings can then be used, for example, to compute the semantic similarity between some number of sentences or paragraphs

from scipy.spatial.distance import cosine

semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

See the list of available PRETRAINED_MODELS in declutr/encoder.py

python -c "from declutr.encoder import PRETRAINED_MODELS ; print(list(PRETRAINED_MODELS.keys()))"

🤗 Transformers

Our pretrained models are also hosted with 🤗 Transformers, so they can be used like any other model in that library. Here is a simple example:

import torch
from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
model = AutoModel.from_pretrained("johngiorgi/declutr-small")

# Prepare some text to embed
texts = [
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella.",
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

Bulk embed a file

To embed all text in a given file with a trained model, run the following command

allennlp predict "output" "path/to/input.txt" \
 --output-file "output/embeddings.jsonl" \
 --batch-size 32 \
 --cuda-device 0 \
 --use-dataset-reader \
 --overrides "{'dataset_reader.num_anchors': null}" \
 --include-package "declutr"

This will:

Load the model serialized to "output" with the "best" weights (i.e. the ones that achieved the lowest loss during training).
Use that model to embed the text in the provided input file ("path/to/input.txt").
Save the embeddings to disk as a JSON lines file ("output/embeddings.jsonl")

The text embeddings are stored in the field "embeddings" in "output/embeddings.jsonl".

Evaluating with SentEval

SentEval is a library for evaluating the quality of sentence embeddings. We provide a script to evaluate our model against SentEval. We have provided a notebook that documents the process of evaluating a trained model on SentEval. Broadly, the steps are the following:

First, clone the SentEval repository and download the transfer task datasets (you only need to do this once)

git clone https://github.com/facebookresearch/SentEval.git
cd SentEval/data/downstream/
./get_transfer_data.bash
cd ../../../

See the SentEval repository for full details.

Then you can run our script to evaluate a trained model against SentEval

python scripts/run_senteval.py allennlp "SentEval" "output"
 --output-filepath "output/senteval_results.json" \
 --cuda-device 0  \
 --include-package "declutr"

The results will be saved to "output/senteval_results.json". This can be changed to any path you like.

Pass the flag --prototyping-config to get a proxy of the results while dramatically reducing computation time.

For a list of commands, run

python scripts/run_senteval.py --help

For help with a specific command, e.g. allennlp, run

python scripts/run_senteval.py allennlp --help

Gotchas

Evaluating the "SNLI" task of SentEval will fail without this fix.

Citing

If you use DeCLUTR in your work, please consider citing our preprint

@article{Giorgi2020DeCLUTRDC,
  title={DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations},
  author={John M Giorgi and Osvald Nitski and Gary D. Bader and Bo Wang},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.03659}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 111

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

JohnGiorgi / Declutr

Programming Languages

Labels

Projects that are alternatives of or similar to Declutr

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

Table of contents

Notebooks

Installation

Setting up a virtual environment

Installing the library and dependencies

Gotchas

Usage

Preparing a dataset

Gotchas

Training

Multi-GPU training

Training with mixed-precision

Embedding

As a library

🤗 Transformers

Bulk embed a file

Evaluating with SentEval

Gotchas

Citing