All Projects → cambridgeltl → mirror-bert

cambridgeltl / mirror-bert

Licence: MIT license
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to mirror-bert

kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-41.07%)
Mutual labels:  unsupervised-learning, bert
SimCLR
Pytorch implementation of "A Simple Framework for Contrastive Learning of Visual Representations"
Stars: ✭ 65 (+16.07%)
Mutual labels:  unsupervised-learning, contrastive-learning
PIC
Parametric Instance Classification for Unsupervised Visual Feature Learning, NeurIPS 2020
Stars: ✭ 41 (-26.79%)
Mutual labels:  unsupervised-learning, contrastive-learning
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+3423.21%)
Mutual labels:  unsupervised-learning, bert
Simclr
SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners
Stars: ✭ 2,720 (+4757.14%)
Mutual labels:  unsupervised-learning, contrastive-learning
Revisiting-Contrastive-SSL
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [NeurIPS 2021]
Stars: ✭ 81 (+44.64%)
Mutual labels:  unsupervised-learning, contrastive-learning
CLSA
official implemntation for "Contrastive Learning with Stronger Augmentations"
Stars: ✭ 48 (-14.29%)
Mutual labels:  unsupervised-learning, contrastive-learning
ViCC
[WACV'22] Code repository for the paper "Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting", https://arxiv.org/abs/2106.10137.
Stars: ✭ 33 (-41.07%)
Mutual labels:  unsupervised-learning, contrastive-learning
Online-Category-Learning
ML algorithm for real-time classification
Stars: ✭ 67 (+19.64%)
Mutual labels:  unsupervised-learning
GCL
List of Publications in Graph Contrastive Learning
Stars: ✭ 25 (-55.36%)
Mutual labels:  contrastive-learning
awesome-graph-self-supervised-learning-based-recommendation
A curated list of awesome graph & self-supervised-learning-based recommendation.
Stars: ✭ 37 (-33.93%)
Mutual labels:  contrastive-learning
PCLNet
Unsupervised Learning for Optical Flow Estimation Using Pyramid Convolution LSTM.
Stars: ✭ 29 (-48.21%)
Mutual labels:  unsupervised-learning
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (-1.79%)
Mutual labels:  bert
label-studio-transformers
Label data using HuggingFace's transformers and automatically get a prediction service
Stars: ✭ 117 (+108.93%)
Mutual labels:  bert
rasa milktea chatbot
Chatbot with bert chinese model, base on rasa framework(中文聊天机器人,结合bert意图分析,基于rasa框架)
Stars: ✭ 97 (+73.21%)
Mutual labels:  bert
SkeletonMerger
Code repository for paper `Skeleton Merger: an Unsupervised Aligned Keypoint Detector`.
Stars: ✭ 49 (-12.5%)
Mutual labels:  unsupervised-learning
ml-ai
ML-AI Community | Open Source | Built in Bharat for the World | Data science problem statements and solutions
Stars: ✭ 32 (-42.86%)
Mutual labels:  unsupervised-learning
bert-squeeze
🛠️ Tools for Transformers compression using PyTorch Lightning ⚡
Stars: ✭ 56 (+0%)
Mutual labels:  bert
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-69.64%)
Mutual labels:  bert
BERT-for-Chinese-Question-Answering
No description or website provided.
Stars: ✭ 75 (+33.93%)
Mutual labels:  bert

Mirror-BERT

UPDATE: see a follow-up work Trans-Encoder (ICLR'22), a SotA unsupervised model for STS.

Code repo for the EMNLP 2021 paper:
Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders
by Fangyu Liu, Ivan Vulić, Anna Korhonen, and Nigel Collier.

Mirror-BERT is an unsupervised contrastive learning method that converts pretrained language models (PLMs) into universal text encoders. It takes a PLM and a txt file containing raw text as input, and output a strong text embedding model, in just 20-30 seconds. It works well for not only sentence, but also word and phrase representation learning.

Huggingface pretrained models

Sentence enocders:

model STS avg.
baseline: sentence-bert (supervised) 74.89
mirror-bert-base-uncased-sentence 74.51
mirror-roberta-base-sentence 75.08
mirror-bert-base-uncased-sentence-drophead 75.16
mirror-roberta-base-sentence-drophead 76.67

Word encoder:

model Multi-SimLex (ENG)
baseline: fasttext 52.80
mirror-bert-base-uncased-word 55.60

(Note that the released models would not replicate the exact numbers in the paper, since the reported numbers in the paper are average of three runs.)

Train

For training sentence representations:

>> ./mirror_scripts/mirror_sentence_bert.sh 0,1

where 0,1 are GPU indices. This script should complete in 20-30 seconds on two NVIDIA 2080Ti/3090 GPUs. If you encounter out-of-memory error, consider reducing max_length in the script. Scripts for replicating other models are availible in mirror_scripts/.

Custom data: For training with your custom corpus, simply set --train_dir in the script to your own txt file (one sentence per line). When you do have raw sentences from your target domain, we recommend you always use the in-domain data for optimal performance. E.g., if you aim to create a conversational encoder, sample 10k utterances to train your model!

Supervised training: Organise your training data in the format of text1||text2 and store them one pair per line in a txt file. Then turn on the --pairwise option. text1 and text2 will be regarded as a positive pair in contrastive learning. You can be creative in finding such training pairs and it would be the best if they are from your application domain. E.g., to build an e-commerce QA encoder, the question||answer pairs from the Amazon quesrion-answer dataset could work quite well. Example training script: mirror_scripts/mirror_sentence_roberta_supervised_amazon_qa.sh. Note that when tuned on your in-domain data, you shouldn't expect the model to be good at STS. Instead, the models need to be evaluated on your in-domain task.

Word-level training: Use mirror_scripts/mirror_word_bert.sh.

Encode

It's easy to compute your own sentence embeddings:

from src.mirror_bert import MirrorBERT

model_name = "cambridgeltl/mirror-roberta-base-sentence-drophead"
mirror_bert = MirrorBERT()
mirror_bert.load_model(path=model_name, use_cuda=True)

embeddings = mirror_bert.get_embeddings([
    "I transform pre-trained language models into universal text encoders.",
], agg_mode="cls")
print (embeddings.shape)

Evaluate

Evaluate sentence representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-roberta-base-sentence-drophead" \
	--agg_mode "cls" \
	--dataset sent_all

Evaluate word representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-bert-base-uncased-word" \
	--agg_mode "cls" \
	--dataset multisimlex_ENG

To test models on other languages, replace ENG to your custom languages. See here for all supported languages on Multi-SimLex.

Citation

@inproceedings{liu-etal-2021-fast,
    title = "Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders",
    author = "Liu, Fangyu  and
      Vuli{\'c}, Ivan  and
      Korhonen, Anna  and
      Collier, Nigel",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.109",
    pages = "1442--1459",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].