[WACV'22] Code repository for the paper "Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting", https://arxiv.org/abs/2106.10137.

Stars: ✭ 33 (-41.07%)

Mutual labels: unsupervised-learning, contrastive-learning

Online-Category-Learning

ML algorithm for real-time classification

Stars: ✭ 67 (+19.64%)

Mutual labels: unsupervised-learning

GCL

List of Publications in Graph Contrastive Learning

Stars: ✭ 25 (-55.36%)

Mutual labels: contrastive-learning

awesome-graph-self-supervised-learning-based-recommendation

A curated list of awesome graph & self-supervised-learning-based recommendation.

Stars: ✭ 37 (-33.93%)

Mutual labels: contrastive-learning

PCLNet

Unsupervised Learning for Optical Flow Estimation Using Pyramid Convolution LSTM.

Stars: ✭ 29 (-48.21%)

Mutual labels: unsupervised-learning

trove

Weakly supervised medical named entity classification

Stars: ✭ 55 (-1.79%)

Mutual labels: bert

label-studio-transformers

Label data using HuggingFace's transformers and automatically get a prediction service

Stars: ✭ 117 (+108.93%)

Mutual labels: bert

rasa milktea chatbot

Chatbot with bert chinese model, base on rasa framework（中文聊天机器人，结合bert意图分析，基于rasa框架）

Stars: ✭ 97 (+73.21%)

Mutual labels: bert

SkeletonMerger

Code repository for paper `Skeleton Merger: an Unsupervised Aligned Keypoint Detector`.

Stars: ✭ 49 (-12.5%)

Mutual labels: unsupervised-learning

ml-ai

ML-AI Community | Open Source | Built in Bharat for the World | Data science problem statements and solutions

Stars: ✭ 32 (-42.86%)

Mutual labels: unsupervised-learning

bert-squeeze

🛠️ Tools for Transformers compression using PyTorch Lightning ⚡

Stars: ✭ 56 (+0%)

Mutual labels: bert

berserker

Berserker - BERt chineSE woRd toKenizER

Stars: ✭ 17 (-69.64%)

Mutual labels: bert

BERT-for-Chinese-Question-Answering

No description or website provided.

Stars: ✭ 75 (+33.93%)

Mutual labels: bert

View All Similar Projects ➔

Mirror-BERT

UPDATE: see a follow-up work Trans-Encoder (ICLR'22), a SotA unsupervised model for STS.

Code repo for the EMNLP 2021 paper:
Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders
by Fangyu Liu, Ivan Vulić, Anna Korhonen, and Nigel Collier.

Mirror-BERT is an unsupervised contrastive learning method that converts pretrained language models (PLMs) into universal text encoders. It takes a PLM and a txt file containing raw text as input, and output a strong text embedding model, in just 20-30 seconds. It works well for not only sentence, but also word and phrase representation learning.

Huggingface pretrained models

Sentence enocders:

model	STS avg.
baseline: sentence-bert (supervised)	74.89
mirror-bert-base-uncased-sentence	74.51
mirror-roberta-base-sentence	75.08
mirror-bert-base-uncased-sentence-drophead	75.16
mirror-roberta-base-sentence-drophead	76.67

Word encoder:

model	Multi-SimLex (ENG)
baseline: fasttext	52.80
mirror-bert-base-uncased-word	55.60

(Note that the released models would not replicate the exact numbers in the paper, since the reported numbers in the paper are average of three runs.)

Train

For training sentence representations:

>> ./mirror_scripts/mirror_sentence_bert.sh 0,1

where 0,1 are GPU indices. This script should complete in 20-30 seconds on two NVIDIA 2080Ti/3090 GPUs. If you encounter out-of-memory error, consider reducing max_length in the script. Scripts for replicating other models are availible in mirror_scripts/.

Custom data: For training with your custom corpus, simply set --train_dir in the script to your own txt file (one sentence per line). When you do have raw sentences from your target domain, we recommend you always use the in-domain data for optimal performance. E.g., if you aim to create a conversational encoder, sample 10k utterances to train your model!

Supervised training: Organise your training data in the format of text1||text2 and store them one pair per line in a txt file. Then turn on the --pairwise option. text1 and text2 will be regarded as a positive pair in contrastive learning. You can be creative in finding such training pairs and it would be the best if they are from your application domain. E.g., to build an e-commerce QA encoder, the question||answer pairs from the Amazon quesrion-answer dataset could work quite well. Example training script: mirror_scripts/mirror_sentence_roberta_supervised_amazon_qa.sh. Note that when tuned on your in-domain data, you shouldn't expect the model to be good at STS. Instead, the models need to be evaluated on your in-domain task.

Word-level training: Use mirror_scripts/mirror_word_bert.sh.

Encode

It's easy to compute your own sentence embeddings:

from src.mirror_bert import MirrorBERT

model_name = "cambridgeltl/mirror-roberta-base-sentence-drophead"
mirror_bert = MirrorBERT()
mirror_bert.load_model(path=model_name, use_cuda=True)

embeddings = mirror_bert.get_embeddings([
    "I transform pre-trained language models into universal text encoders.",
], agg_mode="cls")
print (embeddings.shape)

Evaluate

Evaluate sentence representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-roberta-base-sentence-drophead" \
	--agg_mode "cls" \
	--dataset sent_all

Evaluate word representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-bert-base-uncased-word" \
	--agg_mode "cls" \
	--dataset multisimlex_ENG

To test models on other languages, replace ENG to your custom languages. See here for all supported languages on Multi-SimLex.

Citation

@inproceedings{liu-etal-2021-fast,
    title = "Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders",
    author = "Liu, Fangyu  and
      Vuli{\'c}, Ivan  and
      Korhonen, Anna  and
      Collier, Nigel",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.109",
    pages = "1442--1459",
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

cambridgeltl / mirror-bert

Programming Languages

Labels

Projects that are alternatives of or similar to mirror-bert

Mirror-BERT

Huggingface pretrained models

Train

Encode

Evaluate

Citation