All Projects β†’ microsoft β†’ COCO-LM

microsoft / COCO-LM

Licence: MIT license
[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Cuda
1817 projects
C++
36643 projects - #6 most used programming language
cython
566 projects
lua
6591 projects

Projects that are alternatives of or similar to COCO-LM

Tokenizers
πŸ’₯ Fast State-of-the-Art Tokenizers optimized for Research and Production
Stars: ✭ 5,077 (+4557.8%)
Mutual labels:  transformers, language-model, natural-language-understanding
Deberta
The implementation of DeBERTa
Stars: ✭ 541 (+396.33%)
Mutual labels:  representation-learning, language-model, natural-language-understanding
Revisiting-Contrastive-SSL
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [NeurIPS 2021]
Stars: ✭ 81 (-25.69%)
Mutual labels:  representation-learning, pretraining, contrastive-learning
Simclr
SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners
Stars: ✭ 2,720 (+2395.41%)
Mutual labels:  representation-learning, contrastive-learning
simclr-pytorch
PyTorch implementation of SimCLR: supports multi-GPU training and closely reproduces results
Stars: ✭ 89 (-18.35%)
Mutual labels:  representation-learning, contrastive-learning
SimCLR
Pytorch implementation of "A Simple Framework for Contrastive Learning of Visual Representations"
Stars: ✭ 65 (-40.37%)
Mutual labels:  representation-learning, contrastive-learning
TCE
This repository contains the code implementation used in the paper Temporally Coherent Embeddings for Self-Supervised Video Representation Learning (TCE).
Stars: ✭ 51 (-53.21%)
Mutual labels:  representation-learning, contrastive-learning
wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
Stars: ✭ 39 (-64.22%)
Mutual labels:  transformers, language-model
KB-ALBERT
KBκ΅­λ―Όμ€ν–‰μ—μ„œ μ œκ³΅ν•˜λŠ” 경제/금육 도메인에 νŠΉν™”λœ ν•œκ΅­μ–΄ ALBERT λͺ¨λΈ
Stars: ✭ 215 (+97.25%)
Mutual labels:  transformers, language-model
PLBART
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
Stars: ✭ 151 (+38.53%)
Mutual labels:  representation-learning, language-model
language-planner
Official Code for "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents"
Stars: ✭ 84 (-22.94%)
Mutual labels:  transformers, language-model
Haystack
πŸ” Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+3027.52%)
Mutual labels:  transformers, language-model
CodeT5
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
Stars: ✭ 390 (+257.8%)
Mutual labels:  representation-learning, language-model
gnn-lspe
Source code for GNN-LSPE (Graph Neural Networks with Learnable Structural and Positional Representations), ICLR 2022
Stars: ✭ 165 (+51.38%)
Mutual labels:  transformers, representation-learning
minicons
Utility for analyzing Transformer based representations of language.
Stars: ✭ 28 (-74.31%)
Mutual labels:  transformers, language-model
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+110.09%)
Mutual labels:  transformers, language-model
object-aware-contrastive
Object-aware Contrastive Learning for Debiased Scene Representation (NeurIPS 2021)
Stars: ✭ 44 (-59.63%)
Mutual labels:  representation-learning, contrastive-learning
Clue
δΈ­ζ–‡θ―­θ¨€η†θ§£ζ΅‹θ―„εŸΊε‡† Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+2124.77%)
Mutual labels:  transformers, language-model
label-studio-transformers
Label data using HuggingFace's transformers and automatically get a prediction service
Stars: ✭ 117 (+7.34%)
Mutual labels:  transformers, natural-language-understanding
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-86.24%)
Mutual labels:  transformers, natural-language-understanding

COCO-LM

This repository contains the scripts for fine-tuning COCO-LM pretrained models on GLUE and SQuAD 2.0 benchmarks.

Paper: COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Overview

We provide the scripts in two versions, based on two widely-used open-source codebases, the Fairseq Library and the Huggingface Transformers Library. The two code versions are mostly equivalent in functionality, and you are free to use either of them. However, we note that the fairseq version is what we used in our experiments, and it will best reproduce the results in the paper; the huggingface version is implemented later to provide compatibility with the Huggingface Transformers Library, and may yield slightly different results.

Please follow the README files under the two directories for running the code.

GLUE Fine-Tuning Results

The General Language Understanding Evaluation (GLUE) benchmark is a collection of sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

GLUE dev set results of COCO-LM base++ and large++ models are as follows (median of 5 different random seeds):

Model MNLI-m/mm QQP QNLI SST-2 CoLA RTE MRPC STS-B AVG
COCO-LM base++ 90.2/90.0 92.2 94.2 94.6 67.3 87.4 91.2 91.8 88.6
COCO-LM large++ 91.4/91.6 92.8 95.7 96.9 73.9 91.0 92.2 92.7 90.8

GLUE test set results of COCO-LM base++ and large++ models are as follows (no ensemble, task-specific tricks, etc.):

Model MNLI-m/mm QQP QNLI SST-2 CoLA RTE MRPC STS-B AVG
COCO-LM base++ 89.8/89.3 89.8 94.2 95.6 68.6 82.3 88.5 90.3 87.4
COCO-LM large++ 91.6/91.1 90.5 95.8 96.7 70.5 89.2 88.4 91.8 89.3

SQuAD 2.0 Fine-Tuning Results

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD 2.0 dev set results of COCO-LM base++ and large++ models are as follows (median of 5 different random seeds):

Model EM F1
COCO-LM base++ 85.4 88.1
COCO-LM large++ 88.2 91.0

Citation

If you find the code and models useful for your research, please cite the following paper:

@inproceedings{meng2021cocolm,
  title={{COCO-LM}: Correcting and contrasting text sequences for language model pretraining},
  author={Meng, Yu and Xiong, Chenyan and Bajaj, Payal and Tiwary, Saurabh and Bennett, Paul and Han, Jiawei and Song, Xia},
  booktitle={Conference on Neural Information Processing Systems},
  year={2021}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].