All Projects → ChunyuanLI → Optimus

ChunyuanLI / Optimus

Optimus: the first large-scale pre-trained VAE language model

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Optimus

PCPM
Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.
Stars: ✭ 21 (-88.33%)
Mutual labels:  pretrained-models, language-model
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+996.11%)
Mutual labels:  language-model, pretrained-models
gap-text2sql
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Stars: ✭ 83 (-53.89%)
Mutual labels:  pretrained-models, language-model
open clip
An open source implementation of CLIP.
Stars: ✭ 1,534 (+752.22%)
Mutual labels:  pretrained-models, language-model
Azureml Bert
End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service
Stars: ✭ 342 (+90%)
Mutual labels:  language-model, pretrained-models
Electra
中文 预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model
Stars: ✭ 132 (-26.67%)
Mutual labels:  language-model, pretrained-models
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+30867.78%)
Mutual labels:  language-model, pretrained-models
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1247.22%)
Mutual labels:  language-model, pretrained-models
Speecht
An opensource speech-to-text software written in tensorflow
Stars: ✭ 152 (-15.56%)
Mutual labels:  language-model
Lotclass
[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach
Stars: ✭ 160 (-11.11%)
Mutual labels:  language-model
Electra pytorch
Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
Stars: ✭ 149 (-17.22%)
Mutual labels:  language-model
Transformer Lm
Transformer language model (GPT-2) with sentencepiece tokenizer
Stars: ✭ 154 (-14.44%)
Mutual labels:  language-model
Lazynlp
Library to scrape and clean web pages to create massive datasets.
Stars: ✭ 1,985 (+1002.78%)
Mutual labels:  language-model
Sylvester Flows
Stars: ✭ 152 (-15.56%)
Mutual labels:  vae
Gpt Neo
An implementation of model parallel GPT2& GPT3-like models, with the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow library.
Stars: ✭ 1,252 (+595.56%)
Mutual labels:  language-model
Awd Lstm Lm
LSTM and QRNN Language Model Toolkit for PyTorch
Stars: ✭ 1,834 (+918.89%)
Mutual labels:  language-model
Factorvae
Pytorch implementation of FactorVAE proposed in Disentangling by Factorising(http://arxiv.org/abs/1802.05983)
Stars: ✭ 176 (-2.22%)
Mutual labels:  vae
Indic Bert
BERT-based Multilingual Model for Indian Languages
Stars: ✭ 160 (-11.11%)
Mutual labels:  language-model
F Lm
Language Modeling
Stars: ✭ 156 (-13.33%)
Mutual labels:  language-model
Bert Ner Tf
Named Entity Recognition with BERT using TensorFlow 2.0
Stars: ✭ 155 (-13.89%)
Mutual labels:  pretrained-models

Optimus: the first pre-trained Big VAE language model

This repository contains source code necessary to reproduce the results presented in the EMNLP 2020 paper Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space.

The network architecture of Optimus: encoder for representation learning and decoder for generation Sentences are organized and manipulated in a pre-trained compact and smooth latent space

For more on this project, see the Microsoft Research Blog post.

News

May 21, 2020: Releasing a demo for latent space manipulation, including sentence interpolation and analogy. Check out the website.

May 20, 2020: The latent space manipulation code is cleaned and released. See instructions at optimius_for_snli.md.

May 13, 2020: The fine-tuning code for langauge modeling is released. See instructions at optimus_finetune_language_models.md

Contents

There are four steps to use this codebase to reproduce the results in the paper.

  1. Dependencies
  2. Prepare datasets
  3. Model training
    1. Pre-training on setences in Wikipedia
    2. Languange Modeling
    3. Guided Language Generation
    4. Low-resource Language Understanding
  4. Collect and plot results

Dependencies

Pull docker from Docker Hub at: chunyl/pytorch-transformers:v2. Please see the instruction at doc/env.md

The project is organized into the following structures, with ensential files & folders visualized. output saves the models checkpoints.

├── Optimus
   └── code
       ├── examples
           ├── big_ae
               ├── modules
                   ├── vae.py
                   └── ...
               ├── run_lm_vae_pretraining_phdist_beta.py
               ├── run_lm_vae_training.py
               └── ...
	   ├── pytorch_transformers
               ├── modeling_bert.py
               ├── modeling_gpt2.py
               └── ...
       ├── scripts
           ├── scripts_docker
	   ├── scripts_local
	   ├── scripts_philly
   └── data
       └── datasets
           ├── wikipedia_json_64_filtered
               └── ...
	   ├── snli_data
           └── ...
   └── output
       ├── pretrain
       ├── LM
       └── ...       

Prepare Datasets

Please download or preparation the data via following the instructions at data/download_datasets.md.

Model Training

1. Pre-training on setences in Wikipedia

We pre-trained our models on Philly (a Microsoft internal compute cluster), the code is specialized for multi-node multi-GPU compute on this platform. The pre-training main python is run_lm_vae_pretraining_phdist_beta.py. You may need to adjust the distributed training scripts.

2. Languange Modeling

To have a fair comparison with existing VAE languange models, we consider a model with latent dimension 32. The pre-trained model is fine-tuned on four commonly datasets for one epoch. Please see the details at doc/optimus_finetune_language_models.md

3. Guided Language Generation

Latent Space Manipulation To ensure good performance, we consider a model with latent dimension 768. The pre-trained model is fine-tuned on SNLI dataset, where sentences show related patterns. Please see the details at Please see the details at doc/optimius_for_snli.md

4. Low-resource Language Understanding

Collect and Plot Results

Once the networks are trained and the results are saved, we extracted key results using Python script. The results can be plotted using the included IPython notebook plots/main_plots.ipynb. Start the IPython Notebook server:

$ cd plots
$ ipython notebook

Select the main_plots.ipynb notebook and execute the included code. Note that without modification, we have copyed our extracted results into the notebook, and script will output figures in the paper. If you've run your own training and wish to plot results, you'll have to organize your results in the same format instead.

Questions?

Please drop me (Chunyuan) a line if you have any questions.

@inproceedings{li2020_Optimus,
  title={Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space},
  author={Li, Chunyuan and Gao, Xiang and Li, Yuan and Li, Xiujun and Peng, Baolin and Zhang, Yizhe and Gao, Jianfeng},
  booktitle={EMNLP},
  year={2020}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].