All Projects → icoxfog417 → tying-wv-and-wc

icoxfog417 / tying-wv-and-wc

Licence: MIT license
Implementation for "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling"

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to tying-wv-and-wc

wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
Stars: ✭ 39 (+0%)
Mutual labels:  language-model
mongolian-nlp
Useful resources for Mongolian NLP
Stars: ✭ 119 (+205.13%)
Mutual labels:  language-model
language-planner
Official Code for "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents"
Stars: ✭ 84 (+115.38%)
Mutual labels:  language-model
query completion
Personalized Query Completion
Stars: ✭ 24 (-38.46%)
Mutual labels:  language-model
CoLAKE
COLING'2020: CoLAKE: Contextualized Language and Knowledge Embedding
Stars: ✭ 86 (+120.51%)
Mutual labels:  language-model
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (+66.67%)
Mutual labels:  language-model
Romanian-Transformers
This repo is the home of Romanian Transformers.
Stars: ✭ 60 (+53.85%)
Mutual labels:  language-model
gpt-j
A GPT-J API to use with python3 to generate text, blogs, code, and more
Stars: ✭ 101 (+158.97%)
Mutual labels:  language-model
cscg
Code Generation as a Dual Task of Code Summarization.
Stars: ✭ 28 (-28.21%)
Mutual labels:  language-model
chainer-notebooks
Jupyter notebooks for Chainer hands-on
Stars: ✭ 23 (-41.03%)
Mutual labels:  language-model
gdc
Code for the ICLR 2021 paper "A Distributional Approach to Controlled Text Generation"
Stars: ✭ 94 (+141.03%)
Mutual labels:  language-model
gpt-j-api
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend
Stars: ✭ 248 (+535.9%)
Mutual labels:  language-model
Black-Box-Tuning
ICML'2022: Black-Box Tuning for Language-Model-as-a-Service
Stars: ✭ 99 (+153.85%)
Mutual labels:  language-model
minGPT-TF
A minimal TF2 re-implementation of the OpenAI GPT training
Stars: ✭ 36 (-7.69%)
Mutual labels:  language-model
FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Stars: ✭ 204 (+423.08%)
Mutual labels:  language-model
PCPM
Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.
Stars: ✭ 21 (-46.15%)
Mutual labels:  language-model
open clip
An open source implementation of CLIP.
Stars: ✭ 1,534 (+3833.33%)
Mutual labels:  language-model
CodeT5
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
Stars: ✭ 390 (+900%)
Mutual labels:  language-model
Word-Prediction-Ngram
Next Word Prediction using n-gram Probabilistic Model with various Smoothing Techniques
Stars: ✭ 25 (-35.9%)
Mutual labels:  language-model
bert-movie-reviews-sentiment-classifier
Build a Movie Reviews Sentiment Classifier with Google's BERT Language Model
Stars: ✭ 12 (-69.23%)
Mutual labels:  language-model

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Implementation for "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling"

This paper tries to utilize the diversity of word meaning to train the Deep Neural Network.

Summary of Paper

Motivation

In the language modeling (prediction of the word sequence), we want to express the diversity of word meaning.
For example, when predicting the word next to "Banana is delicious ___", the answer is "fruit", but "sweets", "food" is also ok. But ordinary one-hot vector teaching is not suitable to achieve it. Because any similar words ignored, but the exact answer word.

motivation.PNG

If we can use not one-hot but "distribution", we can teach this variety.

Method

So we use "distribution of the word" to teach the model. This distribution acquired from the answer word and embedding lookup matrix.

formulation.PNG

architecture.PNG

If we use this distribution type loss, then we can prove the equivalence between input embedding and output projection matrix.

equivalence.PNG

To use the distribution type loss and input embedding and output projection equivalence restriction improves the perplexity of the model.

Experiments

Implementation

Result

result.PNG

  • Run the 15 epoch on Penn Treebank dataset.
    • perplexity score is large, I couldn't have confidence of its implementation. I'm waiting pull request!
  • augmentedmodel works better than the baseline(onehotmodel), and augmentedmodel_tying outperforms the baseline!
  • You can run this experiment by python train.py

I implemented stateful LSTM version. Its result as following.

stateful_result.PNG

The perplexity is improved (but zaggy), and tying method loses its effect a little.
To use stateful LSTM in Keras is too hard (especially reset_states in the validation set), so there may be some limit included.

Additional validation

  • At the beginning of the training, embedding matrix to produce "teacher distribution" is not trained yet. So proposed method has a little handicap at first.
    • But the delay of training was not observed
  • Increasing the temperature (alpha) gradually may improve training speed.
  • To use the pre-trained word vector, or fixing the embedding matrix weight for some interval (fixed target technique at the reinforcement learning (please refer Deep Reinforcement Learning)) will also have effect to the training.

By the way, PyTorch example already use tying method! Don't be afraid to use it!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].