All Projects → Zminghua → SentEncoding

Zminghua / SentEncoding

Licence: other
Sentence encoder and training code for Mean-Max AAE

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
perl
6916 projects
sed
78 projects

Projects that are alternatives of or similar to SentEncoding

dodrio
Exploring attention weights in transformer-based models with linguistic knowledge.
Stars: ✭ 233 (+1356.25%)
Mutual labels:  multi-head-attention
datagrand bert
2019达观杯信息提取第5名代码
Stars: ✭ 20 (+25%)
Mutual labels:  multi-head-attention
yolor
implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)
Stars: ✭ 1,867 (+11568.75%)
Mutual labels:  representation
Attention-Visualization
Visualization for simple attention and Google's multi-head attention.
Stars: ✭ 54 (+237.5%)
Mutual labels:  multi-head-attention

Mean-Max AAE

Sentence encoder and training code for the paper Learning Universal Sentence Representations with Mean-Max Attention Autoencoder.

Dependencies

This code is written in python. To use it you will need:

Download datasets

The pre-processed Toronto BookCorpus we used for training our model is available here.

To download GloVe vector:

curl -Lo data/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip data/glove.840B.300d.zip -d data/

To get all the transfer tasks datasets, run (in data/):

./get_transfer_data.bash

This will automatically download and preprocess the transfer tasks datasets, and store them in data/.

Sentence encoder

We provide a simple interface to encode English sentences. Get started with the following steps:

1) Download our Mean-Max AAE models and put it to SentEncoding directory, then decompress:

unzip models.zip

2) Make sure you have the NLTK tokenizer by running the following once:

import nltk
nltk.download('punkt')

3) Load our pre-trained model:

import master
m = master.Master('conf.json')
m.creat_graph()
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
m.prepare()

3) Build the vocabulary of word vectors (i.e keep only those needed):

vocab = m.build_vocab(sentences, tokenize=True)
m.build_emb(vocab)

where sentences is your list of n sentences.

4) Encode your sentences:

embeddings = m.encode(sentences, tokenize=True)

This outputs a numpy array with n vectors of dimension 4096.

Reference

If you found this code useful, please cite the following paper:

  @inproceedings{zhang2018learning,
  title={Learning Universal Sentence Representations with Mean-Max Attention Autoencoder},
  author={Zhang, Minghua and Wu, Yunfang and Li, Weikang and Li, Wei},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={4514--4523},
  year={2018}
  }
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].