All Projects → facebookresearch → Infersent

facebookresearch / Infersent

Licence: other
InferSent sentence embeddings

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Infersent

Recsys course at polimi
This is the official repository for the Recommender Systems course at Politecnico di Milano.
Stars: ✭ 180 (-91.74%)
Mutual labels:  jupyter-notebook
Wibd Workshops 2018
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Deeplearning.ai Note
网易云课堂终于官方发布了吴恩达经过授权的汉化课程-“”深度学习专项课程“”,这是自己做的一些笔记以及代码。下为网易云学习链接
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Quant Notes
Quantitative Interview Preparation Guide, updated version here ==>
Stars: ✭ 180 (-91.74%)
Mutual labels:  jupyter-notebook
Face and emotion detection
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Pytorch Vae
A CNN Variational Autoencoder (CNN-VAE) implemented in PyTorch
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Yolov3 Tf2
YoloV3 Implemented in Tensorflow 2.0
Stars: ✭ 2,327 (+6.79%)
Mutual labels:  jupyter-notebook
Subpixel
subpixel: A subpixel convnet for super resolution with Tensorflow
Stars: ✭ 2,114 (-2.98%)
Mutual labels:  jupyter-notebook
Lets Plot Kotlin
Kotlin API for Lets-Plot - an open-source plotting library for statistical data.
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Machinelearning ng
吴恩达机器学习coursera课程,学习代码(2017年秋) The Stanford Coursera course on MachineLearning with Andrew Ng
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Keras Segnet
SegNet model implemented using keras framework
Stars: ✭ 180 (-91.74%)
Mutual labels:  jupyter-notebook
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Eeg Notebooks v0.1
Previous version of eeg-notebooks
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Pulmonary Nodules Maskrcnn
Mask R-CNN for Pulmonary Nodules Diagnosis, using TensorFlow 天池医疗AI大赛:Mask R-CNN肺部结节智能检测(Segmentation + Classification)
Stars: ✭ 180 (-91.74%)
Mutual labels:  jupyter-notebook
Scatteract
Project which implements extraction of data from scatter plots
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Anime Gan Tensorflow
The BIGGAN based Anime generation implemented with tensorflow. All training data has been open sourced.
Stars: ✭ 180 (-91.74%)
Mutual labels:  jupyter-notebook
Libfm in keras
This notebook shows how to implement LibFM in Keras and how it was used in the Talking Data competition on Kaggle.
Stars: ✭ 181 (-91.69%)
Mutual labels:  jupyter-notebook
Academiccontent
Free tech resources for faculty, students, researchers, life-long learners, and academic community builders for use in tech based courses, workshops, and hackathons.
Stars: ✭ 2,196 (+0.78%)
Mutual labels:  jupyter-notebook
Girls In Ai
免费学代码系列:小白python入门、数据分析data analyst、机器学习machine learning、深度学习deep learning、kaggle实战
Stars: ✭ 2,309 (+5.97%)
Mutual labels:  jupyter-notebook
Deeptoxic
top 1% solution to toxic comment classification challenge on Kaggle.
Stars: ✭ 180 (-91.74%)
Mutual labels:  jupyter-notebook

InferSent

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

We provide our pre-trained English sentence encoder from our paper and our SentEval evaluation toolkit.

Recent changes: Removed train_nli.py and only kept pretrained models for simplicity. Reason is I do not have time anymore to maintain the repo beyond simple scripts to get sentence embeddings.

Dependencies

This code is written in python. Dependencies include:

  • Python 2/3
  • Pytorch (recent version)
  • NLTK >= 3

Download word vectors

Download GloVe (V1) or fastText (V2) vectors:

mkdir GloVe
curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip GloVe/glove.840B.300d.zip -d GloVe/
mkdir fastText
curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
unzip fastText/crawl-300d-2M.vec.zip -d fastText/

Use our sentence encoder

We provide a simple interface to encode English sentences. See demo.ipynb for a practical example. Get started with the following steps:

0.0) Download our InferSent models (V1 trained with GloVe, V2 trained with fastText)[147MB]:

mkdir encoder
curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

Note that infersent1 is trained with GloVe (which have been trained on text preprocessed with the PTB tokenizer) and infersent2 is trained with fastText (which have been trained on text preprocessed with the MOSES tokenizer). The latter also removes the padding of zeros with max-pooling which was inconvenient when embedding sentences outside of their batches.

0.1) Make sure you have the NLTK tokenizer by running the following once:

import nltk
nltk.download('punkt')

1) Load our pre-trained model (in encoder/):

from models import InferSent
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

2) Set word vector path for the model:

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

3) Build the vocabulary of word vectors (i.e keep only those needed):

infersent.build_vocab(sentences, tokenize=True)

where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common English words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK.

4) Encode your sentences (list of n sentences):

embeddings = infersent.encode(sentences, tokenize=True)

This outputs a numpy array with n vectors of dimension 4096. Speed is around 1000 sentences per second with batch size 128 on a single GPU.

5) Visualize the importance that our model attributes to each word:

We provide a function to visualize the importance of each word in the encoding of a sentence:

infersent.visualize('A man plays an instrument.', tokenize=True)

Model

Evaluate the encoder on transfer tasks

To evaluate the model on transfer tasks, see SentEval. Be mindful to choose the same tokenization used for training the encoder. You should obtain the following test results for the baselines and the InferSent models:

Model MR CR SUBJ MPQA STS14 STS Benchmark SICK Relatedness SICK Entailment SST TREC MRPC
InferSent1 81.1 86.3 92.4 90.2 .68/.65 75.8/75.5 0.884 86.1 84.6 88.2 76.2/83.1
InferSent2 79.7 84.2 92.7 89.4 .68/.66 78.4/78.4 0.888 86.3 84.3 90.8 76.0/83.8
SkipThought 79.4 83.1 93.7 89.3 .44/.45 72.1/70.2 0.858 79.5 82.9 88.4 -
fastText-BoV 78.2 80.2 91.8 88.0 .65/.63 70.2/68.3 0.823 78.9 82.3 83.4 74.4/82.4

Reference

Please consider citing [1] if you found this code useful.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (EMNLP 2017)

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@InProceedings{conneau-EtAl:2017:EMNLP2017,
  author    = {Conneau, Alexis  and  Kiela, Douwe  and  Schwenk, Holger  and  Barrault, Lo\"{i}c  and  Bordes, Antoine},
  title     = {Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {670--680},
  url       = {https://www.aclweb.org/anthology/D17-1070}
}

Related work

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].