Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → yannvgn → Laserembeddings

yannvgn / Laserembeddings

Licence: bsd-3-clause

LASER multilingual sentence embeddings as a pip package

Programming Languages

python

139335 projects - #7 most used programming language

Labels

pytorch nlp transfer-learning embeddings multilingual

Projects that are alternatives of or similar to Laserembeddings

Hub

A library for transfer learning by reusing parts of TensorFlow models.

Stars: ✭ 3,007 (+2305.6%)

Mutual labels: transfer-learning, embeddings

Keras-Application-Zoo

Reference implementations of popular DL models missing from keras-applications & keras-contrib

Stars: ✭ 31 (-75.2%)

Mutual labels: embeddings, transfer-learning

Deep-Learning-Experiments-implemented-using-Google-Colab

Colab Compatible FastAI notebooks for NLP and Computer Vision Datasets

Stars: ✭ 16 (-87.2%)

Mutual labels: embeddings, transfer-learning

Bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)

Stars: ✭ 909 (+627.2%)

Mutual labels: embeddings, multilingual

Dogbreed gluon

kaggle Dog Breed Identification

Stars: ✭ 116 (-7.2%)

Mutual labels: transfer-learning

Magnitude

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+1015.2%)

Mutual labels: embeddings

Wagtailtrans

A Wagtail add-on for supporting multilingual sites

Stars: ✭ 103 (-17.6%)

Mutual labels: multilingual

Hdc.caffe

Complete Code for "Hard-Aware-Deeply-Cascaded-Embedding"

Stars: ✭ 98 (-21.6%)

Mutual labels: embeddings

Snca.pytorch

Improving Generalization via Scalable Neighborhood Component Analysis

Stars: ✭ 124 (-0.8%)

Mutual labels: transfer-learning

Blade Build

Blade is a powerful build system from Tencent, supports many mainstream programming languages, such as C/C++, java, scala, python, protobuf...

Stars: ✭ 1,722 (+1277.6%)

Mutual labels: multilingual

Awesome Embedding Models

A curated list of awesome embedding models tutorials, projects and communities.

Stars: ✭ 1,486 (+1088.8%)

Mutual labels: embeddings

Bigcidian

Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.

Stars: ✭ 99 (-20.8%)

Mutual labels: multilingual

Dna2vec

dna2vec: Consistent vector representations of variable-length k-mers

Stars: ✭ 117 (-6.4%)

Mutual labels: embeddings

Scikit Fusion

scikit-fusion: Data fusion via collective latent factor models

Stars: ✭ 103 (-17.6%)

Mutual labels: embeddings

Sigir2020 peterrec

Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation

Stars: ✭ 121 (-3.2%)

Mutual labels: transfer-learning

Fastrtext

R wrapper for fastText

Stars: ✭ 103 (-17.6%)

Mutual labels: embeddings

Diff2vec

Reference implementation of Diffusion2Vec (Complenet 2018) built on Gensim and NetworkX.

Stars: ✭ 108 (-13.6%)

Mutual labels: embeddings

Keras transfer cifar10

Object classification with CIFAR-10 using transfer learning

Stars: ✭ 120 (-4%)

Mutual labels: transfer-learning

Convolutional Handwriting Gan

ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation (CVPR20)

Stars: ✭ 107 (-14.4%)

Mutual labels: transfer-learning

Ml Ai Experiments

All my experiments with AI and ML

Stars: ✭ 107 (-14.4%)

Mutual labels: embeddings

View All Similar Projects ➔

LASER embeddings

Out-of-the-box multilingual sentence embeddings.

laserembeddings is a pip-packaged, production-ready port of Facebook Research's LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.

✨ Version 1.1.1 is here! What's new?

An issue with PyTorch 1.7.0 was fixed (#32) 🐛 Thank you, @niklaskorz

Context

LASER is a collection of scripts and models created by Facebook Research to compute multilingual sentence embeddings for zero-shot cross-lingual transfer.

What does it mean? LASER is able to transform sentences into language-independent vectors. Similar sentences get mapped to close vectors (in terms of cosine distance), regardless of the input language.

That is great, especially if you don't have training sets for the language(s) you want to process: you can build a classifier on top of LASER embeddings, train it on whatever language(s) you have in your training data, and let it classify texts in any language.

The aim of the package is to make LASER as easy-to-use and easy-to-deploy as possible: zero-config, production-ready, etc., just a two-liner to install.

👉 👉 👉 For detailed information, have a look at the amazing LASER repository, read its presentation article and its research paper. 👈 👈 👈

Getting started

Prerequisites

You'll need Python 3.6+ and PyTorch. Please refer to PyTorch installation instructions.

Installation

pip install laserembeddings

Chinese language

Chinese is not supported by default. If you need to embed Chinese sentences, please install laserembeddings with the "zh" extra. This extra includes jieba.

pip install laserembeddings[zh]

Japanese language

Japanese is not supported by default. If you need to embed Japanese sentences, please install laserembeddings with the "ja" extra. This extra includes mecab-python3 and the ipadic dictionary, which is used in the original LASER project.

If you have issues running laserembeddings on Japanese sentences, please refer to mecab-python3 documentation for troubleshooting.

pip install laserembeddings[ja]

Downloading the pre-trained models

python -m laserembeddings download-models

This will download the models to the default data directory next to the source code of the package. Use python -m laserembeddings download-models path/to/model/directory to download the models to a specific location.

Usage

from laserembeddings import Laser

laser = Laser()

# if all sentences are in the same language:

embeddings = laser.embed_sentences(
    ['let your neural network be polyglot',
     'use multilingual embeddings!'],
    lang='en')  # lang is only used for tokenization

# embeddings is a N*1024 (N = number of sentences) NumPy array

If the sentences are not in the same language, you can pass a list of language codes:

embeddings = laser.embed_sentences(
    ['I love pasta.',
     "J'adore les pâtes.",
     'Ich liebe Pasta.'],
    lang=['en', 'fr', 'de'])

If you downloaded the models into a specific directory:

from laserembeddings import Laser

path_to_bpe_codes = ...
path_to_bpe_vocab = ...
path_to_encoder = ...

laser = Laser(path_to_bpe_codes, path_to_bpe_vocab, path_to_encoder)

# you can also supply file objects instead of file paths

If you want to pull the models from S3:

from io import BytesIO, StringIO
from laserembeddings import Laser
import boto3

s3 = boto3.resource('s3')
MODELS_BUCKET = ...

f_bpe_codes = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_codes.fcodes').get()['Body'].read().decode('utf-8'))
f_bpe_vocab = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_vocabulary.fvocab').get()['Body'].read().decode('utf-8'))
f_encoder = BytesIO(s3.Object(MODELS_BUCKET, 'path_to_encoder.pt').get()['Body'].read())

laser = Laser(f_bpe_codes, f_bpe_vocab, f_encoder)

What are the differences with the original implementation?

Some dependencies of the original project have been replaced with pure-python dependencies, to make this package easy to install and deploy.

Here's a summary of the differences:

Part of the pipeline	LASER dependency (original project)	laserembeddings dependency (this package)	Reason
Normalization / tokenization	Moses	Sacremoses 0.0.35, which seems to be the closest version to the Moses version used to train the model	Moses is implemented in Perl
BPE encoding	fastBPE	subword-nmt	fastBPE cannot be installed via pip and requires compiling C++ code
Japanese segmentation (optional)	MeCab / JapaneseTokenizer	mecab-python3 and ipadic dictionary	mecab-python3 comes with wheels for major platforms (no compilation needed)

Will I get the exact same embeddings?

For most languages, in most of the cases, yes.

Some slight (and not so slight 🙄) differences exist for some languages due to differences in the implementation of the Tokenizer.

An exhaustive comparison of the embeddings generated with LASER and laserembeddings is automatically generated and will be updated for each new release.

FAQ

How can I train the encoder?

You can't. LASER models are pre-trained and do not need to be fine-tuned. The embeddings are generic and perform well without fine-tuning. See https://github.com/facebookresearch/LASER/issues/3#issuecomment-404175463.

Credits

Thanks a lot to the creators of LASER for open-sourcing the code of LASER and releasing the pre-trained models. All the kudos should go to them 👏.

A big thanks to the creators of Sacremoses and Subword Neural Machine Translation for their great packages.

Testing

The first thing you'll need is Poetry. Please refer to the installation guidelines.

Clone this repository and install the project:

poetry install -E zh -E ja

To run the tests:

poetry run pytest

Testing the similarity between the embeddings computed with LASER and laserembeddings

First, install the project with the extra dependencies (Chinese and Japanese support):

poetry install -E zh -E ja

Then, download the test data:

poetry run python -m laserembeddings download-test-data

👉 If you want to know more about the contents and the generation of the test data, check out the laserembeddings-test-data repository.

Then, run the test with SIMILARITY_TEST env. variable set to 1.

SIMILARITY_TEST=1 poetry run pytest tests/test_laser.py

Now, have a coffee ☕️ and wait for the test to finish.

The similarity report will be generated here: tests/report/comparison-with-LASER.md.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 125

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗