All Projects → dayyass → muse-as-service

dayyass / muse-as-service

Licence: MIT license
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to muse-as-service

text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+317.78%)
Mutual labels:  embeddings, bert
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-66.67%)
Mutual labels:  text, bert
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (-24.44%)
Mutual labels:  embeddings, bert
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+4284.44%)
Mutual labels:  bert, sentence-embeddings
embedding study
中文预训练模型生成字向量学习,测试BERT,ELMO的中文效果
Stars: ✭ 94 (+108.89%)
Mutual labels:  embeddings, bert
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+797.78%)
Mutual labels:  text, embeddings
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-40%)
Mutual labels:  text, embeddings
AnnA Anki neuronal Appendix
Using machine learning on your anki collection to enhance the scheduling via semantic clustering and semantic similarity
Stars: ✭ 39 (-13.33%)
Mutual labels:  bert, sentence-embeddings
Automator
Various Automator and AppleScript workflow and scripts for simplifying life
Stars: ✭ 68 (+51.11%)
Mutual labels:  service, text
Ml Ai Experiments
All my experiments with AI and ML
Stars: ✭ 107 (+137.78%)
Mutual labels:  text, embeddings
event-embedding-multitask
*SEM 2018: Learning Distributed Event Representations with a Multi-Task Approach
Stars: ✭ 22 (-51.11%)
Mutual labels:  embeddings, sentence-embeddings
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (-11.11%)
Mutual labels:  embeddings, bert
wulaphp
一个有点复杂的PHP框架!
Stars: ✭ 26 (-42.22%)
Mutual labels:  service
viewpoint-mining
参考NER,基于BERT的电商评论观点挖掘和情感分析
Stars: ✭ 31 (-31.11%)
Mutual labels:  bert
ApiCenter
A repository for all your API specifications
Stars: ✭ 26 (-42.22%)
Mutual labels:  service
Xpersona
XPersona: Evaluating Multilingual Personalized Chatbot
Stars: ✭ 54 (+20%)
Mutual labels:  bert
heroku-buildpack-tex
A Heroku buildpack to run TeX Live inside a dyno.
Stars: ✭ 18 (-60%)
Mutual labels:  text
spyql
Query data on the command line with SQL-like SELECTs powered by Python expressions
Stars: ✭ 694 (+1442.22%)
Mutual labels:  text
glText
Cross-platform single header text rendering library for OpenGL
Stars: ✭ 93 (+106.67%)
Mutual labels:  text
UnSupportedServices.bundle
UnSupported Services.bundle for Plex Media Server (https://plex.tv)
Stars: ✭ 18 (-60%)
Mutual labels:  service

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

My public talk about this project at Sberloga:
Web-service for Sentence Embeddings

What is MUSE?

MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (supports 16 languages) of Universal Sentence Encoder (USE).
MUSE model encodes sentences into embedding vectors of fixed size.

What is MUSE as Service?

MUSE as Service is the REST API for sentence tokenization and embedding using MUSE model from TensorFlow Hub.

It is written using Flask and Gunicorn.

Why I need it?

MUSE model from TensorFlow Hub requires next packages to be installed:

  • tensorflow
  • tensorflow-hub
  • tensorflow-text

These packages take up more than 1GB of memory. The model itself takes up 280MB of memory.

For efficient memory usage when working with MUSE model on several projects (several virtual environments) or/and with teammates (several model copies on different computers) it is better to deploy one instance of the model in one virtual environment where all teammates have access to.

This is what MUSE as Service is made for! ❤️

Requirements

Python >= 3.6

Installation

To install MUSE as Service run:

# clone repo (https/ssh)
git clone https://github.com/dayyass/muse-as-service.git
# git clone [email protected]:dayyass/muse-as-service.git

# install dependencies (preferable in venv)
cd muse-as-service
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip && pip install -r requirements.txt

Before using the service you need to:

  • download MUSE model executing the following command:
    python models/download_muse.py

Launch the Service

To build a docker image with a service parametrized with gunicorn.conf.py file run:

docker build -t muse_as_service .

NOTE: instead of building a docker image, you can pull it from Docker Hub.

To launch the service (either locally or on a server) use a docker container:

docker run -d -p {host_port}:{container_port} --name muse_as_service muse_as_service

NOTE: container_port should be equal to port in gunicorn.conf.py file.

You can also launch the service without docker, but it is preferable to launch the service inside the docker container:

  • Gunicorn: gunicorn --config gunicorn.conf.py app:app (parametrized with gunicorn.conf.py file)
  • Flask: python app.py --host {host} --port {port} (default host 0.0.0.0 and port 5000)

It is also possible to launch the service using systemd.

GPU support

MUSE as Service supports GPU inference. To launch the service with GPU support you need:

  • install NVIDIA Container Toolkit
  • use CUDA_VISIBLE_DEVICES environment variable to specify GPU device if needed (e.g. export CUDA_VISIBLE_DEVICES=0)
  • launch the service with docker run command above (after docker build) with --gpus all parameter

NOTE: since TensorFlow2.0 tensorflow and tensorflow-gpu packages are merged.

NOTE: depending on CUDA version installed you may need different tensorflow versions (default version tensorflow==2.3.0 supports CUDA 10.1). See table with TF/CUDA compatibility to choose the right one and pip install it.

Usage

Since the service is usually running on server, it is important to restrict access to the service.

For this reason, MUSE as Service uses token-based authorization with JWT for users in sqlite database app.db.

Initially database has only one user with:

  • username: "admin"
  • password: "admin"

To add new user with username and password run:

python src/muse_as_service/database/add_user.py --username {username} --password {password}

NOTE: no passwords are stored in the database, only their hashes.

To remove the user with username run:

python src/muse_as_service/database/remove_user.py --username {username}

MUSE as Service has the following endpoints:

- /login         - POST request with `username` and `password` to get tokens (access and refresh)
- /logout        - POST request to remove tokens (access and refresh)
- /token/refresh - POST request to refresh access token (refresh token required)
- /tokenize      - GET request for `sentence` tokenization (access token required)
- /embed         - GET request for `sentence` embedding (access token required)

You can use python requests package to work with HTTP requests:

import numpy as np
import requests

# params
ip = "localhost"
port = 5000

sentences = ["This is sentence example.", "This is yet another sentence example."]

# start session
session = requests.Session()

# login
response = session.post(
    url=f"http://{ip}:{port}/login",
    json={"username": "admin", "password": "admin"},
)

# tokenizer
response = session.get(
    url=f"http://{ip}:{port}/tokenize",
    params={"sentence": sentences},
)
tokenized_sentence = response.json()["tokens"]

# embedder
response = session.get(
    url=f"http://{ip}:{port}/embed",
    params={"sentence": sentences},
)
embedding = np.array(response.json()["embedding"])

# logout
response = session.post(
    url=f"http://{ip}:{port}/logout",
)

# close session
session.close()

# results
print(tokenized_sentence)  # [
# ["▁This", "▁is", "▁sentence", "▁example", "."],
# ["▁This", "▁is", "▁yet", "▁another", "▁sentence", "▁example", "."]
# ]
print(embedding.shape)  # (2, 512)

However it is better to use built-in client MUSEClient for sentence tokenization and embedding, that wraps the functionality of the python requests package and provides user with a simpler interface.

To install the built-in client run:
pip install muse-as-service

Instead of using endpoints, listed above, directly, MUSEClient provides the following methods to work with:

- login    - method to login with `username` and `password`
- logout   - method to logout (login required)
- tokenize - method for `sentence` tokenization (login required)
- embed    - method for `sentence` embedding (login required)

Usage example:

from muse_as_service import MUSEClient

# params
ip = "localhost"
port = 5000

sentences = ["This is sentence example.", "This is yet another sentence example."]

# init client
client = MUSEClient(ip=ip, port=port)

# login
client.login(username="admin", password="admin")

# tokenizer
tokenized_sentence = client.tokenize(sentences)

# embedder
embedding = client.embed(sentences)

# logout
client.logout()

# results
print(tokenized_sentence)  # [
# ["▁This", "▁is", "▁sentence", "▁example", "."],
# ["▁This", "▁is", "▁yet", "▁another", "▁sentence", "▁example", "."]
# ]
print(embedding.shape)  # (2, 512)

Tests

To use pre-commit hooks run:
pre-commit install

Before running tests and code coverage, you need to:

  • run app.py in background:
    python app.py &

To launch tests run:
python -m unittest discover

To measure code coverage run:
coverage run -m unittest discover && coverage report -m

NOTE: since we launched Flask application in background, we need to stop it after running tests and code coverage with the following command:

kill $(ps aux | grep '[a]pp.py' | awk '{print $2}')

MUSE supported languages

MUSE model supports next languages:

  • Arabic
  • Chinese-simplified
  • Chinese-traditional
  • Dutch
  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Polish
  • Portuguese
  • Russian
  • Spanish
  • Thai
  • Turkish

Citation

If you use muse-as-service in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021muse,
    author       = {El-Ayyass, Dani},
    title        = {Multilingual Universal Sentence Encoder REST API},
    howpublished = {\url{https://github.com/dayyass/muse-as-service}},
    year         = {2021}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].