All Projects → simonepri → lm-scorer

simonepri / lm-scorer

Licence: MIT license
📃Language Model based sentences scoring library

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to lm-scorer

Lingvo
Lingvo
Stars: ✭ 2,361 (+794.32%)
Mutual labels:  lm, language-model
python-arpa
🐍 Python library for n-gram models in ARPA format
Stars: ✭ 35 (-86.74%)
Mutual labels:  lm, language-model
osdg-tool
OSDG is an open-source tool that maps and connects activities to the UN Sustainable Development Goals (SDGs) by identifying SDG-relevant content in any text. The tool is available online at www.osdg.ai. API access available for research purposes.
Stars: ✭ 22 (-91.67%)
Mutual labels:  ml
leetspeek
Open and collaborative content from leet hackers!
Stars: ✭ 11 (-95.83%)
Mutual labels:  ml
SENet-for-Weakly-Supervised-Relation-Extraction
No description or website provided.
Stars: ✭ 39 (-85.23%)
Mutual labels:  ml
industrial-ml-datasets
A curated list of datasets, publically available for machine learning research in the area of manufacturing
Stars: ✭ 45 (-82.95%)
Mutual labels:  ml
pypmml
Python PMML scoring library
Stars: ✭ 65 (-75.38%)
Mutual labels:  ml
KB-ALBERT
KB국민은행에서 제공하는 경제/금융 도메인에 특화된 한국어 ALBERT 모델
Stars: ✭ 215 (-18.56%)
Mutual labels:  language-model
gap-text2sql
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Stars: ✭ 83 (-68.56%)
Mutual labels:  language-model
CharLM
Character-aware Neural Language Model implemented by PyTorch
Stars: ✭ 32 (-87.88%)
Mutual labels:  language-model
community
README for Rekcurd projects
Stars: ✭ 16 (-93.94%)
Mutual labels:  ml
calm
Context Aware Language Models
Stars: ✭ 29 (-89.02%)
Mutual labels:  language-model
VickyBytes
Subscribe to this GitHub repo to access the latest tech talks, tech demos, learning materials & modules, and developer community updates!
Stars: ✭ 48 (-81.82%)
Mutual labels:  ml
personality-prediction
Experiments for automated personality detection using Language Models and psycholinguistic features on various famous personality datasets including the Essays dataset (Big-Five)
Stars: ✭ 109 (-58.71%)
Mutual labels:  language-model
deprecated-coalton-prototype
Coalton is (supposed to be) a dialect of ML embedded in Common Lisp.
Stars: ✭ 209 (-20.83%)
Mutual labels:  ml
DevSoc21
Official website for DEVSOC 21, our annual flagship hackathon.
Stars: ✭ 15 (-94.32%)
Mutual labels:  ml
asr24
24-hour Automatic Speech Recognition
Stars: ✭ 27 (-89.77%)
Mutual labels:  language-model
gym-rs
OpenAI's Gym written in pure Rust for blazingly fast performance
Stars: ✭ 34 (-87.12%)
Mutual labels:  ml
TrackMania AI
Racing game AI
Stars: ✭ 65 (-75.38%)
Mutual labels:  ml
card-scanner-flutter
A flutter package for Fast, Accurate and Secure Credit card & Debit card scanning
Stars: ✭ 82 (-68.94%)
Mutual labels:  ml

lm-scorer

PyPi version Open in Colab
Lint status Test macOS status Test Ubuntu status
Code style Linter Types checker Test runner Task runner Build tool
Project license

📃 Language Model based sentences scoring library

Synopsis

This package provides a simple programming interface to score sentences using different ML language models.

A simple CLI is also available for quick prototyping.
You can run it locally or on directly on Colab using this notebook.

Do you believe that this is useful? Has it saved you time? Or maybe you simply like it?
If so, support this work with a Star ⭐️.

Install

pip install lm-scorer

Usage

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

# Available models
list(LMScorer.supported_model_names())
# => ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", distilgpt2"]

# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained("gpt2", device=device, batch_size=batch_size)

# Return token probabilities (provide log=True to return log probabilities)
scorer.tokens_score("I like this package.")
# => (scores, ids, tokens)
# scores = [0.018321, 0.0066431, 0.080633, 0.00060745, 0.27772, 0.0036381]
# ids    = [40,       588,       428,      5301,       13,      50256]
# tokens = ["I",      "Ġlike",   "Ġthis",  "Ġpackage", ".",     "<|endoftext|>"]

# Compute sentence score as the product of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="prod")
# => 6.0231e-12

# Compute sentence score as the mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="mean")
# => 0.064593

# Compute sentence score as the geometric mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="gmean")
# => 0.013489

# Compute sentence score as the harmonic mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="hmean")
# => 0.0028008

# Get the log of the sentence score.
scorer.sentence_score("I like this package.", log=True)
# => -25.835

# Score multiple sentences.
scorer.sentence_score(["Sentence 1", "Sentence 2"])
# => [1.1508e-11, 5.6645e-12]

# NB: Computations are done in log space so they should be numerically stable.

CLI

lm-scorer cli

The pip package includes a CLI that you can use to score sentences.

usage: lm-scorer [-h] [--model-name MODEL_NAME] [--tokens] [--log-prob]
                 [--reduce REDUCE] [--batch-size BATCH_SIZE]
                 [--significant-figures SIGNIFICANT_FIGURES] [--cuda CUDA]
                 [--debug]
                 sentences-file-path

Get sentences probability using a language model.

positional arguments:
  sentences-file-path   A file containing sentences to score, one per line. If
                        - is given as filename it reads from stdin instead.

optional arguments:
  -h, --help            show this help message and exit
  --model-name MODEL_NAME, -m MODEL_NAME
                        The pretrained language model to use. Can be one of:
                        gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2.
  --tokens, -t          If provided it provides the probability of each token
                        of each sentence.
  --log-prob, -lp       If provided log probabilities are returned instead.
  --reduce REDUCE, -r REDUCE
                        Reduce strategy applied on token probabilities to get
                        the sentence score. Available strategies are: prod,
                        mean, gmean, hmean.
  --batch-size BATCH_SIZE, -b BATCH_SIZE
                        Number of sentences to process in parallel.
  --significant-figures SIGNIFICANT_FIGURES, -sf SIGNIFICANT_FIGURES
                        Number of significant figures to use when printing
                        numbers.
  --cuda CUDA           If provided it runs the model on the given cuda
                        device.
  --debug               If provided it provides additional logging in case of
                        errors.

Development

You can install this library locally for development using the commands below. If you don't have it already, you need to install poetry first.

# Clone the repo
git clone https://github.com/simonepri/lm-scorer
# CD into the created folder
cd lm-scorer
# Create a virtualenv and install the required dependencies using poetry
poetry install

You can then run commands inside the virtualenv by using poetry run COMMAND.
Alternatively, you can open a shell inside the virtualenv using poetry shell.

If you wish to contribute to this project, run the following commands locally before opening a PR and check that no error is reported (warnings are fine).

# Run the code formatter
poetry run task format
# Run the linter
poetry run task lint
# Run the static type checker
poetry run task types
# Run the tests
poetry run task test

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the license file for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].