All Projects → microsoft → Lmchallenge

microsoft / Lmchallenge

Licence: other
A library & tools to evaluate predictive language models.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lmchallenge

Lightnlp
基于Pytorch和torchtext的自然语言处理深度学习框架。
Stars: ✭ 739 (+1472.34%)
Mutual labels:  language-model
Spago
Self-contained Machine Learning and Natural Language Processing library in Go
Stars: ✭ 854 (+1717.02%)
Mutual labels:  language-model
Pytorch Cpp
C++ Implementation of PyTorch Tutorials for Everyone
Stars: ✭ 1,014 (+2057.45%)
Mutual labels:  language-model
Lm Lstm Crf
Empower Sequence Labeling with Task-Aware Language Model
Stars: ✭ 778 (+1555.32%)
Mutual labels:  language-model
Spacy Transformers
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy
Stars: ✭ 919 (+1855.32%)
Mutual labels:  language-model
Mrsr
MRSR - Matlab Recommender Systems Research is a software framework for evaluating collaborative filtering recommender systems in Matlab.
Stars: ✭ 13 (-72.34%)
Mutual labels:  evaluation
Keras Language Modeling
📖 Some language modeling tools for Keras
Stars: ✭ 666 (+1317.02%)
Mutual labels:  language-model
Gpt2 French
GPT-2 French demo | Démo française de GPT-2
Stars: ✭ 47 (+0%)
Mutual labels:  language-model
Bert language understanding
Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN
Stars: ✭ 933 (+1885.11%)
Mutual labels:  language-model
Boilerplate Dynet Rnn Lm
Boilerplate code for quickly getting set up to run language modeling experiments
Stars: ✭ 37 (-21.28%)
Mutual labels:  language-model
Nlg Eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Stars: ✭ 822 (+1648.94%)
Mutual labels:  evaluation
Humour.ai Language Model That Can Crack Jokes
Language Model that makes you Laugh .
Stars: ✭ 18 (-61.7%)
Mutual labels:  language-model
Facealignmentcompare
Empirical Study of Recent Face Alignment Methods
Stars: ✭ 15 (-68.09%)
Mutual labels:  evaluation
Pykaldi
A Python wrapper for Kaldi
Stars: ✭ 756 (+1508.51%)
Mutual labels:  language-model
Nlp Library
curated collection of papers for the nlp practitioner 📖👩‍🔬
Stars: ✭ 1,025 (+2080.85%)
Mutual labels:  language-model
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+14061.7%)
Mutual labels:  language-model
Lispy
Short and sweet LISP editing
Stars: ✭ 856 (+1721.28%)
Mutual labels:  evaluation
Django Access
Django-Access - the application introducing dynamic evaluation-based instance-level (row-level) access rights control for Django
Stars: ✭ 47 (+0%)
Mutual labels:  evaluation
Ab3dmot
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Stars: ✭ 1,032 (+2095.74%)
Mutual labels:  evaluation
Benchbot
BenchBot is a tool for seamlessly testing & evaluating semantic scene understanding tools in both realistic 3D simulation & on real robots
Stars: ✭ 29 (-38.3%)
Mutual labels:  evaluation

Language Model Challenge (LM Challenge)

A library & tools to evaluate predictive language models. This is a guide for users of LM Challenge; you may also want to see:

What is LM Challenge for?

It is hard to compare language model performance in general. Some models output probabilities, others scores; some model words, others morphemes, characters or bytes. Vocabulary coverage varies. Comparing them in a fair way is therefore difficult. So in LM Challenge we have some very simple 'challenge games' that evaluate (and help compare) language models over a test corpus.

LM Challenge is for researchers and engineers who wish to set a standard for fair comparison of very different language model architectures. It requires a little work to wrap your model in a standard API, but we believe this is often better than writing & testing evaluation tools afresh for each project/investigation.

Note: most of LM Challenge tools are word-based (although all can be applied to sub-word "character compositional" word models). Additionally, our assumption is that the language model is "forward contextual" - so it predicts a word or character based only on preceding words/characters.

Getting Started

Install LM Challenge from the published Python package:

pip3 install --user lmchallenge

(Or from this repository python3 setup.py install --user.)

Setup: LM Challenge needs a model to evaluate. We include an example ngram model implementation in sample. Download data & build models (this may take a couple of minutes):

cd sample/
./prepare.sh

Model REPL: Now you can use the example script to evaluate a very basic ngram model (see ngram.py, which you may find useful if integrating your own prediction model). Note that this command will not terminate, as it launches an interactive program:

python3 ngram.py words data/words.3gram

This starts an interactive program which can accept commands of a single word followed by a hard TAB character and any arguments, for example:

> predict<TAB>
=    0.0000    The    -1.0000    In    -2.0000...

This produces start-of-line predictions, each with an attached score. To query with word context, try the following (making sure you leave a trailing space at the end of the query, after "favourite"):

> predict<TAB>My favourite 
of    0.0000    song    -1.0000    the    -2.0000...

This provides next-word-prediction based on a context. There is more to the API (see formats for more details), but since you won't usually be using the API directly, let's move on to running LM Challenge over this model (so exit the predictor using Ctrl+D, back to your shell).

Evaluation: To run LM Challenge for this model, we'll pipe some text into lmc run, and save the result:

mkdir out
head -n 10 data/wiki.test.tokens | lmc run "python3 ngram.py words data/words.3gram" wc > out/w3.wc.log

The resulting log contains all of the original text, and can be queried using the lmc utilities. Note: jq here is optional, but a very convenient program for working with JSON.

lmc stats out/w3.wc.log | jq .

You should see some statistics about the model - in particular completion & prediction. Now let's try comparing with a less powerful model:

head -n 10 data/wiki.test.tokens | lmc run "python3 ngram.py words data/words.2gram" wc > out/w2.wc.log
lmc stats out/*.wc.log | jq .

The aggregated level prediction and completion stats should be slightly different for the two models. But we can get a better picture from inspecting the logs in detail:

lmc pretty out/w3.wc.log

This shows a pretty-printed dump of the data, according to how well the model performed on each token. We can also pretty-print the difference between two models:

lmc diff out/w3.wc.log out/w2.wc.log

Filter the log for only capitalized words, and print summary statistics:

lmc grep "^[A-Z][a-z]+$" out/w3.wc.log | lmc stats | jq .

You should notice that capitalized words are (in this small, statistically insignificant example), much harder to predict than words in general.

Other challenges: Other LM challenges can be run & inspected in a similar way, see lmc run --help.

Running LM Challenge

LM Challenge is quite flexible - it can be used in a variety of ways:

  1. Command Line Interface
  2. Python API
  3. Log file format

1. Command Line Interface

This is the simplest way of using LM Challenge, and works if your model is implemented in any language supporting piped stdout/stdin. See the Getting Started guide above, and the CLI help:

lmc --help
lmc run --help

2. Python API

If your model runs in Python 3, and you wish to script evaluation in Python, you can use the API directly:

import lmchallenge as lmc
help(lmc)

Our documentation (as in help(lmc)) includes a tutorial for getting started with Python. We don't yet publish the HTML, but it has been tested with pdoc:

$ pdoc --http
# use your browser to view generated documentation

3. Log file format

If you require batching or distribution for sufficient evaluation speed, you can write the LM Challenge log files yourself. This means you can use LM Challenge to process & analyse the data, without imposing a particular execution model. To do this:

  1. Write JSONlines files that contain lmchallenge log data:
    • See data formats notes that describe the log format.
    • (Optionally) use the JSON schema that formally describes an acceptable log datum.
    • (Optionally) use the CLI lmc validate (or Python API lmchallenge.validate.validate) to check that your log conforms to the schema.
    • Note that log files can often be concatenated if they were generated in parallel.
  2. Use the lmchallenge tools to analyse the logs (everything except lmc run).

The details

An LM challenge game is a runnable Python module that evaluates one or more language models on some task, over some test text.

The challenge games we have are:

  • wc - Word Completion Challenge - a Next Word Prediction / Completion task (generates [email protected] & completion ratios)
  • we|ce - Word|Character Entropy Challenges - a language probability distribution task (generates cross entropy given a defined vocabulary)
  • wr - Word Reranking Challenge - a correction task (generates accuracy)

Test text is pure text data (as typed & understood by real actual humans!) LM Challenge does not define test text - we expect it to be provided. This is the other thing you need to decide on in order to evaluate a language model.

A language model is an executable process that responds to commands from a LM challenge game in a specific text format, usually comprising of a pre-trained model of the same language as the test text.

Word Completion wc

The Word Completion task scans through words in the test text, at each point querying the language model for next-word predictions & word completions.

cat DATA | lmc run "PREDICTOR" wc > LOG

The model should aim to predict the correct next word before other words (i.e. with as low a rank as possible), or failing that to predict it in the top two completions, given as short a typed prefix as possible. Statistics available from wc include:

  • next-word-prediction
    • [email protected] - ratio of correct predictions obtained with rank below N
    • MRR (Mean Reciprocal Rank) - the sum total of 1/rank over all words
  • completion
    • characters - ratio of characters that were completed (e.g. if typing "hello", and it is predicted after you type "he", the ratio of completed characters would be 0.5)
    • tokens - ratio of tokens that were completed before they were fully typed

Note that the flag --next-word-only may be used to speed up evaluation, by skipping all prefixes, only evaluating the model's next-word-prediction performance (so that completion stats are not generated).

Word/Character Entropy we|ce

The Word/Character Entropy task produces stats that are analogous to the standard cross-entropy/perplexity measures used for evaluating language models. These evaluators scan through text, at each point querying the language model for a normalized log-probability for the current word.

cat DATA | lmc run "PREDICTOR" we > LOG
cat DATA | lmc run "PREDICTOR" ce > LOG

It is important to note that the entropy metric can only be compared between models that share a common vocabulary. If the vocabulary is different, the entropy task is different, and models should not be compared. Therefore, a model must generate a "fair" normalized log-probability over its vocabulary (and if a word is not in the vocabulary, to omit the score from the results). It should not merge "equivalence classes" of words (except by general agreement with every other model being evaluated). An example of this would be example normalizing capitalization to give "fish" the same score as "Fish", or giving many words an "out of vocabulary" score (such that, if you were to calculate p("fish") + p("Fish") + p(everything else) it would not sum to one). Simply ommiting any words that cannot be scored (e.g. OOV words) is safe, as this contributes to a special "entropy fingerprint", which checks that two models successfully scored the same set of words, and are therefore comparable under the entropy metric.

Word Reranking wr

The Word Reranking task emulates a sloppy typist entering text, using the language model to correct input after it has been typed. This challenge requires a list of words to use as correction candidates for corrupted words (which should be a large set of valid words in the target language.) Text from the data source is first corrupted (as if by a sloppy typist). The corrupted text is fed into a search for nearby candidate words, which are scored according to the language model under evaluation. The evaluator measures corrected, un-corrected and mis-corrected results.

cat DATA | lmc run "PREDICTOR" wr VOCAB > LOG

The aim of the model is to assign high score to the correct word, and low score to all other words. We evaluate this by mixing the score from the language model with an input score for each word (using a minimum score for words that are not scored by the lanugage model), then ranking based on that. If the top-ranked prediction is the correct word, this example was a success, otherwise it counts as a failure. The input score is the log-probability of the particular corrupted text being produced from this word, in the same error model that was used to corrupt the true word. In order to be robust against different ranges of scores from language models, we optimize the input and language model mixing parameters before counting statistics (this is done automatically, but requires the optional dependency scipy). The accuracy aggregate measures the maximum proportion of correct top predictions, using the optimum mixing proportions.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].