All Projects → VKCOM → Youtokentome

VKCOM / Youtokentome

Licence: mit
Unsupervised text tokenizer focused on computational efficiency

Projects that are alternatives of or similar to Youtokentome

Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (-51.37%)
Mutual labels:  natural-language-processing, word-segmentation
Toiro
A comparison tool of Japanese tokenizers
Stars: ✭ 95 (-86.95%)
Mutual labels:  natural-language-processing, word-segmentation
Sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 5,540 (+660.99%)
Mutual labels:  natural-language-processing, word-segmentation
Pycantonese
Cantonese Linguistics and NLP in Python
Stars: ✭ 147 (-79.81%)
Mutual labels:  natural-language-processing, word-segmentation
Pythainlp
Thai Natural Language Processing in Python.
Stars: ✭ 582 (-20.05%)
Mutual labels:  natural-language-processing, word-segmentation
Awesome Relation Extraction
📖 A curated list of awesome resources dedicated to Relation Extraction, one of the most important tasks in Natural Language Processing (NLP).
Stars: ✭ 656 (-9.89%)
Mutual labels:  natural-language-processing
Bertsearch
Elasticsearch with BERT for advanced document search.
Stars: ✭ 684 (-6.04%)
Mutual labels:  natural-language-processing
Conv Emotion
This repo contains implementation of different architectures for emotion recognition in conversations.
Stars: ✭ 646 (-11.26%)
Mutual labels:  natural-language-processing
This Word Does Not Exist
This Word Does Not Exist
Stars: ✭ 640 (-12.09%)
Mutual labels:  natural-language-processing
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (-1.79%)
Mutual labels:  natural-language-processing
Ai Series
📚 [.md & .ipynb] Series of Artificial Intelligence & Deep Learning, including Mathematics Fundamentals, Python Practices, NLP Application, etc. 💫 人工智能与深度学习实战,数理统计篇 | 机器学习篇 | 深度学习篇 | 自然语言处理篇 | 工具实践 Scikit & Tensoflow & PyTorch 篇 | 行业应用 & 课程笔记
Stars: ✭ 702 (-3.57%)
Mutual labels:  natural-language-processing
Nltk data
NLTK Data
Stars: ✭ 675 (-7.28%)
Mutual labels:  natural-language-processing
Dl Nlp Readings
My Reading Lists of Deep Learning and Natural Language Processing
Stars: ✭ 656 (-9.89%)
Mutual labels:  natural-language-processing
Madewithml
Learn how to responsibly deliver value with ML.
Stars: ✭ 29,253 (+3918.27%)
Mutual labels:  natural-language-processing
Quanteda
An R package for the Quantitative Analysis of Textual Data
Stars: ✭ 647 (-11.13%)
Mutual labels:  natural-language-processing
Machine Learning
머신러닝 입문자 혹은 스터디를 준비하시는 분들에게 도움이 되고자 만든 repository입니다. (This repository is intented for helping whom are interested in machine learning study)
Stars: ✭ 705 (-3.16%)
Mutual labels:  natural-language-processing
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+694.37%)
Mutual labels:  natural-language-processing
Pplm
Plug and Play Language Model implementation. Allows to steer topic and attributes of GPT-2 models.
Stars: ✭ 674 (-7.42%)
Mutual labels:  natural-language-processing
Keras Attention
Visualizing RNNs using the attention mechanism
Stars: ✭ 697 (-4.26%)
Mutual labels:  natural-language-processing
Language Detection
A language detection library for PHP. Detects the language from a given text string.
Stars: ✭ 665 (-8.65%)
Mutual labels:  natural-language-processing

PyPI Downloads Code style: black GitHub Build Status

YouTokenToMe

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [Sennrich et al.]. Our implementation is much faster in training and tokenization than Hugging Face, fastBPE and SentencePiece. In some test cases, it is 90 times faster. Check out our benchmark results.

Key advantages:

  • Multithreading for training and tokenization
  • The algorithm has O(N) complexity, where N is the length of training data
  • Highly efficient implementation in C++
  • Python wrapper and command-line interface

Extra features:

As well as in the algorithm from the original paper, ours does not consider tokens that cross word boundaries. Just like in SentencePiece, all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.

For example, the phrase Blazingly fast tokenization! can be tokenized into

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

Installation

pip install youtokentome

Python interface

Example

Let's start with a self-contained example.

import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open(train_data_path, "w") as fout:
    for _ in range(n_lines):
        print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)

# Generating random text
test_text = "".join([random.choice("abcde ") for _ in range(100)])

# Training model
yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)

# Loading model
bpe = yttm.BPE(model=model_path)

# Two types of tokenization
print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))

 

Training model

youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)

Trains BPE model and saves to file.

Args:

  • data: string, path to file with training data
  • model: string, path to where the trained model will be saved
  • vocab_size: int, number of tokens in the final vocabulary
  • coverage: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
  • n_threads: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see benchmark).
  • pad_id: int, reserved id for padding
  • unk_id: int, reserved id for unknown symbols
  • bos_id: int, reserved id for begin of sentence token
  • eos_id: int, reserved id for end of sentence token

Returns: Class youtokentome.BPE with the loaded model.

 

Model loading

youtokentome.BPE(model, n_threads=-1)

Class constructor. Loads the trained model.

  • model: string, path to the trained model
  • n_threads: int, number of parallel threads used to run. If equal to -1, then the maximum number of threads available will be used.

 

Methods

Class youtokentome.BPE has the following methods:

encode

encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)

Args:

  • sentences: list of strings, sentences for tokenization.
  • output_type: enum, sentence can be tokenized to ids or subwords. Use OutputType.ID for ids and OutputType.SUBWORD for subwords.
  • bos: bool, if True then token “beginning of sentence” will be added
  • eos: bool, if True then token “end of sentence” will be added
  • reverse: bool, if True the output sequence of tokens will be reversed
  • dropout_prob: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].

Returns: If output_type is equal to youtokentome.OutputType.ID or youtokentome.OutputType.SUBWORD then a list of lists of integers or list of lists of strings will be returned respectively.

 

vocab

vocab(self)

Returns: A list vocab_size strings. The i-th string in the list corresponds to i-th subword.

 

vocab_size

vocab_size(self)

Returns: int. Size of vocabulary.

 

subword_to_id

subword_to_id(self, subword)

Args:

  • subword: string.

Returns: Integer from the range [0, vocab_size-1]. Id of subword or, if there is no such subword in the vocabulary, unk_id will be returned.

 

id_to_subword

id_to_subword(self, id)

Args:

  • id: int, must be in the range [0, vocab_size-1]

Returns: string. Subword from vocabulary by id.

 

decode

decode(self, ids, ignore_ids=None)

Convert each id to subword and concatenate with space symbol.

Args:

  • ids: list of lists of integers. All integers must be in the range [0, vocab_size-1]
  • ignore_ids: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]

Returns: List of strings.

Command line interface

Example

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA 

Supported commands

YouTokenToMe supports the following commands:

$ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

Command bpe allows you to train Byte Pair Encoding model based on a text file.

$ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

Apply BPE encoding for a corpus of sentences. Use stdin for input and stdout for output.

By default, encoding works in parallel using n_threads threads. Number of threads is limited by 8 (see benchmark).

With the --stream option, --n_threads will be ignored and all sentences will be processed one by one. Each sentence will be tokenized and written to the stdout before the next sentence is read.

$ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

Print vocabulary. This can be useful for understanding the model.

$ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

Convert ids back to text. Use stdin for input and stdout for output.

$ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].