All Projects → OpenNMT → Tokenizer

OpenNMT / Tokenizer

Licence: mit
Fast and customizable text tokenization library with BPE and SentencePiece support

Programming Languages

python
139335 projects - #7 most used programming language
cpp
1120 projects

Projects that are alternatives of or similar to Tokenizer

Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (-59.85%)
Mutual labels:  tokenizer, natural-language-processing, machine-translation
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (+54.55%)
Mutual labels:  natural-language-processing, unicode, icu
Py Nltools
A collection of basic python modules for spoken natural language processing
Stars: ✭ 46 (-65.15%)
Mutual labels:  tokenizer, natural-language-processing
Mtnt
Code for the collection and analysis of the MTNT dataset
Stars: ✭ 48 (-63.64%)
Mutual labels:  natural-language-processing, machine-translation
Fasttext multilingual
Multilingual word vectors in 78 languages
Stars: ✭ 1,067 (+708.33%)
Mutual labels:  natural-language-processing, machine-translation
Nlg Eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Stars: ✭ 822 (+522.73%)
Mutual labels:  natural-language-processing, machine-translation
String To Tree Nmt
Source code and data for the paper "Towards String-to-Tree Neural Machine Translation"
Stars: ✭ 16 (-87.88%)
Mutual labels:  natural-language-processing, machine-translation
Greynir
The greynir.is natural language processing website for Icelandic
Stars: ✭ 47 (-64.39%)
Mutual labels:  tokenizer, natural-language-processing
Opennmt Tf
Neural machine translation and sequence learning using TensorFlow
Stars: ✭ 1,223 (+826.52%)
Mutual labels:  natural-language-processing, machine-translation
Deep Learning Drizzle
Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!!
Stars: ✭ 9,717 (+7261.36%)
Mutual labels:  natural-language-processing, machine-translation
Kadot
Kadot, the unsupervised natural language processing library.
Stars: ✭ 108 (-18.18%)
Mutual labels:  tokenizer, natural-language-processing
Texar Pytorch
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 636 (+381.82%)
Mutual labels:  natural-language-processing, machine-translation
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: ✭ 438 (+231.82%)
Mutual labels:  tokenizer, natural-language-processing
Icu
The new home of the ICU project source code.
Stars: ✭ 1,011 (+665.91%)
Mutual labels:  unicode, icu
Nlp Progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Stars: ✭ 19,518 (+14686.36%)
Mutual labels:  natural-language-processing, machine-translation
Bytenet Tensorflow
ByteNet for character-level language modelling
Stars: ✭ 319 (+141.67%)
Mutual labels:  natural-language-processing, machine-translation
Nonautoreggenprogress
Tracking the progress in non-autoregressive generation (translation, transcription, etc.)
Stars: ✭ 118 (-10.61%)
Mutual labels:  natural-language-processing, machine-translation
Sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Stars: ✭ 293 (+121.97%)
Mutual labels:  tokenizer, machine-translation
Zhihu
This repo contains the source code in my personal column (https://zhuanlan.zhihu.com/zhaoyeyu), implemented using Python 3.6. Including Natural Language Processing and Computer Vision projects, such as text generation, machine translation, deep convolution GAN and other actual combat code.
Stars: ✭ 3,307 (+2405.3%)
Mutual labels:  natural-language-processing, machine-translation
Comet
A Neural Framework for MT Evaluation
Stars: ✭ 58 (-56.06%)
Mutual labels:  natural-language-processing, machine-translation

CI PyPI version

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters ⦅ and ⦆.

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h>

using namespace onmt;

int main() {
  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
  std::vector<std::string> tokens;
  tokenizer.tokenize("Hello World!", tokens);
}

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

git submodule update --init
mkdir build
cd build
cmake ..
make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make
test/onmt_tokenizer_test ../test/data
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].