All Projects → alvations → Sacremoses

alvations / Sacremoses

Licence: mit
Python port of Moses tokenizer, truecaser and normalizer

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Sacremoses

Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (-81.91%)
Mutual labels:  tokenizer, machine-translation
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Stars: ✭ 132 (-54.95%)
Mutual labels:  tokenizer, machine-translation
ilmulti
Tooling to play around with multilingual machine translation for Indian Languages.
Stars: ✭ 19 (-93.52%)
Mutual labels:  tokenizer, machine-translation
NLP Toolkit
Library of state-of-the-art models (PyTorch) for NLP tasks
Stars: ✭ 92 (-68.6%)
Mutual labels:  machine-translation
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (-89.08%)
Mutual labels:  tokenizer
Hebrew-Tokenizer
A very simple python tokenizer for Hebrew text.
Stars: ✭ 16 (-94.54%)
Mutual labels:  tokenizer
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (-13.31%)
Mutual labels:  tokenizer
tokenizer
Tokenize CSS according to the CSS Syntax
Stars: ✭ 52 (-82.25%)
Mutual labels:  tokenizer
ArabicProcessingCog
A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.
Stars: ✭ 19 (-93.52%)
Mutual labels:  tokenizer
cang-jie
Chinese tokenizer for tantivy, based on jieba-rs
Stars: ✭ 48 (-83.62%)
Mutual labels:  tokenizer
PaddleTokenizer
使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle
Stars: ✭ 14 (-95.22%)
Mutual labels:  tokenizer
bredon
A modern CSS value compiler in JavaScript
Stars: ✭ 39 (-86.69%)
Mutual labels:  tokenizer
Attention-Visualization
Visualization for simple attention and Google's multi-head attention.
Stars: ✭ 54 (-81.57%)
Mutual labels:  machine-translation
mystem-scala
Morphological analyzer `mystem` (Russian language) wrapper for JVM languages
Stars: ✭ 21 (-92.83%)
Mutual labels:  tokenizer
SpeechTransProgress
Tracking the progress in end-to-end speech translation
Stars: ✭ 139 (-52.56%)
Mutual labels:  machine-translation
transformer-pytorch
A PyTorch implementation of Transformer in "Attention is All You Need"
Stars: ✭ 77 (-73.72%)
Mutual labels:  machine-translation
banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Stars: ✭ 91 (-68.94%)
Mutual labels:  machine-translation
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (-35.84%)
Mutual labels:  tokenizer
Natural-Language-Processing
Contains various architectures and novel paper implementations for Natural Language Processing tasks like Sequence Modelling and Neural Machine Translation.
Stars: ✭ 48 (-83.62%)
Mutual labels:  machine-translation
Transformer
A Pytorch Implementation of "Attention is All You Need" and "Weighted Transformer Network for Machine Translation"
Stars: ✭ 271 (-7.51%)
Mutual labels:  machine-translation

Sacremoses

Build Status Build status

License

MIT License.

Install

pip install -U sacremoses

NOTE: Sacremoses only supports Python 3 now (sacremoses>=0.0.41). If you're using Python 2, the last possible version is sacremoses==0.0.40.

Usage (Python)

Tokenizer and Detokenizer

>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mt = MosesTokenizer(lang='en')
>>> text = u'This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'
>>> expected_tokenized = u'This , is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf'
>>> tokenized_text = mt.tokenize(text, return_str=True)
>>> tokenized_text == expected_tokenized
True


>>> mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = [u'This', u'ain', u'&apos;t', u'funny', u'.', u'It', u'&apos;s', u'actually', u'hillarious', u',', u'yet', u'double', u'Ls', u'.', u'&#124;', u'&#91;', u'&#93;', u'&lt;', u'&gt;', u'&#91;', u'&#93;', u'&amp;', u'You', u'&apos;re', u'gonna', u'shake', u'it', u'off', u'?', u'Don', u'&apos;t', u'?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> mt.tokenize(sent) == expected_tokens
True
>>> md.detokenize(tokens) == expected_detokens
True

Truecaser

>>> from sacremoses import MosesTruecaser, MosesTokenizer

# Train a new truecaser from a 'big.txt' file.
>>> mtr = MosesTruecaser()
>>> mtok = MosesTokenizer(lang='en')

# Save the truecase model to 'big.truecasemodel' using `save_to`
>> tokenized_docs = [mtok.tokenize(line) for line in open('big.txt')]
>>> mtr.train(tokenized_docs, save_to='big.truecasemodel')

# Save the truecase model to 'big.truecasemodel' after training
# (just in case you forgot to use `save_to`)
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.save_model('big.truecasemodel')

# Truecase a string after training a model.
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']

# Loads a model and truecase a string using trained model.
>>> mtr = MosesTruecaser('big.truecasemodel')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", return_str=True)
'the ADVENTURES OF SHERLOCK HOLMES'
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", return_str=True, use_known=True)
'the adventures of Sherlock Holmes'

Normalizer

>>> from sacremoses import MosesPunctNormalizer
>>> mpn = MosesPunctNormalizer()
>>> mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')
'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'

Usage (CLI)

Since version 0.0.42, the pipeline feature for CLI is introduced, thus there are global options that should be set first before calling the commands:

  • language
  • processes
  • encoding
  • quiet
$ pip install -U sacremoses>=0.0.42

$ sacremoses --help
Usage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Options:
  -l, --language TEXT      Use language specific rules when tokenizing
  -j, --processes INTEGER  No. of processes.
  -e, --encoding TEXT      Specify encoding of file.
  -q, --quiet              Disable progress bar.
  --version                Show the version and exit.
  -h, --help               Show this message and exit.

Commands:
  detokenize
  detruecase
  normalize
  tokenize
  train-truecase
  truecase

Pipeline

Example to chain the following commands:

  • normalize with -c option to remove control characters.
  • tokenize with -a option for aggressive dash split rules.
  • truecase with -a option to indicate that model is for ASR
    • if big.truemodel exists, load the model with -m option,
    • otherwise train a model and save it with -m option to big.truemodel file.
  • save the output to console to the big.txt.norm.tok.true file.
cat big.txt | sacremoses -l en -j 4 \
    normalize -c tokenize -a truecase -a -m big.truemodel \
    > big.txt.norm.tok.true

Tokenizer

$ sacremoses tokenize --help
Usage: sacremoses tokenize [OPTIONS]

Options:
  -a, --aggressive-dash-splits   Triggers dash split rules.
  -x, --xml-escape               Escape special characters for XML.
  -p, --protected-patterns TEXT  Specify file with patters to be protected in
                                 tokenisation.
  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,
                                 add prefixes to the default ones from the
                                 specified language.
  -h, --help                     Show this message and exit.


 $ sacremoses -l en -j 4 tokenize  < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns
 $ sacremoses -l en -j 4 tokenize -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

Detokenizer

$ sacremoses detokenize --help
Usage: sacremoses detokenize [OPTIONS]

Options:
  -x, --xml-unescape  Unescape special characters for XML.
  -h, --help          Show this message and exit.

 $ sacremoses -l en -j 4 detokenize < big.txt.tok > big.txt.tok.detok
100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]

Truecase

$ sacremoses truecase --help
Usage: sacremoses truecase [OPTIONS]

Options:
  -m, --modelfile TEXT            Filename to save/load the modelfile.
                                  [required]
  -a, --is-asr                    A flag to indicate that model is for ASR.
  -p, --possibly-use-first-token  Use the first token as part of truecase
                                  training.
  -h, --help                      Show this message and exit.

$ sacremoses -j 4 truecase -m big.model < big.txt.tok > big.txt.tok.true
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]

Detruecase

$ sacremoses detruecase --help
Usage: sacremoses detruecase [OPTIONS]

Options:
  -j, --processes INTEGER  No. of processes.
  -a, --is-headline        Whether the file are headlines.
  -e, --encoding TEXT      Specify encoding of file.
  -h, --help               Show this message and exit.

$ sacremoses -j 4 detruecase  < big.txt.tok.true > big.txt.tok.true.detrue
100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]

Normalize

$ sacremoses normalize --help
Usage: sacremoses normalize [OPTIONS]

Options:
  -q, --normalize-quote-commas  Normalize quotations and commas.
  -d, --normalize-numbers       Normalize number.
  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE
                                normalization.
  -c, --remove-control-chars    Remove control characters AFTER normalization.
  -h, --help                    Show this message and exit.

$ sacremoses -j 4 normalize < big.txt > big.txt.norm
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].