All Projects → M4t1ss → parallel-corpora-tools

M4t1ss / parallel-corpora-tools

Licence: MIT license
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.

Programming Languages

PHP
23972 projects - #3 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to parallel-corpora-tools

Nmt List
A list of Neural MT implementations
Stars: ✭ 359 (+925.71%)
Mutual labels:  machine-translation, neural-machine-translation, nmt
Nematus
Open-Source Neural Machine Translation in Tensorflow
Stars: ✭ 730 (+1985.71%)
Mutual labels:  machine-translation, neural-machine-translation, nmt
Joeynmt
Minimalist NMT for educational purposes
Stars: ✭ 420 (+1100%)
Mutual labels:  machine-translation, neural-machine-translation, nmt
Subword Nmt
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Stars: ✭ 1,819 (+5097.14%)
Mutual labels:  machine-translation, neural-machine-translation, nmt
Neuralmonkey
An open-source tool for sequence learning in NLP built on TensorFlow.
Stars: ✭ 400 (+1042.86%)
Mutual labels:  machine-translation, neural-machine-translation, nmt
Nmt Keras
Neural Machine Translation with Keras
Stars: ✭ 501 (+1331.43%)
Mutual labels:  machine-translation, neural-machine-translation, nmt
Modernmt
Neural Adaptive Machine Translation that adapts to context and learns from corrections.
Stars: ✭ 231 (+560%)
Mutual labels:  machine-translation, neural-machine-translation, neural
Opus Mt
Open neural machine translation models and web services
Stars: ✭ 111 (+217.14%)
Mutual labels:  machine-translation, neural-machine-translation
Deeply
PHP client for the DeepL.com translation API (unofficial)
Stars: ✭ 152 (+334.29%)
Mutual labels:  machine-translation, neural
Nspm
🤖 Neural SPARQL Machines for Knowledge Graph Question Answering.
Stars: ✭ 156 (+345.71%)
Mutual labels:  machine-translation, neural-machine-translation
Texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 2,236 (+6288.57%)
Mutual labels:  machine-translation, data-processing
Mtbook
《机器翻译:基础与模型》肖桐 朱靖波 著 - Machine Translation: Foundations and Models
Stars: ✭ 2,307 (+6491.43%)
Mutual labels:  machine-translation, neural-machine-translation
Opennmt
Open Source Neural Machine Translation in Torch (deprecated)
Stars: ✭ 2,339 (+6582.86%)
Mutual labels:  machine-translation, neural-machine-translation
Mt Paper Lists
MT paper lists (by conference)
Stars: ✭ 105 (+200%)
Mutual labels:  machine-translation, neural-machine-translation
Opennmt Tf
Neural machine translation and sequence learning using TensorFlow
Stars: ✭ 1,223 (+3394.29%)
Mutual labels:  machine-translation, neural-machine-translation
Sockeye
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet
Stars: ✭ 990 (+2728.57%)
Mutual labels:  machine-translation, neural-machine-translation
Npmt
Towards Neural Phrase-based Machine Translation
Stars: ✭ 175 (+400%)
Mutual labels:  machine-translation, neural-machine-translation
Ludwig
Data-centric declarative deep learning framework
Stars: ✭ 8,018 (+22808.57%)
Mutual labels:  machine, natural-language
Pytorch Forecasting
Time series forecasting with PyTorch
Stars: ✭ 849 (+2325.71%)
Mutual labels:  machine, neural
vat nmt
Implementation of "Effective Adversarial Regularization for Neural Machine Translation", ACL 2019
Stars: ✭ 22 (-37.14%)
Mutual labels:  neural-machine-translation, nmt

Corpora Cleaning Tools

Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems.

Inspired by the Data Filtering and Data Pre-processing sections of Tilde's WMT17 paper. This repository includes some of the more basic scripts that can help to get rid of the majority of junk from parallel corpora.

Tools included

  • parallel - tools for parallel corpora
  • mono - tools for monolingual corpora

Requirements

pip install subword-nmt
pip install langid

Publications

If you use this tool, please cite the following paper:

Matīss Rikters (2018). "Impact of Corpora Quality on Neural Machine Translation." In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018).

@inproceedings{Rikters2018BalticHLT,
	author = {Rikters, Matīss},
	booktitle={In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018)},
	title = {{Impact of Corpora Quality on Neural Machine Translation}},
	address={Tartu, Estonia},
	year = {2018}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].