All Projects → jerinphilip → ilmulti

jerinphilip / ilmulti

Licence: MIT license
Tooling to play around with multilingual machine translation for Indian Languages.

Programming Languages

python
139335 projects - #7 most used programming language
UrWeb
3 projects
ocaml
1615 projects
shell
77523 projects

Projects that are alternatives of or similar to ilmulti

Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (+178.95%)
Mutual labels:  tokenizer, machine-translation
Sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Stars: ✭ 293 (+1442.11%)
Mutual labels:  tokenizer, machine-translation
Indian ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
Stars: ✭ 23 (+21.05%)
Mutual labels:  indian-languages, multilingual-translation
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Stars: ✭ 132 (+594.74%)
Mutual labels:  tokenizer, machine-translation
NiuTrans.NMT
A Fast Neural Machine Translation System. It is developed in C++ and resorts to NiuTensor for fast tensor APIs.
Stars: ✭ 112 (+489.47%)
Mutual labels:  machine-translation
dynmt-py
Neural machine translation implementation using dynet's python bindings
Stars: ✭ 17 (-10.53%)
Mutual labels:  machine-translation
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+263.16%)
Mutual labels:  tokenizer
packetevents
PacketEvents is a powerful packet library. Our packet wrappers are efficient and easy to use. We support many protocol versions. (1.8+)
Stars: ✭ 235 (+1136.84%)
Mutual labels:  wrappers
vscode-blockman
VSCode extension to highlight nested code blocks
Stars: ✭ 233 (+1126.32%)
Mutual labels:  tokenizer
omegat-tencent-plugin
This is a plugin to allow OmegaT to source machine translations from Tencent Cloud.
Stars: ✭ 31 (+63.16%)
Mutual labels:  machine-translation
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-10.53%)
Mutual labels:  tokenizer
elasticsearch-plugins
Some native scoring script plugins for elasticsearch
Stars: ✭ 30 (+57.89%)
Mutual labels:  tokenizer
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (+242.11%)
Mutual labels:  machine-translation
neural tokenizer
Tokenize English sentences using neural networks.
Stars: ✭ 64 (+236.84%)
Mutual labels:  tokenizer
mtdata
A tool that locates, downloads, and extracts machine translation corpora
Stars: ✭ 95 (+400%)
Mutual labels:  machine-translation
rustfst
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Stars: ✭ 104 (+447.37%)
Mutual labels:  tokenizer
deepl-rb
A simple ruby gem for the DeepL API
Stars: ✭ 38 (+100%)
Mutual labels:  machine-translation
ReductionWrappers
R wrappers to connect Python dimensional reduction tools and single cell data objects (Seurat, SingleCellExperiment, etc...)
Stars: ✭ 31 (+63.16%)
Mutual labels:  wrappers
jargon
Tokenizers and lemmatizers for Go
Stars: ✭ 98 (+415.79%)
Mutual labels:  tokenizer
Machine-Translation-v2
英中机器文本翻译
Stars: ✭ 48 (+152.63%)
Mutual labels:  machine-translation

ilmulti

This repository houses tooling used to create the models on the leaderboard of WAT-Tasks. We provide wrappers to models which are trained via pytorch/fairseq to translate. Installation and usage intructions are provided below.

Installation

The code is tested to work with the fairseq-fork which is branched from v0.8.0 and torch version 1.0.0.

# --user is optional

# Check requirements.txt, packages for translation:
# fairseq-ilmt@lrec-2020 and torch  are not enabled by default.
python3 -m pip install -r requirements.txt --user  

# Once requirements are installed, you can install ilmulti into library.

python3 setup.py install --user 

Downloading Models: The script scripts/download-and-setup-models.sh downloads the model and dictionary files required for running examples/mm_all.py. Which models to download can be configured in the script.

A working example using the wrappers in this code can be found in this colab notebook. Thanks @Nimishasri.

Usage

from ilmulti.translator import from_pretrained

translator = from_pretrained(tag='mm-all')
sample = translator("The quick brown fox jumps over the lazy dog", tgt_lang='hi')

The code works with three main components:

1. Segmenter

Also sentence-tokenizer. To handle segmenting a block of text into sentences, accounting for some Indian Language delimiters.

  1. PatternSegmenter: There is a bit crude and rule based implementation contributed by Binu Jasim.
  2. PunktSegmenter: changed this to an unsupervised learnt PunktTokenizer

2. Tokenization

We use SentencePiece to as an unsupervised tokenizer for Indian languages, which works surprisingly well in our experiments. There are trained models on whatever corpora we could find for the specific languages in sentencepiece/models of 4000 vocabulary units and 8000 vocabulary units.

Training a joint SentencePiece over all languages lead to character level tokenization for under-represented languages and since there isn't much to gain due to the difference in scripts, we use individual tokenizers for each language. Combined however, this will have less than 4000 x |#languages| as some common English code mixes come in. This however, makes the MT system robust in some sense to code-mixed inputs.

3. Translator

Translator is a wrapper around a fairseq which we have reused for some web-interfaces and demos.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].