All Projects → sharad461 → nepali-translator

sharad461 / nepali-translator

Licence: Apache-2.0 License
Neural Machine Translation on the Nepali-English language pair

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects
CSS
56736 projects

Projects that are alternatives of or similar to nepali-translator

BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (+75.86%)
Mutual labels:  machine-translation, parallel-corpus
banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Stars: ✭ 91 (+213.79%)
Mutual labels:  machine-translation, parallel-corpus
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (+6.9%)
Mutual labels:  data-cleaning
allie
🤖 A machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers).
Stars: ✭ 93 (+220.69%)
Mutual labels:  data-cleaning
mtdata
A tool that locates, downloads, and extracts machine translation corpora
Stars: ✭ 95 (+227.59%)
Mutual labels:  machine-translation
bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Stars: ✭ 120 (+313.79%)
Mutual labels:  data-cleaning
Video-guided-Machine-Translation
Starter code for the VMT task and challenge
Stars: ✭ 45 (+55.17%)
Mutual labels:  machine-translation
parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Stars: ✭ 35 (+20.69%)
Mutual labels:  machine-translation
transformer-pytorch
A PyTorch implementation of Transformer in "Attention is All You Need"
Stars: ✭ 77 (+165.52%)
Mutual labels:  machine-translation
omegat-tencent-plugin
This is a plugin to allow OmegaT to source machine translations from Tencent Cloud.
Stars: ✭ 31 (+6.9%)
Mutual labels:  machine-translation
foofah
Foofah: programming-by-example data transformation program synthesizer
Stars: ✭ 24 (-17.24%)
Mutual labels:  data-cleaning
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (+124.14%)
Mutual labels:  machine-translation
Machine-Translation-v2
英中机器文本翻译
Stars: ✭ 48 (+65.52%)
Mutual labels:  machine-translation
ilmulti
Tooling to play around with multilingual machine translation for Indian Languages.
Stars: ✭ 19 (-34.48%)
Mutual labels:  machine-translation
dynmt-py
Neural machine translation implementation using dynet's python bindings
Stars: ✭ 17 (-41.38%)
Mutual labels:  machine-translation
apertium-html-tools
Web application providing a fully localised interface for text/website/document translation, analysis and generation powered by Apertium.
Stars: ✭ 36 (+24.14%)
Mutual labels:  machine-translation
masakhane-web
Masakhane Web is a translation web application for solely African Languages.
Stars: ✭ 27 (-6.9%)
Mutual labels:  machine-translation
NiuTrans.NMT
A Fast Neural Machine Translation System. It is developed in C++ and resorts to NiuTensor for fast tensor APIs.
Stars: ✭ 112 (+286.21%)
Mutual labels:  machine-translation
OpenRefine-ecology-lesson
Data Cleaning with OpenRefine for Ecologists
Stars: ✭ 20 (-31.03%)
Mutual labels:  data-cleaning
Attention-Visualization
Visualization for simple attention and Google's multi-head attention.
Stars: ✭ 54 (+86.21%)
Mutual labels:  machine-translation

Nepali Translator

Neural Machine Translation (NMT) on the Nepali-English language pair.

Contributions of this project: adding to and cleaning the parallel data that is publicly available and improving the baseline scores for supervised MT on the pair.

A report on this project is available here.

The parallel data we prepared can be found here.

data_cleaning directory has the scripts that implement the cleaning methods discussed in the report.

translator directory has a working interface for the translator.

Updates

Towards the end of 2019 some additional work was carried out under the project, described here. The models reported in the paper can be found here. I will also add a link to the bigger corpus soon.

As of Feb 2021, there are a few compatibility issues between the model files and the more recent versions of the packages. To fix these, use the following versions of the packages: torch-1.3.0 fairseq-0.9.0 portalocker-2.0.0 sacrebleu-1.4.14 sacremoses-0.0.43 sentencepiece-0.1.91.

Results

Find the more recent results in the paper linked above.

The BLEU scores of 7.6 and 4.3 (for supervised methods) that Guzman et al report in their paper are on their devtest set. There are actually two more sets they release: the validation set called dev set and the recently released (October 2019) test set. In the report linked above, we report only the scores on the dev set. We reproduce their model using their implementation to score it. Here we report the scores on both dev and devtest sets.

On dev set

Models Corpus size NE-EN EN-NE
Guzman et al. (2019) 564k 5.24 2.98
This work 150k 12.26 6.0

On devtest set

Models NE-EN EN-NE
Guzman et al. (2019) 7.6 4.3
This work 14.51 6.58

The results on devtest are from models that use vocab sizes of 2500.

Requirements

  • fairseq
  • sentencepiece
  • sacremoses
  • sacrebleu
  • flask
  • indic_nlp_library

Fairseq is used for training, sentencepiece is used to learn BPE over the corpus, sacremoses for treating English text, sacrebleu for scoring the models, flask for the interface. For handling the Nepali text, we use the Indic NLP Library.

All the libraries can be installed using pip.

To be able to run the translator interface, Indic NLP Library needs to be cloned to translator/app/modules/.

There are other libraries like python-docx and lxml used by the cleaning scripts.

Preparing the translator

After training a model using the fairseq implementation of Transformer, copy the checkpoint file to translator/app/models/ and rename it en-ne.pt or ne-en.pt based on the translation direction of the checkpoint file. The checkpoint files that realize the results in the report are available here. Copy the .pt files to translator/app/models.

After requirements and models are in place, run python app/app.py from translator directory.

Details on the training itself can be obtained from fairseq repo or documentation. The FLORES github is also useful.

Sample translations

NE-EN

Type Sentence
Source ठूला गोदामहरुले, यस क्षेत्रका साना साना धेरै निर्माता हरु द्वारा बनाईएका जुत्ताहरु भण्डार गर्न थाले ।
Reference Large warehouses began to stock footwear in warehouses , made by many small manufacturers from the area .
System Large warehouses began to store shoe made by small producers of this area .
Type Sentence
Source प्राविधिक लेखकहरूले पनि व्यापारिक, पेशागत वा घरेलु प्रयोगका लागि विभिन्न कार्यविधिहरूका बारे लेख्दछन्।
Reference Technical writers also write various procedures for business , professional or domestic use .
System Technical authors also write about various procedures for commercial , professional or domestic use .

EN-NE

Type Sentence
Source Obama's language is sophisticated , Putin speaks directly and prefers to use punctuation and statistics , but both have the same ability to win the audience's heart .
Reference ओबामाको भाषा परिस्कृत छ , पुटिन ठाडो भाषामा तुक्का र तथ्याङ्क प्रयोग गरेर बोल्न रुचाउँछन् , तर दुवैसँग श्रोताको हृदयलाई तरंगित गर्ने समान क्षमता छ ।
System ओबामाको भाषा परिस्कृत छ , पुटिन प्रत्यक्ष रूपमा वाचन र तथ्याङ्क प्रयोग गर्न प्राथमिकता दिन्छ , तर दुवै श्रोताको मुटु जित्न एउटै क्षमता छ ।
Type Sentence
Source Litti Chokha is prepared by stuffing buckwheat flour mixed with various spices in dough and toasting it in fire , and is served with spice paste .
Reference लिट्टी चोखा - लिट्टी जुन आंटा भित्र सत्तू तथा मसला हालेर आगोमा सेकेर बनाईन्छ , को चोखे सँग पस्किइन्छ ।
System लोती चोखोका विभिन्न मसला मिसाएर बकवाहेट फूल मिसाएर तयार पारिन्छ र यसलाई आगोमा टाँस्न र मसला टाँस्ने सेवा गरिन्छ ।

Citation

If you use any part of this project in your work, please cite:

@techreport{nepali-translator-2019,
  title={Nepali Translator},
  author={Duwal, Sharad and Manandhar, Amir and Maskey, Saurav and Hada, Subash},
  institution={Kathmandu University},
  year={2019}
}

Or you can cite the aforementioned paper.

For the completion of sixth semester in Computer Science program at Kathmandu University. July 2019.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].