All Projects → masakhane-io → Masakhane Mt

masakhane-io / Masakhane Mt

Licence: mit
Machine Translation for Africa

Programming Languages

lua
6591 projects

Projects that are alternatives of or similar to Masakhane Mt

Fasttext multilingual
Multilingual word vectors in 78 languages
Stars: ✭ 1,067 (+651.41%)
Mutual labels:  machine-translation
Deep Learning Drizzle
Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!!
Stars: ✭ 9,717 (+6742.96%)
Mutual labels:  machine-translation
Gtos
Code for AAAI2020 paper "Graph Transformer for Graph-to-Sequence Learning"
Stars: ✭ 129 (-9.15%)
Mutual labels:  machine-translation
Comet
A Neural Framework for MT Evaluation
Stars: ✭ 58 (-59.15%)
Mutual labels:  machine-translation
Opennmt Tf
Neural machine translation and sequence learning using TensorFlow
Stars: ✭ 1,223 (+761.27%)
Mutual labels:  machine-translation
Mt Paper Lists
MT paper lists (by conference)
Stars: ✭ 105 (-26.06%)
Mutual labels:  machine-translation
Mtnt
Code for the collection and analysis of the MTNT dataset
Stars: ✭ 48 (-66.2%)
Mutual labels:  machine-translation
Subword Nmt
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Stars: ✭ 1,819 (+1180.99%)
Mutual labels:  machine-translation
Niutrans.smt
NiuTrans.SMT is an open-source statistical machine translation system developed by a joint team from NLP Lab. at Northeastern University and the NiuTrans Team. The NiuTrans system is fully developed in C++ language. So it runs fast and uses less memory. Currently it supports phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.
Stars: ✭ 90 (-36.62%)
Mutual labels:  machine-translation
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+1387.32%)
Mutual labels:  machine-translation
Udacity Natural Language Processing Nanodegree
Tutorials and my solutions to the Udacity NLP Nanodegree
Stars: ✭ 73 (-48.59%)
Mutual labels:  machine-translation
Transformers without tears
Transformers without Tears: Improving the Normalization of Self-Attention
Stars: ✭ 80 (-43.66%)
Mutual labels:  machine-translation
Opus Mt
Open neural machine translation models and web services
Stars: ✭ 111 (-21.83%)
Mutual labels:  machine-translation
Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (-62.68%)
Mutual labels:  machine-translation
Awesome Ai Services
An overview of the AI-as-a-service landscape
Stars: ✭ 133 (-6.34%)
Mutual labels:  machine-translation
Machine Translation
Stars: ✭ 51 (-64.08%)
Mutual labels:  machine-translation
En Fr Mlt Tensorflow
English-French Machine Language Translation in Tensorflow
Stars: ✭ 99 (-30.28%)
Mutual labels:  machine-translation
Pytorch Dual Learning
Implementation of Dual Learning NMT on PyTorch
Stars: ✭ 141 (-0.7%)
Mutual labels:  machine-translation
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Stars: ✭ 132 (-7.04%)
Mutual labels:  machine-translation
Nonautoreggenprogress
Tracking the progress in non-autoregressive generation (translation, transcription, etc.)
Stars: ✭ 118 (-16.9%)
Mutual labels:  machine-translation

Masakhane - A living collection of NLP projects for Africans, by Africans

PRs Welcome Slack Status

MASAKHANE is an research effort for NLP for African languages that is OPEN SOURCE, CONTINENT-WIDE, DISTRIBUTED and ONLINE. This GitHub repository houses the data, code, results and research for building open baseline NLP results for African languages.

Website: masakhane.io

Goals

  • For Africa: To build and facilitate a community of NLP researchers, connect and grow it, spurring and sharing further research, build helpful tools for applications in government, medicine, science and education, to enable language preservation and increase its global visibility and relevance.

  • For NLP Research: To build data sets and tools to facilitate NLP research on African languages, and to pose new research problems to enrich the NLP research landscape.

  • For the global researchers community: To discover best practices for distributed research, to be applied by other emerging research communities.

Hall of Fame for our Contributors

Progress

  • See our pre-print to be published at Findings of EMNLP 2020 here
  • Look at our submitted machine translation benchmarks here! Can't see your language? Please submit a benchmark!
  • Check out our paper to be published at AfricaNLP Workshop @ ICLR 2020
  • Check out papers written by our participants here
  • Find our more about our current initiatives
  • Look at our list of community documents
  • Read our weekly meeting notes
  • Follow our publication on Medium

How can I contribute?

There are many ways to contribute to MASAKHANE.

  1. TRAIN A MODEL - Contribute a trained model and related code for your language
  2. ANALYSIS - Contribute analysis of data/models for any African languages. You do not need any technical experience for this! If you're a linguist, we can pair you up with a machine translation practitioner and you can help contribute analysis
  3. DATA - Help build or find datasets for your language
  4. DOCUMENTATION - Help document our discussions, progress. This is VERY much needed. Or contribute to documentation of the base "notebook" that will improve the experience of others
  5. MENTORSHIP - Provide advice or help tune models for their languages and datasets, or help people get started
  6. ADMIN - Working with so many researchers can be quite a challenge! Help out with administrative tasks
  7. COMPUTE - Help with infrastructure and compute! Do you have spare compute to donate? Let us know! We're always looking for more!
  8. BRAINSTORM Join our weekly meetings, provide advice or ideas
  9. STORY-TELLING - Tell our stories to the world by doing talks about the community, contributing to our Medium publication, or engaging with media outlets
  10. MLOps & ML Engineering - Do you enjoy delving into the MLOps side of machine learning? Are you a software developer looking to hone-in on your ML engineer abilities? Join us to help build tools to support out reproducability, data gathering, and model sharing!

Want more details? Check out our current initiatives

How do I join?

  1. Join our Slack

  2. Request to join our Google Group

  3. This is so we can feature you on our webpage masakhane.io. Please email the following to [email protected]:

    • Your Full Name
    • A preferred social media link
    • The language(s) you'll be working on (or your general relevant specialty - if you're an expert at machine translation and - would like to boost the community through that)
    • A picture
    • Your affiliation and role.

Please be patient with a response via our email address, we're very behind on our administration, in the time of COVID-19.

Building your first machine translation model

Typically, if you have some programming experience, we encourage you to start on your journey with Masakhane, by building a baseline for your language. Feeling nervous to submit or not sure where to start? Please join our weekly meeting and we will pair you with a mentor!

1. Have a look at the example code

Open In Colab
We have an example colab notebook which trains a model for English-to-Zulu translation. You can select it by going to the GitHub section when opening a new project.

2. Finding data for my language?!

This is a huge challenge, but luckily we have a place to start! At ACL 2019, this paper was published. The short story? Turns out the Jehovah's Witness community has been translating many many documents and not all of them are religious. And their language representation is DIVERSE.

Check out this spreadsheet HERE to see if your language is featured, then go to Opus to find the links to the data: http://opus.nlpl.eu/JW300.php

We also provide a script for easy downloading and BPE-preprocessing of JW300 data from OPUS: jw300_utils/get_jw300.py. It requires the installation of the opustools-pkg Python package. Example: For dowloading and pre-processing the Acholi (ach) and the Nyaneka (nyk) portions of JW300, call the script like this: python get_jw300.py ach nyk --output_dir jw300

Can't find your language in the JW300 dataset?

Then we still have some options! Our community has been searching wide and far! Join our slack and google group to discuss a way forward!

3. Run the notebook!

Your next step is to use the JW300 dataset in the colab notebook and run it. Most pieces of advice are within the notebook itself. We are constantly improving that notebook and are open to any recommendations. Struggled to get going? Then let's work together to build a notebook that's easier to use! Create a github issue or email us!

4. It's done! I have results! Now what?

Amazing! You're created your first baseline. Now we need to get the code and data and results into this github repository

In order for us to consider your result submission official, we need a couple of things:

  1. The notebook that will run the code. The notebook MUST run on on someone else account and the data that it uses should be publically accessible (i.e. if I download the notebook and run it, it must work - so shouldn't be using any private files). If you're wondering how to do this, don't fear! Drop us a line and we will work together to make sure the submission is all good! :)

  2. The test sets - in order to replicate this and test against your results, we need saved test sets uploaded separately.

  3. A README.md that describes the (a) the data used - esp important if it's a combination of sources (b) any interesting changes to the model (c) maybe some analysis of some sentences of the final model

  4. The model itself. This can be in the form of a google drive or dropbox link. We will be finding a home for our trained models soon. For models to be used for transfer learning, further trained, or deployed, you need to provide:

    1. a checkpoint with the parameters (.ckpt file),
    2. the source and target vocabulary (src_vocab.txt, trg_vocab.txt),
    3. the configuration file (config.yaml),
    4. and if applicable: the BPE codes or scripts for your pre-processing pipeline. Joey NMT saves the first three in the model directory.
  5. The results - the train, dev, and test set BLEU score

We will be further expanding our analysis techniques so it's super important we have a copy of the model and test sets now so we don't need to rerun the training just to do the analysis

Once you have all of the above, please create a pull request into the repository. See guidelines here.

Structure of my PR:

Also see this as an example for the structure of your contribution

Structure:

/benchmarks
 /<src-lang>-<tgt-lang>
   /<technique> -- this could be "jw300-baseline" or "fine-tuned-baseline" or "nig-newspaper-dataset"
     - notebook.ipynb
     - README.md
     - test.src
     - test.tgt
     - results.txt
     - src_vocab.txt
     - trg_vocab.txt
     - src.bpe
     - [trg.bpe if the bpe model is not joint with src]
     - config.yaml
     - any other files, if you have any

Example:

/benchmarks
  /en-xh
    /xhnavy-data-baseline
      - notebook.ipynb
      - README.md
      - test.xh
      - test.en
      - results.txt
      - src_vocab.txt
      - trg_vocab.txt
      - en-xh.4000.bpe
      - config.yaml
      - preprocessing.py

Here is a link to a pull request that has the relevant things.

Feeling nervous about contributing your first pull request or unsure how to proceed? Please don't feel discouraged! Drop us an email or a slack message and we will work together to get your contribution in ship shape!

5. I've got a baseline. What do I do to improve it?

Cool! So there are many ways to improve results. We've highlighed a few of these in this document. Got other ideas? Drop us a line or submit a PR!

Notes about Model Deployment

We'd like to highlight how NONE of the trained models are suitable for production usage. In our paper here we explore the performance effects of training such a model on the JW300 datasets - the models are still unable to generalize to non-religious domains. As a rule, one should never deploy an NLP model in a domain that it has not been trained for. And even if it IS trained on the relevant domain, a model should be analysed in detail to understand the biases and potential harms. These models aim to serve as WORK IN PROGRESS to spur more research, and to better understand the failure of such systems.

Code of Conduct

See Code of Conduct

Reference

Bibtex

@article{nekoto2020participatory,
  title={Participatory research for low-resourced machine translation: A case study in african languages},
  author={{$\forall$}, { } and Nekoto, Wilhelmina and Marivate, Vukosi and Matsila, Tshinondiwa and Fasubaa, Timi and Kolawole, Tajudeen and Fagbohungbe, Taiwo and Akinola, Solomon Oluwole and Muhammad, Shamsuddee Hassan and Kabongo, Salomon and Osei, Salomey and others},
  journal={Findings of EMNLP},
  year={2020}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].