All Projects → Helsinki-NLP → Opus Mt

Helsinki-NLP / Opus Mt

Licence: mit
Open neural machine translation models and web services

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Opus Mt

Opennmt Tf
Neural machine translation and sequence learning using TensorFlow
Stars: ✭ 1,223 (+1001.8%)
Mutual labels:  natural-language-processing, machine-translation, neural-machine-translation
Mtbook
《机器翻译:基础与模型》肖桐 朱靖波 著 - Machine Translation: Foundations and Models
Stars: ✭ 2,307 (+1978.38%)
Mutual labels:  natural-language-processing, machine-translation, neural-machine-translation
Nematus
Open-Source Neural Machine Translation in Tensorflow
Stars: ✭ 730 (+557.66%)
Mutual labels:  machine-translation, neural-machine-translation
Nlg Eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Stars: ✭ 822 (+640.54%)
Mutual labels:  natural-language-processing, machine-translation
Mtnt
Code for the collection and analysis of the MTNT dataset
Stars: ✭ 48 (-56.76%)
Mutual labels:  natural-language-processing, machine-translation
Thumt
An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group
Stars: ✭ 550 (+395.5%)
Mutual labels:  machine-translation, neural-machine-translation
Texar Pytorch
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 636 (+472.97%)
Mutual labels:  natural-language-processing, machine-translation
Sockeye
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet
Stars: ✭ 990 (+791.89%)
Mutual labels:  machine-translation, neural-machine-translation
Joeynmt
Minimalist NMT for educational purposes
Stars: ✭ 420 (+278.38%)
Mutual labels:  machine-translation, neural-machine-translation
Comet
A Neural Framework for MT Evaluation
Stars: ✭ 58 (-47.75%)
Mutual labels:  natural-language-processing, machine-translation
Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (-52.25%)
Mutual labels:  natural-language-processing, machine-translation
Nlp Tutorial
A list of NLP(Natural Language Processing) tutorials
Stars: ✭ 1,188 (+970.27%)
Mutual labels:  natural-language-processing, neural-machine-translation
Sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 5,540 (+4890.99%)
Mutual labels:  natural-language-processing, neural-machine-translation
Opennmt Py
Open Source Neural Machine Translation in PyTorch
Stars: ✭ 5,378 (+4745.05%)
Mutual labels:  machine-translation, neural-machine-translation
Mt Paper Lists
MT paper lists (by conference)
Stars: ✭ 105 (-5.41%)
Mutual labels:  machine-translation, neural-machine-translation
Nmt Keras
Neural Machine Translation with Keras
Stars: ✭ 501 (+351.35%)
Mutual labels:  machine-translation, neural-machine-translation
String To Tree Nmt
Source code and data for the paper "Towards String-to-Tree Neural Machine Translation"
Stars: ✭ 16 (-85.59%)
Mutual labels:  natural-language-processing, machine-translation
Tf Seq2seq
Sequence to sequence learning using TensorFlow.
Stars: ✭ 387 (+248.65%)
Mutual labels:  natural-language-processing, neural-machine-translation
Neuralmonkey
An open-source tool for sequence learning in NLP built on TensorFlow.
Stars: ✭ 400 (+260.36%)
Mutual labels:  machine-translation, neural-machine-translation
Fasttext multilingual
Multilingual word vectors in 78 languages
Stars: ✭ 1,067 (+861.26%)
Mutual labels:  natural-language-processing, machine-translation
OPUS-MT

Tools and resources for open translation services

This repository includes two setups:

There are also scripts for training models but those are currently only useful in the computing environment used by the University of Helsinki and CSC as the IT service providor.

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Installation of the Tornado-based Web-App

Download the latest version from github:

git clone https://github.com/Helsinki-NLP/Opus-MT.git

Option 1: Manual setup

Install Marian MT. Follow the documentation at https://marian-nmt.github.io/docs/ After the installation, marian-server is expected to be present in path. If not place it in /usr/local/bin

Install pre-requisites. Using a virtual environment is recommended.

pip install -r requirements.txt

Download the translation models from https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models and place it in models directory.

Then edits the services.json to point to that models.

And start the webserver.

python server.py

By default, it will use port 8888. Launch your browser to localhost:8888 to get the web interface. The languages configured in services.json will be available.

Option 2: Using Docker

docker-compose up

And launch your browser to localhost:8888

Configuration

The server.py program accepts a configuration file in json format. By default it try to use config.json in the current directory. But you can give a custom one using -c flag.

An example configuration file looks like this:

{
    "en": {
        "es": {
            "configuration": "./models/en-es/decoder.yml",
            "host": "localhost",
            "port": "10001"
        },
        "fi": {
            "configuration": "./models/en-fi/decoder.yml",
            "host": "localhost",
            "port": "10002"
        },
    }
}

This example configuration can provide MT service for en->fs and en->fi language pairs.

  • configuration points to a yaml file containing the decoder configuration usable by marian-server. If this value is not provided, Opus-MT will assume that the service is already running in a remote host and post as given in other options. If value is provided a new sub process will be created using marian-server
  • host: The host where the server is running.
  • port: The port to be listen for marian-server

Installation of a websocket service on Ubuntu

There is another option of setting up translation services using WebSockets and Linux services. Detailed information is available from doc/WebSocketServer.md.

Public MT models

We store public models (CC-BY 4.0 License) at https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models They should be all compatible with the OPUS-MT services and you can install them by specifying the language pair. The installation script takes the latest model in that directory. For additional customisation you need to adjust the installation procedures (in the Makefile or elsewhere).

There are also development versions of models, which are often a bit more experimental and of low quality. But there are additional language pairs and they can be downloaded from https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/work-spm/models

Train MT models

There is a Makefile for training new models from OPUS data in the Opus-MT-train repository but this is heavily customized for the work environment at CSC and the University of Helsinki projects. This will (hopefully) be more generic in the future to be able to run in different environments and setups as well.

Known issues

  • most automatic evaluations are made on simple and short sentences from the Tatoeba data collection; those scores will be too optimistic when running the models with other more realistic data sets
  • Some (older) test results are not reliable as they use software localisation data (namely GNOME system messages) with a large overlap with other localisation data (i.e. Ubuntu system messages) that are included in the training data
  • All current models are trained without filtering, data augmentation (like backfanslation) and domain adaptation and other optimisation procedures; there is no quality control besides of the automatic evaluation based on automatically selected test sets; for some language pairs there are at least also benchmark scores from official WMT test sets
  • Most models are trained with a maximum of 72 training hours on 1 or 4 GPUs; not all of them converged before this time limit
  • Validation and early stopping is based on automatically selected validation data, often from Tatoeba; the validation data is not representative for many applications

To-Do and wish list

  • more languages and language pairs
  • better and more multilingual models
  • optimize translation performance
  • add backtranslation data
  • domain-specific models
  • GPU enabled container
  • dockerized fine-tuning
  • document-level models
  • load-balancing and other service optimisations
  • public MT service network
  • feedback loop and personalisation

Links and related work

Acknowledgements

The work is supported by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also greatful for the generous computational resources provided by CSC -- IT Center for Science, Finland.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].