anoopkunchukuttan / geomm

Licence: GPL-3.0 license
Geometry-aware Multilingual Embeddings

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to geomm

Hugo Future Imperfect Slim
Multilingual Blogging Theme for Hugo | Check the Wiki for Documentation
Stars: ✭ 233 (+913.04%)
Mutual labels:  multilingual
AgentOCR
一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.
Stars: ✭ 98 (+326.09%)
Mutual labels:  multilingual
TraduXio
A participative platform for cultural texts translators
Stars: ✭ 19 (-17.39%)
Mutual labels:  multilingual
Core
🧿 Bolt 4 core
Stars: ✭ 243 (+956.52%)
Mutual labels:  multilingual
lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Stars: ✭ 75 (+226.09%)
Mutual labels:  multilingual
integreat-cms
Simplified content management back end for the Integreat App - a multilingual information platform for newcomers
Stars: ✭ 46 (+100%)
Mutual labels:  multilingual
Elefant
Elefant, the refreshingly simple PHP CMS and web framework.
Stars: ✭ 188 (+717.39%)
Mutual labels:  multilingual
carbone
Fast and simple report generator, from JSON to pdf, xslx, docx, odt...
Stars: ✭ 810 (+3421.74%)
Mutual labels:  multilingual
sketch-crowdin
Connect your Sketch and Crowdin projects together
Stars: ✭ 35 (+52.17%)
Mutual labels:  multilingual
hugo-notice
A Hugo theme component to display nice notices
Stars: ✭ 138 (+500%)
Mutual labels:  multilingual
Alfred Searchio
Alfred workflow to auto-suggest search results from multiple search engines and languages.
Stars: ✭ 250 (+986.96%)
Mutual labels:  multilingual
mok-project
Multilingual Onscreen Keyboard Project
Stars: ✭ 27 (+17.39%)
Mutual labels:  multilingual
monk
Monk is an elegant and lightweight WordPress translation plugin to make your content reach the world.
Stars: ✭ 15 (-34.78%)
Mutual labels:  multilingual
Multilingual Press
The multisite-based free open source plugin for your multilingual WordPress websites.
Stars: ✭ 239 (+939.13%)
Mutual labels:  multilingual
i18n-language.js
i18n-language.js is Simple i18n language with Vanilla Javascript
Stars: ✭ 21 (-8.7%)
Mutual labels:  multilingual
Yii2 Translate Manager
Translation Manager
Stars: ✭ 221 (+860.87%)
Mutual labels:  multilingual
cloudcannon-jekyll-ecommerce
Multilingual e-commerce static website using Snipcart, CloudCannon, and Jekyll
Stars: ✭ 19 (-17.39%)
Mutual labels:  multilingual
mixed-language-training
Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Task-oriented Dialogue Systems (AAAI-2020)
Stars: ✭ 29 (+26.09%)
Mutual labels:  multilingual
DE-LIMIT
DeEpLearning models for MultIlingual haTespeech (DELIMIT): Benchmarking multilingual models across 9 languages and 16 datasets.
Stars: ✭ 90 (+291.3%)
Mutual labels:  multilingual
exams-qa
A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering
Stars: ✭ 25 (+8.7%)
Mutual labels:  multilingual

Geometry-aware Multilingual Embedding

Code for learning multilingual embeddings using the method reported in:

Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.

Environment Setup

Do the following steps in order:

  1. Clone the repository

  2. Create a python virtual environment without Tensorflow (if TF is present Pymanopt gives out of memory errors).

  3. pip install numpy scipy ipdb

  4. pip install git+https://github.com/pymanopt/pymanopt.git --upgrade

  5. In Pymanopt code(located at C:\Anaconda\envs\ENVRNMT_NAME\Lib\site-packages\pymanopt\tools\autodiff for Windows or the Linux equivalent), at line 46,49,101,104 add a parameter to the call of theano.function, allow_input_downcast=True

  6. conda install theano pygpu

  7. In Users\USER_NAME make a file .theanorc.txt with following content:

     [global]
     device = cuda
     floatX = float32
    
  8. Install cupy based on your CUDA version

  9. Two GPUs are needed

Note: While using this setup with Pymanopt, make sure to import cupy before importing theano, as sometimes theano throws an error that it is unable to find the correct CUDA version. However, the use of Cupy before this fixes the issue.

Datasets

The datasets can be downloaded by running the following commands in vecmap_data/ and muse_data/

./get_vecmap_data.sh
./get_muse_data.sh

Reproducing Results

The results that have been reported in Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach can be reproduced by running the following scripts:

  • Results of the GeoMM algorithm reported in Table 1, 2, and 6:

      ./geomm_results.sh
    
  • Results of the GeoMM-Multi algorithm reported in Table 1, 2, and 6:

      ./geomm_multi_results.sh
    
  • Results of the GeoMM-Semi algorithm reported in Table 7:

      ./geomm_semi_results.sh
    

Note: Since our code makes use of CUDA and FP32 precision, it may not be possible to reproduce our results exactly, due to minor numerical variations in GPU operations. However, the effect on the final results is negligible, as we have observed the variations usually lie within an error margin of 0.1 or 0.2.

Note: Added geomm_optimized.py which can replace geomm.py in all use-cases. Reduces time-taken for en-es pair from 188.5 second to 6.5 second.

GeoMM Embeddings

We provide GeoMM bilingual and multilingual embeddings. These are normalized embeddings in the latent space, . The embeddings are made available under the following license: Creative Commons Attribution-NonCommercial 4.0 International License.

MUSE Dataset

These embeddings have been trained jointly using en-XX MUSE bilingual dictionaries and Wikipedia FastText embeddings.

de en es fr ru zh

VecMap Dataset

These embeddings have been trained jointly using en-XX bilingual dictionaries and embeddings from the VecMap dataset.

de en es fi it

English-Indian language bilingual embeddings

These bilingual embeddings have been trained using the CommonCrawl+Wikipedia FastText Embeddings and the MUSE bilingual dictionaries.

en-hi en-bn en-ta

Acknowledgements

The data-processing part of our code was taken from Mikel Artetxe's Vecmap Repository.

References

Please cite Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach if you found the resources in this repository useful.

@article{jawanpuria2018learning,
  title={Learning multilingual word embeddings in latent metric space: a geometric approach},
  author={Jawanpuria, Pratik and Balgovind, Arjun and Kunchukuttan, Anoop and Mishra, Bamdev},
  journal={Transaction of the Association for Computational Linguistics (TACL)},
  volume={7},
  pages={107--120},
  year={2019}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].