All Projects → CPJKU → wechsel

CPJKU / wechsel

Licence: MIT license
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to wechsel

backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+487.18%)
Mutual labels:  transformers, transfer-learning, language-model, bert
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+8641.03%)
Mutual labels:  transformers, transfer-learning, language-model, bert
Tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Stars: ✭ 5,077 (+12917.95%)
Mutual labels:  transformers, language-model, bert
ParsBigBird
Persian Bert For Long-Range Sequences
Stars: ✭ 58 (+48.72%)
Mutual labels:  transformers, transfer-learning, bert
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+6117.95%)
Mutual labels:  transformers, language-model, bert
oreilly-bert-nlp
This repository contains code for the O'Reilly Live Online Training for BERT
Stars: ✭ 19 (-51.28%)
Mutual labels:  transformers, transfer-learning, bert
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-43.59%)
Mutual labels:  transformers, bert
erc
Emotion recognition in conversation
Stars: ✭ 34 (-12.82%)
Mutual labels:  transformers, bert
Fast Bert
Super easy library for BERT based NLP models
Stars: ✭ 1,678 (+4202.56%)
Mutual labels:  transformers, bert
Nlp Architect
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
Stars: ✭ 2,768 (+6997.44%)
Mutual labels:  transformers, bert
bangla-bert
Bangla-Bert is a pretrained bert model for Bengali language
Stars: ✭ 41 (+5.13%)
Mutual labels:  transformers, bert
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+6356.41%)
Mutual labels:  transformers, bert
COCO-LM
[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
Stars: ✭ 109 (+179.49%)
Mutual labels:  transformers, language-model
HugsVision
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision
Stars: ✭ 154 (+294.87%)
Mutual labels:  transformers, bert
nlp workshop odsc europe20
Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…
Stars: ✭ 127 (+225.64%)
Mutual labels:  transformers, transfer-learning
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+8128.21%)
Mutual labels:  transformers, bert
KB-ALBERT
KB국민은행에서 제공하는 경제/금융 도메인에 특화된 한국어 ALBERT 모델
Stars: ✭ 215 (+451.28%)
Mutual labels:  transformers, language-model
Romanian-Transformers
This repo is the home of Romanian Transformers.
Stars: ✭ 60 (+53.85%)
Mutual labels:  language-model, bert
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (+56.41%)
Mutual labels:  transformers, bert
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (+141.03%)
Mutual labels:  transformers, bert

WECHSEL

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models to appear in NAACL2022.

arXiv: https://arxiv.org/abs/2112.06598

Models from the paper are available on the HuggingFace Hub:

Installation

We distribute a Python Package via PyPI:

pip install wechsel

Alternatively, clone the repository, install requirements.txt and run the code in wechsel/.

Example usage

Transferring English roberta-base to Swahili:

import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

target_tokenizer = source_tokenizer.train_new_from_iterator(
    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
    vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
    load_embeddings("en"),
    load_embeddings("sw"),
    bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
    source_tokenizer,
    target_tokenizer,
    model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

# use `model` and `target_tokenizer` to continue training in Swahili!

Bilingual dictionaries

We distribute 3276 bilingual dictionaries from English to other languages for use with WECHSEL in dicts/.

Citation

Please cite WECHSEL as

@misc{minixhofer2021wechsel,
      title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models}, 
      author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
      year={2021},
      eprint={2112.06598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). We thank Andy Koh and Artus Krohn-Grimberghe for providing additional computational resources. The ELLIS Unit Linz, the LIT AI Lab, the Institute for Machine Learning, are supported by the Federal State Upper Austria. We thank the project INCONTROL-RL (FFG-881064).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].