All Projects → jxhe → cross-lingual-struct-flow

jxhe / cross-lingual-struct-flow

Licence: MIT license
PyTorch implementation of ACL paper https://arxiv.org/abs/1906.02656

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
shell
77523 projects

Projects that are alternatives of or similar to cross-lingual-struct-flow

nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+200%)
Mutual labels:  pos-tagging, dependency-parsing
datalinguist
Stanford CoreNLP in idiomatic Clojure.
Stars: ✭ 93 (+304.35%)
Mutual labels:  pos-tagging, dependency-parsing
DiffuseVAE
A combination of VAE's and Diffusion Models for efficient, controllable and high-fidelity generation from low-dimensional latents
Stars: ✭ 81 (+252.17%)
Mutual labels:  latent-variable-models
sticker2
Further developed as SyntaxDot: https://github.com/tensordot/syntaxdot
Stars: ✭ 14 (-39.13%)
Mutual labels:  dependency-parsing
CLSP
Code and data for EMNLP 2018 paper "Cross-lingual Lexical Sememe Prediction"
Stars: ✭ 19 (-17.39%)
Mutual labels:  cross-lingual
sinling
A collection of NLP tools for Sinhalese (සිංහල).
Stars: ✭ 38 (+65.22%)
Mutual labels:  pos-tagging
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-21.74%)
Mutual labels:  pos-tagging
lava
Latent Variable Models in R https://kkholst.github.io/lava/
Stars: ✭ 28 (+21.74%)
Mutual labels:  latent-variable-models
deepnlp
小时候练手的nlp项目
Stars: ✭ 11 (-52.17%)
Mutual labels:  dependency-parsing
exams-qa
A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering
Stars: ✭ 25 (+8.7%)
Mutual labels:  cross-lingual
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (+265.22%)
Mutual labels:  pos-tagging
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning
Stars: ✭ 41 (+78.26%)
Mutual labels:  pos-tagging
gum
Repository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+208.7%)
Mutual labels:  pos-tagging
Paribhasha
paribhasha.herokuapp.com/
Stars: ✭ 21 (-8.7%)
Mutual labels:  pos-tagging
mixed-language-training
Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Task-oriented Dialogue Systems (AAAI-2020)
Stars: ✭ 29 (+26.09%)
Mutual labels:  cross-lingual
LatentDiffEq.jl
Latent Differential Equations models in Julia.
Stars: ✭ 34 (+47.83%)
Mutual labels:  latent-variable-models
syntaxnet
Syntaxnet Parsey McParseface wrapper for POS tagging and dependency parsing
Stars: ✭ 77 (+234.78%)
Mutual labels:  pos-tagging
BiaffineDependencyParsing
BERT+Self-attention Encoder ; Biaffine Decoder ; Pytorch Implement
Stars: ✭ 67 (+191.3%)
Mutual labels:  dependency-parsing
Cross-Lingual-MRC
Cross-Lingual Machine Reading Comprehension (EMNLP 2019)
Stars: ✭ 66 (+186.96%)
Mutual labels:  cross-lingual
wink-nlp
Developer friendly Natural Language Processing ✨
Stars: ✭ 312 (+1256.52%)
Mutual labels:  pos-tagging

Cross-lingual structured flow model for zero-shot syntactic transfer

This is PyTorch implementation of the paper:

Cross-Lingual Syntactic Transfer through Unsupervised Adaptation of Invertible Projections
Junxian He, Zhisong Zhang, Taylor Berg-Kiripatrick, Graham Neubig
ACL 2019

The structured flow model is a generative model that can be trained in an supervised fashion on labeled data in another language, but also perform unsupervised training to directly maximize likelihood of the target language. In this way, it is able to transfer shared linguistic knowledge from the source language as well as learning language-specific knowledge on the unlabeled target language.

Please concact [email protected] if you have any questions.

Requirements

  • Python >= 3.6
  • PyTorch >= 0.4

Additional requirements can be installed via:

pip install -r requirements.txt

Data

Download the Universal Dependencies 2.2 here (ud-treebanks-v2.2.tgz), put file ud-treebanks-v2.2.tgz into the top-level directory of this repo, and run:

$ tar -xvzf ud-treebanks-v2.2.tgz
$ rm ud-treebanks-v2.2.tgz

Prepare Embeddings

fastText

The fastText embeddings can be downloaded in the Facebook fastText repo (Note that there are different versions of pretrained fastText embeddings in the fastText repo, but the embeddings must be downloaded from the given link since the alignment matrices (from here) we used are learned on this specific version of fastText embeddings). Download the fastText model bin file and put it into the fastText_data folder.

Take English language as an example to download and preprocess the fastText embeddings:

$ cd fastText_data
$ wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip
$ unzip wiki.en.zip
$ cd ..

# create a subset of embedding dict for faster embedding loading and memory efficiency
$ python scripts/compress_vec_dict.py --lang en

$ rm wiki.en.vec
$ rm wiki.en.zip

The argument for --lang is the short code of the language, the list of short codes and corresponding languages is in statistics/lang_list.txt.

multilingual BERT (mBERT)

$ CUDA_VISIBLE_DEVICES=xx python scripts/create_cwr.py --lang [language code]

This command pre-computes the BERT contexualized word representations using pytorch-pretrained-BERT for each sentence in the corresponding treebank. These embeddings are saved in bert-base-multilingual-cased (will be created automatically) as hdf5 files. This command would download the pretrained multilingual BERT model and cache it when it is executed for the first time.

POS Tagging

Several training scripts are provided (note that supervised training scripts must be run first before running unsupervised training scripts):

# supervised training on English with fastText
# [gpu_id] is an integer number
$ ./scripts/run_supervised_tagger.sh [gpu_id]

# unsupervised training on other languages with fastText
$ ./scripts/run_unsupervised_tagger.sh [gpu_id] [language code]




# supervised training on English with mBERT
$ ./scripts/run_supervised_bert_tagger.sh [gpu_id]

# unsupervised training on other languages with mBERT
$ ./scripts/run_unsupervised_bert_tagger.sh [gpu_id] [language code]

Trained models and logs are saved in outputs/tagging.

Dependency Parsing

Several training scripts are provided (note that supervised training scripts must be run first before running unsupervised training scripts):

# supervised training on English with fastText
# [gpu_id] is an integer number
$ ./scripts/run_supervised_parser.sh [gpu_id]

# unsupervised training on distant languages with fastText
$ ./scripts/run_unsupervised_parser_distant.sh [gpu_id] [language code]

# unsupervised training on nearby languages with fastText
$ ./scripts/run_unsupervised_parser_nearby.sh [gpu_id] [language code]




# supervised training on English with mBERT
# [gpu_id] is an integer number
$ ./scripts/run_supervised_bert_parser.sh [gpu_id]

# unsupervised training on distant languages with mBERT
$ ./scripts/run_unsupervised_bert_parser_distant.sh [gpu_id] [language code]

# unsupervised training on nearby languages with mBERT
$ ./scripts/run_unsupervised_bert_parser_nearby.sh [gpu_id] [language code]

Trained models and logs are saved in outputs/parsing.

Acknowledgement

This project would not be possible without the URIEL linguistic database, pre-computed fastText alignment matrix, Google's pretrained multilingual BERT model, and the huggingface transformers.

Reference

@inproceedings{he19acl,
    title = {Cross-Lingual Syntactic Transfer through Unsupervised Adaptation of Invertible Projections},
    author = {Junxian He and Zhisong Zhang and Taylor Berg-Kirkpatrick and Graham Neubig},
    booktitle = {Proceedings of ACL},
    year = {2019}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].