All Projects → danlou → LMMS

danlou / LMMS

Licence: other
Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to LMMS

Hierarchical-Word-Sense-Disambiguation-using-WordNet-Senses
Word Sense Disambiguation using Word Specific models, All word models and Hierarchical models in Tensorflow
Stars: ✭ 33 (-58.23%)
Mutual labels:  wordnet, word-sense-disambiguation
Nlp Tutorial
Natural Language Processing Tutorial for Deep Learning Researchers
Stars: ✭ 9,895 (+12425.32%)
Mutual labels:  paper, bert
ADL2019
Applied Deep Learning (2019 Spring) @ NTU
Stars: ✭ 20 (-74.68%)
Mutual labels:  bert, contextual-embeddings
BERT-embedding
A simple wrapper class for extracting features(embedding) and comparing them using BERT in TensorFlow
Stars: ✭ 24 (-69.62%)
Mutual labels:  bert, contextual-embeddings
ganbert-pytorch
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks in Pytorch/HuggingFace
Stars: ✭ 60 (-24.05%)
Mutual labels:  bert
heinsen routing
Official implementation of "An Algorithm for Routing Capsules in All Domains" (Heinsen, 2019) in PyTorch.
Stars: ✭ 41 (-48.1%)
Mutual labels:  paper
pluGET
📦 Powerful Package manager which updates plugins & server software for minecraft servers
Stars: ✭ 87 (+10.13%)
Mutual labels:  paper
Orion
Mixin loader for Paper
Stars: ✭ 46 (-41.77%)
Mutual labels:  paper
bert-AAD
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation
Stars: ✭ 27 (-65.82%)
Mutual labels:  bert
MiniVox
Code for our ACML and INTERSPEECH papers: "Speaker Diarization as a Fully Online Bandit Learning Problem in MiniVox".
Stars: ✭ 15 (-81.01%)
Mutual labels:  paper
Text-Summarization-Repo
텍스트 요약 분야의 주요 연구 주제, Must-read Papers, 이용 가능한 model 및 data 등을 추천 자료와 함께 정리한 저장소입니다.
Stars: ✭ 213 (+169.62%)
Mutual labels:  paper
muse-as-service
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.
Stars: ✭ 45 (-43.04%)
Mutual labels:  bert
sdn-nfv-papers
This is a paper list about Resource Allocation in Network Functions Virtualization (NFV) and Software-Defined Networking (SDN).
Stars: ✭ 40 (-49.37%)
Mutual labels:  paper
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (-49.37%)
Mutual labels:  bert
TiDB-A-Raft-based-HTAP-Database
Unofficial! English original and Chinese translation of the paper.
Stars: ✭ 42 (-46.84%)
Mutual labels:  paper
PhD
Incremental Methods of Deep Learning for Detection and Classifcation in a Robotics Environment
Stars: ✭ 13 (-83.54%)
Mutual labels:  paper
pFedMe
Personalized Federated Learning with Moreau Envelopes (pFedMe) using Pytorch (NeurIPS 2020)
Stars: ✭ 196 (+148.1%)
Mutual labels:  paper
ghiaseddin
Author's implementation of the paper "Deep Relative Attributes" (ACCV 2016)
Stars: ✭ 41 (-48.1%)
Mutual labels:  paper
SportPaper
Performance-tuned Minecraft 1.8 spigot server
Stars: ✭ 122 (+54.43%)
Mutual labels:  paper
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+91.14%)
Mutual labels:  bert

Language Modelling Makes Sense (LMMS)

This repository includes the code related to the "LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and Beyond" paper.

If you're interested in code for the original LMMS paper from ACL 2019, click here to move to the LMMS_ACL19 branch.

This code is designed to use the transformers package (v3.0.2), and the fairseq package (v0.9.0, only for RoBERTa models, more details in the paper).

Table of Contents

Installation

Prepare Environment

This project was developed on Python 3.6.5 from Anaconda distribution v4.6.2. As such, the pip requirements assume you already have packages that are included with Anaconda (numpy, etc.). After cloning the repository, we recommend creating and activating a new environment to avoid any conflicts with existing installations in your system:

$ git clone https://github.com/danlou/LMMS.git
$ cd LMMS
$ conda create -n LMMS python=3.6.5
$ conda activate LMMS
# $ conda deactivate  # to exit environment when done with project

Additional Packages

To install additional packages used by this project run:

pip install -r requirements.txt

The WordNet package for NLTK isn't installed by pip, but we can install it easily with:

$ python -c "import nltk; nltk.download('wordnet')"

External Data

If you want to evaluate the sense embeddings on WSD or USM, you need the WSD Evaluation Framework.

$ cd external/wsd_eval  # from repo home
$ wget http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip
$ unzip WSD_Evaluation_Framework.zip

For evaluation on the WiC dataset:

$ cd external/wic  # from repo home
$ wget https://pilehvar.github.io/wic/package/WiC_dataset.zip
$ unzip WiC_dataset.zip

Details about downloading GWCS and our WordNet subset of SID will be added soon.

If you want to represent embeddings using annotations from UWA, you must download SemCor+UWA10 from this link, extract the .zip, and place the folder in external/uwa/.

Download Sense Embeddings

You can download the main LMMS-SP embeddings we produced for the paper from the links below.

These sense embeddings should be used with the Transformer models of the same model name.

Tasks comparing or combining LMMS-SP embeddings with contextual embeddings need to also use the corresponding sets of layer weights in data/weights/ (specific to each Sense Profile).

We distribute sense embeddings as '.txt' files, in the standard GloVe format.

Place downloaded sense embeddings in data/vectors/<model_name>/.

bert-large-cased

xlnet-large-cased

roberta-large

albert-xxlarge-v2

Create Sense Embeddings

The creation of LMMS-SP sense embeddings involves a series of steps that have corresponding scripts.

Below you'll find usage descriptions for all the scripts along with the exact command to run in order to replicate the results in the paper (for albert-xxlarge-v2, as an example).

Assumes layer weights have already been determined for each sense profile. The create_sense_weights.py script can be used to convert layer performance to weights.

1. embed_annotations.py - Bootstrap sense embeddings from annotated corpora

Usage description.

$ python scripts/embed_annotations.py -h
usage: embed_annotations.py [-h] [-nlm_id NLM_ID]
                            [-sense_level {synset,sensekey}]
                            [-weights_path WEIGHTS_PATH]
                            [-eval_fw_path EVAL_FW_PATH] -dataset
                            {semcor,semcor_uwa10} [-batch_size BATCH_SIZE]
                            [-max_seq_len MAX_SEQ_LEN]
                            [-subword_op {mean,first,sum}] [-layers LAYERS]
                            [-layer_op {mean,max,sum,concat,ws}]
                            [-max_instances MAX_INSTANCES] -out_path OUT_PATH

Create sense embeddings from annotated corpora.

optional arguments:
  -h, --help            show this help message and exit
  -nlm_id NLM_ID        HF Transfomers model name (default: bert-large-cased)
  -sense_level {synset,sensekey}
                        Representation Level (default: sensekey)
  -weights_path WEIGHTS_PATH
                        Path to layer weights (default: )
  -eval_fw_path EVAL_FW_PATH
                        Path to WSD Evaluation Framework (default:
                        external/wsd_eval/WSD_Evaluation_Framework/)
  -dataset {semcor,semcor_uwa10}
                        Name of dataset (default: semcor)
  -batch_size BATCH_SIZE
                        Batch size (default: 16)
  -max_seq_len MAX_SEQ_LEN
                        Maximum sequence length (default: 512)
  -subword_op {mean,first,sum}
                        Subword Reconstruction Strategy (default: mean)
  -layers LAYERS        Relevant NLM layers (default: -1 -2 -3 -4)
  -layer_op {mean,max,sum,concat,ws}
                        Operation to combine layers (default: sum)
  -max_instances MAX_INSTANCES
                        Maximum number of examples for each sense (default:
                        inf)
  -out_path OUT_PATH    Path to resulting vector set (default: None)

Example usage:

$ python scripts/embed_annotations.py -nlm_id albert-xxlarge-v2 -sense_level sensekey -dataset semcor_uwa10 -weights_path data/weights/lmms-sp-wsd.albert-xxlarge-v2.weights.txt -layer_op ws -out_path data/vectors/sc_uwa10-sp-wsd.albert-xxlarge-v2.vectors.txt

To represent synsets instead of sensekeys, you may use the option '-sense_level synset'.

2. extend_sensekeys.py - Propagate supervised representations (from annotations) through WordNet

Usage description.

$ python scripts/extend_sensekeys.py -h
usage: extend_sensekeys.py [-h] -sup_sv_path SUP_SV_PATH
                           [-ext_mode {synset,hypernym,lexname}] -out_path
                           OUT_PATH

Propagates supervised sense embeddings through WordNet.

optional arguments:
  -h, --help            show this help message and exit
  -sup_sv_path SUP_SV_PATH
                        Path to supervised sense vectors
  -ext_mode {synset,hypernym,lexname}
                        Max abstraction level
  -out_path OUT_PATH    Path to resulting extended vector set

Example usage:

python scripts/extend_sensekeys.py -sup_sv_path data/vectors/sc_uwa10-sp-wsd.albert-xxlarge-v2.vectors.txt -ext_mode lexname -out_path data/vectors/sc_uwa10-extended-sp-wsd.albert-xxlarge-v2.vectors.txt

To extend synsets instead of sensekeys, use the extend_synsets.py script in a similar fashion.

3. embed_glosses.py - Create sense embeddings based on WordNet's glosses and lemmas

Usage description.

$ python scripts/embed_glosses.py -h
usage: embed_glosses.py [-h] [-nlm_id NLM_ID] [-sense_level {synset,sensekey}]
                        [-subword_op {mean,first,sum}] [-layers LAYERS]
                        [-layer_op {mean,sum,concat,ws}]
                        [-weights_path WEIGHTS_PATH] [-batch_size BATCH_SIZE]
                        [-max_seq_len MAX_SEQ_LEN] -out_path OUT_PATH

Creates sense embeddings based on glosses and lemmas.

optional arguments:
  -h, --help            show this help message and exit
  -nlm_id NLM_ID        HF Transfomers model name
  -sense_level {synset,sensekey}
                        Representation Level
  -subword_op {mean,first,sum}
                        Subword Reconstruction Strategy
  -layers LAYERS        Relevant NLM layers
  -layer_op {mean,sum,concat,ws}
                        Operation to combine layers
  -weights_path WEIGHTS_PATH
                        Path to layer weights
  -batch_size BATCH_SIZE
                        Batch size
  -max_seq_len MAX_SEQ_LEN
                        Maximum sequence length
  -out_path OUT_PATH    Path to resulting vector set

Example usage:

$ python scripts/embed_glosses.py -nlm_id albert-xxlarge-v2 -sense_level sensekey -weights_path data/weights/lmms-sp-wsd.albert-xxlarge-v2.weights.txt -layer_op ws -out_path data/vectors/glosses-sp-wsd.albert-xxlarge-v2.vectors.txt

To represent synsets instead of sensekeys, you may use the option '-sense_level synset'.

For a better understanding of what strings we're actually composing to generate these sense embeddings, here are a few examples:

Sensekey (sk) Embedded String (sk's lemma, all lemmas, tokenized gloss)
earth%1:17:00:: earth - Earth , earth , world , globe - the 3rd planet from the sun ; the planet we live on
globe%1:17:00:: globe - Earth , earth , world , globe - the 3rd planet from the sun ; the planet we live on
disturb%2:37:00:: disturb - disturb , upset , trouble - move deeply

4. merge_avg.py - Merging gloss and extended representations

Usage description.

$ python scripts/merge_avg.py -h
usage: merge_avg.py [-h] -v1_path V1_PATH -v2_path V2_PATH [-v3_path V3_PATH]
                    -out_path OUT_PATH

Averages and normalizes vector .txt files.

optional arguments:
  -h, --help          show this help message and exit
  -v1_path V1_PATH    Path to vector set 1
  -v2_path V2_PATH    Path to vector set 2
  -v3_path V3_PATH    Path to vector set 3. Missing vectors are imputated from
                      v2 (optional)
  -out_path OUT_PATH  Path to resulting vector set

Example usage:

$ python scripts/embed_glosses.py -v1_path data/vectors/sc_uwa10-extended-sp-wsd.albert-xxlarge-v2.vectors.txt -v2_path data/vectors/glosses-sp-wsd.albert-xxlarge-v2.vectors.txt -out_path data/vectors/lmms-sp-wsd.albert-xxlarge-v2.vectors.txt

Evaluation

Each of the 5 tasks tackled in the paper has its own evaluation script in evaluation/.

We refer to the start of each evaluation script for example usage and more details.

Demos

For easier application on downstream tasks, we also prepared demonstration files showcasing barebones applications of LMMS-SP for disambiguation and matching using WordNet.

  • demo_disambiguation.py: Loads a Transformer model, LMMS SP-WSD sense embeddings, and spaCy (for lemmatization and POS-tagging) and applies them to disambiguate particular word in an example sentence.
  • demo_matching.py: Loads a Transformer model and LMMS SP-USM sense embeddings, and applies them to match sensekeys and synsets particular word/span in an example sentence.

References

Artificial Intelligence Journal (AIJ)

Current version featuring Sense Profiles, probing analysis, and extensive evaluation (ScienceDirect, arXiv (preprint)).

@article{LOUREIRO2022103661,
title = {LMMS reloaded: Transformer-based sense embeddings for disambiguation and beyond},
journal = {Artificial Intelligence},
volume = {305},
pages = {103661},
year = {2022},
issn = {0004-3702},
doi = {https://doi.org/10.1016/j.artint.2022.103661},
url = {https://www.sciencedirect.com/science/article/pii/S0004370222000017},
author = {Daniel Loureiro and Alípio {Mário Jorge} and Jose Camacho-Collados},
keywords = {Semantic representations, Neural language models},
abstract = {Distributional semantics based on neural approaches is a cornerstone of Natural Language Processing, with surprising connections to human meaning representation as well. Recent Transformer-based Language Models have proven capable of producing contextual word representations that reliably convey sense-specific information, simply as a product of self-supervision. Prior work has shown that these contextual representations can be used to accurately represent large sense inventories as sense embeddings, to the extent that a distance-based solution to Word Sense Disambiguation (WSD) tasks outperforms models trained specifically for the task. Still, there remains much to understand on how to use these Neural Language Models (NLMs) to produce sense embeddings that can better harness each NLM's meaning representation abilities. In this work we introduce a more principled approach to leverage information from all layers of NLMs, informed by a probing analysis on 14 NLM variants. We also emphasize the versatility of these sense embeddings in contrast to task-specific models, applying them on several sense-related tasks, besides WSD, while demonstrating improved performance using our proposed approach over prior work focused on sense embeddings. Finally, we discuss unexpected findings regarding layer and model performance variations, and potential applications for downstream tasks.}
}

ACL 2019

The original LMMS paper (ACL Anthology, arXiv).

@inproceedings{loureiro-jorge-2019-language,
    title = "Language Modelling Makes Sense: Propagating Representations through {W}ord{N}et for Full-Coverage Word Sense Disambiguation",
    author = "Loureiro, Daniel  and
      Jorge, Al{\'\i}pio",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1569",
    doi = "10.18653/v1/P19-1569",
    pages = "5682--5691"
}

EMNLP 2020

Where we improve LMMS sense embeddings using automatic annotations for unambiguous words (UWA corpus) (ACL Anthology, arXiv).

@inproceedings{loureiro-camacho-collados-2020-dont,
    title = "Don{'}t Neglect the Obvious: On the Role of Unambiguous Words in Word Sense Disambiguation",
    author = "Loureiro, Daniel  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.283",
    doi = "10.18653/v1/2020.emnlp-main.283",
    pages = "3514--3520"
}

SemDeep-5 at IJCAI 2019

Application of LMMS for the Word-in-Context (WiC) Challenge (ACL Anthology, arXiv).

@inproceedings{loureiro-jorge-2019-liaad,
    title = "{LIAAD} at {S}em{D}eep-5 Challenge: Word-in-Context ({W}i{C})",
    author = "Loureiro, Daniel  and
      Jorge, Al{\'\i}pio",
    booktitle = "Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5)",
    month = aug,
    year = "2019",
    address = "Macau, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5801",
    pages = "1--5",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].