All Projects → nlpub → hyperstar

nlpub / hyperstar

Licence: MIT license
Hyperstar: Negative Sampling Improves Hypernymy Extraction Based on Projection Learning.

Programming Languages

python
139335 projects - #7 most used programming language
r
7636 projects
shell
77523 projects
awk
318 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to hyperstar

FSCNMF
An implementation of "Fusing Structure and Content via Non-negative Matrix Factorization for Embedding Information Networks".
Stars: ✭ 16 (-33.33%)
Mutual labels:  word2vec, regularization
Word2Vec-iOS
Word2Vec iOS port
Stars: ✭ 23 (-4.17%)
Mutual labels:  word2vec
Book deeplearning in pytorch source
Stars: ✭ 236 (+883.33%)
Mutual labels:  word2vec
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+312.5%)
Mutual labels:  word2vec
Movietaster Open
A practical movie recommend project based on Item2vec.
Stars: ✭ 253 (+954.17%)
Mutual labels:  word2vec
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-8.33%)
Mutual labels:  word2vec
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (+833.33%)
Mutual labels:  word2vec
skip-gram-Chinese
skip-gram for Chinese word2vec base on tensorflow
Stars: ✭ 20 (-16.67%)
Mutual labels:  word2vec
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (+87.5%)
Mutual labels:  word2vec
SSE-PT
Codes and Datasets for paper RecSys'20 "SSE-PT: Sequential Recommendation Via Personalized Transformer" and NurIPS'19 "Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers"
Stars: ✭ 103 (+329.17%)
Mutual labels:  regularization
SparseRegression.jl
Statistical Models with Regularization in Pure Julia
Stars: ✭ 37 (+54.17%)
Mutual labels:  regularization
deep-learning-notes
🧠👨‍💻Deep Learning Specialization • Lecture Notes • Lab Assignments
Stars: ✭ 20 (-16.67%)
Mutual labels:  regularization
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (+183.33%)
Mutual labels:  word2vec
Aravec
AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.
Stars: ✭ 239 (+895.83%)
Mutual labels:  word2vec
Recommendation-based-on-sequence-
Recommendation based on sequence
Stars: ✭ 23 (-4.17%)
Mutual labels:  word2vec
Koan
A word2vec negative sampling implementation with correct CBOW update.
Stars: ✭ 232 (+866.67%)
Mutual labels:  word2vec
russe
RUSSE: Russian Semantic Evaluation.
Stars: ✭ 11 (-54.17%)
Mutual labels:  word2vec
grad-cam-text
Implementation of Grad-CAM for text.
Stars: ✭ 37 (+54.17%)
Mutual labels:  word2vec
GE-FSG
Graph Embedding via Frequent Subgraphs
Stars: ✭ 39 (+62.5%)
Mutual labels:  word2vec
two-stream-cnn
A two-stream convolutional neural network for learning abitrary similarity functions over two sets of training data
Stars: ✭ 24 (+0%)
Mutual labels:  word2vec

Hyperstar: Negative Sampling Improves Hypernymy Extraction Based on Projection Learning

We present a new approach to the extraction of hypernyms based on projection learning and word embeddings. In contrast to classification-based approaches, projection-based methods require no candidate hyponym-hypernym pairs. While it is natural to use both positive and negative training examples in supervised relation extraction, the impact of negative examples on hypernym prediction was not studied so far. In this paper, we show that explicit negative examples used for regularization of the model significantly improve performance compared to the state-of-the-art approach of Fu et al. (2014) on three datasets from different languages.

This repository contains the implementation of our approach, called Hyperstar. The dataset produced in our study is available on Zenodo for both English and Russian.

Paper Docker Hub Dataset

Citation

In case this software, the study, or the dataset was useful for you, please cite our EACL 2017 paper.

@inproceedings{Ustalov:17:eacl,
  author    = {Ustalov, Dmitry and Arefyev, Nikolay and Biemann, Chris and Panchenko, Alexander},
  title     = {{Negative Sampling Improves Hypernymy Extraction Based on Projection Learning}},
  booktitle = {Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume~2, Short Papers},
  series    = {EACL~2017},
  month     = {April},
  year      = {2017},
  address   = {Valencia, Spain},
  publisher = {Association for Computational Linguistics},
  pages     = {543--550},
  isbn      = {978-1-945626-35-7},
  doi       = {10.18653/v1/E17-2087},
  language  = {english},
}

Reproducibility

To reproduce our experimental results, you need dictionaries (LexNet for English, Parsed Wiktionary for Russian) and word embeddings (Google News for English, Russian Distributional Thesaurus for Russian). Since our implementation uses Python 3 and TensorFlow 0.12, please install them, too.

python3 -m venv venv
./venv/bin/pip3 -r requirements.txt

We prepared the Docker image nlpub/hyperstar that contains the necessary dependencies for running our software. However, we recommend using a virtualenv instead.

Please make sure you specified the correct word embedding model in every invocation of Hyperstar scripts.

Preparation

The input dictionaries should be transformed into the format used by Hyperstar. Words for which there is no embeddings should be excluded. This is achieved by running the ./dictionary.en.py script for English and ./dictionary.ru.py for Russian. Then, the word embeddings should be dumped for the further processing using the ./prepare.py script. These scripts might take significant amount of time, but they are executed only once. Finally, the vector space should be separated into a number of clusters using the ./cluster.py -k 1 script, where an arbitrary number of clusters can be specified instead of 1. This is found to be very useful for improving the results, so it is not possible to proceed without clustering.

Training

The original approach by Fu et al. (2014) learns a matrix that transforms an input hyponym vector into its hypernym vector. This approach is implemented as a baseline, while Hyperstar features various regularized approaches:

  • baseline, the original approach
  • regularized_hyponym that penalizes the matrix for transforming the hypernyms back to the hyponyms
  • regularized_synonym that penalizes the matrix for transforming the hypernyms back to the synonyms of the hyponyms
  • regularized_hypernym that promotes the matrix for transforming the hyponym synonyms to the hypernyms

The training script, ./train.py, accepts the following parameters:

  • --model=MODEL, where MODEL is the desired approach described above
  • --gpu=1 that suggests the program to use a GPU, when possible
  • --num_epochs=300 that specifies the number of training epochs
  • --batch_size=2048 that specifies the batch size
  • --stddev=0.01 that specifies the standard deviation for initializing the transformation matrices
  • --lambdac=0.10 that specifies the regularization coefficient

The trained models are written to MODEL.k%d.trained files. Each file represents the trained model for each cluster. The data for further evaluation are written into the MODEL.test.npz file.

Evaluation

The evaluation script requires the previously trained model: ./evaluate.py path-to-the-trained-model. It is also possible to study how good (but usually bad) the intact embeddings represent the subsumptions by running ./identity.py.

It is possible to reuse our post-processing scripts for parameter tuning (./enumerate.sh), evaluation log parsing (./parse-logs.awk sz100-validation.log >sz100-validation.tsv), and data visualization (R --no-save <evaluate.R).

Copyright

Copyright (c) 2016–2017 Dmitry Ustalov and others. See LICENSE for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].