All Projects → thunlp → Character-enhanced-Sememe-Prediction

thunlp / Character-enhanced-Sememe-Prediction

Licence: MIT license
Code accompanying Incorporating Chinese Characters of Words for Lexical Sememe Prediction (ACL2018) https://arxiv.org/abs/1806.06349

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Character-enhanced-Sememe-Prediction

CIAN
Implementation of the Character-level Intra Attention Network (CIAN) for Natural Language Inference (NLI) upon SNLI and MultiNLI corpus
Stars: ✭ 17 (-22.73%)
Mutual labels:  character-level
SE-WRL-SAT
Revised Version of SAT Model in "Improved Word Representation Learning with Sememes"
Stars: ✭ 46 (+109.09%)
Mutual labels:  sememe
BabelNet-Sememe-Prediction
Code and data of the AAAI-20 paper "Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets"
Stars: ✭ 18 (-18.18%)
Mutual labels:  sememe
SDLM-pytorch
Code accompanying EMNLP 2018 paper Language Modeling with Sparse Product of Sememe Experts
Stars: ✭ 27 (+22.73%)
Mutual labels:  sememe
sememe prediction
Codes for Lexical Sememe Prediction via Word Embeddings and Matrix Factorization (IJCAI 2017).
Stars: ✭ 59 (+168.18%)
Mutual labels:  sememe
CLSP
Code and data for EMNLP 2018 paper "Cross-lingual Lexical Sememe Prediction"
Stars: ✭ 19 (-13.64%)
Mutual labels:  sememe

Character-enhanced-Sememe-Prediction

Table of contents

Introduction

The code for Incorporating Chinese Characters of Words for Lexical Sememe Prediction (ACL2018) [1]

Usage

Dependency Requirements

The version of python to be used for different python files has been explicitly designated in shell files.

  1. Python 2.7 (For running the main code)
  2. Python 3 (For changing the version of pickle-dumped file generated by SPWE and SPSE, only CSP.sh requires)
  3. Numpy > 1.0
  4. In order to manage your dependency environment, we strongly encourage you to install the Anaconda.

Preparation 

  1. Prepare a file that contains pre-trained Chinese word embeddings(of Google Word2Vec form). We recommend that the amount of words be at least 200,000 and the number of dimentions be at least 200. It will achieve much better result using a large (20GB or more is recommended) corpus to train your embeddings for running this program.

  2. Rename the word embedding file as embedding_200.txt and put it in the repository root directory.

mv path/to/file/your_word_vec.txt ./embedding_200.txt
  1. Prepare a file that contains pre-trained Chinese character embeddings(of CWE form; see paper [2] and code). We recommend that the number of dimentions be at least 200. It will achieve much better result using a large (20GB or more is recommended) corpus to train your embeddings for running this program.

  2. Rename the word embedding file as char_embedding_200.txt and put it in the repository root directory.

mv path/to/file/your_character_embedding_file.txt ./char_embedding_200.txt
  1. Run data_generator.sh, the program will automatically generate evaluation data set and other data files required during training.
./data_generator.sh

Training and Prediction

  1. Run SPWCF.sh/SPCSE.sh The corresponding model will be automatically learned and evaluated.
./SPWCF.sh
./SPCSE.sh
  1. Since we need SPWE and SPSE as a part of our model, see paper [3] and code for details. Please use SPWE and SPSE to get the model files model_SPWE and model_SPSE and copy them to the root directory of this repository.
mv path/to/file/model_SPWE ./
mv path/to/file/model_SPSE ./
  1. Run CSP.sh The corresponding model will be automatically learned and evaluated.
./CSP.sh

References

[1] Huiming Jin, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and Leyu Lin. 2018. Incorporating Chinese Characters of Words for Lexical Sememe Prediction. In Proceedings of ACL.

[2] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huan-Bo Luan. 2015. Joint Learning of Character and Word Embeddings. In Proceedings of IJCAI.

[3] Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, and Maosong Sun. 2017. Lexical sememe prediction via word embeddings and matrix factorization. In Proceedings of IJCAI

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].