All Projects → thunlp → sememe_prediction

thunlp / sememe_prediction

Licence: MIT license
Codes for Lexical Sememe Prediction via Word Embeddings and Matrix Factorization (IJCAI 2017).

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
shell
77523 projects

Labels

Projects that are alternatives of or similar to sememe prediction

CLSP
Code and data for EMNLP 2018 paper "Cross-lingual Lexical Sememe Prediction"
Stars: ✭ 19 (-67.8%)
Mutual labels:  sememe
Character-enhanced-Sememe-Prediction
Code accompanying Incorporating Chinese Characters of Words for Lexical Sememe Prediction (ACL2018) https://arxiv.org/abs/1806.06349
Stars: ✭ 22 (-62.71%)
Mutual labels:  sememe
SE-WRL-SAT
Revised Version of SAT Model in "Improved Word Representation Learning with Sememes"
Stars: ✭ 46 (-22.03%)
Mutual labels:  sememe
BabelNet-Sememe-Prediction
Code and data of the AAAI-20 paper "Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets"
Stars: ✭ 18 (-69.49%)
Mutual labels:  sememe
SDLM-pytorch
Code accompanying EMNLP 2018 paper Language Modeling with Sparse Product of Sememe Experts
Stars: ✭ 27 (-54.24%)
Mutual labels:  sememe

Sememe Prediction

The code for Lexical Sememe Prediction via Word Embeddings and Matrix Factorization(IJCAI2017)

Running Requirement

Memory: at least 8GB, 16GB or more is recommended.

Storage: at least 15GB, 20GB or more is recommended.

Dependency: The version of python should be at least 3.5, numpy is needed.

How to Run

  1. Prepare a file that contains pre-trained Chinese word embeddings(of Google Word2Vec form). We recommend that the amount of words be at least 200,000 and the number of dimentions be at least 200. It will achieve much better result using a large ( 20GB or more is recommended) corpus to train your embeddings for running this program. For example, SogouT (password: f2ul).

  2. Rename the word embedding file as 'embedding_200.txt' and put it under the directory.

(For application user, then you should prepare a file named as "hownet.txt_test", which lists all the target words(per word per line, in UTF-8 NO BOM encoding). After that, what you should do is to run application_SPWE[SPSE/SPASE].sh and you can leave out the following instructions and get the result in output_SPWE[SPSE/SPASE])

  1. Run data_generator.sh, the program will automatically generate evaluation data set and other data files required during training.

  2. Run SPSE.sh/SPWE.sh/SPASE.sh . The corresponding model will be automatically training and evaluated. (As for SPASE model, it will take pretty much time for training. On a typical computer with CPU of 12 cores, it takes 3 days. For better performance, we suggest that user rewrite it in C++. Model.cpp serves as a simple example.)

  3. Run Ensemble_Model.sh after you have run SPSE.sh and SPWE.sh.

(Please check Ensemble_Model.sh, you will get more information about how to run other combinations of models (only support combining 2 models at once)

Data Set

hownet.txt is an Chinese knowledge base with annotated word-sense-sememe information

Evaluation Set

After you have run data_generator.sh, you will see 'hownet.txt_test' and 'hownet.txt_answer' file under the directory. These two files make the evaluation set. The size of the evaluation set is 10% of the full size of the part of embedding_200.txt which is anotated in hownet.txt. The evaluation set is generated by random choices.

Result Files

Feel free to get insight of the files which are named after 'output_', these files contain the sememe predictions for evaluation set.

You can also use pickle library in python to load the files which are named after 'model_'. For more information, please refer to Ensemble_model.py.

Cite

If you use the code, please cite this paper:

Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, Maosong Sun. Lexical Sememe Prediction via Word Embeddings and Matrix Factorization. The 26th International Joint Conference on Artificial Intelligence (IJCAI 2017).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].