All Projects → txsun1997 → CoLAKE

txsun1997 / CoLAKE

Licence: MIT license
COLING'2020: CoLAKE: Contextualized Language and Knowledge Embedding

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to CoLAKE

IEAJKE
Code and data for our paper "Iterative Entity Alignment via Joint Knowledge Embeddings"
Stars: ✭ 43 (-50%)
Mutual labels:  knowledge-graph, knowledge-embedding
gpt-j-api
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend
Stars: ✭ 248 (+188.37%)
Mutual labels:  language-model
Romanian-Transformers
This repo is the home of Romanian Transformers.
Stars: ✭ 60 (-30.23%)
Mutual labels:  language-model
minGPT-TF
A minimal TF2 re-implementation of the OpenAI GPT training
Stars: ✭ 36 (-58.14%)
Mutual labels:  language-model
news-graph
Key information extraction from text and graph visualization
Stars: ✭ 83 (-3.49%)
Mutual labels:  knowledge-graph
query completion
Personalized Query Completion
Stars: ✭ 24 (-72.09%)
Mutual labels:  language-model
yang-db
YANGDB Open-source, Scalable, Non-native Graph database (Powered by Elasticsearch)
Stars: ✭ 92 (+6.98%)
Mutual labels:  knowledge-graph
GGNN Reasoning
PyTorch implementation for Graph Gated Neural Network (for Knowledge Graphs)
Stars: ✭ 34 (-60.47%)
Mutual labels:  knowledge-graph
LM-CNLC
Chinese Natural Language Correction via Language Model
Stars: ✭ 15 (-82.56%)
Mutual labels:  language-model
semantic-python-overview
(subjective) overview of projects which are related both to python and semantic technologies (RDF, OWL, Reasoning, ...)
Stars: ✭ 406 (+372.09%)
Mutual labels:  knowledge-graph
neno
NENO is a note-taking app that helps you create your personal knowledge graph.
Stars: ✭ 65 (-24.42%)
Mutual labels:  knowledge-graph
PCPM
Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.
Stars: ✭ 21 (-75.58%)
Mutual labels:  language-model
gdc
Code for the ICLR 2021 paper "A Distributional Approach to Controlled Text Generation"
Stars: ✭ 94 (+9.3%)
Mutual labels:  language-model
Capricorn
提供强大的NLP能力, low-code实现chatbot
Stars: ✭ 14 (-83.72%)
Mutual labels:  knowledge-graph
skipchunk
Extracts a latent knowledge graph from text and index/query it in elasticsearch or solr
Stars: ✭ 18 (-79.07%)
Mutual labels:  knowledge-graph
amie plus
AMIE+ association rule mining
Stars: ✭ 24 (-72.09%)
Mutual labels:  knowledge-graph
wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
Stars: ✭ 39 (-54.65%)
Mutual labels:  language-model
ChineseTextAnalysisResouce
中文文本分析相关资源汇总
Stars: ✭ 71 (-17.44%)
Mutual labels:  knowledge-graph
KR-EAR
Knowledge Representation Learning with Entities, Attributes and Relations
Stars: ✭ 109 (+26.74%)
Mutual labels:  knowledge-embedding
Shukongdashi
使用知识图谱,自然语言处理,卷积神经网络等技术,基于python语言,设计了一个数控领域故障诊断专家系统
Stars: ✭ 109 (+26.74%)
Mutual labels:  knowledge-graph

CoLAKE

Source code for paper "CoLAKE: Contextualized Language and Knowledge Embedding". If you have any problem about reproducing the experiments, please feel free to contact us or propose an issue.

Prepare your environment

We recommend to create a new environment.

conda create --name colake python=3.7
source activate colake

CoLAKE is implemented based on fastNLP and huggingface's transformers, and uses fitlog to record the experiments.

git clone https://github.com/fastnlp/fastNLP.git
cd fastNLP/ & python setup.py install
git clone https://github.com/fastnlp/fitlog.git
cd fitlog/ & python setup.py install
pip install transformers==2.11
pip install sklearn

To re-train CoLAKE, you may need mixed CPU-GPU training to handle the large number of entities. Our implementation is based on KVStore provided by DGL. In addition, to reproduce the experiments on link prediction, you may also need DGL-KE.

pip install dgl==0.4.3
pip install dglke

Reproduce the experiments

1. Download the model and entity embeddings

Download the pre-trained CoLAKE model and embeddings for more than 3M entities. To reproduce the experiments on LAMA and LAMA-UHN, you only need to download the model. You can use the download_gdrive.py in this repo to directly download files from Google Drive to your server:

mkdir model
python download_gdrive.py 1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b ./model/model.bin
python download_gdrive.py 1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI ./model/entities.npy

Alternatively, you can use gdown:

pip install gdown
gdown https://drive.google.com/uc?id=1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b
gdown https://drive.google.com/uc?id=1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI

2. Run the experiments

Download the datasets for the experiments in the paper: Google Drive.

python download_gdrive.py 1UNXICdkB5JbRyS5WTq6QNX4ndpMlNob6 ./data.tar.gz
tar -xzvf data.tar.gz
cd finetune/

FewRel

python run_re.py --debug --gpu 0

Open Entity

python run_typing.py --debug --gpu 0

LAMA and LAMA-UHN

cd ../lama/
python eval_lama.py

Re-train CoLAKE

1. Download the data

Download the latest wiki dump (XML format):

wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Download the knowledge graph (Wikidata5M):

wget -c https://www.dropbox.com/s/6sbhm0rwo4l73jq/wikidata5m_transductive.tar.gz?dl=1
tar -xzvf wikidata5m_transductive.tar.gz

Download the Wikidata5M entity & relation aliases:

wget -c https://www.dropbox.com/s/lnbhc8yuhit4wm5/wikidata5m_alias.tar.gz?dl=1
tar -xzvf wikidata5m_alias.tar.gz

2. Preprocess the data

Preprocess wiki dump:

mkdir pretrain_data
# process xml-format wiki dump
python preprocess/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
# Modify anchors
python preprocess/extract.py 4
python preprocess/gen_data.py 4
# Count entity & relation frequency and generate vocabs
python statistic.py

3. Train CoLAKE

Initialize entity and relation embeddings with the average of RoBERTa BPE embedding of entity and relation aliases:

cd pretrain/
python init_ent_rel.py

Train CoLAKE with mixed CPU-GPU:

./run_pretrain.sh

Cite

If you use the code and model, please cite this paper:

@inproceedings{sun2020colake,
  author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang},
  title = {CoLAKE: Contextualized Language and Knowledge Embedding},
  booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}},
  year = {2020}
}

Acknowledgments

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].