All Projects → gnn4dr → Drkg

gnn4dr / Drkg

Licence: apache-2.0
A knowledge graph and a set of tools for drug repurposing

Projects that are alternatives of or similar to Drkg

Deepke
基于深度学习的开源中文关系抽取框架
Stars: ✭ 525 (+127.27%)
Mutual labels:  knowledge-graph, jupyter-notebook
Pytextrank
Python implementation of TextRank for phrase extraction and summarization of text documents
Stars: ✭ 1,675 (+625.11%)
Mutual labels:  knowledge-graph, jupyter-notebook
Kglab
Graph-Based Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, RDFlib, pySHACL, RAPIDS, NetworkX, iGraph, PyVis, pslpython, pyarrow, etc.
Stars: ✭ 98 (-57.58%)
Mutual labels:  knowledge-graph, jupyter-notebook
Multihopkg
Multi-hop knowledge graph reasoning learned via policy gradient with reward shaping and action dropout
Stars: ✭ 202 (-12.55%)
Mutual labels:  knowledge-graph, jupyter-notebook
Tensorflow Nlp
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x
Stars: ✭ 1,487 (+543.72%)
Mutual labels:  knowledge-graph, jupyter-notebook
Knowledge Graph Analysis Programming Exercises
Exercises for the Analysis of Knowledge Graphs
Stars: ✭ 208 (-9.96%)
Mutual labels:  knowledge-graph, jupyter-notebook
Nn
🧑‍🏫 50! Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠
Stars: ✭ 5,720 (+2376.19%)
Mutual labels:  jupyter-notebook
Jupyterwith
declarative and reproducible Jupyter environments - powered by Nix
Stars: ✭ 235 (+1.73%)
Mutual labels:  jupyter-notebook
Mxnet The Straight Dope
An interactive book on deep learning. Much easy, so MXNet. Wow. [Straight Dope is growing up] ---> Much of this content has been incorporated into the new Dive into Deep Learning Book available at https://d2l.ai/.
Stars: ✭ 2,551 (+1004.33%)
Mutual labels:  jupyter-notebook
Awesome Pandas
A collection of resources for pandas (Python) and related subjects.
Stars: ✭ 232 (+0.43%)
Mutual labels:  jupyter-notebook
Pyschedule
pyschedule - resource scheduling in python
Stars: ✭ 232 (+0.43%)
Mutual labels:  jupyter-notebook
Rl learn
我的强化学习笔记和学习材料📖 still updating ... ...
Stars: ✭ 234 (+1.3%)
Mutual labels:  jupyter-notebook
Pyhessian
PyHessian is a Pytorch library for second-order based analysis and training of Neural Networks
Stars: ✭ 232 (+0.43%)
Mutual labels:  jupyter-notebook
Datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
Stars: ✭ 231 (+0%)
Mutual labels:  jupyter-notebook
Web scraping with python
Python 入门爬虫和数据分析实战
Stars: ✭ 234 (+1.3%)
Mutual labels:  jupyter-notebook
Tensorflow 101
TensorFlow Tutorials
Stars: ✭ 2,565 (+1010.39%)
Mutual labels:  jupyter-notebook
Learning Pyspark
Code repository for Learning PySpark by Packt
Stars: ✭ 233 (+0.87%)
Mutual labels:  jupyter-notebook
Relevant Search Book
Code and Examples for Relevant Search
Stars: ✭ 231 (+0%)
Mutual labels:  jupyter-notebook
My tech resources
List of tech resources future me and other Javascript/Ruby/Python/Elixir/Elm developers might find useful
Stars: ✭ 233 (+0.87%)
Mutual labels:  jupyter-notebook
Pydqc
python automatic data quality check toolkit
Stars: ✭ 233 (+0.87%)
Mutual labels:  jupyter-notebook

Drug Repurposing Knowledge Graph (DRKG)

Drug Repurposing Knowledge Graph (DRKG) is a comprehensive biological knowledge graph relating genes, compounds, diseases, biological processes, side effects and symptoms. DRKG includes information from six existing databases including DrugBank, Hetionet, GNBR, String, IntAct and DGIdb, and data collected from recent publications particularly related to Covid19. It includes 97,238 entities belonging to 13 entity-types; and 5,874,261 triplets belonging to 107 edge-types. These 107 edge-types show a type of interaction between one of the 17 entity-type pairs (multiple types of interactions are possible between the same entity-pair), as depicted in the figure below. It also includes a bunch of notebooks about how to explore and analysis the DRKG using statistical methodologies or using machine learning methodologies such as knowledge graph embedding.

DRKG schema
Figure: Interactions in the DRKG. The number next to an edge indicates the number of relation-types for that entity-pair in DRKG.

Statistics of DRKG

The type-wise distribution of the entities in DRKG and their original data-source(s) is shown in following table.

Entity type Drugbank GNBR Hetionet STRING IntAct DGIdb Bibliography Total Entities
Anatomy - - 400 - - - - 400
Atc 4,048 - - - - - - 4,048
Biological Process - - 11,381 - - - - 11,381
Cellular Component - - 1,391 - - - - 1,391
Compound 9,708 11,961 1,538 - 153 6,348 6,250 24,313
Disease 1,182 4,746 257 - - - 33 5,103
Gene 4,973 27,111 19,145 18,316 16,321 2,551 3,181 39,220
Molecular Function - - 2,884 - - - - 2,884
Pathway - - 1,822 - - - - 1,822
Pharmacologic Class - - 345 - - - - 345
Side Effect - - 5,701 - - - - 5,701
Symptom - - 415 - - - - 415
Tax - 215 - - - - - 215
Total 19,911 44,033 45,279 18,316 16,474 8,899 9,464 97,238

The following table shows the number of triplets between different entity-type pairs in DRKG for DRKG and various datasources.

Entity-type pair Drugbank GNBR Hetionet STRING IntAct DGIdb Bibliography Total interactions
(Gene, Gene) - 66,722 474,526 1,496,708 254,346 - 58,629 2,350,931
(Compound, Gene) 24,801 80,803 51,429 - 1,805 26,290 25,666 210,794
(Disease, Gene) - 95,399 27,977 - - - 461 123,837
(Atc, Compound) 15,750 - - - - - - 15,750
(Compound, Compound) 1,379,271 - 6,486 - - - - 1,385,757
(Compound, Disease) 4,968 77,782 1,145 - - - - 83,895
(Gene, Tax) - 14,663 - - - - - 14,663
(Biological Process, Gene) - - 559,504 - - - - 559,504
(Disease, Symptom) - - 3,357 - - - - 3,357
(Anatomy, Disease) - - 3,602 - - - - 3,602
(Disease, Disease) - - 543 - - - - 543
(Anatomy, Gene) - - 726,495 - - - - 726,495
(Gene, Molecular Function) - - 97,222 - - - - 97,222
(Compound, Pharmacologic Class) - - 1,029 - - - - 1,029
(Cellular Component, Gene) - - 73,566 - - - - 73,566
(Gene, Pathway) - - 84,372 - - - - 84,372
(Compound, Side Effect) - - 138,944 - - - - 138,944
Total 1,424,790 335,369 2,250,197 1,496,708 256,151 26,290 84,756 5,874,261

Download DRKG

To analyze DRKG, you can directly download drkg by following commands:

wget https://dgl-data.s3-us-west-2.amazonaws.com/dataset/DRKG/drkg.tar.gz

If you use our notebooks provided in this repository, you don't need to download the file manually. The notebooks can automatically download the file for you.

When you untar drkg.tar.gz, you will see the following files:

./drkg.tsv
./entity2src.tsv
./relation_glossary.tsv
./embed
./embed/DRKG_TransE_l2_relation.npy
./embed/relations.tsv
./embed/entities.tsv
./embed/Readme.md
./embed/DRKG_TransE_l2_entity.npy
./embed/mol_contextpred.npy
./embed/mol_masking.npy
./embed/mol_infomax.npy
./embed/mol_edgepred.npy

DRKG dataset

The whole dataset contains four part:

  • drkg.tsv, a tsv file containing the original drkg in the format of (h, r, t) triplets.
  • embed, a folder containing the pretrained Knowledge Graph Embedding using the entire drkg.tsv as the training set and pretrained GNN-based molecule embeddings from molecule SMILES
  • entity2src.tsv, a file mapping entities in drkg to their original sources.
  • relation_glossary.tsv, a file containing rge glossary of the relations in DRKG, and other associated information with sources (if available).

Pretrained DRKG embedding

The DRKG mebedding is trained using TransE_l2 model with dimention size of 400, there are four files:

  • DRKG_TransE_l2_entity.npy, NumPy binary data, storing the entity embedding
  • DRKG_TransE_l2_relation.npy, NumPy binary data, storing the relation embedding
  • entities.tsv, mapping from entity_name to tentity_id.
  • relations.tsv, mapping from relation_name to relation_id

To use the pretrained embedding, one can use np.load to load the entity embeddings and relation embeddings separately:

import numpy as np
entity_emb = np.load('./embed/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('./embed/DRKG_TransE_l2_relation.npy')

Pretrained Molecule Embedding

We also provide molecule embeddings for most small-molecule drugs in DrugBank using pre-trained GNNs. In particular, Strategies for Pre-training Graph Neural Networks develops multiple approaches for pre-training GNN-based molecular representations, combining supervised molecular property prediction with self-supervised learning approaches. We employ their method to compute four variants of molecule embeddings using DGL-LifeSci.

  • mol_contextpred.npy: From a model pre-trained to predict surrounding graph structures of molecular subgraphs
  • mol_infomax.npy: From a model pre-trained to maximize the mutual information between local node representations and a global graph representation
  • mol_edgepred.npy: From a model pre-trained to encourage nearby nodes to have similar representations and enforcing disparate notes to have distinct representations
  • mol_masking.npy: From a model pre-trained to predict randomly masked node and edge attributes

Tools to analyze DRKG

We analyze DRKG with some deep learning frameworks, including DGL (a framework for graph neural networks) and DGL-KE (a library for computing knowledge graph embeddings). Please follow the instructions below to install the deep learning frameworks.

Install PyTorch

Currently all notebooks use PyTorch as Deep Learning backend. For install other version of pytorch please goto Install PyTorch

sudo pip3 install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Install DGL

Please install DGL (a framework for graph neural networks) with the following command. It installs DGL with CUDA support.

sudo pip3 install dgl-cu101

For installing other versions of DGL, please go to Install DGL

Install DGL-KE

If you want to training the model with notebooks (e.g., using Train_embeddings.ipynb or Edge_score_analysis.ipynb) at [knowledge-graph-embedding-based-analysis-of-drkg], you need to install both DGL and DGL-KE package here. DGL-KE can work with DGL >= 0.4.3 (either CPU or GPU)

sudo pip3 install dglke

Notebooks for analyzing DRKG

We provide a set of notebooks to analyze DRKG. Some of the notebooks use the tools installed in the previous section.

Basic Graph Analysis of DRKG

To evaluate the structural similarity among a pair of relation types we compute their Jaccard similarity coefficient and the overlap among the two edge types via the overlap coeffcient. This analysis is given in

Knowledge Graph Embedding Based Analysis of DRKG

We analyze the extracted DRKG by learning a TransE KGE model that utilizes the $\ell_2$ distance. As DRKG combines information from different data sources, we want to verify that meaningful entity and relation embeddings can be generated using knowledge graph embedding technology.

We split the edge triplets in training, validation and test sets as follows 90%, 5%, and 5% and train the KGE model as shown in following notebook:

Finally, we obtain the entity and relation embeddings for the DRKG. We can do various embedding based analysis as provided in the following notebooks:

Drug Repurposing Using Pretrained Model for COVID-19

We present an example of using pretrained DRKG model for drug repurposing for COVID-19. In the example, we directly use the pretrained model provided at DRKG dataset and proposed 100 drugs for COVID-19. The following notebook provides the details:

DRKG with DGL

We provide a notebook, with example of using DRKG with Deep Graph Library (DGL).

The following notebook provides an example of building a heterograph from DRKG in DGL; and some examples of queries on the DGL heterograph:

Additional Information for DrugBank

Some additional information about compounds from DrugBank is included in drugbank_info, including the type and weight of drugs, and the SMILES of small-molecule drugs.

Licence

This project is licensed under the Apache-2.0 License. However, the DRKG integrates data from many resources and users should consider the licensing of each source (see this table) . We apply a license attribute on a per node and per edge basis for sources with defined licenses.

Cite

Please cite our dataset if you use this code and data in your work.

@misc{drkg2020,
  author = {Ioannidis, Vassilis N. and Song, Xiang and Manchanda, Saurav and Li, Mufei and Pan, Xiaoqin
            and Zheng, Da and Ning, Xia and Zeng, Xiangxiang and Karypis, George},
  title = {DRKG - Drug Repurposing Knowledge Graph for Covid-19},
  howpublished = "\url{https://github.com/gnn4dr/DRKG/}",
  year = {2020}
}

A preprint describing this work will be available soon.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].