All Projects → vedashree29296 → PyEmbeo

vedashree29296 / PyEmbeo

Licence: other
graph embeddings for neo4j in python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to PyEmbeo

Migrate
Database migrations. CLI and Golang library.
Stars: ✭ 2,315 (+9160%)
Mutual labels:  neo4j
Graph Data Science
Source code for the Neo4j Graph Data Science library of graph algorithms.
Stars: ✭ 251 (+904%)
Mutual labels:  neo4j
neo4j-faker
Use faker cypher functions to generate demo and test data with cypher
Stars: ✭ 30 (+20%)
Mutual labels:  neo4j
Aaia
AWS Identity and Access Management Visualizer and Anomaly Finder
Stars: ✭ 218 (+772%)
Mutual labels:  neo4j
Neo4j Java Driver
Neo4j Bolt driver for Java
Stars: ✭ 241 (+864%)
Mutual labels:  neo4j
BusinessIntelligence
商务智能期末项目
Stars: ✭ 43 (+72%)
Mutual labels:  neo4j
Neo4j Timetree
Java and REST APIs for working with time-representing tree in Neo4j
Stars: ✭ 199 (+696%)
Mutual labels:  neo4j
Probabilistic-RNN-DA-Classifier
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Stars: ✭ 22 (-12%)
Mutual labels:  embeddings
Neo4j Framework
GraphAware Neo4j Framework
Stars: ✭ 247 (+888%)
Mutual labels:  neo4j
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-12%)
Mutual labels:  embeddings
Userline
Query and report user logons relations from MS Windows Security Events
Stars: ✭ 221 (+784%)
Mutual labels:  neo4j
Neo4j To Elasticsearch
GraphAware Framework Module for Integrating Neo4j with Elasticsearch
Stars: ✭ 241 (+864%)
Mutual labels:  neo4j
twitch-project
A weekly stream in which I build a web application with Neo4j and Typescript
Stars: ✭ 78 (+212%)
Mutual labels:  neo4j
Helicalinsight
Helical Insight software is world’s first Open Source Business Intelligence framework which helps you to make sense out of your data and make well informed decisions.
Stars: ✭ 214 (+756%)
Mutual labels:  neo4j
database-journal
Databases: Concepts, commands, codes, interview questions and more...
Stars: ✭ 50 (+100%)
Mutual labels:  neo4j
Bolt sips
Neo4j driver for Elixir
Stars: ✭ 204 (+716%)
Mutual labels:  neo4j
simple elmo
Simple library to work with pre-trained ELMo models in TensorFlow
Stars: ✭ 49 (+96%)
Mutual labels:  embeddings
neo4j-ogm-university
Example Project for Neo4j OGM
Stars: ✭ 52 (+108%)
Mutual labels:  neo4j
geometric embedding
"Zero-Training Sentence Embedding via Orthogonal Basis" paper implementation
Stars: ✭ 19 (-24%)
Mutual labels:  embeddings
img2vec-keras
Image to dense vector embedding. Clone of https://github.com/christiansafka/img2vec for Keras users
Stars: ✭ 36 (+44%)
Mutual labels:  embeddings

PyEmbeo

(Graphs embeddings for Neo4j in Python)

INTRODUCTION

NEO4J

Graphs databases are a powerful way to represent real world data in a simple and intuitive manner They can effectively capture inherent relationships within the data and provide meaningful insights that cannot be obtained using traditional relational databases.

Neo4j is a leading graph database platform that offers great capabilities for storing and querying large scale enterprise data and can be easily scaled up to accomodate millions of nodes without hindering the performance. Moreover, it has great community support and a large number of plugins available for carrying out various tasks. Head over to their official website

WHAT ARE GRAPH EMBEDDINGS?

Machine Learning on graph data has been the talk of the town for quite a while now. With the advantage of using graphs being quite evident; applying machine learning algorithms on graphs can be used for tasks such as graph analysis, link prediction, clustering etc.

Graph Embeddings are a way to encode the graph data as vectors that can effectively capture the structural information, such as the graph topology and the node to node relationships in the graph database. These embeddings can then be ingested by ML algorithms for performing various tasks

HOW CAN GRAPH EMBEDDINGS BE USED?

Graph embeddings can be used to perform various tasks including machine learning tasks. For example, embeddings of two nodes can be used to determine if a relationship can exist between them. Or, given a particular node and a relation, embeddings can be used to find similar nodes and rank them using similarity search algortihms Common applications include knowledge graph completion and drug discovery where new relations can be dicovered between two nodes. Link prediction and Recommendation systems in cases such as social networks analysis where potential new friendships can be found.

PyEmbeo

PyEmbeo is a project in python that creates graph embeddings for a Neo4j graph database. Link to the neo4j database can be passed to the script through a command line interface to generate graph embeddings. Other parameters (such as the number of epochs for training) can be configured by creating or editing the "config.yml" file. (See config_link for all the configurable parameters). The obtained embeddings can be then used to perform other tasks such as similarity search, scoring or ranking. (Note: currently the similarity search task has been implemented, other tasks are still in development)

Installation and Setup


REQUIREMENTS

  • Neo4j database and py2neo

  • conda (or miniconda)

  • python >=3.5

Also, ensure that the APOC plugin for Neo4j is installed and configured for your database. Make sure following lines are added to the 'neo4j.conf' file:

apoc.import.file.enabled=true .

apoc.export.file.enabled=true .

dbms.security.procedures.whitelist=apoc.* .

apoc.import.file.use_neo4j_config=false .

STEPS FOR INSTALLATION:

  • Clone the repository using and navigate inside the directory :

git clone <link>

cd ./PyEmbeo

  • create a conda environment and activate it by running:

conda env create -f requirements.yml

  • This creates a conda environment called pyembeo and installs all the requirements. Activate the environment by exceuting:

conda activate pyembeo

Usage


Training:

PyEmbeo uses torchbiggraph to generate graph embeddings. PyTorch-BigGraph is a tool can create graph embeddings for very large, multi-realtional graphs without the need for computing resources such as GPUs. For more details,you can refer to the PyTorch-BigGraph documentation

The script uses the config.yml file to configure all the training parameters. The file has been preconfigured with default parameters and only a minimal set of parameter need to be passed through the command line. However, the parameters can be tweaked by editing the config.yml file.

The command line interface takes the following parameters:

  • project_name : This is the root directory that will store the required data and embedding checkpoint files.

  • url : The url to the neo4j database in the format of bolt(or http): // (ip of the database):(port number). By default the url is configured to bolt://localhost:7687

You will be then prompted to enter the username and password to connect to the database.

  • config_path: This is an optional parameter that specifies the path to a 'config.yml' file incase the default parameters are edited.

To get all the parameters execute: python embed.py --help

To launch the training script for creating graph embeddings execute the following command from the project directory: python embed.py train --project_name=sampleproject --url=bolt://localhost:7687

This will create a folder called as sampleproject in the current directory which will store all the data and checkpoint files required.

Once the training is done, the embeddings will be save to sampleproject/model directory

Similarity Search:

A common task using graph embeddings is performing similarity search to return similar nodes which can then be used to find undiscovered relationships.

PyEmbeo uses FAISS that is used for fast similarity searching for a large number of vectors. A similarity search can be triggered by passing the node id of a particular node (any even any other property can also be passed but it will be computationally heavy) More Details can be found at: official documentation or this post and this post

the similarity search script takes similar arguments like the training script along with a few extra ones:

  • project_name : This is the root directory that will store the required data and embedding checkpoint files.

  • url : The url to the neo4j database in the format of bolt(or http): // (ip of the database):(port number). By default the url is configured to bolt://localhost:7687

  • node: This specifies the node id of any node present in the graph.

To get all the parameters execute: python task.py --help

The script first creates faiss indexes if they are not alread created and then returns n similar nodes for the given node(default n = 5 )

To execute the similarity search task, exceute the following command from the project directory: python task.py similarity --project_name=sampleproject --node=1234 --url=bolt://localhost:7687/

Storage format:

A root directory with the name given by the --project_name argument is created along with its subfolders: |-- my_project_name/ .

|------ data/ .

|---------- graph_partitioned/ .

|----------------- egdes.h5 files .

|---------- files related to the nodes (.json,.txt,.tsv files) .

|------ model/

|---------- index/

|----------------- .index files

|---------- config.json

|---------- embeddings files (.h5 and .txt files)

|------ metadata.json

data/ : stores all the data related files such as

  • entity_names (.json) stores list of the node ids of the entities
  • entity count (.txt) store the total count of entites
  • graph.tsv stores the graph data in tsv format which is used as an input for training graph embeddings
  • graph_partitioned/ edges (.h5) files store the edge list

model/: stores the checkpoint and embeddings files created during training.

  • config.json is a configuration file that is created using the config.yml file which is used by torchbiggraph for trainig
  • embeddings (.h5) store the graph embeddings
  • checkpoint_version(.txt) stores the latest checkpoint version of the embeddings

metadata.json stores data aboout the number of nodes, labels and types of relationships

Configuration Options:

Default parameters can be overridden by editing or creating a config.yml file. Most of the parameters are used by torchbiggraph and more details about each can be found at :......... Some of the editable paramters list includes:

  • EMBEDDING_DIMENSIONS: size of the embedding vectors. defaults to 400

  • EPOCHS: number of training iterations to perform (defaults to 20)

  • NUM_PARTITIONS : the number of partitions to divide the nodes into. This is used in torchbiggraph which will divide the nodes of a particular type. (defaults to 1)

torchbiggraph uses the concept of operators and comparators for scoring while training the graph embeddings. More details can be found at: comparators and operators

  • operator : can be 'none','diagonal','translation','complex_diagonal', 'affine' or 'linear' . Defaults to 'complex_diagonal'
  • comparator :can be 'dot','cos','l2','squared_l2'. Defaults to 'dot'

The similarity search parameters can also be tweaked accordingly:

  • FAISS_INDEX_NAME: The type of index to use for similarity searching . Defaults to IndexIVFFlat. Currently only the IVFFlat and FlatL2 index types are supported . see index types for details on type of indexes
  • NEAREST_NEIGHBORS: number of similar nodes to return. Defaults ti 5
  • NUM_CLUSTER: number of clusters that are created by the clustering algorithm while creating the index
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].