benedekrozemberczki / Sine

Licence: gpl-3.0
A PyTorch Implementation of "SINE: Scalable Incomplete Network Embedding" (ICDM 2018).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Sine

How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-8.96%)
Mutual labels:  sklearn, gensim
Role2vec
A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).
Stars: ✭ 134 (+100%)
Mutual labels:  sklearn, gensim
Neuralhmm
code for unsupervised learning Neural Hidden Markov Models paper
Stars: ✭ 64 (-4.48%)
Mutual labels:  unsupervised-learning, torch
Karateclub
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)
Stars: ✭ 1,190 (+1676.12%)
Mutual labels:  unsupervised-learning, sklearn
Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (+213.43%)
Mutual labels:  unsupervised-learning, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (-22.39%)
Mutual labels:  gensim, unsupervised-learning
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-35.82%)
Mutual labels:  unsupervised-learning, gensim
Diff2vec
Reference implementation of Diffusion2Vec (Complenet 2018) built on Gensim and NetworkX.
Stars: ✭ 108 (+61.19%)
Mutual labels:  unsupervised-learning, gensim
Danmf
A sparsity aware implementation of "Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection" (CIKM 2018).
Stars: ✭ 161 (+140.3%)
Mutual labels:  unsupervised-learning, sklearn
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (-32.84%)
Mutual labels:  sklearn, gensim
Attentionwalk
A PyTorch Implementation of "Watch Your Step: Learning Node Embeddings via Graph Attention" (NeurIPS 2018).
Stars: ✭ 266 (+297.01%)
Mutual labels:  torch, sklearn
Voxelmorph
Unsupervised Learning for Image Registration
Stars: ✭ 1,057 (+1477.61%)
Mutual labels:  unsupervised-learning
Php Ml
PHP-ML - Machine Learning library for PHP
Stars: ✭ 7,900 (+11691.04%)
Mutual labels:  unsupervised-learning
Twitterldatopicmodeling
Uses topic modeling to identify context between follower relationships of Twitter users
Stars: ✭ 48 (-28.36%)
Mutual labels:  gensim
Neuralamr
Sequence-to-sequence models for AMR parsing and generation
Stars: ✭ 60 (-10.45%)
Mutual labels:  torch
Word2vec
訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.
Stars: ✭ 48 (-28.36%)
Mutual labels:  gensim
Sklearn Porter
Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
Stars: ✭ 1,014 (+1413.43%)
Mutual labels:  sklearn
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+1589.55%)
Mutual labels:  gensim
Dmgi
Unsupervised Attributed Multiplex Network Embedding (AAAI 2020)
Stars: ✭ 62 (-7.46%)
Mutual labels:  unsupervised-learning
Hypergan
Composable GAN framework with api and user interface
Stars: ✭ 1,104 (+1547.76%)
Mutual labels:  unsupervised-learning

Scalable Incomplete Network Embedding

Arxiv codebeat badge repo sizebenedekrozemberczki

A PyTorch implementation of Scalable Incomplete Network Embedding (ICDM 2018).

Abstract

Attributed network embedding aims to learn low-dimensional vector representations for nodes in a network, where each node contains rich attributes/features describing node content. Because network topology structure and node attributes often exhibit high correlation, incorporating node attribute proximity into network embedding is beneficial for learning good vector representations. In reality, large-scale networks often have incomplete/missing node content or linkages, yet existing attributed network embedding algorithms all operate under the assumption that networks are complete. Thus, their performance is vulnerable to missing data and suffers from poor scalability. In this paper, we propose a Scalable Incomplete Network Embedding (SINE) algorithm for learning node representations from incomplete graphs. SINE formulates a probabilistic learning framework that separately models pairs of node-context and node-attribute relationships. Different from existing attributed network embedding algorithms, SINE provides greater flexibility to make the best of useful information and mitigate negative effects of missing information on representation learning. A stochastic gradient descent based online algorithm is derived to learn node representations, allowing SINE to scale up to large-scale networks with high learning efficiency. We evaluate the effectiveness and efficiency of SINE through extensive experiments on real-world networks. Experimental results confirm that SINE outperforms state-of-the-art baselines in various tasks, including node classification, node clustering, and link prediction, under settings with missing links and node attributes. SINE is also shown to be scalable and efficient on large-scale networks with millions of nodes/edges and high-dimensional node features.

This repository provides an implementation of SINE as described in the paper:

SINE: Scalable Incomplete Network Embedding. Daokun Zhang, Jie Yin, Xingquan Zhu, Chengqi Zhang. ICDM, 2018. [Paper]

The SINE model is available in [Karate Club] framework.

The original C implementation is available [here].

Requirements

The codebase is implemented in Python 3.5.2. package versions used for development are just below.

networkx          2.4
tqdm              4.28.1
numpy             1.15.4
pandas            0.23.4
texttable         1.5.0
scipy             1.1.0
argparse          1.1.0
torch             1.1.0.
torchvision       0.3.0

Datasets

The code takes an input graph in a csv file. Every row indicates an edge between two nodes separated by a comma. The first row is a header. Nodes should be indexed starting with 0. Sample graphs for the `Twitch Brasilians` and `Wikipedia Chameleons` are included in the `input/` directory.

The feature matrix can be stored two ways as a **sparse binary** one. For simplicity, it is a JSON. Nodes are keys of the json and features are the values. For each node feature column ids are stored as elements of a list. The feature matrix is structured as:

{ 0: [0, 1, 38, 1968, 2000, 52727],
  1: [10000, 20, 3],
  2: [],
  ...
  n: [2018, 10000]}

Options

Learning of the embedding is handled by the `src/main.py` script which provides the following command line arguments.

Input and output options

  --edge-path    STR     Input graph path.           Default is `input/chameleon_edges.csv`.
  --feature-path STR     Input Features path.        Default is `input/chameleon_features.json`.
  --output-path  STR     Embedding path.             Default is `output/chameleon_sine.csv`.

Model options

  --dimensions              INT       Number of embeding dimensions.         Default is 128.
  --budget                  INT       Sampling budget.                       Default is 10^5.
  --noise-samples           INT       Number of noise samples.               Default is 5.
  --batch-size              INT       Number of source nodes per batch.      Default is 32.
  --walk-length             INT       Truncated random walk length.          Default is 80.  
  --number-of-walks         INT       Number of walks per source node.       Default is 10.
  --window-size             INT       Skip-gram window size.                 Default is 5.
  --learning-rate           FLOAT     Learning rate value.                   Default is 0.001.

Examples

The following commands learn a graph embedding and write the embedding to disk. The node representations are ordered by the ID.

Creating a SINE embedding of the default dataset with the default hyperparameter settings. Saving the embedding at the default path.

python src/main.py

Creating a SINE embedding of the default dataset with 256 dimensions.

python src/main.py --dimensions 256

Creating a SINE embedding of the default dataset with a low sampling budget.

python src/main.py --budget 1000

Creating an embedding of an other dense structured dataset the Twitch Brasilians. Saving the output in a custom folder.

python src/main.py --edge-path input/ptbr_edges.csv --feature-path input/ptbr_features.json --output-path output/ptbr_sine.csv

License

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].