benedekrozemberczki / Musae

Licence: gpl-3.0
The reference implementation of "Multi-scale Attributed Node Embedding".

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Musae

Sense2vec
🦆 Contextually-keyed word vectors
Stars: ✭ 1,184 (+1478.67%)
Mutual labels:  word2vec, gensim
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-60%)
Mutual labels:  word2vec, gensim
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (+25.33%)
Mutual labels:  word2vec, gensim
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (-40%)
Mutual labels:  word2vec, gensim
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+953.33%)
Mutual labels:  word2vec, gensim
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (+22.67%)
Mutual labels:  word2vec, gensim
word2vec-pt-br
Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br
Stars: ✭ 34 (-54.67%)
Mutual labels:  word2vec, gensim
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+161.33%)
Mutual labels:  word2vec, gensim
Word2vec Tutorial
中文詞向量訓練教學
Stars: ✭ 426 (+468%)
Mutual labels:  word2vec, gensim
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+438.67%)
Mutual labels:  word2vec, gensim
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-70.67%)
Mutual labels:  word2vec, gensim
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-42.67%)
Mutual labels:  word2vec, gensim
Aravec
AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.
Stars: ✭ 239 (+218.67%)
Mutual labels:  word2vec, gensim
biovec
ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.
Stars: ✭ 23 (-69.33%)
Mutual labels:  word2vec, gensim
Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (+180%)
Mutual labels:  word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (-30.67%)
Mutual labels:  word2vec, gensim
Splitter
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).
Stars: ✭ 177 (+136%)
Mutual labels:  word2vec, gensim
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+152%)
Mutual labels:  word2vec, gensim
wordfish-python
extract relationships from standardized terms from corpus of interest with deep learning 🐟
Stars: ✭ 19 (-74.67%)
Mutual labels:  word2vec, gensim
Twitter sentiment analysis word2vec convnet
Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Network
Stars: ✭ 24 (-68%)
Mutual labels:  word2vec, gensim

MUSAE

Arxiv codebeat badge repo sizebenedekrozemberczki

The reference implementation of Multi-Scale Attributed Node Embedding.

Abstract

We present network embedding algorithms that capture information about a node from the local distribution over node attributes around it, as observed over random walks following an approach similar to Skip-gram. Observations from neighborhoods of different sizes are either pooled (AE) or encoded distinctly in a multi-scale approach (MUSAE). Capturing attribute-neighborhood relationships over multiple scales is useful for a diverse range of applications, including latent feature identification across disconnected networks with similar attributes. We prove theoretically that matrices of node-feature pointwise mutual information are implicitly factorized by the embeddings. Experiments show that our algorithms are robust, computationally efficient and outperform comparable models on social, web and citation network datasets.

The second-order random walks sampling methods were taken from the reference implementation of Node2Vec.

The datasets are also available on SNAP.

The model is now also available in the package Karate Club.

This repository provides the reference implementations for MUSAE and AE as described in the paper:

Multi-scale Attributed Node Embedding. Benedek Rozemberczki, Carl Allen, and Rik Sarkar. arXiv, 2019. https://arxiv.org/abs/1909.13021

Table of Contents

  1. Citing
  2. Requirements
  3. Datasets
  4. Logging
  5. Options
  6. Examples

Citing

If you find MUSAE useful in your research, please consider citing the following paper:

>@misc{rozemberczki2019multiscale,    
       title = {{Multi-scale Attributed Node Embedding}},   
       author = {Benedek Rozemberczki and Carl Allen and Rik Sarkar},   
       year = {2019},   
       eprint = {1909.13021},  
       archivePrefix = {arXiv},  
       primaryClass = {cs.LG}   
       }

Requirements

The codebase is implemented in Python 3.5.2. package versions used for development are just below.

networkx          2.4
tqdm              4.28.1
numpy             1.15.4
pandas            0.23.4
texttable         1.5.0
scipy             1.1.0
argparse          1.1.0
gensim            3.6.0

Datasets

Logging

The models are defined in a way that parameter settings and runtimes are logged. Specifically we log the followings:

1. Hyperparameter settings.     We save each hyperparameter used in the experiment.
2. Optimization runtime.        We measure the time needed for optimization - measured by seconds.
3. Sampling runtime.            We measure the time needed for sampling - measured by seconds.

Options

Learning the embedding is handled by the src/main.py script which provides the following command line arguments.

Input and output options

  --graph-input      STR   Input edge list csv.     Default is `input/edges/chameleon_edges.csv`.
  --features-input   STR   Input features json.     Default is `input/features/chameleon_features.json`.
  --output           STR   Embedding output path.   Default is `output/chameleon_embedding.csv`.
  --log              STR   Log output path.         Default is `logs/chameleon.json`.

Random walk options

  --sampling      STR       Random walker order (first/second).              Default is `first`.
  --P             FLOAT     Return hyperparameter for second-order walk.     Default is 1.0
  --Q             FLOAT     In-out hyperparameter for second-order walk.     Default is 1.0.
  --walk-number   INT       Walks per source node.                           Default is 5.
  --walk-length   INT       Truncated random walk length.                    Default is 80.

Model options

  --model                 STR        Pooled or multi-scale model (AE/MUSAE).      Default is `musae`.
  --base-model            STR        Use of Doc2Vec base model.                   Default is `null`.
  --approximation-order   INT        Matrix powers approximated.                  Default is 3.
  --dimensions            INT        Number of dimensions.                        Default is 32.
  --down-sampling         FLOAT      Length of random walk per source.            Default is 0.001.
  --exponent              FLOAT      Downsampling exponent of frequency.          Default is 0.75.
  --alpha                 FLOAT      Initial learning rate.                       Default is 0.05.
  --min-alpha             FLOAT      Final learning rate.                         Default is 0.025.
  --min-count             INT        Minimal occurence of features.               Default is 1.
  --negative-samples      INT        Number of negative samples per node.         Default is 5.
  --workers               INT        Number of cores used for optimization.       Default is 4.
  --epochs                INT        Gradient descent epochs.                     Default is 5.

Examples

Training a MUSAE model for a 10 epochs.

$ python src/main.py --epochs 10

Changing the dimension size.

$ python src/main.py --dimensions 32

License

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].