All Projects → Lab41 → altair

Lab41 / altair

Licence: Apache-2.0 License
Assessing Source Code Semantic Similarity with Unsupervised Learning

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to altair

Word2vec
Python interface to Google word2vec
Stars: ✭ 2,370 (+5542.86%)
Mutual labels:  word2vec, doc2vec
ML2017FALL
Machine Learning (EE 5184) in NTU
Stars: ✭ 66 (+57.14%)
Mutual labels:  rnn, unsupervised-learning
Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (+400%)
Mutual labels:  word2vec, unsupervised-learning
Danmf
A sparsity aware implementation of "Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection" (CIKM 2018).
Stars: ✭ 161 (+283.33%)
Mutual labels:  word2vec, unsupervised-learning
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (+23.81%)
Mutual labels:  word2vec, unsupervised-learning
Tensorflow Tutorials
텐서플로우를 기초부터 응용까지 단계별로 연습할 수 있는 소스 코드를 제공합니다
Stars: ✭ 2,096 (+4890.48%)
Mutual labels:  word2vec, rnn
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (+119.05%)
Mutual labels:  word2vec, doc2vec
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (+2.38%)
Mutual labels:  word2vec, unsupervised-learning
DeepLearning-Lab
Code lab for deep learning. Including rnn,seq2seq,word2vec,cross entropy,bidirectional rnn,convolution operation,pooling operation,InceptionV3,transfer learning.
Stars: ✭ 83 (+97.62%)
Mutual labels:  word2vec, rnn
chainer-notebooks
Jupyter notebooks for Chainer hands-on
Stars: ✭ 23 (-45.24%)
Mutual labels:  word2vec, rnn
Graphwavemachine
A scalable implementation of "Learning Structural Node Embeddings Via Diffusion Wavelets (KDD 2018)".
Stars: ✭ 151 (+259.52%)
Mutual labels:  word2vec, unsupervised-learning
Embedding
Embedding模型代码和学习笔记总结
Stars: ✭ 25 (-40.48%)
Mutual labels:  word2vec, doc2vec
Skip Thoughts.torch
Porting of Skip-Thoughts pretrained models from Theano to PyTorch & Torch7
Stars: ✭ 146 (+247.62%)
Mutual labels:  word2vec, rnn
Chameleon recsys
Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems
Stars: ✭ 202 (+380.95%)
Mutual labels:  word2vec, rnn
Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (+128.57%)
Mutual labels:  word2vec, unsupervised-learning
GE-FSG
Graph Embedding via Frequent Subgraphs
Stars: ✭ 39 (-7.14%)
Mutual labels:  word2vec, doc2vec
Bagofconcepts
Python implementation of bag-of-concepts
Stars: ✭ 18 (-57.14%)
Mutual labels:  word2vec, unsupervised-learning
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-19.05%)
Mutual labels:  word2vec, rnn
doc2vec-golang
doc2vec , word2vec, implemented by golang. word embedding representation
Stars: ✭ 33 (-21.43%)
Mutual labels:  word2vec, doc2vec
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-28.57%)
Mutual labels:  word2vec, doc2vec

Altair

altair logo

Read our project findings on our blog then try the demo below

Assessing Source Code Similarity with Unsupervised Learning

How do you determine what a segment of source code does?

How do you search a corpus for source code that you want to use?

Altair is Lab41's exploration of representing source code and its associated features in a vector space. We are interested in generating robust source code embeddings for Python like Word2Vec creates word embeddings for written text. You can read about our early experimentation with word embeddings for source code on the Lab41 blog.

Our primary use case of source code representation and similarity calculation is enabling meaningful recommendations of code to coders. We believe that similar techniques could be useful for code security analysis, code authorship, and code plaigarism detection.

Altair Demo via Docker!

  1. Download a pickle file containing the Gensim Doc2Vec vectors for 200,000 Python scripts from GitHub here. In this example we saved the downloaded pickle file to ~/models/

  2. Build the container

docker build -f Dockerfile.demo -t altair.demo .
  1. Run the container
docker run -v ~/models/:/altair/altair/models/github/ -p 5000:5000 altair.demo
  1. Open a browser and go to
http://0.0.0.0:5000/

You should see the Altair home page below altair home screen

  1. This demonstration expects a url with raw python code. Let's test out Altair on Lab41's Magnolia (speaker separation in audio) project by entering the following url in the white input box:
https://raw.githubusercontent.com/Lab41/Magnolia/master/src/features/spectral_features.py

Press 'run'. You should see Altair recommendations of audio analysis Python scripts similar to the screenshot below

altair results screen

  1. Let's do one more. Let's try Lab41's Pelops (car reidentification via computer vision) project by entering the following url in the white input box:
https://raw.githubusercontent.com/Lab41/pelops/master/pelops/features/hog.py

Press 'run'. You should see Altair recommendations of computer vision Python scripts similar to the screenshot below

altair results screen

Make Your Own Altair: Docker container to Vectorize a Folder of Python Scripts (*.py)

The Docker container uses a Doc2Vec model trained on 1 million Python scripts from Github and the output is a dictionary of vectors saved as a pickle file in the "out" volume. A distance measurement (ex: cosine distance) can be used to locate similar vectors in the output.

Build the container

docker build -f Dockerfile.vectorize_folder -t altair.vectorize_folder .

Run the container

docker run -v /dirwithPythonScripts/:/in -v /dirtoSaveOutput/:/out altair.vectorize_folder

Run the container with custom settings (ex: Use Doc2Vec model trained on 500k Python scripts from Github and specify output file name)

docker run -v /dirwithPythonScripts/:/in -v /dirtoSaveOutput/:/out altair.vectorize_folder /altair/models/doc2vec_trainedmodel_cbow_docs500000_negative10_mincount500_minlen2000_win5.pkl /in /out/myoutput.pkl

Prerequisites

Local Computing Components

  • git
  • python3
  • pip
  • conda

Installation

Cloning the repository

Clone Altair repository from the command line, then cd into the directory

git clone https://github.com/Lab41/altair.git
cd altair
Conda

Anaconda is a completely free Python distribution. It includes more than 400 of the most popular Python packages for science, math, engineering, and data analysis. Anaconda includes conda, a cross-platform package manager and environment manager and seen by some as the successor to pip.

Before getting started, you’ll need both conda and gcc installed on your system. Download the Anaconda version for Python3+ by entering the following (as of Feb 2017) on a Linux command line:

wget https://repo.continuum.io/archive/Anaconda3-4.3.0-Linux-x86_64.sh
bash Anaconda3-4.3.0-Linux-x86_64.sh

Once that’s done, you can create an new environment on your system by calling:

conda env create -f environment.yml

Note: If the conda command is not found, start a new shell to refresh your path.

After it finishes, you should have a new conda environment named altair containing all of the dependencies. Activate it by calling

source activate altair

Check out the preprocessing README.md to find out where you can obtain our training and testing data.

Notes

Per Gensim, reproducibility between interpreter launches requires use of the PYTHONHASHSEED environment variable to control hash randomization in Python 3.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].