All Projects → spcl → ncc

spcl / ncc

Licence: BSD-3-Clause license
Neural Code Comprehension: A Learnable Representation of Code Semantics

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ncc

Awesome Embedding Models
A curated list of awesome embedding models tutorials, projects and communities.
Stars: ✭ 1,486 (+817.28%)
Mutual labels:  embeddings, embedding-models
LSCDetection
Data Sets and Models for Evaluation of Lexical Semantic Change Detection
Stars: ✭ 17 (-89.51%)
Mutual labels:  embeddings
llvm-semantics
Formal semantics of LLVM IR in K
Stars: ✭ 42 (-74.07%)
Mutual labels:  llvm-ir
deep-char-cnn-lstm
Deep Character CNN LSTM Encoder with Classification and Similarity Models
Stars: ✭ 20 (-87.65%)
Mutual labels:  embeddings
joern
Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs
Stars: ✭ 968 (+497.53%)
Mutual labels:  code-analysis
VarCLR
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning
Stars: ✭ 30 (-81.48%)
Mutual labels:  embeddings
dpar
Neural network transition-based dependency parser (in Rust)
Stars: ✭ 41 (-74.69%)
Mutual labels:  embeddings
muse-as-service
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.
Stars: ✭ 45 (-72.22%)
Mutual labels:  embeddings
entity-network
Tensorflow implementation of "Tracking the World State with Recurrent Entity Networks" [https://arxiv.org/abs/1612.03969] by Henaff, Weston, Szlam, Bordes, and LeCun.
Stars: ✭ 58 (-64.2%)
Mutual labels:  embeddings
datastories-semeval2017-task6
Deep-learning model presented in "DataStories at SemEval-2017 Task 6: Siamese LSTM with Attention for Humorous Text Comparison".
Stars: ✭ 20 (-87.65%)
Mutual labels:  embeddings
Archived-SANSA-ML
SANSA Machine Learning Layer
Stars: ✭ 39 (-75.93%)
Mutual labels:  embeddings
doc
Design documents related to the decompilation pipeline.
Stars: ✭ 23 (-85.8%)
Mutual labels:  llvm-ir
towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
Stars: ✭ 821 (+406.79%)
Mutual labels:  embeddings
STransE
STransE: a novel embedding model of entities and relationships in knowledge bases (NAACL 2016)
Stars: ✭ 50 (-69.14%)
Mutual labels:  embedding-models
validating-binary-decompilation
Scalable Validator for Binary Lifters
Stars: ✭ 41 (-74.69%)
Mutual labels:  llvm-ir
info-retrieval
Information Retrieval in High Dimensional Data (class deliverables)
Stars: ✭ 33 (-79.63%)
Mutual labels:  embeddings
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (-27.16%)
Mutual labels:  embeddings
Deep-Learning-Experiments-implemented-using-Google-Colab
Colab Compatible FastAI notebooks for NLP and Computer Vision Datasets
Stars: ✭ 16 (-90.12%)
Mutual labels:  embeddings
word2vec-tsne
Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.
Stars: ✭ 59 (-63.58%)
Mutual labels:  embeddings
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (-75.31%)
Mutual labels:  embeddings

Neural Code Comprehension: A Learnable Representation of Code Semantics

ncc (Neural Code Comprehension) is a general Machine Learning technique to learn semantics from raw code in virtually any programming language. It relies on inst2vec, an embedding space and graph representation of LLVM IR statements and their context.

ncc_scheme

This repository contains the code used in [paper]:

Neural Code Comprehension: A Learnable Representation of Code Semantics, Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

Please cite as:

@incollection{ncc,
title = {Neural Code Comprehension: A Learnable Representation of Code Semantics},
author = {Ben-Nun, Tal and Jakobovits, Alice Shoshana and Hoefler, Torsten},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {3588--3600},
year = {2018},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/7617-neural-code-comprehension-a-learnable-representation-of-code-semantics.pdf}
}

Code

Requirements

For training inst2vec embeddings:

  • GNU / Linux or Mac OS
  • Python (3.6.5)
    • tensorflow (1.7.0) or preferably: tensorflow-gpu (1.7.0)
    • networkx (2.1)
    • scipy (1.1.0)
    • absl-py (0.2.2)
    • jinja2 (2.10)
    • bokeh (0.12.16)
    • umap (0.1.1)
    • sklearn (0.0)
    • wget (3.2)

Additionally, for training ncc models:

  • GNU / Linux or Mac OS
  • Python (3.6.5)
    • labm8 (0.1.2)
    • keras (2.2.0)

Running the code

1. Training inst2vec embeddings

By default, inst2vec will be trained on publicly available code. Some additional datasets are available on demand and you may add them manually to the training data. For more information on how to do this as well as on the datasets in general, see datasets.

$ python train_inst2vec.py --helpfull # to see the full list of options
$ python train_inst2vec.py \
>  # --context_width ... (default: 2)
>  # --data ... (default: data/, automatically generated one. You may provide your own)

Alternatively, you may skip this step and use pre-trained embeddings.

2. Evaluating inst2vec embeddings

$ python train_inst2vec.py \
> --embeddings_file ... (path to the embeddings p-file to evaluate)
> --vocabulary_folder ... (path to the associated vocabulary folder)

3. Training on tasks with ncc

We provide the code for training three downstream tasks using the same neural architecture (ncc) and inst2vec embeddings.

Algorithm classification

Task: Classify applications into 104 classes given their raw code.
Code and classes provided by https://sites.google.com/site/treebasedcnn/ (see Convolutional neural networks over tree structures for programming language processing)

Train:

$ python train_task_classifyapp.py --helpfull # to see the full list of options
$ python train_task_classifyapp.py

Alternatively, display results from a pre-trained model.

Optimal device mapping prediction

Task: Predict the best-performing compute device (e.g., CPU, GPU) Code and classes provided by https://github.com/ChrisCummins/paper-end2end-dl (see End-to-end Deep Learning of Optimization Heuristics)

Train:

$ python train_task_devmap.py --helpfull # to see the full list of options
$ python train_task_devmap.py

Alternatively, display results from a pre-trained model.

Optimal thread coarsening factor prediction

Code and classes provided by https://github.com/ChrisCummins/paper-end2end-dl (see End-to-end Deep Learning of Optimization Heuristics)

Train:

$ python train_task_threadcoarsening.py --helpfull # to see the full list of options
$ python train_task_threadcoarsening.py

Alternatively, display results from a pre-trained model.

Contact

We would be thrilled if you used and built upon this work. Contributions, comments, and issues are welcome!

License

NCC is published under the New BSD license, see LICENSE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].