All Projects → ALFA-group → adversarial-code-generation

ALFA-group / adversarial-code-generation

Licence: other
Source code for the ICLR 2021 work "Generating Adversarial Computer Programs using Optimized Obfuscations"

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Makefile
30231 projects
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to adversarial-code-generation

embeddings-for-trees
Set of PyTorch modules for developing and evaluating different algorithms for embedding trees.
Stars: ✭ 19 (+18.75%)
Mutual labels:  ml4code, ml4se
minimal-nmt
A minimal nmt example to serve as an seq2seq+attention reference.
Stars: ✭ 36 (+125%)
Mutual labels:  seq2seq
AdverseDrive
Attacking Vision based Perception in End-to-end Autonomous Driving Models
Stars: ✭ 24 (+50%)
Mutual labels:  adversarial-machine-learning
sentence2vec
Deep sentence embedding using Sequence to Sequence learning
Stars: ✭ 23 (+43.75%)
Mutual labels:  seq2seq
classifier multi label seq2seq attention
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search
Stars: ✭ 26 (+62.5%)
Mutual labels:  seq2seq
type4py
Type4Py: Deep Similarity Learning-Based Type Inference for Python
Stars: ✭ 41 (+156.25%)
Mutual labels:  ml4se
procedural-advml
Task-agnostic universal black-box attacks on computer vision neural network via procedural noise (CCS'19)
Stars: ✭ 47 (+193.75%)
Mutual labels:  adversarial-machine-learning
CVAE Dial
CVAE_XGate model in paper "Xu, Dusek, Konstas, Rieser. Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity"
Stars: ✭ 16 (+0%)
Mutual labels:  seq2seq
Deep-Learning-Tensorflow
Gathers Tensorflow deep learning models.
Stars: ✭ 50 (+212.5%)
Mutual labels:  seq2seq
submodlib
Summarize Massive Datasets using Submodular Optimization
Stars: ✭ 36 (+125%)
Mutual labels:  combinatorial-optimization
dynmt-py
Neural machine translation implementation using dynet's python bindings
Stars: ✭ 17 (+6.25%)
Mutual labels:  seq2seq
tiro
TIRO - A hybrid iterative deobfuscation framework for Android applications
Stars: ✭ 20 (+25%)
Mutual labels:  program-analysis
adversarial-recommender-systems-survey
The goal of this survey is two-fold: (i) to present recent advances on adversarial machine learning (AML) for the security of RS (i.e., attacking and defense recommendation models), (ii) to show another successful application of AML in generative adversarial networks (GANs) for generative applications, thanks to their ability for learning (high-…
Stars: ✭ 110 (+587.5%)
Mutual labels:  adversarial-machine-learning
seq3
Source code for the NAACL 2019 paper "SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression"
Stars: ✭ 121 (+656.25%)
Mutual labels:  seq2seq
keras-chatbot-web-api
Simple keras chat bot using seq2seq model with Flask serving web
Stars: ✭ 51 (+218.75%)
Mutual labels:  seq2seq
ThinkMatch
Code & pretrained models of novel deep graph matching methods.
Stars: ✭ 639 (+3893.75%)
Mutual labels:  combinatorial-optimization
GHOST
General meta-Heuristic Optimization Solving Toolkit
Stars: ✭ 28 (+75%)
Mutual labels:  combinatorial-optimization
chatbot
kbqa task-oriented qa seq2seq ir neo4j jena seq2seq tf chatbot chat
Stars: ✭ 32 (+100%)
Mutual labels:  seq2seq
transformer
A PyTorch Implementation of "Attention Is All You Need"
Stars: ✭ 28 (+75%)
Mutual labels:  seq2seq
surveyor
A symbolic debugger for C/C++ (via LLVM), machine code, and JVM programs
Stars: ✭ 14 (-12.5%)
Mutual labels:  program-analysis

Generating Adversarial Computer Programs using Optimized Obfuscations

Code repository for the paper Generating Adversarial Computer Programs using Optimized Obfuscations, published at ICLR 2021.

Link to paper - https://openreview.net/forum?id=PH5PH9ZO_4

Slides -

Citation

@inproceedings{
shashank2021generating,
title={Generating Adversarial Computer Programs using Optimized Obfuscations},
author={Shashank Srikant and Sijia Liu and Tamara Mitrovska and Shiyu Chang and Quanfu Fan and Gaoyuan Zhang and Una-May O{'R}eilly},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=PH5PH9ZO_4}
}

Abstract

Machine learning (ML) models that learn and predict properties of computer programs are increasingly being adopted and deployed. These models have demonstrated success in applications such as auto-completing code, summarizing large programs, and detecting bugs and malware in programs. In this work, we investigate principled ways to adversarially perturb a computer program to fool such learned models, and thus determine their adversarial robustness. We use program obfuscations, which have conventionally been used to avoid attempts at reverse engineering programs, as adversarial perturbations. These perturbations modify programs in ways that do not alter their functionality but can be crafted to deceive an ML model when making a decision. We provide a general formulation for an adversarial program that allows applying multiple obfuscation transformations to a program in any language. We develop first-order optimization algorithms to efficiently determine two key aspects -- which parts of the program to transform, and what transformations to use. We show that it is important to optimize both these aspects to generate the best adversarially perturbed program. Due to the discrete nature of this problem, we also propose using randomized smoothing to improve the attack loss landscape to ease optimization. We evaluate our work on Python and Java programs on the problem of program summarization. We show that our best attack proposal achieves a improvement over a state-of-the-art attack generation approach for programs trained on a seq2seq model. We further show that our formulation is better at training models that are robust to adversarial attacks.

Authors

Shashank Srikant, Sijia Liu, Tamara Mitrovska, Shiyu Chang, Quanfu Fan, Gaoyuan Zhang, and Una-May O’Reilly.

If you face issues running this codebase, please open an issue on this repository, and mention as much information to reproduce your issue, including the exact command you have run, the configurations that you have used, the output you see, etc. See posts like these which describe how to communicate problems effectively via Github issues.

To discuss any other details on the method we introduce, contact Shashank ([email protected]), Sijia ([email protected]), or Una-May ([email protected]).

Talk video

If you want to get a big-picture understanding of our formulation, please watch this talk Shashank presented at the MIT-IBM Watson AI Lab.

Link - https://www.youtube.com/watch?v=b23HNPilfB4

Instructions

Our codebase builds on the well documented codebase released by the authors of Semantic Robustness of Models of Source Code

The repository provides a number of Makefile commands to download datasets, transform their ASTs, train code models, and finally attack and evaluate the performance on a test set. For full details on the directory structure and the organization of the files in this repository, see section Directory Structure.

Our attack formulation is mainly implemented in the following files -

The following instructions help reproduce the results from our work

  • Download and normalize datasets
make download-datasets
make normalize-datasets
  • Create the transformed datasets
make apply-transforms-sri-py150
make apply-transforms-c2s-java-small
make extract-transformed-tokens
  • Run normal training for seq2seq
./experiments/normal_seq2seq_train.sh

This command will train the model on the Python dataset. To train on Java, replace sri/py150 with c2s/java-small inside the script.

  • Create adversarial datasets and evaluate
./experiments/run_attack_0.sh
./experiments/run_attack_1.sh

These scripts contain commands for various experiments. For each experiment we first create an adversarial dataset by using our attack and then evaluate the trained seq2seq model on the generated dataset. The numbers 0 and 1 in the script names refer to the GPU on which the experiments are run.

This generates files in the directory ./datasets/adversarial/experiment-configuration/tokens/dataset-name/, where experiment-configuration is the same as short_name from the run_attack_* scripts and dataset-name is either sri/py150 or c2s/java-small. Within this directory,

  • targets-test-gradient.json contains the tokens which our attack algorithm recommends. This JSON file has the following format --
{
    "transforms.Combined" : {
        "33": {
            "@R_1@": "arquillian",
            "system . out . println ( @R_4@ ) ;": "",
            "@R_3@": "fs",
            "if ( false ) { int @R_6@ = 1 ; } ;": "",
            "@R_2@": "snapshot name",
            "if ( false ) { int @R_5@ = 1 ; } ;": ""
        }
    }
}

In the example above, transforms.Combined is the transform name and 33 is an index for some program in the dataset.

  • The file gradient-targeting/test.tsv contains the test set with the attacks identified by our formulation inserted in them.

  • Collect all results into a csv table and generate plots

python experiments/collect_results.py 1

Directory Structure

The instructions that follow have been adopted from Ramakrishnan et al.'s codebase. In this repository, we have the following directories:

./datasets

Note: the datasets are all much too large to be included in this GitHub repo. This is simply the structure as it would exist on disk once our framework is setup.

./datasets
  + ./raw            # The four datasets in "raw" form
  + ./normalized     # The four datasets in the "normalized" JSON-lines representation 
  + ./preprocess
    + ./tokens       # The four datasets in a representation suitable for token-level models
    + ./ast-paths    # The four datasets in a representation suitable for code2seq
  + ./transformed    # The four datasets transformed via our code-transformation framework 
    + ./normalized   # Transformed datasets normalized back into the JSON-lines representation
    + ./preprocessed # Transformed datasets preprocessed into:
      + ./tokens     # ... a representation suitable for token-level models
      + ./ast-paths  # ... a representation suitable for code2seq
  + ./adversarial    # Datasets in the format < source, target, tranformed-variant #1, #2, ..., #K >
    + ./tokens       # ... in a token-level representation
    + ./ast-paths    # ... in an ast-paths representation

./models

We have two Machine Learning on Code models. Both of them are trained on the Code Summarization task. The seq2seq model has been modified to incorporate our attack formulation, and includes an adversarial training loop. The branch pytorch-code2seq implements our attack formulation on code2seq. This is work in progress.

./models
  + ./pytorch-seq2seq   # seq2seq model implementation
  + ./pytorch-code2seq  # code2seq model implementation, available on branch pytorch-code2seq. This is WIP.

./results

This directory stores results that are small-enough to be checked into GitHub. This is automatically generated once the codebase is set up.

./scripts

In this directory there are a large number of scripts for doing various chores related to running and maintaing this code transformation infrastructure.

./tasks

This directory houses the implementations of various pieces of our core framework:

./tasks
  + ./astor-apply-transforms
  + ./depth-k-test-seq2seq
  + ./download-c2s-dataset
  + ./download-csn-dataset
  + ./extract-adv-dataset-c2s
  + ./extract-adv-dataset-tokens
  + ./generate-baselines
  + ./integrated-gradients-seq2seq
  + ./normalize-raw-dataset
  + ./preprocess-dataset-c2s
  + ./preprocess-dataset-tokens
  + ./spoon-apply-transforms
  + ./test-model-seq2seq
  + ./train-model-seq2seq

./vendor

This directory contains dependencies in the form of git submodukes.

Makefile

We have one overarching Makefile that can be used to drive a number of the data generation, training, testing, adn evaluation tasks.

download-datasets                  (DS-1) Downloads all prerequisite datasets
normalize-datasets                 (DS-2) Normalizes all downloaded datasets
extract-ast-paths                  (DS-3) Generate preprocessed data in a form usable by code2seq style models. 
extract-tokens                     (DS-3) Generate preprocessed data in a form usable by seq2seq style models. 
apply-transforms-c2s-java-med      (DS-4) Apply our suite of transforms to code2seq's java-med dataset.
apply-transforms-c2s-java-small    (DS-4) Apply our suite of transforms to code2seq's java-small dataset.
apply-transforms-csn-java          (DS-4) Apply our suite of transforms to CodeSearchNet's java dataset.
apply-transforms-csn-python        (DS-4) Apply our suite of transforms to CodeSearchNet's python dataset.
apply-transforms-sri-py150         (DS-4) Apply our suite of transforms to SRI Lab's py150k dataset.
extract-transformed-ast-paths      (DS-6) Extract preprocessed representations (ast-paths) from our transfromed (normalized) datasets 
extract-transformed-tokens         (DS-6) Extract preprocessed representations (tokens) from our transfromed (normalized) datasets 
extract-adv-datasets-tokens        (DS-7) Extract preprocessed adversarial datasets (representations: tokens)
docker-cleanup                     (MISC) Cleans up old and out-of-sync Docker images.
submodules                         (MISC) Ensures that submodules are setup.
help                               (MISC) This help.
test-model-seq2seq                 (TEST) Tests the seq2seq model on a selected dataset.
train-model-seq2seq                (TRAIN) Trains the seq2seq model on a selected dataset.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].