All Projects → HuantWang → FUNDED_NISL

HuantWang / FUNDED_NISL

Licence: other
FUNDED is a novel learning framework for building vulnerability detection models.

Programming Languages

python
139335 projects - #7 most used programming language
java
68154 projects - #9 most used programming language
scala
5932 projects
shell
77523 projects
Batchfile
5799 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to FUNDED NISL

xssfinder
Toolset for detecting reflected xss in websites
Stars: ✭ 105 (+114.29%)
Mutual labels:  vulnerability-detection
wazuh-packages
Wazuh - Tools for packages creation
Stars: ✭ 54 (+10.2%)
Mutual labels:  vulnerability-detection
wazuh-cloudformation
Wazuh - Amazon AWS Cloudformation
Stars: ✭ 32 (-34.69%)
Mutual labels:  vulnerability-detection
zdh server
数据采集平台zdh,etl 处理服务
Stars: ✭ 53 (+8.16%)
Mutual labels:  datacollection
T-XPLOITER
T-XPLOITER is a Perl program for detect and (even) exploit website(s). Why the name is T-XPLOITER ? T means Triple, XPLOITER means Exploiter. This program has 3 features and functions to detect and (even) exploit website(s), just check it out :).
Stars: ✭ 13 (-73.47%)
Mutual labels:  vulnerability-detection
wazuh-ansible
Wazuh - Ansible playbook
Stars: ✭ 166 (+238.78%)
Mutual labels:  vulnerability-detection
gradejs
GradeJS analyzes production Webpack bundles without having access to the source code of a website. Instantly see vulnerabilities, outdated packages, and more just by entering a web application URL.
Stars: ✭ 362 (+638.78%)
Mutual labels:  vulnerability-detection
GNNSCVulDetector
Smart Contract Vulnerability Detection Using Graph Neural Networks (IJCAI-20 Accepted)
Stars: ✭ 42 (-14.29%)
Mutual labels:  vulnerability-detection
iust deep fuzz
Advanced file format fuzzer based-on deep neural language models.
Stars: ✭ 36 (-26.53%)
Mutual labels:  vulnerability-detection
scan-cli-plugin
Docker Scan is a Command Line Interface to run vulnerability detection on your Dockerfiles and Docker images
Stars: ✭ 135 (+175.51%)
Mutual labels:  vulnerability-detection
MixewayScanner
Mixeway Scanner is Spring Boot application which aggregate integration with number of OpenSource Vulnerability scanners - both SAST and DAST types
Stars: ✭ 15 (-69.39%)
Mutual labels:  vulnerability-detection
patton-cli
The knife of the Admin & Security auditor
Stars: ✭ 42 (-14.29%)
Mutual labels:  vulnerability-detection
quick-scripts
A collection of my quick and dirty scripts for vulnerability POC and detections
Stars: ✭ 73 (+48.98%)
Mutual labels:  vulnerability-detection
DGFraud-TF2
A Deep Graph-based Toolbox for Fraud Detection in TensorFlow 2.X
Stars: ✭ 84 (+71.43%)
Mutual labels:  graphneuralnetwork
dr checker 4 linux
Port of "DR.CHECKER : A Soundy Vulnerability Detection Tool for Linux Kernel Drivers" to Clang/LLVM 10 and Linux Kernel
Stars: ✭ 34 (-30.61%)
Mutual labels:  vulnerability-detection
aparoid
Static and dynamic Android application security analysis
Stars: ✭ 62 (+26.53%)
Mutual labels:  vulnerability-detection
vulnerability-db
Vulnerability database and package search for sources such as OSV, NVD, GitHub and npm.
Stars: ✭ 36 (-26.53%)
Mutual labels:  vulnerability-detection
PyCPU
Central Processing Unit Information Gathering Tool
Stars: ✭ 19 (-61.22%)
Mutual labels:  vulnerability-detection
vulnerablecode
A free and open vulnerabilities database and the packages they impact. And the tools to aggregate and correlate these vulnerabilities. Sponsored by NLnet https://nlnet.nl/project/vulnerabilitydatabase/ for https://www.aboutcode.org/ Chat at https://gitter.im/aboutcode-org/vulnerablecode Docs at https://vulnerablecode.readthedocs.org/
Stars: ✭ 269 (+448.98%)
Mutual labels:  vulnerability-detection
CARE-GNN
Code for CIKM 2020 paper Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters
Stars: ✭ 121 (+146.94%)
Mutual labels:  graphneuralnetwork

FUNDED

Using graph neural networks and open-source repositories to detect code vulnerabilities. This is an implementation of the model described in:

Huanting Wang, Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang Huang, Dingyi Fang, Yansong Feng, Lizhong Bian and Zheng Wang, "Combining Graph-based Learning with Automated Data Collection for Code Vulnerability Detection"

FUNDED is a novel learning framework for building vulnerability detection models, which leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the program’s control, data, and call dependencies.

Check our paper for detailed information.

November 2020 - The paper was accepted to IEEE TIFS!

More Dataset are available at here!

Contents

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Install the necessary dependencies before running the project,the part of SoftWare is related to data preprocess while Python Libraries are the environment we have tested. For more details, please reference requirements.txt:

Software:
Python Libraries:

Setup


This section gives the steps, explanations and examples for getting the project running.

1) Clone this repo

$ git clone [email protected]:HuantWang/FUNDED_NISL.git

2) Install Prerequisites

$ pip install -r requirements.txt

3) Run the testcase

$ cd NISL_TIFS2021/FUNDED/cli
$ CUDA_VISIBLE_DEVICES=2 python train.py GGNN GraphBinaryClassification ../data/data/CWE-77



GNN Detection module

This part contains GNN Detection model' relevant source code structure and partial sample data set.

Detection Structure

├── LICENSE
├── README.md                       <- The top-level README for developers using this project.
├── requirements.txt                <- The python environment for developers using this project.
├── FUNDED
│   ├── cli     
│   │   ├── train.py		    <- the entrance of training models.	
│   │   ├── test.py		     <- testing the specified model using data.
│   │   ├── __init__.py
│   ├── cli_utils     
│   │   ├── default_hypers	
│   │   │   ├── GraphBinaryClassification_GGNN.json
│   │   ├── dataset_utils.py	
│   │   ├── model_utils.py	
│   │   ├── param_helpers.py	
│   │   ├── task_utils.py
│   │   ├── training_utils.py
│   │   ├── __init__.py	
│   ├── data                       
│   │   ├── data	
│   │   │   ├── data_preprocess.py
│   │   │   ├── our_map_all.txt
│   │   │   ├── __init__.py
│   │   ├── graph_dataset.py	
│   │   ├── jsonl_graph_dataset.py	
│   │   ├── jsonl_graph_property_dataset.py	
│   │   ├── __init__.py	
│   ├── layers                      
│   │   ├── message_passing	
│   │   │   ├── ggnn.py
│   │   │   ├── gnn_edge_mlp.py
│   │   │   ├── gnn_film.py
│   │   │   ├── message_passing.py
│   │   │   ├── __init__.py
│   │   ├── gnn.py
│   │   ├── graph_global_exchange.py	
│   │   ├── nodes_to_graph_representation.py
│   │   ├── __init__.py	
│   ├── models   
│   │   ├── graph_binary_classification_task.py
│   │   ├── graph_regression_task.py
│   │   ├── graph_task_model.py 
│   │   ├── node_multiclass_task.py
│   │   ├── __init__.py	
│   ├── utils                          
│   │   ├── activation.py
│   │   ├── constants.py
│   │   ├── gather_dense_gradient.py
│   │   ├── param_helpers.py
│   └── └── __init__.py	
└────── __init__.py

Data Preprocessing

To construct the AST, we use Soot for Java, ANTLR for Swift, PHP and joern for C/C++.

c/c++


For c/c++, we download different CWE types' datasets from SARD, CVE and Github.

The specific steps of data preprocessing are as follows:

Warning: Modify the path with your own data in code.

  1. Slicing data
$ cd FUNDED_NISL/Edge_processing/slicec_7edges_funcblock/src/main/java/slice
  • Run ClassifyFileOfProject.java to extract all the C file.
  • Run Main.java to slice data in function level.
  1. Extracting different edge relationship

Then we traverse all the source codes' AST nodes,which have been parsed by cdt.While traversing, all nodes are numbered in sequence, and the relationship between different edges is obtained according to specific rules.

$ cd FUNDED_NISL/Edge_processing/slicec_7edges_funcblock/src/main/java/sevenEdges
  • Use joern to get all the control flows and data flows in the source code, specific reference: joern.
  • Run Main.java to extrace others.
  • Run concateJoern.java to concate all edges.

We provide a demo dataset for data preprocess.

java


For java,We download data from SARD, CVE and Github.

With the same idea like parsing c/c++ above,we construct all relationships in different edges using soot and jdt.

Warning: Modify the path with your own data

$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/Java_jdt_AST_CDFG/src/main/java/yoshikihigo/tinypdg/
$ java Main.java sourceFilePath savafilePath

PHP and Swift


For PHP and Swift,We collect datasets from SARD, CVE and Github.

Then extracting edge nodes from AST constructed with Antlr.

$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/php_swift/src/php/main
$ java TestPhp.java sourceFilePath savafilePath

$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/php_swift/src/swift3/main
$ java TestSwift3.java sourceFilePath savafilePath

Dataset


The datasets used are HERE.
The edges dataset contains 44 different types of C language CWE data. Through script processing,we can get the final inputs. For example, under data/data/CWE-399 and data/data/CWE-400 are available the test datasets with the graphs consisting of ast, cfg and pdg.

Fields

cwe file_id target contents
399 0a2a9a6f-779e-47b4-823e-43eccd125b4f.c$$$0 0 1,2 1,3 2,7,9 (1,9,0)(2,8,1)(3,7,2) ...
399 1b733c0b-30d5-4cc2-9431-8695795abfed.c$$$1 1 6,7 4,5 1,4,9 (2,7,0)(3,5,1)(4,2,2) ...
399 3e9bebda-cef3-4988-9543-a5e5473849c2.c$$$0 0 1,2 3,5 3,5,8 (1,2,0)(2,6,1)(4,8,2) ...
399 8bcbb6c4-3f3f-471c-b2dc-ab9151bb22f8.c$$$2 1 2,7 2,9 2,3,7 (6,7,0)(1,5,1)(6,9,2) ...
399 53ee12a1-ba49-41f2-a163-c2b662a4db27.c$$$0 0 4,5 7,8 3,6,8 (5,8,0)(3,6,1)(7,8,2) ...
... ... ...
400 8388fdcf-40cf-4e59-9f11-17d9e320efd8.c$$$4 0 1,7 2,5 3,4,8 (4,7,0)(5,8,1)(2,9,2) ...
400 91978dee-4ee4-428b-8576-ffb49e8dc12a.c$$$6 1 2,3 3,8 3,7,9 (3,6,0)(4,6,1)(2,8,2) ...
400 113353a8-f804-4aff-a81a-15f20e638d4b.c$$$1 1 4,6 4,7 5,6,7 (3,7,0)(4,5,1)(8,9,2) ...
400 b7b5ae35-d478-4c51-96c2-8f107fc08fde.c$$$3 1 2,5 7,8 1,7,8 (5,8,0)(3,6,1)(2,8,2) ...
400 e831aff3-bd88-4ef7-a5b0-2d87e1b20fbe.c$$$0 0 6,8 2,8 4,6,9 (6,9,0)(1,5,1)(1,4,2) ...
... ... ...

Results

Example results of training on the sample dataset CWE-400. Saved Model checkpoint at 60 epochs.

Dataset parameters: {

 "max_nodes_per_batch": 128,
 "num_fwd_edge_types": 7, 
 "add_self_loop_edges": true, 
 "tie_fwd_bkwd_edges": true,
 "threshold_for_classification": 0.5

}

Model parameters: {

 "gnn_aggregation_function": "sum", 
 "gnn_message_activation_function": "ReLU", 
 "gnn_hidden_dim": 256, 
 "gnn_use_target_state_as_input": false, 
 "gnn_normalize_by_num_incoming": true, 
 "gnn_num_edge_MLP_hidden_layers": 1, 
 "gnn_num_aggr_MLP_hidden_layers": null, 
 "gnn_message_calculation_class": "RGIN", 
 "gnn_initial_node_representation_activation": "tanh", 
 "gnn_dense_intermediate_layer_activation": "tanh", 
 "gnn_num_layers": 5, "gnn_dense_every_num_layers": 10000, 
 "gnn_residual_every_num_layers": 2, 
 "gnn_use_inter_layer_layernorm": true, 
 "gnn_layer_input_dropout_rate": 0.2, 
 "gnn_global_exchange_mode": "gru", 
 "gnn_global_exchange_every_num_layers": 10000, 
 "gnn_global_exchange_weighting_fun": "softmax", 
 "gnn_global_exchange_num_heads": 4, 
 "gnn_global_exchange_dropout_rate": 0.2, 
 "optimizer": "Adam", "learning_rate": 0.001, 
 "learning_rate_decay": 0.98, "momentum": 0.85, 
 "gradient_clip_value": 1.0, 
 "use_intermediate_gnn_results": false, 
 "graph_aggregation_num_heads": 16, 
 "graph_aggregation_hidden_layers": [128], 
 "graph_aggregation_dropout_rate": 0.2

}

== Running on test dataset
Loading data from ../data/data/tem_CWE-77/ast.
Loading data from ../data/data/tem_CWE-77/cdfg.
Restoring best model state from trained_model/GGNN_GraphBinaryClassification__2020-11-30_10-41-23_best.pkl.
NoneCP_test  Accuracy = 0.915|precision = 0.846 | recall = 1.000 | f1 = 0.917
== Running on test dataset
Loading data from ../data/data/tem_CWE-77/new/ast.
Loading data from ../data/data/tem_CWE-77/new/cdfg.
Restoring best model state from trained_model/GGNN_GraphBinaryClassification__2020-11-30_10-44-23_best.pkl.
CP_test  Accuracy = 0.942|precision = 0.893 | recall = 1.000 | f1 = 0.943

Tuning

We use NNI(Neural Network Intelligence) for tuning in this project.

$ pip install nni

Add a search_space.json file under the work directory and write the parameters to be configured,which we have configured in the project.

search_space.json

{
 "max_nodes_per_batch":{ "_type": "choice", "_value": [32,64,128]},
 "gnn_hidden_dim":{ "_type": "choice", "_value": [4,8,16,...]},
 "gnn_num_layers": { "_type": "choice", "_value": [2,4,8,...] },
 "graph_aggregation_num_heads":{ "_type": "choice", "_value": [4,8,16,32,...]
},
 "graph_aggregation_hidden_layers":{ "_type": "choice", "_value": [32,64,128,256,...] },
 "graph_aggregation_dropout_rate":{ "_type": "choice", "_value": [0.1,0.2,0.5,...] },
 "learning_rate": { "_type": "choice", "_value": [0.01,0.001,0.0001,...] }
}

Define the configuration file in YAML format, which declares the search space and the path of the trial file. It also provides other information, such as the parameters of the whole algorithm, the maximum number of trials and the maximum duration.

config.yml

authorName: NNI Example
experimentName: CWE-77
trialConcurrency: 1
maxExecDuration: 110h # max executable time
maxTrialNum: 500 # max trial num
trainingServicePlatform: local
searchSpacePath: search_space.json # path of search space
useAnnotation: false
tuner:
    builtinTunerName: TPE
    classArgs:
        optimize_mode: maximize # choices: maximize, minimize
    gpuIndices: "1" # specify GPUof optimizer
trial:
    command: python3 train.py GGNN GraphBinaryClassification ../data/data/CWE-77 --patience 100 # execute commands
    codeDir: .
    gpuNum: 0
logDir: ~/nni # log directory
localConfig:
    gpuIndices: "0" # specify GPU number
    useActiveGpu: true

Run NNI

nnictl create --config config.yml --port 8080

Wait for the output INFO: Successfully started experiment! in the command line. This message indicates that the experiment has been successfully started.

For more details,reference https://github.com/Microsoft/nni

Data collection module

Collection Structure

├── EnsembleLearning.py
├── InputData_New.py                       
├── stopwords.txt
├── sample.zip

Ready for training

  • Download our pretrained w2v model here
  • We also provide a dataset sample.zip, unzip and make it work

Prepare data

  • You can extract features from commits, or just use our sample.zip

Train your own ensemble classifier

  • Use EnsembleLearning.py to train your own ensemble model

Warning: Replace the path with your own data path.

python EnsembleLearning.py 

License

Distributed under the NISL License. See LICENSE for more information.

Contact

Huanting Wang - [email protected]; [email protected]

Citation

@ARTICLE{Wang2020FUNDED,
  author = {H. {Wang} and G. {Ye} and Z. {Tang} and S. H. {Tan} and S. {Huang} and D. {Fang} and Y. {Feng} and L. {Bian} and Z. {Wang}},
  journal = {IEEE Transactions on Information Forensics and Security}, 
  title = {Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection}, 
  year = {2021},
  volume = {16},
  pages = {1943-1958},
  doi = {10.1109/TIFS.2020.3044773},
  ieeeid = {9293321},
  publisher = {IEEE},
  keywords = {Software Vulnerability, Code Vulnerability Detection, Deep Learning, Deep Graph Neural Networks},}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].