All Projects → mprhode → malware-prediction-rnn

mprhode / malware-prediction-rnn

Licence: Apache-2.0 license
RNN implementation with Keras for machine activity data to predict malware

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to malware-prediction-rnn

Apkfile
Android app analysis and feature extraction library
Stars: ✭ 190 (+442.86%)
Mutual labels:  malware-detection
assemblyline
AssemblyLine 4 - File triage and malware analysis
Stars: ✭ 69 (+97.14%)
Mutual labels:  malware-detection
MCAntiMalware
Anti-Malware for minecraft
Stars: ✭ 182 (+420%)
Mutual labels:  malware-detection
Threat Hunting
Personal compilation of APT malware from whitepaper releases, documents and own research
Stars: ✭ 219 (+525.71%)
Mutual labels:  malware-detection
WeDefend
⛔🛡️ WeDefend - Monitor and Protect Windows from Remote Access Trojan
Stars: ✭ 23 (-34.29%)
Mutual labels:  malware-detection
Malware-Sample-Sources
Malware Sample Sources
Stars: ✭ 214 (+511.43%)
Mutual labels:  malware-detection
Flerken
A Solution For Cross-Platform Obfuscated Commands Detection presented on CIS2019 China. 动静态Bash/CMD/PowerShell命令混淆检测框架 - CIS 2019大会
Stars: ✭ 133 (+280%)
Mutual labels:  malware-detection
yara
Malice Yara Plugin
Stars: ✭ 27 (-22.86%)
Mutual labels:  malware-detection
antimalwareapp
Anti-malware for Android using machine learning
Stars: ✭ 206 (+488.57%)
Mutual labels:  malware-detection
Malware-Detection
Deep Learning Based Android Malware Detection Framework
Stars: ✭ 29 (-17.14%)
Mutual labels:  malware-detection
Secuml
Machine Learning for Computer Security
Stars: ✭ 221 (+531.43%)
Mutual labels:  malware-detection
binary viewer
A binary visualization tool to aid with reverse engineering and malware detection similar to Cantor.Dust
Stars: ✭ 55 (+57.14%)
Mutual labels:  malware-detection
ML-Antivirus
An antivirus powered by machine learning.
Stars: ✭ 32 (-8.57%)
Mutual labels:  malware-detection
Xapkdetector
APK/DEX detector for Windows, Linux and MacOS.
Stars: ✭ 208 (+494.29%)
Mutual labels:  malware-detection
gitavscan
Git Anti-Virus Scan Action - Detect trojans, viruses, malware & other malicious threats.
Stars: ✭ 23 (-34.29%)
Mutual labels:  malware-detection
Nauz File Detector
Linker/Compiler/Tool detector for Windows, Linux and MacOS.
Stars: ✭ 146 (+317.14%)
Mutual labels:  malware-detection
Rat-Hunter
detect trojans by easy way 🛡️
Stars: ✭ 24 (-31.43%)
Mutual labels:  malware-detection
adv-dnn-ens-malware
adversarial examples, adversarial malware examples, adversarial malware detection, adversarial deep ensemble, Android malware variants
Stars: ✭ 33 (-5.71%)
Mutual labels:  malware-detection
Batch-Antivirus
Batch Antivirus, a powerful antivirus suite written in batch with real-time protection and heuristical scanning.
Stars: ✭ 26 (-25.71%)
Mutual labels:  malware-detection
malware-persistence
Collection of malware persistence and hunting information. Be a persistent persistence hunter!
Stars: ✭ 109 (+211.43%)
Mutual labels:  malware-detection

malware-prediction-rnn

RNN implementation in Keras to predict malware from machine activity data - code for experiments in Early Stage Malware Prediction Using Recurrent Neural Networks

Data (data_2.csv) available here http://doi.org/10.17035/d.2018.0050524986

Experiments from the paper are set out in order in run_experiments

Implementation uses Keras v2.0.6 and Python >= 3.4

If you use our code in your research please cite:

@article{RHODE2018578,
title = "Early-stage malware prediction using recurrent neural networks",
journal = "Computers & Security",
volume = "77",
pages = "578 - 594",
year = "2018",
issn = "0167-4048",
doi = "https://doi.org/10.1016/j.cose.2018.05.010",
url = "http://www.sciencedirect.com/science/article/pii/S0167404818305546",
author = "Matilda Rhode and Pete Burnap and Kevin Jones",
}

Experiment

The basic Experiment class in experiments > Experiments takes a dictionary of hyperparameters and data as objects (either as a tuple for k-fold cross validation or as four separate test/train inputs/labels items). Results are stored in a folder as a comma separated values file.

  • parameters: A dictionary of hyperparamters such as in experiments > Configs. The keys relate to the RNN implementation and the values can either be a list or a dictionary of possible values with associated weighted probabilities of choosing them. The latter is intended to aid biased random searches e.g. params = {.... ["dropout"] = {0.2: 0.5, 0.1: 0.25, 0.3: 0.25}.... } will bias the random search to choose a value of 0.2 half of the time, and 0.1 or 0.3 respectively quarter of the time.

  • search_algorithm: {"grid", "random"}

    • Grid search will explore every possible combination of parameters supplied to the Experiment. Grid search will keep running until all options have been exhausted.

    • Random search will randomly select a configuration from the possible combinations of parameters, the choice of parameters can be biased by using dictionaries with values representing relative weights between the keys. See Configurations / RNN hyperparameters for more. Random search will run until the num_experiments parameter in Experiment.run() is reached, default=100.

  • x_train: sequential (3D) tensor of train input data supplied for a train-test experiment

  • y_train: sequential (2D) tensor of train label data supplied for a train-test experiment, corresponding to the indices of the x_train data

  • x_test: sequential (3D) tensor of test input data supplied for a train-test experiment

  • y_test: sequential (2D) tensor of test label data supplied for a train-test experiment, corresponding to the indices of the x_test data

  • data: tuple of (input, label) data for k-fold cross validation experiment

  • folds: integer to determine k in k-fold validation, defaults to 10. Must be an integer (or left default) for k-fold validation experiment along with data tuple

  • thresholding: Boolean to determine if k-fold test is cut short when accuracy falls below threshold, defaults to False. When thresholding=True, the threshold automatically increases if the average accuracy of k-folds is greater than the threshold. The new threshold is minimum of the set of k-fold accuracies.

  • threshold: 0 <= float < 1 which determines accuracy level cut off during a k-fold experiment. If a fold acheives lower than threshold, the remaining folds are not run, and the next configuration will begin. Automatically increases if a k-fold experiment achieved higher average accuracy (across k-folds) than threshold to the minimum of the set of k-fold accuracies.

  • folder_name: String to name folder in which csv file results are stored

Increase_Snaphot_Experiment

Increase the temporal distance between input features. Add "steps" to parameters dictionary to increase the time interval between data, should be integer <= sequence_length

Ensemble_configurations

Average the results of multiple RNN models.Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied

  • Pass a list of parameter dictionaries in place of parameters to Ensemble_configurations class to average the results of multiple models. Only the first element in the list of possible parameters will be used if more than one is supplied.
  • batch_size: int can be passed to Ensemble_configurations to use the same batch_size across models

Ensemble_sub_sequences

Average the results of classifying all sub-sequences and the entire data sequence. Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied

Omit_test_data

Leave all possible combinations of input features out of training to see impact of their omission. Trains a model then sequentially omits all possible combinations of 1,2,3...n, where n=total number of features, giving 2047 combinations for the 11 features used in the paper.

Omit training data

Leave a single feature out of testing and training.

  • supply "leave_out_features" to parameters dictionary to omit a single feature from training and testing

RNN implementation

Takes a dictionary of parameters, the training data and testing data as input. Data used to determine shape of RNN layers. Possible options for configurations outlined in Configurations / RNN hyperparameters.

Configurations / RNN hyperparameters

The configuration dictionaries used in the paper are stored in experiments > Configs. The possible parameters which can be edited and passed into an experiment are as detailed in the table below N.B. these are wider than the limitations of the random search configuration. see the commented code for details of each hyperparameter.

Parameter Possible values Notes
layer_type "GRU", "LSTM" fixed as "GRU" in Configs
loss "binary_crossentropy" -
kernel_initializer "lecun_uniform" Can also be any of the initialisers listed in Keras
recurrent_initializer "lecun_uniform" Can also be any of the initialisers listed in Keras
"depth" integer => 1 -
"bidirectional" Boolean -
"hidden_neurons" integer => 1 -
"learning_rate": 0 <= float <= 1 will default to 0.001 if "adam" optimiser used
"optimiser": "adam", "sgd" -
"dropout": 0 <= float < 1 -
"b_l1_reg": 0 <= float < 1 -
"b_l2_reg": 0 <= float < 1 -
"r_l1_reg": 0 <= float < 1 -
"r_l2_reg": 0 <= float < 1 -
"epochs": integer > 1 -
"sequence_length": 1 < integer < 300 -
"batch_size": 1 < integer < 59 -
"description": string to describe parameters only needed for Ensemble_configurations
"step": integer => 1 only needed for Increase_Snaphot_Experiment
"leave_out_feature": 0 <= integer < number of input features (here 11) not necessary for code to work

Formatting hyperparameter configurations

Hyperparameters should be supplied as a dictionary with parameter name as the key and value(s) stored in a list or as the keys of dictionaries. If using dictionaries, also supply relative weights representing the frequency with which the values should be chosen in a random search (the frequencies will be ignored in a grid search) - the lists and dictionaries can be mixed together e.g:

{
# more parameters up here
["dropout"]: [0, 0.1, 0.2, 0.3],
["optimiser"]: {"adam": 0.75, "sgd":0.25}, # equivalent to {"adam": 3, "sgd": 1} as weights are relative
["epochs"]: list(range(0,1000)),
# more parameters down here
}
  • get_all(): returns the search space used for random search in the paper
  • get_A(), get_B(), get_C(): returns configurations A, B, and C respectively as outlined in the paper
  • get_A_B_C(): returns configurations A, B, and C as values in a dictionary, keys are "A", "B", and "C"
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].