All Projects → BinaryResearch → centrifuge-toolkit

BinaryResearch / centrifuge-toolkit

Licence: MIT license
Tool for visualizing and empirically analyzing information encoded in binary files

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to centrifuge-toolkit

DS-Cookbook101
A jupyter notebook having all most frequent used code snippet for daily data scienceoperations
Stars: ✭ 59 (+20.41%)
Mutual labels:  scikit-learn, seaborn
Mlcourse.ai
Open Machine Learning Course
Stars: ✭ 7,963 (+16151.02%)
Mutual labels:  scikit-learn, seaborn
kmeans-dbscan-tutorial
A clustering tutorial with scikit-learn for beginners.
Stars: ✭ 20 (-59.18%)
Mutual labels:  scikit-learn, dbscan
pyclustertend
A python package to assess cluster tendency
Stars: ✭ 38 (-22.45%)
Mutual labels:  scikit-learn, cluster-analysis
datascienv
datascienv is package that helps you to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries
Stars: ✭ 53 (+8.16%)
Mutual labels:  scikit-learn, seaborn
point-cloud-clusters
A catkin workspace in ROS which uses DBSCAN to identify which points in a point cloud belong to the same object.
Stars: ✭ 43 (-12.24%)
Mutual labels:  dbscan
Python-Data-Visualization
D-Lab's 3 hour introduction to data visualization with Python. Learn how to create histograms, bar plots, box plots, scatter plots, compound figures, and more, using matplotlib and seaborn.
Stars: ✭ 42 (-14.29%)
Mutual labels:  seaborn
regression-stock-prediction
Predicting Google’s stock price using regression
Stars: ✭ 54 (+10.2%)
Mutual labels:  scikit-learn
scikit-learn.net
Machine Learning in .NET Core.
Stars: ✭ 32 (-34.69%)
Mutual labels:  scikit-learn
Machine-learning-toolkits-with-python
Machine learning toolkits with Python
Stars: ✭ 31 (-36.73%)
Mutual labels:  scikit-learn
sklearn-oblique-tree
a python interface to OC1 and other oblique decision tree implementations
Stars: ✭ 33 (-32.65%)
Mutual labels:  scikit-learn
PyRCN
A Python 3 framework for Reservoir Computing with a scikit-learn-compatible API.
Stars: ✭ 39 (-20.41%)
Mutual labels:  scikit-learn
catheat
Plot categorical heatmaps with seaborn
Stars: ✭ 17 (-65.31%)
Mutual labels:  seaborn
machine learning examples
machine_learning_examples
Stars: ✭ 68 (+38.78%)
Mutual labels:  scikit-learn
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (-22.45%)
Mutual labels:  dbscan
ML-For-Beginners
12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all
Stars: ✭ 40,023 (+81579.59%)
Mutual labels:  scikit-learn
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (+200%)
Mutual labels:  scikit-learn
How-to-score-0.8134-in-Titanic-Kaggle-Challenge
Solution of the Titanic Kaggle competition
Stars: ✭ 114 (+132.65%)
Mutual labels:  scikit-learn
machine-learning
Python machine learning applications in image processing, recommender system, matrix completion, netflix problem and algorithm implementations including Co-clustering, Funk SVD, SVD++, Non-negative Matrix Factorization, Koren Neighborhood Model, Koren Integrated Model, Dawid-Skene, Platt-Burges, Expectation Maximization, Factor Analysis, ISTA, F…
Stars: ✭ 91 (+85.71%)
Mutual labels:  dbscan
mloperator
Machine Learning Operator & Controller for Kubernetes
Stars: ✭ 85 (+73.47%)
Mutual labels:  scikit-learn

Centrifuge

Centrifuge makes it easy to use visualization, statistics and machine learning to analyze information in binary files.


This tool implements two new approaches to analysis of file data:

  1. DBSCAN, an unsupervised machine learning algorithm, is used to find clusters of byte sequences based on their statistical properties (features). Byte sequences that encode the same data type, e.g. machine code, typically have similar properties. As a result, clusters are often representative of a specific data type. Each cluster can be extracted and analysed further.

  2. The specific data type of a cluster can often be identified without using machine learning by measuring the Wasserstein distance between its byte value distribution and a data type reference distribution. If this distance is less than a set threshold for a particular data type, that cluster will be identified as that data type. Currently, reference distributions exist for high entropy data, UTF-8 english, and machine code targeting various CPU architectures.

These two approaches are used together in sequence: first DBSCAN finds clusters, then the Wasserstein distances between the clusters' data and the reference distributions are measured to identify their data type. To identify the target CPU of any machine code discovered in the file, Centrifuge uses ISAdetect.

Required Libraries

All required libraries come bundled with Anaconda.

*Developed in a Linux environment. Not tested on Windows or MacOS.

Usage

Detailed walkthroughs can be found in the notebooks. Code snippets are located in the scripts folder.

Overview of the Approach

The first step is file partitioning and feature measurement.

DBSCAN can then be used to find clusters in the file data.

Once clusters have been found, the data in the clusters can be identified.

The feature observations of each cluster are stored in a separate data frame, one for each cluster (e.g if 6 clusters are found, there will be 6 data frames, 1 per cluster). The output of DBSCAN is also saved in a data frame. This means custom analysis of any/all clusters can easily be performed any time after DBSCAN identifies clusters in the file data.

Example Output

Output of bash.identify_cluster_data_types(), as seen in Introduction to Centrifuge:

Searching for machine code
--------------------------------------------------------------------

[+] Checking Cluster 4 for possible match
[+] Closely matching CPU architecture reference(s) found for Cluster 4
[+] Sending sample to https://isadetect.com/
[+] response:

{
   "prediction": {
       "architecture": "amd64",
       "endianness": "little",
       "wordsize": 64
   },
   "prediction_probability": 1.0
}


Searching for utf8-english data
-------------------------------------------------------------------

[+] UTF-8 (english) detected in Cluster 3
   Wasserstein distance to reference: 16.337275669642857

[+] UTF-8 (english) detected in Cluster 5
   Wasserstein distance to reference: 11.878225097656252


Searching for high entropy data
-------------------------------------------------------------------

[+] High entropy data found in Cluster 1
   Wasserstein distance to reference: 0.48854199218749983
[*] This distance suggests the data in this cluster could be
   a) encrypted
   b) compressed via LZMA with maximum compression level
   c) something else that is random or close to random.

File Data Visualization

More pictures can be found in the gallery.

Example Use Cases

  • Determining whether a file contains a particular type of data.

    An entropy scan is useful for discovering compressed or encrypted data, but what about other data types such as machine code, symbol tables, sections of hardcoded ASCII strings, etc? Centrifuge takes advantage of the fact that in binary files, information encoded in a particular way is stored contiguously and uses scikit-learn's implementation of DBSCAN to locate these regions.

  • Analyzing files with no metadata such as magic numbers, headers or other format information.

    This includes most firmware, as well as corrupt files. Centrifuge does not depend on metadata or signatures of any kind.

  • Investigating differences between different types of data using statistical methods or machine learning, or building a model or "profile" of a specific data type.

    Does machine code differ in a systematic way from other types of information encoded in binary files? Can compressed data be distinguished from encrypted data? These questions can be investigated in an empirical way using Centrifuge.

  • Visualizing information in files using Python libraries such as Seaborn, Matplotlib and Altair

    Rather than generate elaborate 2D or 3D visual representations of file contents using space-filling curves or cylindrical coordinate systems, Centrifuge creates data frames that contain the feature measurements of each cluster. The information in these data frames can be easily visualized with boxplots, violin plots, pairplots, histograms, density plots, scatterplots, barplots, cumulative distribution function (CDF) plots, etc.

Dataset

The ISAdetect dataset was used to create the i386, AMD64, MIPSEL, MIPS64EL, ARM64, ARMEL, PowerPC, PPC64, and SH4 reference distributions.

Todo

  • Adding the ability to use OPTICS for automatic clustering. It would be nice to automate the entire workflow, going straight from an input file to data type identification. Currently this is not possible because eps and min_samples need to be adjusted manually in order ensure meaningful results when using DBSCAN.
  • Improving the UTF-8 english data reference distribution. Rather than derive it from text extracted from an ebook, samples should be drawn from hard-coded text data in executable binaries.
  • Creating reference distributions for AVR and Xtensa
  • update the code with docstrings and comments
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].