All Projects → shoyo → acoustic-keylogger

shoyo / acoustic-keylogger

Licence: other
Pipeline of a keylogging attack using just an audio signal and unsupervised learning.

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to acoustic-keylogger

Minisom
🔴 MiniSom is a minimalistic implementation of the Self Organizing Maps
Stars: ✭ 801 (+901.25%)
Mutual labels:  clustering, unsupervised-learning
Awesome Community Detection
A curated list of community detection research papers with implementations.
Stars: ✭ 1,874 (+2242.5%)
Mutual labels:  clustering, unsupervised-learning
Bagofconcepts
Python implementation of bag-of-concepts
Stars: ✭ 18 (-77.5%)
Mutual labels:  clustering, unsupervised-learning
dti-clustering
(NeurIPS 2020 oral) Code for "Deep Transformation-Invariant Clustering" paper
Stars: ✭ 60 (-25%)
Mutual labels:  clustering, unsupervised-learning
Spectralcluster
Python re-implementation of the spectral clustering algorithm in the paper "Speaker Diarization with LSTM"
Stars: ✭ 220 (+175%)
Mutual labels:  clustering, unsupervised-learning
L2c
Learning to Cluster. A deep clustering strategy.
Stars: ✭ 262 (+227.5%)
Mutual labels:  clustering, unsupervised-learning
Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (+20%)
Mutual labels:  clustering, unsupervised-learning
treecut
Find nodes in hierarchical clustering that are statistically significant
Stars: ✭ 26 (-67.5%)
Mutual labels:  clustering, unsupervised-learning
Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (+162.5%)
Mutual labels:  clustering, unsupervised-learning
Keras deep clustering
How to do Unsupervised Clustering with Keras
Stars: ✭ 202 (+152.5%)
Mutual labels:  clustering, unsupervised-learning
ML2017FALL
Machine Learning (EE 5184) in NTU
Stars: ✭ 66 (-17.5%)
Mutual labels:  clustering, unsupervised-learning
LabelPropagation
A NetworkX implementation of Label Propagation from a "Near Linear Time Algorithm to Detect Community Structures in Large-Scale Networks" (Physical Review E 2008).
Stars: ✭ 101 (+26.25%)
Mutual labels:  clustering, unsupervised-learning
MVGL
TCyb 2018: Graph learning for multiview clustering
Stars: ✭ 26 (-67.5%)
Mutual labels:  clustering, unsupervised-learning
Unsupervised Classification
SCAN: Learning to Classify Images without Labels (ECCV 2020), incl. SimCLR.
Stars: ✭ 605 (+656.25%)
Mutual labels:  clustering, unsupervised-learning
Unsupervised-Learning-in-R
Workshop (6 hours): Clustering (Hdbscan, LCA, Hopach), dimension reduction (UMAP, GLRM), and anomaly detection (isolation forests).
Stars: ✭ 34 (-57.5%)
Mutual labels:  clustering, unsupervised-learning
Self Supervised Learning Overview
📜 Self-Supervised Learning from Images: Up-to-date reading list.
Stars: ✭ 73 (-8.75%)
Mutual labels:  clustering, unsupervised-learning
LinearCorex
Fast, linear version of CorEx for covariance estimation, dimensionality reduction, and subspace clustering with very under-sampled, high-dimensional data
Stars: ✭ 39 (-51.25%)
Mutual labels:  clustering, unsupervised-learning
machine-learning-course
Machine Learning Course @ Santa Clara University
Stars: ✭ 17 (-78.75%)
Mutual labels:  clustering, unsupervised-learning
Danmf
A sparsity aware implementation of "Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection" (CIKM 2018).
Stars: ✭ 161 (+101.25%)
Mutual labels:  clustering, unsupervised-learning
Revisiting-Contrastive-SSL
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [NeurIPS 2021]
Stars: ✭ 81 (+1.25%)
Mutual labels:  clustering, unsupervised-learning

Acoustic Keylogging Attack - Research

CircleCI

Overview

A keyboard acoustic emanations attack is a form of a side-channel attack where an attacker extracts what a victim typed on his or her keyboard using just the audio signal of the typing. Such a keylogging attack can result in passwords and confidential information being stolen from just a compromised microphone, and thus has severe security implications.

During my time in university, I explored whether many of the techniques that have become de facto in the fields of voice recognition and natural language processing can translate over to an implementation of an audio-based keylogging attack. With the advent of several open-source machine learning libraries in the past decade, I wondered whether such an attack is becoming increasingly accessible to implement and should receive more awareness from the general public.

I've since graduated from university and am no longer actively researching this topic. However, details about the results and methodology are described below for fellow researchers that are interested in investigating this topic further.

Objective

Evaluate the threat of a keyboard acoustic emanations attack in the current machine learning landscape by creating a proof-of-concept pipeline for executing such an attack and measuring its accuracy, practicality, and accessibility.

  • Accuracy: How well does the pipeline approximate typed keys?
  • Practicality: How robust is the pipeline under realistic conditions?
  • Accessibility: How easily can the pipeline be built? (with regards to prerequisite knowledge and used technology)

Extent of Research

An essential component of the pipeline is the ability to distinguish between key sounds emitted by each key on a keyboard. There must be quantitative evidence that keys (or groups of keys) emit different sounds, and they do so consistently under regular conditions.

For certain keyboards and typing patterns, the results suggest that key sounds emitted can indeed be clustered by the position of the key on the keyboard. Each point represents a single keystroke's 13,230-length audio vector (0.3s slice sampled at 44,100Hz) mapped to a 2-D vector space.

mx-brown tSNE clusters t-SNE clusters formed by keystroke sounds generated by a VP3 mechanical keyboard with Cherry MX Brown switches

apple-butterfly tSNE clusters t-SNE clusters formed by keystroke sounds generated by a Macbook Pro 2016 with Apple butterfly switches

Methodology

The results above were produced by processing audio data of non-overlapping keystrokes (i.e. the key-up of each keystroke occurs before the key-down of the next keystroke). Each keystroke is extracted into a mel-frequency ceptral coefficients feature vector and embedded with t-SNE. Note that the points above are colored and labeled to visualize the accuracy of the clustering. In practice, the attacker would need to perform further processing (cluster labeling) after this step which may or may not accurately predict which key belongs to which cluster.

The functions used to process the data are located in the acoustic_keylogger package. I don't have any immediate plans to write external documentation on how to use the package, but each function in it has relatively detailed docstrings so please refer to them if you'd like to use acoustic_keylogger in your own research.

Cluster labeling involves taking as input a sequence of "cluster IDs" and assigning a key type to each cluster ID. If we can make assumptions about the content that was being typed (such as the language or topic) then we can treat this as a time-series prediction problem to be solved with Hidden Markov Chains or recurrent neural networks.

The Pipeline

This is the pipeline that was being implemented when this project was actively developed.

  • (Done) Data Collection - Gathering a diverse dataset of typing sounds recorded under realistic conditions

  • (Done) Keystroke Detection - Identifying all of the keystroke sounds in a given audio file

  • (Done) Keystroke Feature Extraction - Preprocessing each keystroke sound for further analysis

  • (Done) Clustering - Forming clusters with the preprocessed keystroke data

  • Predictive Cluster Labeling - Identifying which clusters correspond to which key type

  • Iterative Pseudo-labeled Supervised Training - Training a classifier using the predicted labels and iterating

This pipeline is modeled after the research described in Keyboard Acoustic Emanations Revisited by L. Zhuang, F. Zhou, J. D. Tygar in 2005. I highly recommend reading this paper for those that want to explore this field of research.

Setting up

Option 0 - Using your own research environment

I'd assume many readers of this repository already have their own environment for conducting numerical research (with Jupyter, NumPy etc.).

If you'd like to tinker around in your own environment, simply copy the acoustic_keylogger package into your own machine, (possibly) update your PYTHONPATH, and import the functions as needed.

Option 1 - Docker

This project uses a Python development environment and a PostgreSQL database to manage various audio data. I chose Postgres due to variable-length array support, but feel free to edit the config to use your preferred database. This option spins up the Jupyter environment and database with Docker Compose.

  • Install Docker.
  • Build images with $ docker-compose build. This is only required when dependencies or Docker config are updated.

This step will install all dependencies for env (such as Jupyter, Tensorflow, NumPy etc.) and mount your local file system with the file system within the env Docker container.

  • Spin up the database and development environment with $ docker-compose up.

This should open up the database for connections and connect http://localhost:8888 to the Jupyter notebook.

Option 2 - No Docker

Docker requires more overhead memory and comes with little quirks in the development environment with the current setup (like having to manually open the Jupyter notebook). I find that a lot of times using Docker for small tweaks is a bit overkill, so I'm leaving this option here.

  • Install the latest version of Python that's compatible with the dependencies (currently 3.8 due to Tensorflow 2.4.1). To downgrade Python without overriding your current version, you can install conda and run

      $ conda install python=<version>
    
  • Set up a Python virtual environment. You can use conda, pipenv, virtualenvwrapper etc. for managing multiple environments.

  • Install dependencies with

      $ pip install -r requirements.txt  
    
  • Make sure Python can find the acoustic_keylogger package with

      $ export PYTHONPATH=/path/to/repo/acoustic-keylogger/
    

    and can connect to the test database with

      $ export TEST_DATABASE_URL=postgresql+psycopg2://postgres@acoustic-keylogger_db_1:5432
    

    You can add these commands to your configuration file (~/.bashrc, ~/.zshrc, etc.) so it gets loaded between terminal sessions.

  • Open Jupyter notebook with

      $ jupyter notebook
    

This option can be simpler if you're unfamiliar with Docker or you don't need to access the database. (Though the latter should still be possible using local postgres commands)

Testing CircleCI

Tests are implemented for the acoustic_keylogger package, which contains various functions for audio processing and data management. These tests are contained in tests/test_acoustic_keylogger.

To run tests with the Docker configuration (Option 1), execute:

$ docker-compose run env pytest -q tests

To run tests with no Docker configuration (Option 2), execute:

$ python -m pytest -q tests

Note: Both of the commands above are assumed to be executed from the root directory of this repository.

Relevant Research Papers

Many research papers were published in the mid-2000s concerning the topic of keyboard acoustic emanations attacks. Some research, such as Keyboard Acoustic Emanations Revisited by L. Zhuang, F. Zhou, J. D. Tygar in 2005, demonstrated extremely accurate results (96% chars recovered from 10 minute sound recording) even without labeled training data.

Supervised Methods

Unsupervised Methods

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].