All Projects → philipperemy → Tensorflow 1.4 Billion Password Analysis

philipperemy / Tensorflow 1.4 Billion Password Analysis

Deep Learning model to analyze a large corpus of clear text passwords.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Tensorflow 1.4 Billion Password Analysis

Nlp Pretrained Model
A collection of Natural language processing pre-trained models.
Stars: ✭ 122 (-92.91%)
Mutual labels:  natural-language-processing
Keita
My personal toolkit for PyTorch development.
Stars: ✭ 124 (-92.79%)
Mutual labels:  natural-language-processing
Medquad
Medical Question Answering Dataset of 47,457 QA pairs created from 12 NIH websites
Stars: ✭ 129 (-92.5%)
Mutual labels:  natural-language-processing
Spacy Js
🎀 JavaScript API for spaCy with Python REST API
Stars: ✭ 123 (-92.85%)
Mutual labels:  natural-language-processing
Spacy Dev Resources
💫 Scripts, tools and resources for developing spaCy
Stars: ✭ 123 (-92.85%)
Mutual labels:  natural-language-processing
Neuro
🔮 Neuro.js is machine learning library for building AI assistants and chat-bots (WIP).
Stars: ✭ 126 (-92.67%)
Mutual labels:  natural-language-processing
Cs230 Code Examples
Code examples in pyTorch and Tensorflow for CS230
Stars: ✭ 1,701 (-1.1%)
Mutual labels:  natural-language-processing
Textacy
NLP, before and after spaCy
Stars: ✭ 1,849 (+7.5%)
Mutual labels:  natural-language-processing
Aws Machine Learning University Accelerated Nlp
Machine Learning University: Accelerated Natural Language Processing Class
Stars: ✭ 1,695 (-1.45%)
Mutual labels:  natural-language-processing
Rasa Chatbot Templates
RASA chatbot use case boilerplate
Stars: ✭ 127 (-92.62%)
Mutual labels:  natural-language-processing
Clicr
Machine reading comprehension on clinical case reports
Stars: ✭ 123 (-92.85%)
Mutual labels:  natural-language-processing
Fnc 1 Baseline
A baseline implementation for FNC-1
Stars: ✭ 123 (-92.85%)
Mutual labels:  natural-language-processing
Neuraldialog Larl
PyTorch implementation of latent space reinforcement learning for E2E dialog published at NAACL 2019. It is released by Tiancheng Zhao (Tony) from Dialog Research Center, LTI, CMU
Stars: ✭ 127 (-92.62%)
Mutual labels:  natural-language-processing
Files2rouge
Calculating ROUGE score between two files (line-by-line)
Stars: ✭ 120 (-93.02%)
Mutual labels:  natural-language-processing
Chars2vec
Character-based word embeddings model based on RNN for handling real world texts
Stars: ✭ 130 (-92.44%)
Mutual labels:  natural-language-processing
Turkish Morphology
A two-level morphological analyzer for Turkish.
Stars: ✭ 121 (-92.97%)
Mutual labels:  natural-language-processing
100 Days Of Nlp
Stars: ✭ 125 (-92.73%)
Mutual labels:  natural-language-processing
Prenlp
Preprocessing Library for Natural Language Processing
Stars: ✭ 130 (-92.44%)
Mutual labels:  natural-language-processing
Konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Stars: ✭ 130 (-92.44%)
Mutual labels:  natural-language-processing
Deep Lyrics
Lyrics Generator aka Character-level Language Modeling with Multi-layer LSTM Recurrent Neural Network
Stars: ✭ 127 (-92.62%)
Mutual labels:  natural-language-processing

1.4 Billion Text Credentials Analysis (NLP)

Using deep learning and NLP to analyze a large corpus of clear text passwords.

Objectives:

  • Train a generative model.
  • Understand how people change their passwords over time: hello123 -> h@llo123 -> h@llo!23.

Disclaimer: for research purposes only.

In the press

Get the data

  • Download any Torrent client.
  • Here is a magnet link you can find on Reddit:
    • magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337
  • Checksum list is available here: checklist.chk
  • ./count_total.sh in BreachCompilation should display something like 1,400,553,870 rows.

Get started (processing + deep learning)

Process the data and run the first deep learning model:

# make sure to install the python deps first. Virtual env are recommended here.
# virtualenv -p python3 venv3; source venv3/bin/activate; pip install -r requirements.txt
# Remove "--max_num_files 100" to process the whole dataset (few hours and 50GB of free disk space are required.)
./process_and_train.sh <BreachCompilation path>

Data (explanation)

INPUT:   BreachCompilation/
         BreachCompilation is organized as:

         - a/          - folder of emails starting with a
         - a/a         - file of emails starting with aa
         - a/b
         - a/d
         - ...
         - z/
         - ...
         - z/y
         - z/z

OUTPUT: - BreachCompilationAnalysis/edit-distance/1.csv
        - BreachCompilationAnalysis/edit-distance/2.csv
        - BreachCompilationAnalysis/edit-distance/3.csv
        [...]
        > cat 1.csv
            1 ||| samsung94 ||| samsung94@
            1 ||| 040384alexej ||| 040384alexey
            1 ||| HoiHalloDoeii14 ||| hoiHalloDoeii14
            1 ||| hoiHalloDoeii14 ||| hoiHalloDoeii13
            1 ||| hoiHalloDoeii13 ||| HoiHalloDoeii13
            1 ||| 8znachnuu ||| 7znachnuu
        EXPLANATION: edit-distance/ contains the passwords pairs sorted by edit distances.
        1.csv contains all pairs with edit distance = 1 (exactly one addition, substitution or deletion).
        2.csv => edit distance = 2, and so on.

        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/99_per_user.json
        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9j_per_user.json
        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9a_per_user.json
        [...]
        > cat 96_per_user.json
        {
            "1.0": [
            {
                "edit_distance": [
                    0,
                    1
                ],
                "email": "[email protected]",
                "password": [
                    "090698d",
                    "090698D"
                ]
            },
        {
                "edit_distance": [
                    0,
                    1
                ],
                "email": "[email protected]",
                "password": [
                    "5555555555q",
                    "5555555555Q"
                ]
         }
        EXPLANATION: reduce-passwords-on-similar-emails/ contains files sorted by the first 2 letters of
        the email address. For example [email protected] will be located in 96_per_user.json
        Each file lists all the passwords grouped by user and by edit distance.
        For example, [email protected] had 2 passwords: 090698d and 090698D. The edit distance between them is 1.
        The edit_distance and the password arrays are of the same length, hence, a first 0 in the edit distance array.
        Those files are useful to model how users change passwords over time.
        We can't recover which one was the first password, but a shortest hamiltonian path algorithm is run
        to detect the most probably password ordering for a user. For example:
        hello => hello1 => hell@1 => hell@11 is the shortest path.
        We assume that users are lazy by nature and that they prefer to change their password by the lowest number
        of characters.

Run the data processing alone:

python3 run_data_processing.py --breach_compilation_folder <BreachCompilation path> --output_folder ~/BreachCompilationAnalysis

If the dataset is too big for you, you can set max_num_files to something between 0 and 2000.

  • Make sure you have enough free memory (8GB should be enough).
  • It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
  • Uncompressed output is around 45G.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].