All Projects → davidsvy → Neural-Scam-Artist

davidsvy / Neural-Scam-Artist

Licence: MIT license
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Neural-Scam-Artist

Reader
Extract clean(er), readable text from web pages via Mercury Web Parser.
Stars: ✭ 75 (+316.67%)
Mutual labels:  web-scraping, readability
Transformer Temporal Tagger
Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging
Stars: ✭ 55 (+205.56%)
Mutual labels:  transformer, huggingface
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+3850%)
Mutual labels:  web-scraping, readability
tensorflow-ml-nlp-tf2
텐서플로2와 머신러닝으로 시작하는 자연어처리 (로지스틱회귀부터 BERT와 GPT3까지) 실습자료
Stars: ✭ 245 (+1261.11%)
Mutual labels:  transformer, gpt2
Bertviz
Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)
Stars: ✭ 3,443 (+19027.78%)
Mutual labels:  transformer, gpt2
Datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+8983.33%)
Mutual labels:  lsh, minhash
TabFormer
Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)
Stars: ✭ 209 (+1061.11%)
Mutual labels:  transformer, huggingface
transformer-models
Deep Learning Transformer models in MATLAB
Stars: ✭ 90 (+400%)
Mutual labels:  transformer, gpt2
zero-administration-inference-with-aws-lambda-for-hugging-face
Zero administration inference with AWS Lambda for 🤗
Stars: ✭ 19 (+5.56%)
Mutual labels:  transformer, huggingface
minhash-lsh
Minhash LSH in Golang
Stars: ✭ 20 (+11.11%)
Mutual labels:  lsh, minhash
finetune-gpt2xl
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed
Stars: ✭ 353 (+1861.11%)
Mutual labels:  gpt2, huggingface
web-poet
Web scraping Page Objects core library
Stars: ✭ 67 (+272.22%)
Mutual labels:  web-scraping
kaggle-champs
Code for the CHAMPS Predicting Molecular Properties Kaggle competition
Stars: ✭ 49 (+172.22%)
Mutual labels:  transformer
cape
Continuous Augmented Positional Embeddings (CAPE) implementation for PyTorch
Stars: ✭ 29 (+61.11%)
Mutual labels:  transformer
Frost
A backup program that does deduplication, compression, encryption
Stars: ✭ 25 (+38.89%)
Mutual labels:  deduplication
deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Stars: ✭ 59 (+227.78%)
Mutual labels:  deduplication
core
The complete web scraping toolkit for PHP.
Stars: ✭ 1,110 (+6066.67%)
Mutual labels:  web-scraping
crawlzone
Crawlzone is a fast asynchronous internet crawling framework for PHP.
Stars: ✭ 70 (+288.89%)
Mutual labels:  web-scraping
lsh-semantic-similarity
Locality Sensitive Hashing for semantic similarity (Python 3.x)
Stars: ✭ 16 (-11.11%)
Mutual labels:  lsh
Variational-Transformer
Variational Transformers for Diverse Response Generation
Stars: ✭ 79 (+338.89%)
Mutual labels:  transformer

Neural Scam Artist

TL;DR
A dataset of scam emails is scraped from an anti-fraud website. The dataset is then deduplicated using MinHash and LSH. The deduplicated dataset is used for fine-tuning GPT-2.

Comic stolen from Agent-X Comics.

📖 Table of Contents

☁️ Project Description

Objective

The goal of this project is create a new dataset of fraudulent emails that can advance the research on intelligent email assistants.

Web Scraper

Data is scraped from the website https://antifraudintl.org/. At first, a set of thread urls is collected and stored. Then, each thread is searched for emails. For each thread, at most one email is kept as the rest are duplicates. Metadata (Subject, Date etc) is removed. The resultant dataset is stored inside a csv file.

Deduplication

To avoid the quadratic complexity, a cheap alternative is selected: MinHash and LSH using the datasketch library. For each document, this method efficiently locates its nearest neighbors. Because this leads to a a large amount of false negatives (i.e. dulpicate documents that are classified as non-duplicates), the approach is extended by creating a duplicate graph. Nodes in this graph represent documents and are connected with an edge if their respective documents have been classified as duplicates. To deduplicate the dataset, connected components of the graph are located and for each component only a single node is selected. A readability criterion is used for selection.

GPT-2

A small pretrained GPT-2 model from the Huggingface library is fine-tuned on the deduplicated dataset. A collection of cherry-picked randomly selected generated samples can be found here here.

📁 Shared Files

Resource Size #Samples Link
Full dataset 128.5 MB 85,160 Link
Deduplicated dataset 74.2 MB 58,227 Link
Thread urls 6.4 MB 95,324 Link
GPT-2 Checkpoints ~1.5 GB Link

🧰 Requirements

See requirements.txt.

⚙️ Installation

$ git clone https://github.com/davidsvy/Neural-Scam-Artist
$ cd Neural-Scam-Artist
$ pip install -r requirements.txt

🧻 Usage

To generate dataset (~3 hours on Colab):


$ python create_dataset.py [-c configs/create_dataset.yaml]

To deduplicate dataset (~30 minutes on Colab):

$ python deduplicate_dataset.py [-c configs/deduplicate_dataset.yaml]

To train GPT-2 (~3 hours/epoch on Colab with K80):

$ python gpt2_train.py [-c configs/gpt2_train.yaml]

To generate text with GPT-2:

$ python gpt2_sample.py [-c configs/gpt2_sample.yaml]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].