All Projects → alex000kim → Nsfw_data_scraper

alex000kim / Nsfw_data_scraper

Licence: mit
Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier

Programming Languages

shell
77523 projects
Jupyter Notebook
11667 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Nsfw data scraper

Open nsfw android
🔥🔥🔥色情图片离线识别,基于TensorFlow实现。识别只需20ms,可断网测试,成功率99%,调用只要一行代码,从雅虎的开源项目open_nsfw移植,该模型文件可用于iOS、java、C++等平台
Stars: ✭ 1,586 (-86.12%)
Mutual labels:  nsfw, pornography
content-moderation-image-api
An NSFW Image Classification REST API for effortless Content Moderation built with Node.js, Tensorflow, and Parse Server
Stars: ✭ 50 (-99.56%)
Mutual labels:  nsfw, nsfw-classifier
toxicity
The world's largest social media toxicity dataset.
Stars: ✭ 135 (-98.82%)
Mutual labels:  content-moderation
91porn-utility
91porn comprehensive utility
Stars: ✭ 78 (-99.32%)
Mutual labels:  nsfw
nsfw-classification-tensorflow
NSFW classify model implemented with tensorflow.
Stars: ✭ 58 (-99.49%)
Mutual labels:  nsfw
VideoAudit
📹 一个短视频APP视频内容安全审核的思路调研及实现汇总
Stars: ✭ 221 (-98.07%)
Mutual labels:  nsfw
nsfw api
Python REST API to detect images with adult content
Stars: ✭ 71 (-99.38%)
Mutual labels:  nsfw
porn-description-generator
Generates new porn descriptions based on an edited dataset of xhamster video descriptions uploaded between 2007-2016.
Stars: ✭ 40 (-99.65%)
Mutual labels:  pornography
Porn Vault
💋 Manage your ever-growing porn collection. Using Vue & GraphQL
Stars: ✭ 1,634 (-85.7%)
Mutual labels:  pornography
A41SLBOT
All For One Bot is an open-source discord server bot built for All For One SL™ discord server.
Stars: ✭ 83 (-99.27%)
Mutual labels:  nsfw
gas
[NSFW]gas station for skilled driver
Stars: ✭ 60 (-99.48%)
Mutual labels:  nsfw
NsfwSqueezenet
Caffe Squeezenet model for binary classification of pornographic/non-pornographic material
Stars: ✭ 57 (-99.5%)
Mutual labels:  pornography
rayriffy-h
The missing piece of nhentai
Stars: ✭ 76 (-99.34%)
Mutual labels:  nsfw
unbound-dns-firewall
DNS-Firewall Python script for UNBOUND
Stars: ✭ 23 (-99.8%)
Mutual labels:  pornography
SpamProtectionRobot
Anti Spam/NSFW Telegram Bot Written In Python With Pyrogram.
Stars: ✭ 46 (-99.6%)
Mutual labels:  nsfw-classifier
YAPO-e-plus
YAPO e+ - Yet Another Porn Organizer (extended)
Stars: ✭ 92 (-99.2%)
Mutual labels:  pornography
get-sauce
A command line program to download hentai videos and images from multiple websites
Stars: ✭ 40 (-99.65%)
Mutual labels:  nsfw
ArminC-uBlock-Settings
⚙️ ArminC's settings for uBlock₀ - remove most of the ads, pop-ups and trackers.
Stars: ✭ 24 (-99.79%)
Mutual labels:  nsfw
NHentai-API
NHentai API made using python BeautifulSoup webscrapping.
Stars: ✭ 27 (-99.76%)
Mutual labels:  nsfw
Komugari
A simple, multi-functional Discord bot written in Discord.js
Stars: ✭ 39 (-99.66%)
Mutual labels:  nsfw

NSFW Data Scraper

Note: use with caution - the dataset is noisy

Description

This is a set of scripts that allows for an automatic collection of tens of thousands of images for the following (loosely defined) categories to be later used for training an image classifier:

  • porn - pornography images
  • hentai - hentai images, but also includes pornographic drawings
  • sexy - sexually explicit images, but not pornography. Think nude photos, playboy, bikini, etc.
  • neutral - safe for work neutral images of everyday things and people
  • drawings - safe for work drawings (including anime)

Here is what each script (located under scripts directory) does:

  • 1_get_urls_.sh - iterates through text files under scripts/source_urls downloading URLs of images for each of the 5 categories above. The Ripme application performs all the heavy lifting. The source URLs are mostly links to various subreddits, but could be any website that Ripme supports. Note: I already ran this script for you, and its outputs are located in raw_data directory. No need to rerun unless you edit files under scripts/source_urls.
  • 2_download_from_urls_.sh - downloads actual images for urls found in text files in raw_data directory.
  • 3_optional_download_drawings_.sh - (optional) script that downloads SFW anime images from the Danbooru2018 database.
  • 4_optional_download_neutral_.sh - (optional) script that downloads SFW neutral images from the Caltech256 dataset
  • 5_create_train_.sh - creates data/train directory and copy all *.jpg and *.jpeg files into it from raw_data. Also removes corrupted images.
  • 6_create_test_.sh - creates data/test directory and moves N=2000 random files for each class from data/train to data/test (change this number inside the script if you need a different train/test split). Alternatively, you can run it multiple times, each time it will move N images for each class from data/train to data/test.

Prerequisites

  • Docker

How to collect data

$ docker build . -t docker_nsfw_data_scraper
Sending build context to Docker daemon  426.3MB
Step 1/3 : FROM ubuntu:18.04
 ---> 775349758637
Step 2/3 : RUN apt update  && apt upgrade -y  && apt install wget rsync imagemagick default-jre -y
 ---> Using cache
 ---> b2129908e7e2
Step 3/3 : ENTRYPOINT ["/bin/bash"]
 ---> Using cache
 ---> d32c5ae5235b
Successfully built d32c5ae5235b
Successfully tagged docker_nsfw_data_scraper:latest
$ # Next command might run for several hours. It is recommended to leave it overnight
$ docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh
Getting images for class: neutral
...
...
$ ls data
test  train
$ ls data/train/
drawings  hentai  neutral  porn  sexy
$ ls data/test/
drawings  hentai  neutral  porn  sexy

How to train a CNN model

  • Install fastai: conda install -c pytorch -c fastai fastai
  • Run train_model.ipynb top to bottom

Results

I was able to train a CNN classifier to 91% accuracy with the following confusion matrix:

alt text

As expected, drawings and hentai are confused with each other more frequently than with other classes.

Same with porn and sexy categories.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].