surge-ai / toxicity

Licence: MIT license

The world's largest social media toxicity dataset.

Projects that are alternatives of or similar to toxicity

Code for the paper "Characterizing and Detecting Hateful Users on Twitter"

Stars: ✭ 69 (-48.89%)

Mutual labels: hate-speech

Data and code from our stories, "Google Has a Secret Blocklist that Hides YouTube Hate Videos from Advertisers—But It’s Full of Holes," and "Google Blocks Advertisers from Targeting Black Lives Matter YouTube Videos."

Stars: ✭ 27 (-80%)

Mutual labels: hate-speech

HateALERT-EVALITA

Code for replicating results of team 'hateminers' at EVALITA-2018 for AMI task

Stars: ✭ 13 (-90.37%)

Mutual labels: hate-speech

DE-LIMIT

DeEpLearning models for MultIlingual haTespeech (DELIMIT): Benchmarking multilingual models across 9 languages and 16 datasets.

Stars: ✭ 90 (-33.33%)

Mutual labels: hate-speech

nocuous

A static code analysis tool for JavaScript and TypeScript.

Stars: ✭ 31 (-77.04%)

Mutual labels: toxicity

react-text-toxicity

Detect text toxicity in a simple way, using React. Based in a Keras model, loaded with Tensorflow.js.

Stars: ✭ 38 (-71.85%)

Mutual labels: toxicity

flowrisk

A Python Implementation of Measures for Order Flow Risk, e.g. VPIN

Stars: ✭ 53 (-60.74%)

Mutual labels: toxicity

Nsfw data scraper

Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier

Stars: ✭ 11,429 (+8365.93%)

Mutual labels: content-moderation

detox

Korean Hate Speech Detection Model

Stars: ✭ 38 (-71.85%)

Mutual labels: hate-speech-detection

The Toxicity Dataset

by Surge AI, the world's most powerful NLP data labeling platform and workforce

Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work.

We hope you find this sample of our dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.

Interested in the full dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world to build human-powered datasets to train stunning ML. Reach out to [email protected]!

Dataset

This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Click on toxicity_en.csv to see a spreadsheet of 1000 English examples. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.

Columns

text: the text of the comment
is_toxic: whether or not the comment is toxic

Future

We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time. You can also check out our other free datasets here.

If you're also interested in a dataset of profanity, check out our obscenity list.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

surge-ai / toxicity

Labels

Projects that are alternatives of or similar to toxicity

The Toxicity Dataset

Dataset

Columns

Future