Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → vzhou842 → Profanity Check

vzhou842 / Profanity Check

Licence: mit

A fast, robust Python library to check for offensive language in strings.

Programming Languages

139335 projects - #7 most used programming language

1442 projects

Labels

scikit-learn sklearn

Projects that are alternatives of or similar to Profanity Check

Python Flask Sklearn Docker Template

A simple example of python api for real time machine learning, using scikit-learn, Flask and Docker

Stars: ✭ 117 (-66.95%)

Mutual labels: scikit-learn, sklearn

sklearn-pmml-model

A library to parse and convert PMML models into Scikit-learn estimators.

Stars: ✭ 71 (-79.94%)

Mutual labels: scikit-learn, sklearn

Data Science algorithms for Qlik implemented as a Python Server Side Extension (SSE).

Stars: ✭ 135 (-61.86%)

Mutual labels: scikit-learn, sklearn

A repository for recording the machine learning code

Stars: ✭ 75 (-78.81%)

Mutual labels: scikit-learn, sklearn

به فارسی، برای مشارکت scikit-learn

Stars: ✭ 19 (-94.63%)

Mutual labels: scikit-learn, sklearn

Facial Expression Recognition Svm

Training SVM classifier to recognize people expressions (emotions) on Fer2013 dataset

Stars: ✭ 110 (-68.93%)

Mutual labels: scikit-learn, sklearn

imbalanced-ensemble

Class-imbalanced / Long-tailed ensemble learning in Python. Modular, flexible, and extensible. | 模块化、灵活、易扩展的类别不平衡/长尾机器学习库

Stars: ✭ 199 (-43.79%)

Mutual labels: scikit-learn, sklearn

AiLearning: 机器学习 - MachineLearning - ML、深度学习 - DeepLearning - DL、自然语言处理 NLP

Stars: ✭ 32,316 (+9028.81%)

Mutual labels: scikit-learn, sklearn

sklearn-audio-classification

An in-depth analysis of audio classification on the RAVDESS dataset. Feature engineering, hyperparameter optimization, model evaluation, and cross-validation with a variety of ML techniques and MLP

Stars: ✭ 31 (-91.24%)

Mutual labels: scikit-learn, sklearn

A Streamlit application to play with machine learning models directly from the browser

Stars: ✭ 48 (-86.44%)

Mutual labels: scikit-learn, sklearn

Mlatimperial2017

Materials for the course of machine learning at Imperial College organized by Yandex SDA

Stars: ✭ 71 (-79.94%)

Mutual labels: scikit-learn, sklearn

SciKIt-learn Pipeline in PAndas

Stars: ✭ 33 (-90.68%)

Mutual labels: scikit-learn, sklearn

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

Stars: ✭ 1,014 (+186.44%)

Mutual labels: scikit-learn, sklearn

Sklearn Evaluation

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

Stars: ✭ 294 (-16.95%)

Mutual labels: scikit-learn, sklearn

🧙 A web app to generate template code for machine learning

Stars: ✭ 948 (+167.8%)

Mutual labels: scikit-learn, sklearn

a delightful machine learning tool that allows you to train, test, and use models without writing code

Stars: ✭ 2,956 (+735.03%)

Mutual labels: scikit-learn, sklearn

Hyperparameter hunter

Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries

Stars: ✭ 648 (+83.05%)

Mutual labels: scikit-learn, sklearn

Machinelearningstocks

Using python and scikit-learn to make stock predictions

Stars: ✭ 897 (+153.39%)

Mutual labels: scikit-learn, sklearn

Code for determining optimal number of clusters for K-means algorithm using the 'elbow criterion'

Stars: ✭ 35 (-90.11%)

Mutual labels: scikit-learn, sklearn

Kaio-machine-learning-human-face-detection

Machine Learning project a case study focused on the interaction with digital characters, using a character called "Kaio", which, based on the automatic detection of facial expressions and classification of emotions, interacts with humans by classifying emotions and imitating expressions

Stars: ✭ 18 (-94.92%)

Mutual labels: scikit-learn, sklearn

View All Similar Projects ➔

profanity-check

A fast, robust Python library to check for profanity or offensive language in strings. Read more about how and why profanity-check was built in this blog post. You can also test out profanity-check in your browser.

How It Works

profanity-check uses a linear SVM model trained on 200k human-labeled samples of clean and profane text strings. Its model is simple but surprisingly effective, meaning profanity-check is both robust and extremely performant.

Why Use profanity-check?

No Explicit Blacklist

Many profanity detection libraries use a hard-coded list of bad words to detect and filter profanity. For example, profanity uses this wordlist, and even better-profanity still uses a wordlist. There are obviously glaring issues with this approach, and, while they might be performant, these libraries are not accurate at all.

A simple example for which profanity-check is better is the phrase "You cocksucker" - profanity thinks this is clean because it doesn't have "cocksucker" in its wordlist.

Performance

Other libraries like profanity-filter use more sophisticated methods that are much more accurate but at the cost of performance. A benchmark (performed December 2018 on a new 2018 Macbook Pro) using a Kaggle dataset of Wikipedia comments yielded roughly the following results:

Package	1 Prediction (ms)	10 Predictions (ms)	100 Predictions (ms)
profanity-check	0.2	0.5	3.5
profanity-filter	60	1200	13000
profanity	0.3	1.2	24

profanity-check is anywhere from 300 - 4000 times faster than profanity-filter in this benchmark!

Accuracy

This table speaks for itself:

Package	Test Accuracy	Balanced Test Accuracy	Precision	Recall	F1 Score
profanity-check	95.0%	93.0%	86.1%	89.6%	0.88
profanity-filter	91.8%	83.6%	85.4%	70.2%	0.77
profanity	85.6%	65.1%	91.7%	30.8%	0.46

See the How section below for more details on the dataset used for these results.

Installation

$ pip install profanity-check

Usage

from profanity_check import predict, predict_prob

predict(['predict() takes an array and returns a 1 for each string if it is offensive, else 0.'])
# [0]

predict(['fuck you'])
# [1]

predict_prob(['predict_prob() takes an array and returns the probability each string is offensive'])
# [0.08686173]

predict_prob(['go to hell, you scum'])
# [0.7618861]

Note that both predict() and predict_prob return numpy arrays.

More on How/Why It Works

How

Special thanks to the authors of the datasets used in this project. profanity-check was trained on a combined dataset from 2 sources:

t-davidson/hate-speech-and-offensive-language, used in their paper Automated Hate Speech Detection and the Problem of Offensive Language
the Toxic Comment Classification Challenge on Kaggle.

profanity-check relies heavily on the excellent scikit-learn library. It's mostly powered by scikit-learn classes CountVectorizer, LinearSVC, and CalibratedClassifierCV. It uses a Bag-of-words model to vectorize input strings before feeding them to a linear classifier.

Why

One simplified way you could think about why profanity-check works is this: during the training process, the model learns which words are "bad" and how "bad" they are because those words will appear more often in offensive texts. Thus, it's as if the training process is picking out the "bad" words out of all possible words and using those to make future predictions. This is better than just relying on arbitrary word blacklists chosen by humans!

Caveats

This library is far from perfect. For example, it has a hard time picking up on less common variants of swear words like "f4ck you" or "you b1tch" because they don't appear often enough in the training corpus. Never treat any prediction from this library as unquestionable truth, because it does and will make mistakes. Instead, use this library as a heuristic.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 354

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (19) 🔗