All Projects → futurice → spice-hate_speech_detection

futurice / spice-hate_speech_detection

Licence: MIT license
A SPICE-program funded project where the goal is to detect hate speech in social media.

Programming Languages

python
139335 projects - #7 most used programming language

Automatic hate speech detection

Setup

  1. Install requirements
  • python3
  • python packages: pandas, sklearn, fasttext, sqlalchemy, ...
  1. Configure collector
  • Edit hiit_collector.py.example and save it as hiit_collector.py
  1. Configure PostgreSQL
  • Edit postgre_keys.py.example and save it as postgre_keys.py
  1. Get the data

Usage:

Collect new data

usage:

`collector.py [-h] [--user USER] [--password PASSWORD] [--hostname HOSTNAME] [--outdir OUTDIR] [--startdate STARTDATE] [--enddate ENDDATE]

optional arguments: -h, --help show this help message and exit --user USER Username --password PASSWORD Password --hostname HOSTNAME Hostname --outdir OUTDIR Directory to store data --startdate STARTDATE Startdate as YYYY-MM-DD --enddate ENDDATE Enddate as YYYY-MM-DD`

Example:

./collector.py --startdate 2017-03-01 --enddate 2017-03-15

Train predictor

Example:

./predict.py --inputdir data/incoming --outdir data/output/ --featurename bow --featurefile data/models/feature_extractor_bow.pkl --predictor data/models/fasttext_svm.pkl

Predict hate speech

Example:

./predict.py --inputdir data/incoming --outdir data/output/ --featurename bow --featurefile data/models/feature_extractor_bow.pkl --predictor data/models/bow_svm.pkl

Sync data

Example:

./sync.py --inputdir data/output/

TODO

  1. CNN on Embedding Matrix (c.f Willi)
  2. Stemmings, stop words for BoW
  3. Study SVM factors (with BoW)
  4. Mezadona ? To Models
  5. Plot TSNE manifolds for wikipedia model and twitter model
  • Highlight hatewords

DONE:

  1. Try Naive Bayes-classifier with BoW
  • Naive Bayes (Gaussian) did perform comparable to RF, but worse than SVM
  • With FastText it performed poorly
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].