EmoTxt

A toolkit for emotion detection from technical text. It is part of the Collab Emotion Mining Toolkit (EMTk).

Fair Use Policy

Please, cite the following paper if you intend to use our tool for your own research:

F. Calefato, F. Lanubile, N. Novielli. “EmoTxt: A Toolkit for Emotion Recognition from Text” In Proceedings of the Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, {ACII} Workshops 2017, San Antonio, USA, Oct. 23-26, 2017, pp. 79-80, ISBN: 978-1-5386-0563-9.

Installation

You will need to install Git LFS extension to check out this project. Once installed and initialized, simply run:

$ git clone https://github.com/collab-uniba/Emotion_and_Polarity_SO.git

Requirements

Ram: 8GB
Python 2.7.x
- Libraries
  - nltk, numpy, scikit_learn, scipy, pattern
    - Installation: open the command line and run $ pip install -r requirements.txt
    - Complete the nltk installation: Run the Python interpreter and type the commands
      >>> import nltk >>> nltk.download()
Java 8+
- Maven 3.x
  - if you want to build the jar yourself type the following commands
```
cd java
mvn clean install
```
  - The fat jar will be generated in the java/targetfolder with the name EmotionAndPolarity-0.0.1-SNAPSHOT-jar-with-dependencies.jar. Rename it as Emotion_and_Polarity_SO.jar and move it directly under the java folder.
R
- Libraries:
  - caret, LiblinearR , e1071
    - Installation: open the command line and run
      
      $ Rscript requirements.R

Usage and dataset

In the following, we show first how to train a new model for emotion classification and, then, how to test the model on unseen data.

For testing purposes, you can use the sample.csv input file available in the root of the repo. Other, more complex examples, look at the dataset files available under the subfolder ./java/DatasetSO/StackOverflowCSV.

If you are looking for the entire experimental dataset of ~5K Stack Overflow posts annotated with emotion, it is available from this repository.

Training a new model for emotion classification (70/30% split for train and test)

$ sh train.sh -i file.csv -d delimiter [-g] [-p] -e emotion

where:

-i file.csv: the input file coded in UTF-8 without BOM, containing the input corpus. Please, note that gold label are required for each item in the dataset. The format of the input file is the following:
```
id;label;text
...
22;NO;"""Excellent! This is exactly what I needed. Thanks!"""
23;YES;"""FEAR!!!!!!!!!!!"""
...
```
-d delimiter: the delimiter used in the csv file (values in {c, sc}, where stands for comma and sc for semicolon). Please, note that all the example files provided here use semicolon as delimiter, so -d sc is a mandatory option during tests.
-g: enables the extraction of n-grams (i.e,. bigrams and unigrams). N-grams extraction is mandatory for the first run when you want to train a new classification model for a given emotion, using your own dataset for the first time. Because n-gram extraction is computationally expensive, it should be skipped if you retrain the model for the same emotion using the same input file.
-p: enables the extraction of features regarding politeness, mood and modality. Because this is computationally expensive, the switch is off by default.
-e emotion: the specific emotion for which you want to train a classification model, with values in {joy, anger, sadness, love, surprise, fear}.

As a result, the script will generate the following output files:

An output folder named training_<file.csv>_<emotion>/, containing:
- n-grams/: a subfolder containing the extracted n-grams
- idfs/: a subfolder containing the IDFs computed for n-grams and WordNet Affect emotion words
- feature-<emotion>.csv: a .csv file with the features extracted from the input corpus and used for training the model
- liblinear/:
  - there are two subfolders: DownSampling/ and NoDownSampling/. Each one contains:
    - trainingSet.csv
    - testingSet.csv
    - eight models trained with liblinear model_<emotion>_<IDMODEL>.Rda, where IDMODEL is the ID of the liblinear model, with values in {0,...,7}):
    - performance_<emotion>_<IDMODEL>.txt, containing the results of the parameter tuning for the model (best C) as performed by caret, the confusion matrix and the precision, recall and f-measure for the best cost for the specific emotion
    - predictions_<emotion>_<IDMODEL>.csv, containing the test instances with the predicted labels for the specific emotion

Emotion detection

$ sh classify.sh -i file.csv -d delimiter -e emotion [-m model] [-f idf] [-o n-grams] [-l] [-p]

where:

-i file.csv: the input csv file with header and coded in UTF-8 without BOM, containing the corpus to be classified; the format of the input file is the following:
```
id;label;text
...
22;NO;"""Excellent! This is exactly what I needed. Thanks!"""
23;YES;"""FEAR!!!!!!!!!!!"""
...
```
-d delimiter: the delimiter used in the csv file (values in {c, sc}, where stands for comma and sc for semicolon). Please, note that all the example files provided here use semicolon as delimiter, so -d sc is a mandatory option during tests.
-e emotion: the specific emotion to be detected in the input file or text, defined in {joy, anger, sadness, love, surprise, fear}.
-m model: the model file learnt during the training step (e.g., model-anger.rda). If you don't specify the model name, the default model will be used, that is the one learnt on our Stack Overflow gold standard.
-o n-grams: if you specify a model name using -m (i.e., you don't want to use the default model for a given emotion) you are required to provide also the path to the folder containing the dictionaries extracted during the training step. This folder includes n-grams, i.e., UnigramsList.txt and BigramsList.txt.
-f idf: if you specify a model name using -m (i.e., you don't want to use the default model for a given emotion) you are required to specify also the path to the folder containing the dictionaries with IDFs computed during the training step. The folder includes IDFs for n-grams (uni- and bi-grams) and for WordNet Affect lists of emotion words.
-l: if presents , indicates <file.csv> contains a gold label in the column label.
-p: enables the extraction of features regarding politeness, mood and modality. Because this is computationally expensive, the switch is off by default.

As a result, the script will create an output folder named classification_<file.csv>_<emotion> containing:

predictions_<emotion>.csv: a csv file with header, containing a binary prediction (yes/no) for each line of the input corpus:
```
id;predicted
...
22;NO
23;YES
...
```
performance_<emotion>.txt: a file created only if the input corpus <file.csv> contains the column label; the file contains several performance metrics (Precision, Recall, F1, confusion matrix).

For example, if you wanted to detect anger in the input file sample.csv, you would have to run:

$ sh classify.sh -i sample.csv -d sc -p -e anger

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

collab-uniba / Emotion_and_Polarity_SO

Programming Languages

Labels