ReadMe

This repository contains the data, code, pre-trained models and experiment results for the paper: [SEntiMoji: An Emoji-Powered Learning Approach for Sentiment Analysis in Software Engineering] .

SEntiMoji

This study proposes SEntiMoji, which leverages the texts containing emoji from both Github and Twitter to improve the sentiment analysis and emotion detection task in software engineering (SE) domain. SEntiMoji is demonstrated to be able to significantly outperform the exisiting SE-customized sentiment analysis and emotion detection methods on representative benchmark datasets.

Overview

data/ contains the data used in this study. It contains two subfolders:
- GitHub_data/ contains the processed emoji-texts used to train SEntiMoji.
- benchmark_dataset/ contains the benchmark datasets used for evaluation. Benchmark dataset includes datasets for sentiment analysis task and emotion detection task.
  - Datasets for sentiment analysis: the Jira, Stack Overflow, Code Review, and Java Library datset.
  - Datasets for emotion detection: the Jira Emotion Dataset (for binary classification), the Jira Deva Dataset (for multiclass classification), the Stack Overflow Emotion Dataset (for binary classification).
code/ contains the scripts of SEntiMoji model. The variants of SEntiMoji share the same scripts with it.
- SEntiMoji_script/ contains the representation learning code (Deepmoji/deepmoji), the pipeline code for training and evaluating (pipeline.py), the files mapping labels to class indexes (label2index/), and vocabulary dicts for each pre-trained representation model (vocabulary/).
- Mtest.py is responsible for the McNemar’s test.
trained_model/ contains the pre-trained embeddings, representation models, and final sentiment classifier. It contains three subfolders:
- word_embeddings/ contains the word embeddings trained on GitHub posts.
- representation_model/ contains the pre-trained representation models used for SEntiMoji (i.e., model_SEntiMoji.hdf5), SEntiMoji-G (i.e., model_SEntiMoji-G.hdf5), and SEntiMoji-T (i.e., model_SEntiMoji-T.hdf5).
⚠️ Since the size of model and embedding exceeds the Github file size limit, we use git lfs to manage these large files. If you use git clone to download the whole project, these large files are not included so you will get error when you load them. You have to download them through one of these two following ways:
1. Install git lfs first and use command git lfs pull to download the large files.
2. Open the file in github website and click the download button to download the large files directly.
result/ contains the detailed results of five-fold cross-validation (summarized in the sheets of result_5fold_sentiment.xlsx and result_5fold_emotion.xlsx) instead of the mean performance shown in the paper. In addition, for each dataset, we show the predicted labels for all folds. In each result file, the first column is the text, the second column is the predicted label, and the third column is the ground truth label.

Running SEntiMoji

We assume that you're using Python 3.6 with pip installed. As a backend you need to install either Theano (version 0.9+) or Tensorflow (version 1.3+). For the installation of depedencies, open the command line and run: pip install -r requirements.txt
In order to train a sentiment classifer or emotion detector based on SEntiMoji (or the variants of SEntiMoji) model, you can run the scripts in the code/SEntiMoji_script directory.

Train model on provided benchmark datasets.
- For sentiment classification task, you have to specify the pretrained model name, task and dataset name in command line. For example, if you want to train and evaluate the classifier on the Jira dataset using the SEntiMoji representation model, just run:python pipeline.py --model SEntiMoji --task sentiment --benchmark_dataset_name Jira.
- For emotion detection task, you have to specify the pretrained model name, task, dataset name and emotion type in command line. For example, if you want to train and evaluate the classifier on the Jira LOVE dataset using the SEntiMoji representation model, just run: python pipeline.py --model SEntiMoji --task emotion --benchmark_dataset_name Jira --emotion_type love.
Train model on your own dataset.
- Your train data file should contain two columns separated by \t, one is for text and the other is for class label. You should create a new folder to place the train data file in.
- For training, you have to specify the pretrained model name, task, directory of data and filename of data in your command line. For example, if you save train data in ./data/train.txt and you want to train and evalute and classifier using SEntiMoji representation model, run command: python pipeline.py --use_own_dataset --model SEntiMoji --task sentiment --own_dataset_dir ./data/ --own_dataset_file train.txt

If you want to try another model or dataset, just change the arguments of the command line. Use command python pipeline.py --help to see the detailed decriptions for command line arguments.

In order to do classification using trained model, you can run the scripts in the code/SEntiMoji_script directory. You have to specify path of trained model, path of your test data, number of classes and the name of pretrained model you used for training. Just run command: python classify.py --model_path path_of_obtained_model --test_file_path path_of_test_file --nb_classes number_of_classes --pretrained_model {SEntiMoji,SEntiMoji-T,SEntiMoji-G}

⚠️ Please notice that the number of classes and the name of pretrained model should be the same as the setting of training. For example, if pretrained model you used in training is SEntiMoji and train data you used is for binary classification, you should set pretrained_model=SEntiMoji and nb_classes=2.

If you want to perform McNemar’s Test to compare the results of two classifiers, you can run Mtest.py in code/ directory. You have to specify the method name, dataset name and task name in the command line argument.

For sentiment classification task: For example, if you want to do mcnemar's test for the result of SEntiMoji and SEntiMoji-T on Jira dataset, run: python Mtest.py --methodA SEntiMoji --methodB SEntiMoji-T --dataset Jira --task sentiment.
For emotion detection task: For example, if you want to do mcnemar's test for the result of SEntiMoji and SEntiMoji-T on Jira LOVE dataset, run: python Mtest.py --methodA SEntiMoji --methodB SEntiMoji-T --dataset Jira --task emotion --emotion_type love.

If you want to try another model or dataset, just change the arguments of the command line. Use command python Mtest.py --help to see the detailed decriptions for command line arguments.

Declaration

We upload all the benchmark datasets to this repository for convenience. As they were not generated and released by us, we do not claim any rights on them. If you use any of them, please make sure you fulfill the licenses that they were released with and consider citing the original papers. The scripts of baseline methods (SentiStrength, SentiStrength-SE, SentiCR, Senti4SD, EmoTxt, DEVA) are not included in this repository. You can turn to their homepage for downloading.
The large-scale Tweets used to train DeepMoji are not released by Felbo et al. due to licensing restrictions. Therefore, we include the pre-trained DeepMoji released rather than the raw Tweet corpus in this repository.
The large-scale GitHub data are collected by Lu et al. and not released publicly. After obtain their consent, in this repository, we release only the processed emoji-texts used to train our model, to increase reproducibility and replicability.

License

This code and the pretrained model is licensed under the MIT license (https://mit-license.org).

Citation

Please consider citing the following paper when using our code or pretrained models for your application.

@inproceedings{chencao2019,
  title={SEntiMoji: An Emoji-Powered Learning Approach for Sentiment Analysis in Software Engineering},
  author={Zhenpeng Chen and Yanbin Cao and Xuan Lu and Qiaozhu Mei and Xuanzhe Liu},
  booktitle={Proceedings of the 2019 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE'19},
  year={2019}
}

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

SEntiMoji / SEntiMoji

Programming Languages

Labels

Projects that are alternatives of or similar to SEntiMoji