All Projects → sarnthil → Unify Emotion Datasets

sarnthil / Unify Emotion Datasets

Licence: mit
A Survey and Experiments on Annotated Corpora for Emotion Classification in Text

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Unify Emotion Datasets

Multi object datasets
Multi-object image datasets with ground-truth segmentation masks and generative factors.
Stars: ✭ 121 (-28.4%)
Mutual labels:  datasets
Gekko Datasets
Gekko Trading Bot dataset dumps. Ready to use and download history files in SQLite format.
Stars: ✭ 146 (-13.61%)
Mutual labels:  datasets
Tidyversity
🎓 Tidy tools for academics
Stars: ✭ 155 (-8.28%)
Mutual labels:  logistic-regression
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+1149.7%)
Mutual labels:  datasets
Python Machine Learning Book
The "Python Machine Learning (1st edition)" book code repository and info resource
Stars: ✭ 11,428 (+6662.13%)
Mutual labels:  logistic-regression
Idenprof
IdenProf dataset is a collection of images of identifiable professionals. It is been collected to enable the development of AI systems that can serve by identifying people and the nature of their job by simply looking at an image, just like humans can do.
Stars: ✭ 149 (-11.83%)
Mutual labels:  datasets
Bird Recognition Review
A list of useful resources in the bird sound (song and calls) recognition, such as datasets, papers, links to open source projects and competitions
Stars: ✭ 116 (-31.36%)
Mutual labels:  datasets
Machine Learning Models
Decision Trees, Random Forest, Dynamic Time Warping, Naive Bayes, KNN, Linear Regression, Logistic Regression, Mixture Of Gaussian, Neural Network, PCA, SVD, Gaussian Naive Bayes, Fitting Data to Gaussian, K-Means
Stars: ✭ 160 (-5.33%)
Mutual labels:  logistic-regression
Pix2code
pix2code: Generating Code from a Graphical User Interface Screenshot
Stars: ✭ 11,349 (+6615.38%)
Mutual labels:  datasets
Awesome Nlp Polish
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Stars: ✭ 153 (-9.47%)
Mutual labels:  datasets
Mylearn
machine learning algorithm
Stars: ✭ 125 (-26.04%)
Mutual labels:  logistic-regression
Remo Python
🐰 Python lib for remo - the app for annotations and images management in Computer Vision
Stars: ✭ 138 (-18.34%)
Mutual labels:  datasets
The Python Workshop
A New, Interactive Approach to Learning Python
Stars: ✭ 150 (-11.24%)
Mutual labels:  logistic-regression
Pipedream
Connect APIs, remarkably fast. Free for developers.
Stars: ✭ 2,068 (+1123.67%)
Mutual labels:  datasets
Corus
Links to Russian corpora + Python functions for loading and parsing
Stars: ✭ 154 (-8.88%)
Mutual labels:  datasets
Ml Fraud Detection
Credit card fraud detection through logistic regression, k-means, and deep learning.
Stars: ✭ 117 (-30.77%)
Mutual labels:  logistic-regression
Pins
Pin, Discover and Share Resources
Stars: ✭ 149 (-11.83%)
Mutual labels:  datasets
Machine learning
Estudo e implementação dos principais algoritmos de Machine Learning em Jupyter Notebooks.
Stars: ✭ 161 (-4.73%)
Mutual labels:  logistic-regression
Amazon Product Recommender System
Sentiment analysis on Amazon Review Dataset available at http://snap.stanford.edu/data/web-Amazon.html
Stars: ✭ 158 (-6.51%)
Mutual labels:  logistic-regression
Robotcar Dataset Sdk
Software Development Kit for the Oxford Robotcar Dataset
Stars: ✭ 151 (-10.65%)
Mutual labels:  datasets

Requirements:

System packages

  • Python 3.6+
  • git

Installing Python dependencies

  • pip3 install requests sh click
  • pip3 install regex docopt numpy sklearn scipy, if you want to use classify_xvsy_logreg.py
  • git clone [email protected]:sarnthil/unify-emotion-datasets.git

This will create a new folder called unify-emotion-datasets.

Running the two scripts

First run the script that downloads all obtainable datasets:

  • cd unify-emotion-datasets # go inside the repository
  • python3 download_datasets.py

Please read carefully the instructions, you will be asked to read and confirm having read the licenses and terms of use of each dataset. In case the dataset is not obtainable directly you will be given instructions on how to obtain the dataset.

Then run the script that unifies the downloaded datasets, which will be located in unify-emotion-datasets/datasets/:

python3 create_unified_dataset.py

This will create a new file called unified-dataset.jsonl in the same folder.

Also, we advise you to cite the papers corresponding to the datasets you use. The corresponding bibtex citations you find in the file datasets/README.md or while running download_datasets.py.

Paper/Reference

An Analysis of Annotated Corpora for Emotion Classification in Text

If you plan to use this corpus, please use this citation:

@inproceedings{Bostan2018,
  author = {Bostan, Laura Ana Maria and Klinger, Roman},
  title = {An Analysis of Annotated Corpora for Emotion Classification in Text},
  booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
  year = {2018},
  publisher = {Association for Computational Linguistics},
  pages = {2104--2119},
  location = {Santa Fe, New Mexico, USA},
  url = {http://aclweb.org/anthology/C18-1179},
  pdf = {http://aclweb.org/anthology/C18-1179.pdf}
}

Experimenting with classification

If you want to reuse the code for the emotion classification task, see the script classify_xvsy_logreg.py:

python3 classify_xvsy_logreg.py --help will show you the following:

Classify using MaxEnt algorithm

Usage:
    classify_xvsy_logreg.py [options] <first> <second>
    classify_xvsy_logreg.py [options] --all-vs <second>

Options:
    -j --json=<JSONFILE>  Filename of the json file [default: ../unified.jsonl]
    -a --all-vs<=dataset> Dataset name of the testing data
    -d --debug            Use a small word list and a fast classifier
    -o --output=<OUTPUT>  Output folder [default: .]
    -m --force-multi      Force using multi-label classification
    -k --keep-last        Quit immediately if results file found

For example if you want to train on TEC and test on SSEC do the following:

python3 classify_xvsy_logreg.py -d tec emoint 

The names of the dataset are the ones used in the file unified-dataset.jsonl in the field source.

Tip

Use jq for an easy interaction with the unified-dataset.jsonl

Examples of how to use it for various tasks:

  • selecting the instances of that have as a source crowdflower or tec jq 'select(.source=="crowdflower" or .source =="tec")' <unified-dataset.jsonl | less
  • count how often instances are annotated with high surprise per dataset jq 'select(.emotions.surprise >0.5) | .source' <unified-dataset.jsonl | sort | uniq -c
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].