All Projects → oliverguhr → german-sentiment

oliverguhr / german-sentiment

Licence: MIT license
A data set and model for german sentiment classification.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to german-sentiment

Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+8572.97%)
Mutual labels:  sentiment-analysis, fasttext, sentiment-classification
german-sentiment-lib
An easy to use python package for deep learning-based german sentiment classification.
Stars: ✭ 33 (-10.81%)
Mutual labels:  sentiment-analysis, sentiment-classification, bert-model
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-35.14%)
Mutual labels:  sentiment-analysis, transformer, sentiment-classification
Absa keras
Keras Implementation of Aspect based Sentiment Analysis
Stars: ✭ 126 (+240.54%)
Mutual labels:  sentiment-analysis, sentiment-classification
Context
ConText v4: Neural networks for text categorization
Stars: ✭ 120 (+224.32%)
Mutual labels:  sentiment-analysis, sentiment-classification
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+235.14%)
Mutual labels:  sentiment-analysis, sentiment-classification
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (+286.49%)
Mutual labels:  sentiment-analysis, transformer
Multimodal Sentiment Analysis
Attention-based multimodal fusion for sentiment analysis
Stars: ✭ 172 (+364.86%)
Mutual labels:  sentiment-analysis, sentiment-classification
Sentimentanalysis
Sentiment analysis neural network trained by fine-tuning BERT, ALBERT, or DistilBERT on the Stanford Sentiment Treebank.
Stars: ✭ 186 (+402.7%)
Mutual labels:  sentiment-analysis, transformer
levheimcube
No description or website provided.
Stars: ✭ 11 (-70.27%)
Mutual labels:  sentiment-analysis, sentiment-classification
Transformer Temporal Tagger
Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging
Stars: ✭ 55 (+48.65%)
Mutual labels:  transformer, bert-model
Tia
Your Advanced Twitter stalking tool
Stars: ✭ 98 (+164.86%)
Mutual labels:  sentiment-analysis, sentiment-classification
Senta
Baidu's open-source Sentiment Analysis System.
Stars: ✭ 1,187 (+3108.11%)
Mutual labels:  sentiment-analysis, sentiment-classification
brand-sentiment-analysis
Scripts utilizing Heartex platform to build brand sentiment analysis from the news
Stars: ✭ 21 (-43.24%)
Mutual labels:  sentiment-analysis, sentiment-classification
Absa Pytorch
Aspect Based Sentiment Analysis, PyTorch Implementations. 基于方面的情感分析,使用PyTorch实现。
Stars: ✭ 1,181 (+3091.89%)
Mutual labels:  sentiment-analysis, sentiment-classification
Twitter Sentiment Analysis
Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc.
Stars: ✭ 978 (+2543.24%)
Mutual labels:  sentiment-analysis, sentiment-classification
sentiment analysis dict
sentiment analysis、情感分析、文本分类、基于字典、python、classification
Stars: ✭ 111 (+200%)
Mutual labels:  sentiment-analysis, sentiment-classification
Tensorflow Sentiment Analysis On Amazon Reviews Data
Implementing different RNN models (LSTM,GRU) & Convolution models (Conv1D, Conv2D) on a subset of Amazon Reviews data with TensorFlow on Python 3. A sentiment analysis project.
Stars: ✭ 34 (-8.11%)
Mutual labels:  sentiment-analysis, sentiment-classification
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-8.11%)
Mutual labels:  sentiment-analysis, fasttext
Sentiment-analysis-amazon-Products-Reviews
NLP with NLTK for Sentiment analysis amazon Products Reviews
Stars: ✭ 37 (+0%)
Mutual labels:  sentiment-analysis, sentiment-classification

Broad-Coverage German Sentiment Classification Model for Dialog Systems

This repository contains the code and data for the Paper "Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems" published at LREC 2020.

Usage

If you like to use the models for your own projects please head over to this repository. It contains a Python package that provides a easy to use interface.

Data Sets

We trained our models on a combination of self created and exisiting data sets, to cover a broad variety of topics and domains.

Data Set Positive Samples Neutral Samples Negative Samples Total Samples
Emotions 188 28 1,090 1,306
filmstarts 40,049 0 15,610 55,659
GermEval-2017 1,371 16,309 5,845 23,525
holidaycheck 3,135,449 0 388,744 3,524,193
Leipzig Wikipedia Corpus 2016 0 1,000,000 0 1,000,000
PotTS 3,448 2,487 1,569 7,504
SB10k 1,716 4,628 1,130 7,474
SCARE 538,103 0 197,279 735,382
Sum 3,720,324 1,023,452 611,267 5,355,043

The data sets without the SCARE Dataset can be downloaded from here. Due to legal requirements, we can not provide the SCARE data set directly, but you can obtain the data from the author directly. However, if you are interested in this data, please obtain the Scare data set from the autors and integrate it usign our provided scripts to create the combined data set.

The unprocessed data set can be downloaded from here (1.5 GB), it contains all hotel and movie reviews, plus a set of neutral german texts.

The Filmstarts data set consists of 71,229 user written movie reviews in the German language. We have collected this data from the German website filmstarts.de using a web crawler. The users can label their reviews in the range of 0.5 to 5 stars. With 40,049 documents the majority of the reviews in this data set are positive and only 15,610 reviews are negative. All data was downloaded between the 15th and 16th of October 2018, containing reviews up to this date.

The holidaycheck data set contains hotel reviews from the German website holidaycheck.de. The users of this website can write a general review and rate their hotel. Additionally, they can review and rate six specific aspects: location & surroundings, rooms, service, cuisine, sports & entertainment and hotel. A full review contains therefore seven texts and the associated star rating in the range from zero to six stars. In total, we have downloaded 4,832,001 text-rating pairs for hotels from ten destinations: Egypt, Bulgaria, China, Greece, India, Majorca, Mexico, Tenerife, Thailand and Tunisia. The reviews were obtained from November to December 2018 and contain reviews up to this date. After removing all reviews with no stars or four stars, the data set contains 3,524,193 text-rating pairs.

The Emotions data set contains a list of utterances that we have recorded during the "Wizard of Oz" experiments with the service robots. We have noticed, that people used insults while talking to the robot. Since most of these words are filtered in social media and review platforms, other data sets do not contain such words. We used synonym replacement as a data augmentation technique to generate new utterances based on our recordings. Besides negative feedback, this data set contains also positive feedback and phrases about sexual identity and orientation that where labelled as neutral. Overall this data set contains 1,306 examples.

Trained Models

You can download our trained models for FastText and Bert here (6 GB). With this models we achived following results:

Bert

Data Set Balanced Unbalanced
SCARE 0.9409 0.9436
GermEval-2017 0.7727 0.7885
holidaycheck 0.9552 0.9775
SB10k 0.6930 0.6720
filmstarts 0.9062 0.9219
PotTS 0.6423 0.6502
emotions 0.9652 0.9621
Leipzig Wikipedia Corpus 2016 0.9983 0.9981
combined 0.9636 0.9744

Micro averaged F1 scores for BERT trained on the balanced and unbalanced data set.

Fast Text

Data Set Balanced Unbalanced
SCARE 0.9071 0.9083
GermEval-2017 0.6970 0.6980
holidaycheck 0.9296 0.9639
SB10k 0.6862 0.6213
filmstarts 0.8206 0.8432
PotTS 0.5268 0.5416
emotions 0.9913 0.9773
Leipzig Wikipedia Corpus 2016 0.9883 0.9886
combined 0.9405 0.9573

Micro averaged F1 scores for FastText trained on the balanced and unbalanced.

Setup

We recommend to install this project in a python virtual environment. To install and activate this virtual environment you need to execute this three commands.

pip3 install virtualenv
python3 -m venv ./venv
source venv/bin/activate

Make sure that you are using a recent python version by running "python -V ". You should at least run Python 3.6.

python -V
> Python 3.6.8

Next, install the needed python packages.

pip install -r requirements.txt

In order to reproduce the results, you need to download our models and data. We provide a script that downloads all required packages:

sh download-models-and-data.sh

Paper & Citetation

You can read the paper here. Please cite us if you found this useful.

@InProceedings{guhr-EtAl:2020:LREC,
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  title     = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
  booktitle      = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month          = {May},
  year           = {2020},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1620--1625},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.202/}
}

If you use the combined data set for your work, you can use this list to cite all the contained data sets:

@LanguageResource{sanger_scare_2016,
	address = {Portorož, Slovenia},
	title = {{SCARE} ― {The} {Sentiment} {Corpus} of {App} {Reviews} with {Fine}-grained {Annotations} in {German}},
	url = {https://www.aclweb.org/anthology/L16-1178},	
	urldate = {2019-11-07},
	booktitle = {Proceedings of the {Tenth} {International} {Conference} on {Language} {Resources} and {Evaluation} ({LREC}'16)},
	publisher = {European Language Resources Association (ELRA)},
	author = {Sänger, Mario and Leser, Ulf and Kemmerer, Steffen and Adolphs, Peter and Klinger, Roman},
	year = {2016},
	pages = {1114--1121}
}

@LanguageResource{sidarenka_potts:_2016,
	address = {Paris, France},
	title = {{PotTS}: {The} {Potsdam} {Twitter} {Sentiment} {Corpus}},
	isbn = {978-2-9517408-9-1},
	language = {english},
	booktitle = {Proceedings of the {Tenth} {International} {Conference} on {Language} {Resources} and {Evaluation} ({LREC} 2016)},
	publisher = {European Language Resources Association (ELRA)},
	author = {Sidarenka, Uladzimir},
	editor = {Chair), Nicoletta Calzolari (Conference and Choukri, Khalid and Declerck, Thierry and Goggi, Sara and Grobelnik, Marko and Maegaard, Bente and Mariani, Joseph and Mazo, Helene and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios},
	year = {2016},
	note = {event-place: Portorož, Slovenia}
}

@LanguageResource{cieliebak_twitter_2017,
	address = {Valencia, Spain},
	title = {A {Twitter} {Corpus} and {Benchmark} {Resources} for {German} {Sentiment} {Analysis}},
	url = {https://www.aclweb.org/anthology/W17-1106},
	doi = {10.18653/v1/W17-1106},
	urldate = {2019-11-07},
	booktitle = {Proceedings of the {Fifth} {International} {Workshop} on {Natural} {Language} {Processing} for {Social} {Media}},
	publisher = {Association for Computational Linguistics},
	author = {Cieliebak, Mark and Deriu, Jan Milan and Egger, Dominic and Uzdilli, Fatih},
	month = apr,
	year = {2017},
	pages = {45--51}
}

@LanguageResource{wojatzki_germeval_2017,
	address = {Berlin, Germany},
	title = {{GermEval} 2017: {Shared} {Task} on {Aspect}-based {Sentiment} in {Social} {Media} {Customer} {Feedback}},
	booktitle = {Proceedings of the {GermEval} 2017 – {Shared} {Task} on {Aspect}-based {Sentiment} in {Social} {Media} {Customer} {Feedback}},
	author = {Wojatzki, Michael and Ruppert, Eugen and Holschneider, Sarah and Zesch, Torsten and Biemann, Chris},
	year = {2017},
	pages = {1--12}	
}

@inproceedings{goldhahn-etal-2012-building,
    title = "Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages",
    author = "Goldhahn, Dirk  and
      Eckart, Thomas  and
      Quasthoff, Uwe",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf",
    pages = "759--765"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].