XED

This is the XED dataset. The dataset consists of emotion annotated movie subtitles from OPUS. We use Plutchik's 8 core emotions to annotate. The data is multilabel. The original annotations have been sourced for mainly English and Finnish, with the rest created using annotation projection to aligned subtitles in 41 additional languages, with 31 languages included in the final dataset (more than 950 lines of annotated subtitle lines). The dataset is an ongoing project with forthcoming additions such as machine translated datasets. Please let us know if you find any errors or come across other issues with the datasets!

Format

The files are formatted as follows:

sentence1\tlabel1,label2
sentence2\tlabel2,label3,label4...

Where the number indicates the emotion in ascending alphabetical order: anger:1, anticipation:2, disgust:3, fear:4, joy:5, sadness:6, surprise:7, trust:8, with neutral:0 where applicable. Note that if you use our BERT code, it will re-arrange the original labels when you use 1-8 into 0-7 by switching trust:8->0

Metadata can be found in the metadata file and the projection "pairs" files. Access to detailed metadata can be found on the OPUS website. We recommend the use of OPUS Tools. Coompatible augmentation data by expert annotators can be found for a selection of languages in the following repos:

NB! The number of annotated subtitle lines are the same as listed in the original paper. The original paper gives the number of annotations, not lines with annotations which is the format of the files here.

Evaluations

We used BERT to test the robustness of the annotations.

English annotated data

Number of annotations:	24164 + 9384 neutral
Number of unique data points:	17530 + 6420 neutral
Number of emotions:	8 (+pos, neg, neu)
Number of annotators:	108 (63 active)

data	f1	accuracy
English without NER, BERT	0.530	0.538
English with NER, BERT	0.536	0.544
English NER with neutral, BERT	0.467	0.529
English NER binary with surprise, BERT	0.679	0.765
English NER true binary, BERT	0.838	0.840
English NER, one-vs-rest Linear SVC	0.502	0.650-0.789 / class

Multilingual projections

And for the other languages with more than 950 lines using SVM:

LANG	SIZE	AVG_LEN	ANGER	ANTICIP.	DISGUST	FEAR	JOY	SADNESS	SURPRISE	TRUST	1label	2labels	3labels	4+labels	F1_SVM
AR	3590	30.02	1012	839	478	565	561	536	615	589	65.01	26.94%	6.74%	1.31%	0.5729
BG	6974	41.3	1923	1630	891	1051	1174	1112	1166	1239	64.01	27.89%	6.62%	1.48%	0.6069
BR	12295	38.49	3228	2846	1641	1821	2128	2025	2121	2098	64.69	27.02%	6.66%	1.63%	0.6726
BS	2443	33.13	632	571	294	367	428	394	397	399	65.98	26.65%	6.47%	0.9%	0.5854
CN	1395	10.92	315	315	140	180	288	221	242	266	66.31	27.46%	5.16%	1.08%	0.5004
CS	6511	29.94	1728	1615	807	1035	1045	1011	1110	1091	64.64	27.42%	6.63%	1.31%	0.6263
DA	1838	31.03	447	472	193	218	350	282	294	351	66.59	26.17%	6.2%	1.03%	0.5989
DE	5503	50.24	1492	1304	742	790	938	889	905	904	64.96	27.11%	6.6%	1.33%	0.6059
EL	8083	35.22	2238	1956	1070	1162	1369	1273	1345	1367	64.25	27.58%	6.73%	1.45%	0.6192
ES	11303	35.69	3007	2631	1482	1765	1902	1810	1959	1924	64.52	27.22%	6.59%	1.66%	0.676
ET	1476	28.66	370	396	144	218	280	210	222	255	65.58	27.57%	6.17%	0.68%	0.5449
FI	8289	29.11	2175	2010	1014	1281	1503	1243	1383	1447	64.3	27.8%	6.38%	1.52%	0.5859
FR	7306	41.27	1946	1726	994	1127	1256	1200	1198	1259	63.63	28.02%	6.86%	1.49%	0.6257
HE	4449	28.97	1244	1078	551	658	791	681	754	783	63.34	28.37%	6.74%	1.55%	0.598
HR	5941	31.7	1494	1408	724	978	1029	947	991	1052	64.13	28.24%	6.26%	1.36%	0.6503
HU	5777	32.07	1539	1378	715	925	937	899	989	1028	64.19	27.77%	6.63%	1.42%	0.5978
IS	977	29.55	236	230	121	124	175	168	134	180	66.84	27.12%	5.32%	0.72%	0.5416
IT	6552	44.65	1783	1514	887	1092	1011	1122	1065	1104	63.58	28.4%	6.59%	1.42%	0.6907
MK	300	28.9	58	100	33	36	61	53	64	52	58.67	31.0%	9.67%	0.67%	0.4961
NL	5333	33.93	1392	1337	658	822	878	857	942	927	64.22	27.21%	6.86%	1.71%	0.614
NO	4257	31.1	1051	1029	500	584	822	678	731	712	65.09	27.93%	5.68%	1.29%	0.5771
PL	7179	32.44	1966	1707	964	1121	1206	1119	1199	1220	64.03	27.72%	6.69%	1.56%	0.6233
PT	7220	33.72	1890	1710	906	1101	1260	1210	1234	1257	63.85	27.87%	6.86%	1.43%	0.6203
RO	9474	36.88	2543	2181	1258	1433	1563	1568	1579	1608	64.9	27.07%	6.58%	1.45%	0.6387
RU	2377	32.45	564	590	268	423	376	395	416	405	64.7	27.6%	6.6%	1.09%	0.5976
SK	975	59.82	256	234	99	168	168	153	152	159	65.44	28.0%	5.54%	1.03%	0.5305
SL	2680	29.19	679	694	278	402	456	416	481	419	65.52	27.61%	5.6%	1.27%	0.6015
SR	8984	31.69	2365	2163	1131	1282	1652	1399	1519	1565	64.3	27.58%	6.72%	1.39%	0.6566
SV	4905	44.34	1273	1160	591	691	815	831	866	827	65.3	27.01%	6.48%	1.2%	0.6218
TR	9202	35.95	2423	2243	1212	1339	1610	1469	1589	1628	63.64	28.03%	6.71%	1.63%	0.608
VI	956	34.53	245	224	128	141	187	150	144	178	63.28	28.56%	7.11%	1.05%	0.5594

Publications

You can read more about it in the following paper:

Öhman, E., Pàmies, M., Kajava, K. and Tiedemann, J., 2020. XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020).

@inproceedings{ohman2020xed,
  title={XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection},
  author={{\"O}hman, Emily and P{\`a}mies, Marc and Kajava, Kaisla and Tiedemann, J{\"o}rg},
  booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)},
  year={2020}
}

Please cite this paper if you use the dataset.

Some preliminary and related work has also been discussed in the following papers:

Öhman, E., Kajava, K., Tiedemann, J. and Honkela, T., 2018, October. Creating a dataset for multilingual fine-grained emotion-detection using gamification-based annotation. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 24-30).
Öhman, E.S. and Kajava, K.S., 2018. Sentimentator: Gamifying fine-grained sentiment annotation. Digital Humanities in the Nordic Countries 2018.
Kajava, K.S., Öhman, E.S., Hui, P. and Tiedemann, J., 2020. Emotion Preservation in Translation: Evaluating Datasets for Annotation Projection. In Digital Humanities in the Nordic Countries 2020. CEUR Workshop Proceedings.
Öhman, E., 2020. Challenges in Annotation: Annotator Experiences from a Crowdsourced Emotion Annotation Task. In Digital Humanities in the Nordic Countries 2020. CEUR Workshop Proceedings.

If you publish something using our dataset, feel free to contact us and we can add a link to your publication in this repo.

License: Creative Commons Attribution 4.0 International License (CC-BY)

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Helsinki-NLP / XED

Programming Languages

Labels

Projects that are alternatives of or similar to XED

XED

Format

Evaluations

English annotated data

Multilingual projections

Publications