All Projects β†’ kocohub β†’ Korean Hate Speech

kocohub / Korean Hate Speech

Licence: cc-by-sa-4.0
Korean HateSpeech Dataset

Projects that are alternatives of or similar to Korean Hate Speech

Wikisql
A large annotated semantic parsing corpus for developing natural language interfaces.
Stars: ✭ 965 (+402.6%)
Mutual labels:  dataset, natural-language-processing
Pytreebank
πŸ˜‘πŸ˜‡ Stanford Sentiment Treebank loader in Python
Stars: ✭ 93 (-51.56%)
Mutual labels:  dataset, natural-language-processing
Mtnt
Code for the collection and analysis of the MTNT dataset
Stars: ✭ 48 (-75%)
Mutual labels:  dataset, natural-language-processing
Doccano
Open source annotation tool for machine learning practitioners.
Stars: ✭ 5,600 (+2816.67%)
Mutual labels:  dataset, natural-language-processing
Mams For Absa
A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.
Stars: ✭ 135 (-29.69%)
Mutual labels:  dataset, natural-language-processing
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (+182.81%)
Mutual labels:  dataset, natural-language-processing
Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-69.79%)
Mutual labels:  dataset, natural-language-processing
Chazutsu
The tool to make NLP datasets ready to use
Stars: ✭ 238 (+23.96%)
Mutual labels:  dataset, natural-language-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-36.98%)
Mutual labels:  dataset, natural-language-processing
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-43.75%)
Mutual labels:  dataset, natural-language-processing
Text2sql Data
A collection of datasets that pair questions with SQL queries.
Stars: ✭ 287 (+49.48%)
Mutual labels:  dataset, natural-language-processing
Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (+939.58%)
Mutual labels:  dataset, natural-language-processing
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+47.4%)
Mutual labels:  dataset, natural-language-processing
Insuranceqa Corpus Zh
🚁 δΏι™©θ‘ŒδΈšθ―­ζ–™εΊ“οΌŒθŠε€©ζœΊε™¨δΊΊ
Stars: ✭ 821 (+327.6%)
Mutual labels:  dataset, natural-language-processing
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+32.81%)
Mutual labels:  dataset, natural-language-processing
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-71.35%)
Mutual labels:  dataset, natural-language-processing
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-50%)
Mutual labels:  dataset, natural-language-processing
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (-27.6%)
Mutual labels:  dataset, natural-language-processing
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (-17.71%)
Mutual labels:  dataset, natural-language-processing
Dataset loaders
A collection of dataset loaders
Stars: ✭ 187 (-2.6%)
Mutual labels:  dataset

Korean HateSpeech Dataset

We provide the first human-annotated Korean corpus for toxic speech detection and the large unlabeled corpus.
The data is comments from the Korean entertainment news aggregation platform.

Dataset description

The dataset consists of 3 parts: 1) labeled 2) unlabeled and 3) news_title.

1. labeled

There are 9,381 human-labeled comments in total. They are splitted into 7,896 training set, 471 validation set, and 974 test set. (We left test set labels undisclosed for the fair comparison of prediction models. The model can be evaluated via the Kaggle submission which will be described later in this document.) Each comment is annotated on two aspects, the existence of social bias and hate speech, given that hate speech is closely related to bias.

For social bias, we present gender, others, and none bias labels. Considering the context of Korean entertainment news where public figures encounter stereotypes mostly intertwined with gender, we weigh more on the prevalent bias. We also added binary label whether a comment contains gender bias or not. For hate speech, we introduce hate, offensive, and none labels.

comments	contain_gender_bias	bias	hate
솑쀑기 μ‹œλŒ€κ·Ήμ€ λ―Ώκ³ λ³Έλ‹€. 첫회 μ‹ μ„ ν•˜κ³  μ’‹μ•˜λ‹€.	False	none	none
μ§€ν˜„μš° λ‚˜μœλ†ˆ	False	none	offensive
μ•Œλ°”μ“°κ³ λ§Žμ΄λ§Œλ“€λ©΄λ˜μ§€ λˆμš•μ‹¬μ—†μœΌλ©΄κ³¨λͺ©μ‹λ‹Ήμ™œλ‚˜μ˜¨κ²¨ κΈ°λŒ•κΈ°κ²Œλ‚˜ν•˜κ³  μ‚°μ—κ°€μ„œνŒ”μ–΄λΌ	False	none	hate
μ„€λ§ˆ γ…ˆ ν˜„μ • μž‘κ°€ μ•„λ‹ˆμ§€??	True	gender	hate

Detailed definitions are described in guideline.

2. unlabeled

We additionally provide 2,033,893 unlabeled comments since labeled data is limited.
This unlabeled dataset can be used in various ways: pretraining language model, semi-supervised learning, and so on.

3. news_title

We release news titles for each comments. To fully understand meaning of the comments, context is must be required.
For the entertainment news, both title and contents can be used for the context. However, we only provide the news articles' title, due to the legal issue.

Usage

koco is a library to easily access kocohub datasets.

For korean-hate-speech, we can load datasets by using this code:

>>> import koco

>>> train_dev = koco.load_dataset('korean-hate-speech', mode='train_dev')
>>> type(train_dev)
dict
>>> train_dev.keys()
dict_keys(['train', 'dev'])
>>> train_dev['train'][33]
{'comments': '2,30λŒ€ κ³¨λΉˆμ—¬μžλ“€μ€ 이 기사에 λ‹€ λͺ¨μ΄λŠ”건가ㅋㅋㅋㅋ μ΄λž˜μ„œ μ—¬μžλŠ” νˆ¬ν‘œκΆŒ μ£Όλ©΄ μ•ˆλœλ‹€. μ— λ„·μ‚¬μ „νˆ¬ν‘œλ‚˜ ν•˜κ³  살아야지 계집듀은',
 'contain_gender_bias': True,
 'bias': 'gender',
 'hate': 'hate',
 'news_title': '"β€œ8λ…„μ§Έ μ—°μ•  μ€‘β€β€¦β€˜μΈμƒμˆ μ§‘β€™ 블락비 유ꢌβ™₯μ „μ„ ν˜œ, 4μ‚΄μ°¨ μ—°μƒμ—°ν•˜ μ»€ν”Œ"'}

>>> unlabeled = koco.load_dataset('korean-hate-speech', mode='unlabeled')
>>> type(unlabeled)
list
>>> unlabeled[33]
{'comments': 'μ΄μ£Όμ—°λ‹˜ 되게 μ΄μ˜μ‹œλ‹€ 였빠 μ˜€λž˜κ°€μš” μž˜μ–΄μšΈλ € μ£Όμ—°λ‹˜ 울였빠 μž˜λΆ€νƒν•΄μš”',
 'news_title': '"[단독] μ§€λ“œλž˜κ³€β™₯이주연, μ œμ£Όλ„ λ°μ΄νŠΈβ€¦2018λ…„ 1호 μ»€ν”Œ 탄생"'}

>>> test = koco.load_dataset('korean-hate-speech', mode='test')
>>> type(test)
list
>>> test[33]
{'comments': 'λλ‚Όλ•Œλ„ 됐지 μš”μ¦˜κ°™μ€ λΆ„μœ„κΈ°μ—” μ„±λ“œλ¦½ 잘λͺ»μ³€λ‹€κ°€ λ‚œλ¦¬. κ·Έλ™μ•ˆ μž˜λ΄€μŠ΅λ‹ˆλ‹€',
'news_title': '[단독] β€˜SNL 코리아’ 곡식적인 폐지 ν™•μ •β€¦μ•„λ¦„λ‹€μš΄ μ’…λ£Œ'}

Kaggle competition

We open Kaggle competition to provide leaderboard system easily. There are 3 competitions:

  1. Gender-bias detection: www.kaggle.com/c/korean-gender-bias-detection
  2. Bias detection: www.kaggle.com/c/korean-bias-detection
  3. Hate speech detection: www.kaggle.com/c/korean-hate-speech-detection

Feel free to participate πŸŽ‰

Annotation Guideline

Contributors

The main contributors of the work are:

*: Equal Contribution

Note that this project is an independent research and was not supported by any of the organizations.
Instead, we had an individual sponsor Hyunjoong Kim and we sincerely thank Hyunjoong Kim for providing financial support ❀️

References

If you find this dataset useful, feel free to cite our publication BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection which is accepted in [email protected] 2020:

@inproceedings{moon-etal-2020-beep,
    title = "{BEEP}! {K}orean Corpus of Online News Comments for Toxic Speech Detection",
    author = "Moon, Jihyung  and
      Cho, Won Ik  and
      Lee, Junbum",
    booktitle = "Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.socialnlp-1.4",
    pages = "25--31",
    abstract = "Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff{'}s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.",
}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].