Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → kocohub → Korean Hate Speech

kocohub / Korean Hate Speech

Licence: cc-by-sa-4.0

Korean HateSpeech Dataset

Labels

natural-language-processing dataset

Projects that are alternatives of or similar to Korean Hate Speech

A large annotated semantic parsing corpus for developing natural language interfaces.

Stars: ✭ 965 (+402.6%)

Mutual labels: dataset, natural-language-processing

😡😇 Stanford Sentiment Treebank loader in Python

Stars: ✭ 93 (-51.56%)

Mutual labels: dataset, natural-language-processing

Code for the collection and analysis of the MTNT dataset

Stars: ✭ 48 (-75%)

Mutual labels: dataset, natural-language-processing

Open source annotation tool for machine learning practitioners.

Stars: ✭ 5,600 (+2816.67%)

Mutual labels: dataset, natural-language-processing

A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.

Stars: ✭ 135 (-29.69%)

Mutual labels: dataset, natural-language-processing

Hate Speech And Offensive Language

Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017

Stars: ✭ 543 (+182.81%)

Mutual labels: dataset, natural-language-processing

Char Rnn Tensorflow

Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow

Stars: ✭ 58 (-69.79%)

Mutual labels: dataset, natural-language-processing

The tool to make NLP datasets ready to use

Stars: ✭ 238 (+23.96%)

Mutual labels: dataset, natural-language-processing

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (-36.98%)

Mutual labels: dataset, natural-language-processing

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Stars: ✭ 108 (-43.75%)

Mutual labels: dataset, natural-language-processing

A collection of datasets that pair questions with SQL queries.

Stars: ✭ 287 (+49.48%)

Mutual labels: dataset, natural-language-processing

Basic Utilities for PyTorch Natural Language Processing (NLP)

Stars: ✭ 1,996 (+939.58%)

Mutual labels: dataset, natural-language-processing

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Stars: ✭ 283 (+47.4%)

Mutual labels: dataset, natural-language-processing

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

Stars: ✭ 821 (+327.6%)

Mutual labels: dataset, natural-language-processing

A dataset of millions of news articles scraped from a curated list of data sources.

Stars: ✭ 255 (+32.81%)

Mutual labels: dataset, natural-language-processing

Corpus of Annual Reports in Japan

Stars: ✭ 55 (-71.35%)

Mutual labels: dataset, natural-language-processing

BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision

Stars: ✭ 96 (-50%)

Mutual labels: dataset, natural-language-processing

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (-27.6%)

Mutual labels: dataset, natural-language-processing

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Stars: ✭ 158 (-17.71%)

Mutual labels: dataset, natural-language-processing

Dataset loaders

A collection of dataset loaders

Stars: ✭ 187 (-2.6%)

Mutual labels: dataset

View All Similar Projects ➔

Korean HateSpeech Dataset

We provide the first human-annotated Korean corpus for toxic speech detection and the large unlabeled corpus.
The data is comments from the Korean entertainment news aggregation platform.

Dataset description

The dataset consists of 3 parts: 1) labeled 2) unlabeled and 3) news_title.

1. `labeled`

There are 9,381 human-labeled comments in total. They are splitted into 7,896 training set, 471 validation set, and 974 test set. (We left test set labels undisclosed for the fair comparison of prediction models. The model can be evaluated via the Kaggle submission which will be described later in this document.) Each comment is annotated on two aspects, the existence of social bias and hate speech, given that hate speech is closely related to bias.

For social bias, we present gender, others, and none bias labels. Considering the context of Korean entertainment news where public figures encounter stereotypes mostly intertwined with gender, we weigh more on the prevalent bias. We also added binary label whether a comment contains gender bias or not. For hate speech, we introduce hate, offensive, and none labels.

comments	contain_gender_bias	bias	hate
송중기 시대극은 믿고본다. 첫회 신선하고 좋았다.	False	none	none
지현우 나쁜놈	False	none	offensive
알바쓰고많이만들면되지 돈욕심없으면골목식당왜나온겨 기댕기게나하고 산에가서팔어라	False	none	hate
설마 ㅈ 현정 작가 아니지??	True	gender	hate

Detailed definitions are described in guideline.

2. `unlabeled`

We additionally provide 2,033,893 unlabeled comments since labeled data is limited.
This unlabeled dataset can be used in various ways: pretraining language model, semi-supervised learning, and so on.

3. `news_title`

We release news titles for each comments. To fully understand meaning of the comments, context is must be required.
For the entertainment news, both title and contents can be used for the context. However, we only provide the news articles' title, due to the legal issue.

Usage

koco is a library to easily access kocohub datasets.

For korean-hate-speech, we can load datasets by using this code:

>>> import koco

>>> train_dev = koco.load_dataset('korean-hate-speech', mode='train_dev')
>>> type(train_dev)
dict
>>> train_dev.keys()
dict_keys(['train', 'dev'])
>>> train_dev['train'][33]
{'comments': '2,30대 골빈여자들은 이 기사에 다 모이는건가ㅋㅋㅋㅋ 이래서 여자는 투표권 주면 안된다. 엠넷사전투표나 하고 살아야지 계집들은',
 'contain_gender_bias': True,
 'bias': 'gender',
 'hate': 'hate',
 'news_title': '"“8년째 연애 중”…‘인생술집’ 블락비 유권♥전선혜, 4살차 연상연하 커플"'}

>>> unlabeled = koco.load_dataset('korean-hate-speech', mode='unlabeled')
>>> type(unlabeled)
list
>>> unlabeled[33]
{'comments': '이주연님 되게 이쁘시다 오빠 오래가요 잘어울려 주연님 울오빠 잘부탁해요',
 'news_title': '"[단독] 지드래곤♥이주연, 제주도 데이트…2018년 1호 커플 탄생"'}

>>> test = koco.load_dataset('korean-hate-speech', mode='test')
>>> type(test)
list
>>> test[33]
{'comments': '끝낼때도 됐지 요즘같은 분위기엔 성드립 잘못쳤다가 난리. 그동안 잘봤습니다',
'news_title': '[단독] ‘SNL 코리아’ 공식적인 폐지 확정…아름다운 종료'}

Kaggle competition

We open Kaggle competition to provide leaderboard system easily. There are 3 competitions:

Gender-bias detection: www.kaggle.com/c/korean-gender-bias-detection
Bias detection: www.kaggle.com/c/korean-bias-detection
Hate speech detection: www.kaggle.com/c/korean-hate-speech-detection

Feel free to participate 🎉

Annotation Guideline

Contributors

The main contributors of the work are:

*: Equal Contribution

Note that this project is an independent research and was not supported by any of the organizations.
Instead, we had an individual sponsor Hyunjoong Kim and we sincerely thank Hyunjoong Kim for providing financial support ❤️

References

If you find this dataset useful, feel free to cite our publication BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection which is accepted in [email protected] 2020:

@inproceedings{moon-etal-2020-beep,
    title = "{BEEP}! {K}orean Corpus of Online News Comments for Toxic Speech Detection",
    author = "Moon, Jihyung  and
      Cho, Won Ik  and
      Lee, Junbum",
    booktitle = "Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.socialnlp-1.4",
    pages = "25--31",
    abstract = "Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff{'}s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.",
}

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 192

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗