All Projects → ko-nlp → Open-korean-corpora

ko-nlp / Open-korean-corpora

Licence: other
Open Korean NLP Dataset Curation for the Users All Around the Globe

Projects that are alternatives of or similar to Open-korean-corpora

Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-29.27%)
Mutual labels:  dataset, korean
clean-code
zz절대 적용 안하죠?
Stars: ✭ 23 (-71.95%)
Mutual labels:  korean
pull facebook data for good
[DEPRECATED] Imitate an API for downloading data from Facebook Data For Good
Stars: ✭ 12 (-85.37%)
Mutual labels:  dataset
mxmortalitydb
A data only R package containing all injury intent deaths registered in Mexico from 2004 to 2019
Stars: ✭ 20 (-75.61%)
Mutual labels:  dataset
dataset-histology-landmarks
Dataset: landmarks for registration of histology images
Stars: ✭ 26 (-68.29%)
Mutual labels:  dataset
rouzeta
reference code for Rouzeta(FST-based morpological analyzer)
Stars: ✭ 14 (-82.93%)
Mutual labels:  korean
spanish-corpora
Unannotated Spanish 3 Billion Words Corpora
Stars: ✭ 61 (-25.61%)
Mutual labels:  corpora
OTT-QA
Code and Data for ICLR2021 Paper "Open Question Answering over Tables and Text"
Stars: ✭ 92 (+12.2%)
Mutual labels:  dataset
corona-virus
一个冠状病毒肺炎传染病学研究数据集
Stars: ✭ 34 (-58.54%)
Mutual labels:  dataset
squad-v1.1-pt
Portuguese translation of the SQuAD dataset
Stars: ✭ 13 (-84.15%)
Mutual labels:  dataset
pump-and-dump-dataset
Additional material for paper: Pump and Dumps in the Bitcoin Era: Real Time Detection of Cryptocurrency Market Manipulations, ICCCN '20
Stars: ✭ 66 (-19.51%)
Mutual labels:  dataset
snorkeling
Extracting biomedical relationships from literature with Snorkel 🏊
Stars: ✭ 56 (-31.71%)
Mutual labels:  dataset
HJDataset
A Large Dataset of Historical Japanese Documents with Complex Layouts
Stars: ✭ 19 (-76.83%)
Mutual labels:  dataset
Awesome-Deepfakes-Detection
A list of tools, papers and code related to Deepfake Detection.
Stars: ✭ 30 (-63.41%)
Mutual labels:  dataset
covid19-data-greece
Datasets and analysis of Novel Coronavirus (COVID-19) outbreak in Greece
Stars: ✭ 16 (-80.49%)
Mutual labels:  dataset
HAR
Recognize one of six human activities such as standing, sitting, and walking using a Softmax Classifier trained on mobile phone sensor data.
Stars: ✭ 18 (-78.05%)
Mutual labels:  dataset
Audio-Classification-using-CNN-MLP
Multi class audio classification using Deep Learning (MLP, CNN): The objective of this project is to build a multi class classifier to identify sound of a bee, cricket or noise.
Stars: ✭ 36 (-56.1%)
Mutual labels:  dataset
MaskedFaceRepresentation
Masked face recognition focuses on identifying people using their facial features while they are wearing masks. We introduce benchmarks on face verification based on masked face images for the development of COVID-safe protocols in airports.
Stars: ✭ 17 (-79.27%)
Mutual labels:  dataset
tracing-vs-freehand
Tracing Versus Freehand for Evaluating Computer-Generated Drawings (SIGGRAPH 2021)
Stars: ✭ 21 (-74.39%)
Mutual labels:  dataset
BugZoo
Keep your bugs contained. A platform for studying historical software bugs.
Stars: ✭ 49 (-40.24%)
Mutual labels:  dataset

Open Korean Corpora: A Living Document for Korean NLP Dataset Curation

image

Overview

  • Korean, a language with 80M users is often overlooked in NLP research
  • The availability of public datasets and tasks has hindered investigation
  • Even the publicly available datasets are not always accompanied by English documentation and have poor discoverability
  • Our work attempts to tackle this problem by curating a living document of open resources for the Korean language

NLP-OSS @ EMNLP 2020

We will be in the live session and monitoring the slide chat during EMNLP 2020. If you have any questions or would simply want to drop by to say hello, please drop by!

Public Institutions

Multiple government-funded institutions create datasets for the Korean language

  • National Institute of Korean Language (NIKL)
  • Electronics and Telecommunications Research Institute (ETRI)
  • NIA AI HUB

Generally, government funded datasets tend to be very restrictive at allowing access to non-Korean citizens

Thus, Open Corpora here denotes a freely accessible and downloadable (at least only with a simple sign-in) dataset

Open Dataset for Korean NLP

Our work focuses on curating open Korean corpora under the following criteria:

  • Documentation status
  • License for use and distribution

Documentation and License

For documentation status Docu. the following holds.

  • doc - Does the corpus have any documentation on the usage?
  • art - Does the corpus have a related article?
  • inter - Does the corpus have a internationally available publication?

License

For License, we check the followings:

  • Commercially available (com), academic use only (acad), unknown (unk)
  • Redistribution is available with/without modification (rd and rd/mod-x), neither (no), unknown (unk)

Other Attributes

  • In Providers, we note if the dataset is provided by universities or institutes (Academia), companies or the research group thereof (Industry), or something combined, as Competition purpose.
  • In Volume, (w) denotes words, (s) denotes sentences, (p) denotes pairs (either document or sentence pairs), (d) denotes dialogues, (h) denotes hours, and (u) denotes speech utterances.
  • In Goal, Eval is noted if the purpose is suggested as an evaluation.

View at a Glance

The table below describes the open Korean corpora investigated so far. To be updated along with our survey or PR. You can visit Here for the Korean description, and more information regarding government-driven database.

No. Dataset Typical Usage Provider Docu. License Volume Goal Lang.
1 KAIST Morpho-Syntactically Annotated Corpus Morphological analysis Academia art acad/no 70M (w) - ko
2 KAIST Korean Tree-Tagging Corpus Tree parsing Academia inter acad/no 30K (s) - ko
3 UD Korean KAIST Dependency parsing Academia inter acad/no 27K (s) - ko
4 PKT-UD Dependency parsing Academia inter acad/no 5K (s) - ko
5 KMOU NER NER Academia art acad/rd 24K (s) - ko
6 AIR x NAVER NER NER Competition doc acad/no 90K (s) - ko
7 AIR x NAVER SRL SRL Competition doc acad/no 35K (s) - ko
8 Question Pair Paraphrase detection Academia doc com/rd 10K (p) - ko
9 KorNLI NLI Industry inter com/rd 1,000K (p) - ko
10 KorSTS STS Industry inter com/rd 8,500 (p) - ko
11 ParaKQC STS Academia inter com/rd 540K (p) - ko
12 NSMC Sentiment analysis Academia doc com/rd 150K / 50K (s) - ko
13 BEEP! Hate speech detection Academia inter com/rd 8K / 500 / 1,000 (s) - ko
14 3i4K Speech act classification Academia inter com/rd 55K / 6K (s) - ko
15 KorQuAD 1.0 QA Industry inter com/rd (mod-x) 60K / 5K / 4K (p) - ko
16 KorQuAD 2.0 QA Industry art com/rd (mod-x) 80K / 10K / 10K (p) - ko
17 Sci-news-sum-kr Summarization Academia doc acad/rd 50 (p) Eval ko
18 sae4K Summarization Academia inter com/rd 50K (p) - ko
19 Korean Parallel Corpora MT Academia inter com/red(mod-x) 97K (p) - ko, en
20 KAIST Translation Evaluation Set MT Academia doc acad/no 3K (p) Eval ko, en
21 KAIST Chinese-Korean Multilingual Corpus MT Academia doc acad/no 60K (p) - ko, zh
22 Transliteration Dataset Transliteration Academia doc com/rd 35K (p) - ko, en
23 KAIST Transliteration Evaluation Set Transliteration Academia doc acad/no 7K (p) Eval ko, en
24 SIGMORPHON G2P G2P conversion Competition inter com/rd 3,600 / 450 / 450 (p) - ko, en, hy, bg, fr, ka, hi, hu, is, lt, el
25 PAWS-X Paraphrase detection Industry inter com/rd 5K / 2K / 2K (p) - ko, fr, es, de, zh, ja
26 TyDi-QA QA Industry inter com/rd 11K / 1,698 / 1,722 (p) - ko, en, ar, bn, fi, ja, id, sw, ru, te, th
27 XPersona Dialog Academia inter com/rd 299 (d) / 4,684 (s) - ko, en, it, fr, id, zh, ja
28 KSS ASR Academia doc acad/rd 12+ (h) / 13K (u) / 1 speaker - ko
29 Zeroth ASR Industry doc com/rd 51+ (h) / 27K (s) / 46K (u) / 181 speakers - ko
30 ClovaCall ASR Industry inter acad/no 80+ (h) / 60K (u)/ 11K speakers - ko
31 Pansori-TedXKR ASR Academia inter acad/rd / (mod-x) 3+ (h) / 3K (u)/ 41 speakers - ko
32 ProSem SLU Academia inter com/rd 6+ (h) / 3,500 (s) / 7K (u) / 2 speakers - ko

Citing

To cite our work, please use the following: (Also available as cho-etal-2020-open in anthology.bib)

@inproceedings{cho-etal-2020-open,
    title = "Open {K}orean Corpora: A Practical Report",
    author = "Cho, Won Ik  and
      Moon, Sangwhan  and
      Song, Youngsook",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.12",
    pages = "85--93",
    abstract = "Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.",
}

Contributing

Please read the contributor guidelines before sending a pull request.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].