Top 107 corpus open source projects

dialogue-datasets
collect the open dialog corpus and some useful data processing utils.
SpiCE-Corpus
An open-access corpus of conversational bilingual speech in Cantonese and English
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
OneStopEnglishCorpus
No description or website provided.
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
cljs-corpus
A greppable archive of ClojureScript code
✭ 37
corpus
bible-corpus
A multilingual parallel corpus created from translations of the Bible.
PoetryCorpus
Поэтический корпус русского языка
pdf-corpus
Python script to quickly create hand-crafted PDF files
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
egret-wenda-corpus
A Public Corpus for Machine Learning
TV4Dialog
No description or website provided.
LanguageCodes
We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
thaigov-corpus
โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
When-in-Rome
A meta-corpus of functional harmonic analysis.
malay-dataset
Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
open2ch-dialogue-corpus
おーぷん2ちゃんねるをクロールして作成した対話コーパス
nytwit
New York Times Word Innovation Types dataset
ocr2text
Convert a PDF via OCR to a TXT file in UTF-8 encoding
OpenConvert
Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
opensource-voice-tools
A repo listing known open source voice tools, ordered by where they sit in the voice stack
Chatbot-Training-Corpus
总结了一些可以用作聊天机器人训练实作的文字语聊,包含中英文不同语言
tvsub
TVsub: DCU-Tencent Chinese-English Dialogue Corpus
Speech-Corpus-Collection
A Collection of Speech Corpus for ASR and TTS
proiel-treebank
Official releases of the PROIEL treebank of ancient Indo-European languages
DANeS
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
rclc
Rich Context leaderboard competition, including the corpus and current SOTA for required tasks.
german-nouns
A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
megs
A merged version of multiple open-source German speech datasets.
61-107 of 107 corpus projects