DeepSentiPersRepository for the experiments described in the paper named "DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus"
dialogue-datasetscollect the open dialog corpus and some useful data processing utils.
SpiCE-CorpusAn open-access corpus of conversational bilingual speech in Cantonese and English
OpenDialogAn Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
foliaFoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
KWDLCKyoto University Web Document Leads Corpus
bible-corpusA multilingual parallel corpus created from translations of the Bible.
pdf-corpusPython script to quickly create hand-crafted PDF files
CBLUE中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
jrte-corpusJapanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
LanguageCodesWe present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
BSDThe Business Scene Dialogue corpus
textboxText collections made available by the CLiGS group.
malay-datasetText corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
gumRepository for the Georgetown University Multilayer Corpus (GUM)
nytwitNew York Times Word Innovation Types dataset
ocr2textConvert a PDF via OCR to a TXT file in UTF-8 encoding
OpenConvertText conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
opensource-voice-toolsA repo listing known open source voice tools, ordered by where they sit in the voice stack
trafilaturaPython & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
tvsubTVsub: DCU-Tencent Chinese-English Dialogue Corpus
proiel-treebankOfficial releases of the PROIEL treebank of ancient Indo-European languages
DANeSDANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
rclcRich Context leaderboard competition, including the corpus and current SOTA for required tasks.
german-nounsA list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
megsA merged version of multiple open-source German speech datasets.