kanji-frequencyKanji usage frequency data collected from various sources
Stars: ✭ 92 (+39.39%)
OpenConvertText conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
Stars: ✭ 20 (-69.7%)
nippon日语N5-N2语法笔记~ 🍻
Stars: ✭ 84 (+27.27%)
NlvrCornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Stars: ✭ 192 (+190.91%)
german-nounsA list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Stars: ✭ 101 (+53.03%)
gumRepository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+7.58%)
tvsubTVsub: DCU-Tencent Chinese-English Dialogue Corpus
Stars: ✭ 40 (-39.39%)
Wp2txtWP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Stars: ✭ 145 (+119.7%)
Code Docstring CorpusPreprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Stars: ✭ 137 (+107.58%)
Nihonoari-AppA little and minimalist Japanese Kana training
Stars: ✭ 66 (+0%)
thaigov-corpusโครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-71.21%)
megsA merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-68.18%)
nytwitNew York Times Word Innovation Types dataset
Stars: ✭ 21 (-68.18%)
Nlp bahasa resourcesA Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+139.39%)
trafilaturaPython & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+977.27%)
ProsodyHelsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (+110.61%)
When-in-RomeA meta-corpus of functional harmonic analysis.
Stars: ✭ 35 (-46.97%)
bisemanticText pair classification
Stars: ✭ 12 (-81.82%)
KhcoderKH Coder: for Quantitative Content Analysis or Text Mining
Stars: ✭ 126 (+90.91%)
Dialog corpus用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+2418.18%)
DANeSDANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Stars: ✭ 64 (-3.03%)
malay-datasetText corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (+186.36%)
rclcRich Context leaderboard competition, including the corpus and current SOTA for required tasks.
Stars: ✭ 20 (-69.7%)
kotobaA Discord bot for helping with learning Japanese.
Stars: ✭ 118 (+78.79%)
AtCoderClans【非公式】AtCoderがもっと楽しくなるリンク集です。有志による非公式サービス・ツール・ライブラリ・記事などをまとめています。
Stars: ✭ 74 (+12.12%)
jmdict-simplifiedJMdict, JMnedict, Kanjidic, KRADFILE/RADKFILE in JSON format
Stars: ✭ 96 (+45.45%)
jaco-jsJapanese character optimizer for JavaScript
Stars: ✭ 72 (+9.09%)
google-news-scraperGoogle News Scraper for languages like Japanese, Chinese... [VPN Support]
Stars: ✭ 88 (+33.33%)
Chinese Names Corpus中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Stars: ✭ 3,053 (+4525.76%)
ocr2textConvert a PDF via OCR to a TXT file in UTF-8 encoding
Stars: ✭ 90 (+36.36%)
Weibo terminaterFinal Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Stars: ✭ 2,295 (+3377.27%)
TV4DialogNo description or website provided.
Stars: ✭ 33 (-50%)
Efaqa Corpus Zh❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Stars: ✭ 170 (+157.58%)
opensource-voice-toolsA repo listing known open source voice tools, ordered by where they sit in the voice stack
Stars: ✭ 21 (-68.18%)
BSDThe Business Scene Dialogue corpus
Stars: ✭ 51 (-22.73%)
Clue中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+3574.24%)
Awesome ChatbotAwesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Stars: ✭ 1,785 (+2604.55%)
Senti4SDAn emotion-polarity classifier specifically trained on developers' communication channels
Stars: ✭ 41 (-37.88%)
textboxText collections made available by the CLiGS group.
Stars: ✭ 19 (-71.21%)
limelightA php Japanese language text analyzer and parser.
Stars: ✭ 76 (+15.15%)
LanguageCodesWe present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+6.06%)
banglabertThis repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…
Stars: ✭ 186 (+181.82%)
proiel-treebankOfficial releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (-54.55%)