Chinese Names Corpus中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Stars: ✭ 3,053 (+13173.91%)
FakenewscorpusA dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+1008.7%)
CoarijCorpus of Annual Reports in Japan
Stars: ✭ 55 (+139.13%)
Dialog corpus用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+7126.09%)
Dataset Listlists of text corpus and more (mainly Japanese)
Stars: ✭ 84 (+265.22%)
Nlp chinese corpus大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+28839.13%)
ProsodyHelsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (+504.35%)
Ua GecUA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (+369.57%)
Clue中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+10443.48%)
Nlp bahasa resourcesA Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+586.96%)
malay-datasetText corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (+721.74%)
PoetryCorpusПоэтический корпус русского языка
Stars: ✭ 40 (+73.91%)
gumRepository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+208.7%)
ocr2textConvert a PDF via OCR to a TXT file in UTF-8 encoding
Stars: ✭ 90 (+291.3%)
pdf-corpusPython script to quickly create hand-crafted PDF files
Stars: ✭ 17 (-26.09%)
opensource-voice-toolsA repo listing known open source voice tools, ordered by where they sit in the voice stack
Stars: ✭ 21 (-8.7%)
textboxText collections made available by the CLiGS group.
Stars: ✭ 19 (-17.39%)
nytwitNew York Times Word Innovation Types dataset
Stars: ✭ 21 (-8.7%)
DictGenerate使用Go语言编写的社工字典生成器(The social engineering dictionary generator written by Go)
Stars: ✭ 64 (+178.26%)
OpenConvertText conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
Stars: ✭ 20 (-13.04%)
trafilaturaPython & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+2991.3%)
CBLUE中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+1547.83%)
tvsubTVsub: DCU-Tencent Chinese-English Dialogue Corpus
Stars: ✭ 40 (+73.91%)
foliaFoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (+143.48%)
JSON-pathFind the path of a key / value in a JSON hierarchy easily.
Stars: ✭ 88 (+282.61%)
jrte-corpusJapanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Stars: ✭ 66 (+186.96%)
proiel-treebankOfficial releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (+30.43%)
dialogue-datasetscollect the open dialog corpus and some useful data processing utils.
Stars: ✭ 24 (+4.35%)
SpiCE-CorpusAn open-access corpus of conversational bilingual speech in Cantonese and English
Stars: ✭ 33 (+43.48%)
thai-languagecomputer tools for thai language
Stars: ✭ 20 (-13.04%)
django-serializable-modelDjango classes to make your models, managers, and querysets serializable, with built-in support for related objects in ~150 LoC
Stars: ✭ 15 (-34.78%)
DANeSDANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Stars: ✭ 64 (+178.26%)
TV4DialogNo description or website provided.
Stars: ✭ 33 (+43.48%)
rclcRich Context leaderboard competition, including the corpus and current SOTA for required tasks.
Stars: ✭ 20 (-13.04%)
LanguageCodesWe present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+204.35%)
german-nounsA list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Stars: ✭ 101 (+339.13%)
envsEasy access of environment variables from Python with support for typing (ex. booleans, strings, lists, tuples, integers, floats, and dicts). Now with CLI settings file converter.
Stars: ✭ 25 (+8.7%)
megsA merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-8.7%)
OpenDialogAn Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (+308.7%)
cljs-corpusA greppable archive of ClojureScript code
Stars: ✭ 37 (+60.87%)