Top 107 corpus open source projects

Chinese Names Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Awesome Deeplearning Resources
Deep Learning and deep reinforcement learning research papers and some codes
Weibo terminater
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Nlvr
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Efaqa Corpus Zh
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Wp2txt
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Code Docstring Corpus
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Awesome Chatbot
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Khcoder
KH Coder: for Quantitative Content Analysis or Text Mining
Dialog corpus
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Sejong Corpus
Korean sejong corpus download and simple analysis
Colibri Core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Datasets
Poetry-related datasets developed by THUAIPoet (Jiuge) group.
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Pansori
Tools for ASR Corpus Generation from Online Video
Pubmed Rct
PubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Lexicon Thai
คลังศัพท์ภาษาไทย
Chi Corpus
迟先生语料库
Pyclue
Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Dataset List
lists of text corpus and more (mainly Japanese)
Russian news corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Blacklab
A corpus retrieval engine based on Apache Lucene
✭ 69
javacorpus
Coarij
Corpus of Annual Reports in Japan
Mitie chinese wikipedia corpus
Pre-trained Wikipedia corpus by MITIE
Chatterbot Corpus
A multilingual dialog corpus
Typing Assistant
Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Lyrics Corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Company Names Corpus
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
Naive Bayes Classifier
Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
Seq2seq Chatbot
Chatbot in 200 lines of code using TensorLayer
Quanteda
An R package for the Quantitative Analysis of Textual Data
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Small Chinese Corpus
Some useful Chinese corpus datasets 中文语料小数据
Chinese Nlp Corpus
Collections of Chinese NLP corpus
Corpora
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Fuzzdata
Fuzzing resources for feeding various fuzzers with input. 🔧
Cluecorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Korpora
Korean corpus repository
✭ 270
pythoncorpus
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Medical-Names-Corpus
医疗语料库。医疗机构名语料库。药品本位码。
wordfish-python
extract relationships from standardized terms from corpus of interest with deep learning 🐟
fastmorph
Fast corpus search engine originally made for the Corpus of Written Tatar language
open-discourse
Open Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).
Species-Names-Corpus
物种名称语料库。植物名,动物名。
1-60 of 107 corpus projects