Weibo terminaterFinal Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
NlvrCornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Nlp bahasa resourcesA Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Wp2txtWP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Clue中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
ProsodyHelsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Code Docstring CorpusPreprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Awesome ChatbotAwesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
KhcoderKH Coder: for Quantitative Content Analysis or Text Mining
Dialog corpus用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Colibri CoreColibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
DatasetsPoetry-related datasets developed by THUAIPoet (Jiuge) group.
Ua GecUA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
PansoriTools for ASR Corpus Generation from Online Video
Pubmed RctPubMed 200k RCT dataset: a large dataset for sequential sentence classification.
PycluePython toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Ja.text8Japanese text8 corpus for word embedding.
Russian news corpusRussian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
BlacklabA corpus retrieval engine based on Apache Lucene
CoarijCorpus of Annual Reports in Japan
Typing AssistantTyping Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Lyrics CorporaAn unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Naive Bayes ClassifierNaive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
QuantedaAn R package for the Quantitative Analysis of Textual Data
Awesome Persian Nlp IrCurated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
CorporaA collection of small corpuses of interesting data for the creation of bots and similar stuff.
WordlessAn Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
FuzzdataFuzzing resources for feeding various fuzzers with input. 🔧
Cluecorpus2020Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
FakenewscorpusA dataset of millions of news articles scraped from a curated list of data sources.
wordfish-pythonextract relationships from standardized terms from corpus of interest with deep learning 🐟
fastmorphFast corpus search engine originally made for the Corpus of Written Tatar language
open-discourseOpen Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).