All Categories → Data Processing → corpus

Top 107 corpus open source projects

Chinese Names Corpus

中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。

✭ 3,053

dataset ner corpus dict names

Awesome Deeplearning Resources

Deep Learning and deep reinforcement learning research papers and some codes

✭ 2,483

deep-learning nlp video neural-network reinforcement-learning paper code corpus modelzoo

Weibo terminater

Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator

✭ 2,295

python chatbot chinese scraper weibo corpus sina

Nlvr

Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.

✭ 192

html machine-learning computer-vision natural-language-processing corpus

Efaqa Corpus Zh

❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库

✭ 170

python natural-language-processing natural-language-understanding corpus psychology

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

✭ 158

library nlp natural-language-processing dataset sentiment-analysis packages corpus

Indonesian Nlp Resources

data resource untuk NLP bahasa indonesia

✭ 143

nlp dataset crawler sentiment-analysis named-entity-recognition corpus pos-tagging

Wp2txt

WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.

✭ 145

ruby nlp wikipedia corpus

Clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

✭ 2,425

python Jupyter Notebook shell pytorch tensorflow dataset benchmark chinese language-model pretrained-models nlu corpus glue transformers albert bert roberta chineseglue

Prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

✭ 139

python machine-learning pytorch natural-language-processing dataset speech-synthesis corpus sequence-labeling

Gossiping Chinese Corpus

PTT 八卦版問答中文語料

✭ 137

jupyter-notebook dataset chatbot dialog question-answering corpus chinese-nlp

Code Docstring Corpus

Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

✭ 137

python code-generation corpus neural-machine-translation documentation-generator

Awesome Chatbot

Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:

✭ 1,785

python tensorflow awesome tutorial chatbot seq2seq corpus seq2seq-model seq2seq-chatbot

Khcoder

KH Coder: for Quantitative Content Analysis or Text Mining

✭ 126

perl visualization text-mining corpus

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

✭ 2,112

python shell nlp chinese text-classification sentiment-analysis knowledge-graph datasets ner machine-translation qa corpus text-summarization match text-similarity machine-reading-comprehension

Dialog corpus

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

✭ 1,662

python dataset chatbot dialog system corpus

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

✭ 121

awesome awesome-list nlp natural-language-processing parser dataset named-entity-recognition text-mining information-retrieval natural-language-understanding nlu corpus information-extraction

Sejong Corpus

Korean sejong corpus download and simple analysis

✭ 116

python shell linux mac korean corpus morphological-analysis

Colibri Core

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

✭ 112

python library nlp text-processing corpus linguistics

Datasets

Poetry-related datasets developed by THUAIPoet (Jiuge) group.

✭ 111

chinese corpus

Ua Gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

✭ 108

python natural-language-processing dataset corpus

Pansori

Tools for ASR Corpus Generation from Online Video

✭ 106

python speech-recognition corpus

Pubmed Rct

PubMed 200k RCT dataset: a large dataset for sequential sentence classification.

✭ 101

machine-learning nlp corpus medical

Lexicon Thai

คลังศัพท์ภาษาไทย

✭ 96

python corpus

Chi Corpus

迟先生语料库

✭ 96

python corpus

Pyclue

Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark

✭ 91

python language-model tiny corpus

Dataset List

lists of text corpus and more (mainly Japanese)

✭ 84

dataset corpus

Ja.text8

Japanese text8 corpus for word embedding.

✭ 79

python deep-learning machine-learning natural-language-processing word2vec corpus

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

✭ 76

machine-learning nlp ml text word2vec nlp-machine-learning russian corpus articles

Blacklab

A corpus retrieval engine based on Apache Lucene

✭ 69

java corpus

Coarij

Corpus of Annual Reports in Japan

✭ 55

python natural-language-processing dataset finance corpus

Mitie chinese wikipedia corpus

Pre-trained Wikipedia corpus by MITIE

✭ 43

nlp nlp-machine-learning corpus

Chatterbot Corpus

A multilingual dialog corpus

✭ 964

python language yaml dialog corpus

Typing Assistant

Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.

✭ 32

javascript python css nlp natural-language-processing keyboard prediction corpus autocompletion

Lyrics Corpora

An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts

✭ 13

python music corpus lyrics songs python-api

Company Names Corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

✭ 868

dataset ner corpus dict

Naive Bayes Classifier

Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.

✭ 6

java algorithm eclipse corpus

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

✭ 821

python machine-learning natural-language-processing dataset chatbot question-answering natural-language-understanding corpus

Seq2seq Chatbot

Chatbot in 200 lines of code using TensorLayer

✭ 777

python tensorflow nlp bot chat chatbot lstm rnn corpus tensorlayer

Nlp chinese corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP