Sejong CorpusKorean sejong corpus download and simple analysis
Stars: ✭ 116 (+213.51%)
Colibri CoreColibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (+202.7%)
DatasetsPoetry-related datasets developed by THUAIPoet (Jiuge) group.
Stars: ✭ 111 (+200%)
Ua GecUA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (+191.89%)
PansoriTools for ASR Corpus Generation from Online Video
Stars: ✭ 106 (+186.49%)
Pubmed RctPubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Stars: ✭ 101 (+172.97%)
PycluePython toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Stars: ✭ 91 (+145.95%)
Dataset Listlists of text corpus and more (mainly Japanese)
Stars: ✭ 84 (+127.03%)
Ja.text8Japanese text8 corpus for word embedding.
Stars: ✭ 79 (+113.51%)
Russian news corpusRussian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (+105.41%)
BlacklabA corpus retrieval engine based on Apache Lucene
Stars: ✭ 69 (+86.49%)
CoarijCorpus of Annual Reports in Japan
Stars: ✭ 55 (+48.65%)
Typing AssistantTyping Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Stars: ✭ 32 (-13.51%)
Lyrics CorporaAn unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Stars: ✭ 13 (-64.86%)
Naive Bayes ClassifierNaive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
Stars: ✭ 6 (-83.78%)
Seq2seq ChatbotChatbot in 200 lines of code using TensorLayer
Stars: ✭ 777 (+2000%)
Nlp chinese corpus大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+17889.19%)
QuantedaAn R package for the Quantitative Analysis of Textual Data
Stars: ✭ 647 (+1648.65%)
Awesome Persian Nlp IrCurated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+1143.24%)
BookcorpusCrawl BookCorpus
Stars: ✭ 443 (+1097.3%)
CorporaA collection of small corpuses of interesting data for the creation of bots and similar stuff.
Stars: ✭ 4,293 (+11502.7%)
WordlessAn Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Stars: ✭ 378 (+921.62%)
FuzzdataFuzzing resources for feeding various fuzzers with input. 🔧
Stars: ✭ 376 (+916.22%)
Cluecorpus2020Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Stars: ✭ 278 (+651.35%)
KorporaKorean corpus repository
Stars: ✭ 270 (+629.73%)
FakenewscorpusA dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+589.19%)
Indian ParallelCorpusCurated list of publicly available parallel corpus for Indian Languages
Stars: ✭ 23 (-37.84%)
wordfish-pythonextract relationships from standardized terms from corpus of interest with deep learning 🐟
Stars: ✭ 19 (-48.65%)
fastmorphFast corpus search engine originally made for the Corpus of Written Tatar language
Stars: ✭ 14 (-62.16%)
open-discourseOpen Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).
Stars: ✭ 47 (+27.03%)
DeepSentiPersRepository for the experiments described in the paper named "DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus"
Stars: ✭ 17 (-54.05%)
dialogue-datasetscollect the open dialog corpus and some useful data processing utils.
Stars: ✭ 24 (-35.14%)
Filipino-Text-BenchmarksOpen-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-40.54%)
SpiCE-CorpusAn open-access corpus of conversational bilingual speech in Cantonese and English
Stars: ✭ 33 (-10.81%)
OpenDialogAn Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (+154.05%)
Text classificationall kinds of text classification models and more with deep learning
Stars: ✭ 7,179 (+19302.7%)
Dense BiLSTMTensorflow Implementation of Densely Connected Bidirectional LSTM with Applications to Sentence Classification
Stars: ✭ 48 (+29.73%)