Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Stars: ✭ 112 (+460%)

Mutual labels: corpus

tvsub

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

Stars: ✭ 40 (+100%)

Mutual labels: corpus

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+305%)

Mutual labels: corpus

proiel-treebank

Official releases of the PROIEL treebank of ancient Indo-European languages

Stars: ✭ 30 (+50%)

Mutual labels: corpus

lucene-geo-gazetteer

Uses Apache Lucene, OpenNLP and geonames and extracts locations from text and geocodes them.

Stars: ✭ 34 (+70%)

Mutual labels: opennlp

Probabilistic-RNN-DA-Classifier

Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model

Stars: ✭ 22 (+10%)

Mutual labels: corpus

Datasets

Poetry-related datasets developed by THUAIPoet (Jiuge) group.

Stars: ✭ 111 (+455%)

Mutual labels: corpus

german-nouns

A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.

Stars: ✭ 101 (+405%)

Mutual labels: corpus

bible-corpus

A multilingual parallel corpus created from translations of the Bible.

Stars: ✭ 115 (+475%)

Mutual labels: corpus

Dialogue-Corpus

No description or website provided.

Stars: ✭ 27 (+35%)

Mutual labels: corpus

BSD

The Business Scene Dialogue corpus

Stars: ✭ 51 (+155%)

Mutual labels: corpus

Awesome Deeplearning Resources

Deep Learning and deep reinforcement learning research papers and some codes

Stars: ✭ 2,483 (+12315%)

Mutual labels: corpus

CBLUE

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Stars: ✭ 379 (+1795%)

Mutual labels: corpus

Nlvr

Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.

Stars: ✭ 192 (+860%)

Mutual labels: corpus

textbox

Text collections made available by the CLiGS group.

Stars: ✭ 19 (-5%)

Mutual labels: corpus

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Stars: ✭ 158 (+690%)

Mutual labels: corpus

cljs-corpus

A greppable archive of ClojureScript code

Stars: ✭ 37 (+85%)

Mutual labels: corpus

Wp2txt

WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.

Stars: ✭ 145 (+625%)

Mutual labels: corpus

open2ch-dialogue-corpus

おーぷん2ちゃんねるをクロールして作成した対話コーパス

Stars: ✭ 65 (+225%)

Mutual labels: corpus

Prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (+595%)

Mutual labels: corpus

toSkoy

เเอปเเปลงพ๊ษ๊ไธญเป็นภ๊ษ๊สก๊อบ์ย (รุ่นใหฒ่ล่๊ษุฎ) (Plain English : One-way encryption algorithm for Thai language, which only Thai people could understand)

Stars: ✭ 52 (+160%)

Mutual labels: thai-language

Code Docstring Corpus

Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

Stars: ✭ 137 (+585%)

Mutual labels: corpus

nytwit

New York Times Word Innovation Types dataset

Stars: ✭ 21 (+5%)

Mutual labels: corpus

Khcoder

KH Coder: for Quantitative Content Analysis or Text Mining