Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Stars: ✭ 112 (-11.11%)

Mutual labels: corpus

Tidytext

Text mining using tidy tools ✨📄✨

Stars: ✭ 975 (+673.81%)

Mutual labels: text-mining

Lexicon

A data package containing lexicons and dictionaries for text analysis

Stars: ✭ 87 (-30.95%)

Mutual labels: text-mining

Typing Assistant

Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.

Stars: ✭ 32 (-74.6%)

Mutual labels: corpus

Scattertext

Beautiful visualizations of how language differs among document types.

Stars: ✭ 1,722 (+1266.67%)

Mutual labels: text-mining

Lyrics Corpora

An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts

Stars: ✭ 13 (-89.68%)

Mutual labels: corpus

Ja.text8

Japanese text8 corpus for word embedding.

Stars: ✭ 79 (-37.3%)

Mutual labels: corpus

Pansori

Tools for ASR Corpus Generation from Online Video

Stars: ✭ 106 (-15.87%)

Mutual labels: corpus

Autophrase

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

Stars: ✭ 835 (+562.7%)

Mutual labels: text-mining

Python nlp tutorial

This repository provides everything to get started with Python for Text Mining / Natural Language Processing (NLP)

Stars: ✭ 72 (-42.86%)

Mutual labels: text-mining

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

Stars: ✭ 821 (+551.59%)

Mutual labels: corpus

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+526.98%)

Mutual labels: text-mining

Konlpy

Python package for Korean natural language processing.

Stars: ✭ 1,098 (+771.43%)

Mutual labels: text-mining

Text predictor

Char-level RNN LSTM text generator📄.

Stars: ✭ 99 (-21.43%)

Mutual labels: text-mining

Ngram

Fast n-Gram Tokenization

Stars: ✭ 55 (-56.35%)

Mutual labels: text-mining

Textcluster

短文本聚类预处理模块 Short text cluster

Stars: ✭ 115 (-8.73%)

Mutual labels: text-mining

Spark Nkp

Natural Korean Processor for Apache Spark

Stars: ✭ 50 (-60.32%)

Mutual labels: text-mining

Chi Corpus

迟先生语料库

Stars: ✭ 96 (-23.81%)

Mutual labels: corpus

Tadw

An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).

Stars: ✭ 43 (-65.87%)

Mutual labels: text-mining

Keywords2vec

Stars: ✭ 121 (-3.97%)

Mutual labels: text-mining

Gsoc2018 3gm

💫 Automated codification of Greek Legislation with NLP

Stars: ✭ 36 (-71.43%)

Mutual labels: text-mining

Pyclue

Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark

Stars: ✭ 91 (-27.78%)

Mutual labels: corpus

Metasra Pipeline

MetaSRA: normalized sample-specific metadata for the Sequence Read Archive

Stars: ✭ 33 (-73.81%)

Mutual labels: text-mining

Genius

Easily access song lyrics from Genius in a tibble.

Stars: ✭ 111 (-11.9%)

Mutual labels: text-mining

Chatterbot Corpus

A multilingual dialog corpus

Stars: ✭ 964 (+665.08%)

Mutual labels: corpus

R Text Data

List of textual data sources to be used for text mining in R

Stars: ✭ 85 (-32.54%)

Mutual labels: text-mining

Tidy Text Mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson

Stars: ✭ 961 (+662.7%)

Mutual labels: text-mining

Dialog corpus

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

Stars: ✭ 1,662 (+1219.05%)

Mutual labels: corpus

Spider

A configurable web spider with a easy-to-use web console

Stars: ✭ 954 (+657.14%)

Mutual labels: text-mining

Dataset List

lists of text corpus and more (mainly Japanese)

Stars: ✭ 84 (-33.33%)

Mutual labels: corpus

Company Names Corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

Stars: ✭ 868 (+588.89%)

Mutual labels: corpus

Ua Gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Stars: ✭ 108 (-14.29%)

Mutual labels: corpus

Bagofconcepts

Python implementation of bag-of-concepts

Stars: ✭ 18 (-85.71%)

Mutual labels: text-mining

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Stars: ✭ 76 (-39.68%)

Mutual labels: corpus

Naive Bayes Classifier

Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.

Stars: ✭ 6 (-95.24%)

Mutual labels: corpus

Sejong Corpus

Korean sejong corpus download and simple analysis

Stars: ✭ 116 (-7.94%)

Mutual labels: corpus

Rake Nltk

Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.

Stars: ✭ 793 (+529.37%)

Mutual labels: text-mining

Blacklab

A corpus retrieval engine based on Apache Lucene