trafilaturaPython & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+464.29%)
malay-datasetText corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (+50%)
Friend.lyA social media platform with a friend recommendation engine based on personality trait extraction
Stars: ✭ 41 (-67.46%)
CoarijCorpus of Annual Reports in Japan
Stars: ✭ 55 (-56.35%)
Lda Topic ModelingA PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-27.78%)
DatasetsPoetry-related datasets developed by THUAIPoet (Jiuge) group.
Stars: ✭ 111 (-11.9%)
Orange3 Text🍊 📄 Text Mining add-on for Orange3
Stars: ✭ 83 (-34.13%)
NlpplnNLP pipeline software using common workflow language
Stars: ✭ 31 (-75.4%)
PipeitPipeIt is a text transformation, conversion, cleansing and extraction tool.
Stars: ✭ 57 (-54.76%)
Colibri CoreColibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (-11.11%)
TidytextText mining using tidy tools ✨📄✨
Stars: ✭ 975 (+673.81%)
LexiconA data package containing lexicons and dictionaries for text analysis
Stars: ✭ 87 (-30.95%)
Typing AssistantTyping Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Stars: ✭ 32 (-74.6%)
ScattertextBeautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+1266.67%)
Lyrics CorporaAn unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Stars: ✭ 13 (-89.68%)
Ja.text8Japanese text8 corpus for word embedding.
Stars: ✭ 79 (-37.3%)
PansoriTools for ASR Corpus Generation from Online Video
Stars: ✭ 106 (-15.87%)
AutophraseAutoPhrase: Automated Phrase Mining from Massive Text Corpora
Stars: ✭ 835 (+562.7%)
Python nlp tutorialThis repository provides everything to get started with Python for Text Mining / Natural Language Processing (NLP)
Stars: ✭ 72 (-42.86%)
Nlp In PracticeStarter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+526.98%)
KonlpyPython package for Korean natural language processing.
Stars: ✭ 1,098 (+771.43%)
Text predictorChar-level RNN LSTM text generator📄.
Stars: ✭ 99 (-21.43%)
NgramFast n-Gram Tokenization
Stars: ✭ 55 (-56.35%)
Textcluster短文本聚类预处理模块 Short text cluster
Stars: ✭ 115 (-8.73%)
Spark NkpNatural Korean Processor for Apache Spark
Stars: ✭ 50 (-60.32%)
TadwAn implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-65.87%)
Gsoc2018 3gm💫 Automated codification of Greek Legislation with NLP
Stars: ✭ 36 (-71.43%)
PycluePython toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Stars: ✭ 91 (-27.78%)
Metasra PipelineMetaSRA: normalized sample-specific metadata for the Sequence Read Archive
Stars: ✭ 33 (-73.81%)
GeniusEasily access song lyrics from Genius in a tibble.
Stars: ✭ 111 (-11.9%)
R Text DataList of textual data sources to be used for text mining in R
Stars: ✭ 85 (-32.54%)
Tidy Text MiningManuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
Stars: ✭ 961 (+662.7%)
Dialog corpus用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+1219.05%)
SpiderA configurable web spider with a easy-to-use web console
Stars: ✭ 954 (+657.14%)
Dataset Listlists of text corpus and more (mainly Japanese)
Stars: ✭ 84 (-33.33%)
Ua GecUA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-14.29%)
BagofconceptsPython implementation of bag-of-concepts
Stars: ✭ 18 (-85.71%)
Russian news corpusRussian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (-39.68%)
Naive Bayes ClassifierNaive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
Stars: ✭ 6 (-95.24%)
Sejong CorpusKorean sejong corpus download and simple analysis
Stars: ✭ 116 (-7.94%)
Rake NltkPython implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
Stars: ✭ 793 (+529.37%)
BlacklabA corpus retrieval engine based on Apache Lucene
Stars: ✭ 69 (-45.24%)
Learning Social Media Analytics With RThis repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt
Stars: ✭ 102 (-19.05%)
Seq2seq ChatbotChatbot in 200 lines of code using TensorLayer
Stars: ✭ 777 (+516.67%)
PyphoneticsA Python 3 phonetics library.
Stars: ✭ 61 (-51.59%)
Nlp chinese corpus大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+5182.54%)
Text2vecFast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+467.46%)
Cogcomp NlpyCogComp's light-weight Python NLP annotators
Stars: ✭ 115 (-8.73%)