All Projects → stepthom → Text_mining_resources

stepthom / Text_mining_resources

Resources for learning about Text Mining and Natural Language Processing

Projects that are alternatives of or similar to Text mining resources

Awesome Sentiment Analysis
Repository with all what is necessary for sentiment analysis and related areas
Stars: ✭ 459 (+28.21%)
Mutual labels:  sentiment-analysis, nlp-machine-learning, text-mining, text-analysis
Pyss3
A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI
Stars: ✭ 191 (-46.65%)
Mutual labels:  data-mining, natural-language-processing, text-classification, text-mining
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-74.58%)
Mutual labels:  natural-language-processing, nlp-machine-learning, text-mining, topic-modeling
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-96.65%)
Mutual labels:  text-mining, sentiment-analysis, text-analysis, topic-modeling
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-82.96%)
Mutual labels:  natural-language-processing, nlp-machine-learning, text-mining, topic-modeling
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (-2.79%)
Mutual labels:  data-mining, text-classification, text-mining, text-analysis
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-90.78%)
Mutual labels:  text-mining, text-classification, text-analysis, topic-modeling
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (+4.75%)
Mutual labels:  data-mining, text-classification, text-mining
Metasra Pipeline
MetaSRA: normalized sample-specific metadata for the Sequence Read Archive
Stars: ✭ 33 (-90.78%)
Mutual labels:  data-mining, natural-language-processing, text-mining
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+3465.08%)
Mutual labels:  data-mining, natural-language-processing, topic-modeling
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+603.35%)
Mutual labels:  natural-language-processing, sentiment-analysis, text-classification
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-95.53%)
Mutual labels:  text-mining, data-mining, text-analysis
Textract
extract text from any document. no muss. no fuss.
Stars: ✭ 3,165 (+784.08%)
Mutual labels:  data-mining, natural-language-processing, text-mining
Nlp With Ruby
Curated List: Practical Natural Language Processing done in Ruby
Stars: ✭ 907 (+153.35%)
Mutual labels:  list, natural-language-processing, sentiment-analysis
Shifterator
Interpretable data visualizations for understanding how texts differ at the word level
Stars: ✭ 209 (-41.62%)
Mutual labels:  natural-language-processing, sentiment-analysis, text-analysis
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (-67.88%)
Mutual labels:  data-mining, natural-language-processing, text-mining
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-49.44%)
Mutual labels:  natural-language-processing, nlp-machine-learning, text-mining
Graphbrain
Language, Knowledge, Cognition
Stars: ✭ 294 (-17.88%)
Mutual labels:  natural-language-processing, text-mining, text-analysis
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (-58.94%)
Mutual labels:  text-mining, sentiment-analysis, topic-modeling
DaDengAndHisPython
【微信公众号:大邓和他的python】, Python语法快速入门https://www.bilibili.com/video/av44384851 Python网络爬虫快速入门https://www.bilibili.com/video/av72010301, 我的联系邮箱[email protected]
Stars: ✭ 59 (-83.52%)
Mutual labels:  text-mining, text-classification, text-analysis

Uncle Steve's Big List of Text Analytics and NLP Resources

 ____ ____ ____ ____ _________ ____ ____ ____ ____ ____ ____ 
||t |||e |||x |||t |||       |||m |||i |||n |||i |||n |||g ||
||__|||__|||__|||__|||_______|||__|||__|||__|||__|||__|||__||
|/__\|/__\|/__\|/__\|/_______\|/__\|/__\|/__\|/__\|/__\|/__\|

A curated list of resources for learning about natural language processing, text analytics, and unstructured data. Awesome

Table of Contents

Books

R

Python

General

Blogs

Blog Articles, Papers, Case Studies

General

Biases in NLP

Scraping

Cleaning

Stop Words

Stemming

Dimensionality Reduction

Sarcasm Detection

Document Classification

Entity and Information Extraction

Document Clustering and Document Similarity

Concept Analysis/Topic Modeling

Sentiment Analysis

Methods

Challenges

Politics

Stock Market

Applications

Tools and Technology

Text Summarization

Machine Translation

Q&A Systems, Chatbots

Fuzzy Matching, Probabilistic Matching, Record Linkage, Etc.

Word and Document Embeddings

Deep Learning

Capsule Networks

Knowledge Graphs

NLP Conferences

Benchmarks

  • SQuAD leaderboard. A list of the strongest-performing NLP models on the Stanford Question Answering Dataset (SQuAD).
    • SQuAD 1.0 paper (Last updated October 2016). SQuAD v1.1 includes over 100,000 question and answer pairs based on Wikipedia articles.
    • SQuAD 2.0 paper (October 2018). The second generation of SQuAD includes unanswerable questions that the NLP model must identify as being unanswerable from the training data.
  • GLUE leaderboard.
    • GLUE paper (September 2018). A collection of nine NLP tasks including single-sentence tasks (e.g. check if grammar is correct, sentiment analysis), similarity and paraphrase tasks (e.g. determine if two questions are equivalent), and inference tasks (e.g. determine whether a premise contradicts a hypothesis).

Online courses

Udemy

Stanford

Coursera

DataCamp

Others

APIs and Libraries

  • R packages
    • tm: Text Mining.
    • lsa: Latent Semantic Analysis.
    • lda: Collapsed Gibbs Sampling Methods for Topic Models.
    • textir: Inverse Regression for Text Analysis.
    • corpora: Statistics and data sets for corpus frequency data.
    • tau: Text Analysis Utilities.
    • tidytext: Text mining using dplyr, ggplot2, and other tidy tools.
    • Sentiment140: Sentiment text analysis
    • sentimentr: Lexicon-based sentiment analysis.
    • cleanNLP: ML-based sentiment analysis.
    • RSentiment: Lexicon-based sentiment analysis. Contains support for negation detection and sarcasm.
    • text2vec: Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities.
    • fastTextR: Interface to the fastText library.
    • LDAvis: Interactive visualization of topic models.
    • keras: Interface to Keras, a high-level neural networks 'API'. (RStudio Blog: TensorFlow for R)
    • retweet: Client for accessing Twitter’s REST and stream APIs. (21 Recipes for Mining Twitter Data with rtweet)
    • topicmodels: Interface to the C code for Latent Dirichlet Allocation (LDA).
    • textmineR: Aid for text mining in R, with a syntax that should be familiar to experienced R users.
    • wordVectors: Creating and exploring word2vec and other word embedding models.
    • gtrendsR: Interface for retrieving and displaying the information returned online by Google Trends.
    • textstem: Tools that stem and lemmatize text.
    • NLPutils Utilities for Natural Language Processing.
    • Udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing using UDPipe.
  • Python modules
    • NLTK: Natural Language Toolkit.
    • scikit-learn: Machine Learning in Python
    • spaCy: Industrial-Strength Natural Language Processing in Python.
    • textblob: Simplified Text processing.
    • Gensim: Topic Modeling for humans.
    • Pattern.en: A fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface.
    • textmining: Python Text Mining utilities.
    • Scrapy: Open source and collaborative framework for extracting the data you need from websites.
    • lda2vec: Tools for interpreting natural language.
    • PyText A deep-learning based NLP modeling framework built on PyTorch.
    • sent2vec: General purpose unsupervised sentence representations.
    • flair: A very simple framework for state-of-the-art Natural Language Processing (NLP)
    • word_forms: Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.
    • AllenNLP: Open-source NLP research library, built on PyTorch.
    • Beautiful Soup: Parse HTML and XML documents. Useful for webscraping.
    • BigARTM: Fast topic modeling platform.
    • Scattertext: Beautiful visualizations of how language differs among document types.
    • embeddings: Pretrained word embeddings in Python.
    • fastText: Library for efficient learning of word representations and sentence classification.
    • Google Seq2Seq: A general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.
    • polyglot: A natural language pipeline that supports multilingual applications.
    • textacy: NLP, before and after spaCy
    • Glove-Python: A “toy” implementation of GloVe in Python. Includes a paragraph embedder.
    • Bert As A Service: Client/Server package for sentence encoding, i.e. mapping a variable-length sentence to a fixed-length vector. Design intent to provide a scalable production ready service, also allowing researchers to apply BERT quickly.
    • Keras-BERT: A Keras Implementation of BERT
    • Paragraph embedding scripts and Pre-trained models: Scripts for training and testing paragraph vectors, with links to some pre-trained Doc2Vec and Word2Vec models
    • Texthero Text preprocessing, representation and visualization from zero to hero.
  • Apache Tika: a content analysis tookilt.
  • Apache Spark: is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
    • MLlib: MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. Related to NLP there are methods available for LDA, Word2Vec, and TFIDF.
    • LDA: latent Dirichlet allocation
    • Word2Vec: is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document
    • TFIDF: term frequency-inverse document frequency
  • HDF5: an open source file format that supports large, complex, heterogeneous data. Requires no configuration.
    • h5py: Python HDF5 package
  • Stanford CoreNLP: a suite of core NLP tools
  • Stanford Parser: A probabilistic natural language parser.
  • Stanford POS Tagger: A Parts-of-Speech tagger.
  • Stanford Named Entity Recognizer: Recognizes proper nouns (things, places, organizations) and labels them as such.
  • Stanford Classifier: A softmax classifier.
  • Stanford OpenIE: Extracts relationships between words in a sentence (e.g. Mark Zuckerberg; founded; Facebook).
  • Stanford Topic Modeling Toolbox
  • MALLET: MAchine Learning for LanguagE Toolkit
  • Apache OpenNLP: Machine learning based toolkit for text NLP.
  • Streamcrab: Real-Time, Twitter sentiment analyzer engine http:/www.streamcrab.com
  • TextRazor API: Extract Meaning from your Text.
  • fastText. Library for fast text representation and classification. Facebook.
  • Comparison of Top 6 Python NLP Libraries.
  • pyCaret's NLP Module. PyCaret is an open source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights; also, PyCaret's Founder Moez Ali is a Smith Alumni - MMA 2020.

Products

Cloud

Getting Data out of PDFs

Online Demos and Tools

Datasets

Lexicons for Sentiment Analysis

Misc

Meta

Other Curated Lists

Contribute

Contributions are more than welcome! Please read the contribution guidelines first.

License

CC0

To the extent possible under law, @stepthom has waived all copyright and related or neighboring rights to this work.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].