All Projects → textvec → Textvec

textvec / Textvec

Licence: mit
Text vectorization tool to outperform TFIDF for classification tasks

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Textvec

Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+1361.68%)
Mutual labels:  natural-language-processing, text-classification, text-processing
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+108.38%)
Mutual labels:  text-classification, text-processing, text-analysis
Stanza Old
Stanford NLP group's shared Python tools.
Stars: ✭ 142 (-14.97%)
Mutual labels:  natural-language-processing, text-processing, text-analysis
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+114.37%)
Mutual labels:  natural-language-processing, text-classification, text-analysis
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (-14.97%)
Mutual labels:  text-classification, text-analysis, text-processing
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+373.05%)
Mutual labels:  natural-language-processing, text-classification, tf-idf
Bible text gcn
Pytorch implementation of "Graph Convolutional Networks for Text Classification"
Stars: ✭ 90 (-46.11%)
Mutual labels:  natural-language-processing, text-classification
Neuronblocks
NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego
Stars: ✭ 1,356 (+711.98%)
Mutual labels:  natural-language-processing, text-classification
Kadot
Kadot, the unsupervised natural language processing library.
Stars: ✭ 108 (-35.33%)
Mutual labels:  natural-language-processing, text-classification
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-25.75%)
Mutual labels:  text-classification, text-processing
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+577.84%)
Mutual labels:  natural-language-processing, text-classification
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (-31.14%)
Mutual labels:  natural-language-processing, text-processing
Padatious
A neural network intent parser
Stars: ✭ 124 (-25.75%)
Mutual labels:  text-processing, text-analysis
Text classification
Text Classification Algorithms: A Survey
Stars: ✭ 1,276 (+664.07%)
Mutual labels:  text-classification, text-processing
Monkeylearn Ruby
Official Ruby client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Ruby apps.
Stars: ✭ 76 (-54.49%)
Mutual labels:  natural-language-processing, text-classification
Texting
[ACL 2020] Tensorflow implementation for "Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks"
Stars: ✭ 103 (-38.32%)
Mutual labels:  natural-language-processing, text-classification
Nlp Tutorial
A list of NLP(Natural Language Processing) tutorials
Stars: ✭ 1,188 (+611.38%)
Mutual labels:  natural-language-processing, text-classification
Nlp Pretrained Model
A collection of Natural language processing pre-trained models.
Stars: ✭ 122 (-26.95%)
Mutual labels:  natural-language-processing, text-classification
Prenlp
Preprocessing Library for Natural Language Processing
Stars: ✭ 130 (-22.16%)
Mutual labels:  natural-language-processing, text-processing
Monkeylearn Python
Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.
Stars: ✭ 143 (-14.37%)
Mutual labels:  natural-language-processing, text-classification

textvec logo

WHAT: Supervised text vectorization tool

Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP methods in Python. The main idea of this project is to show alternatives for an excellent TFIDF method which is highly overused for supervised tasks. All interfaces are similar to scikit-learn so you should be able to test the performance of this supervised methods just with a few changes.

Textvec is compatible with: Python 2.7-3.7.


WHY: Comparison with TFIDF

As you can read in the different articles1,2 almost on every dataset supervised methods outperform unsupervised. But most text classification examples on the internet ignores that fact.

IMDB_bin RT_bin Airlines Sentiment_bin Airlines Sentiment_multiclass 20news_multiclass
TF 0.8984 0.7571 0.9194 0.8084 0.8206
TFIDF 0.9052 0.7717 0.9259 0.8118 0.8575
TFPF 0.8813 0.7403 0.9212 NA NA
TFRF 0.8797 0.7412 0.9194 NA NA
TFICF 0.8984 0.7642 0.9199 0.8125 0.8292
TFBINICF 0.8984 0.7571 0.9194 NA NA
TFCHI2 0.8898 0.7398 0.9108 NA NA
TFGR 0.8850 0.7065 0.8956 NA NA
TFRRF 0.8879 0.7506 0.9194 NA NA
TFOR 0.9092 0.7806 0.9207 NA NA

Here is a comparison for binary classification on imdb sentiment data set. Labels sorted by accuracy score and the heatmap shows the correlation between different approaches. As you can see some methods are good for to ensemble models or perform features selection.

Binary comparison

For more dataset benchmarks (rotten tomatoes, airline sentiment) see Binary classification quality comparison


Install:

Usage:

pip install textvec

Source code:

git clone https://github.com/textvec/textvec
cd textvec
pip install .

HOW: Examples

The usage is similar to scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer
from textvec.vectorizers import TfBinIcfVectorizer

cvec = CountVectorizer().fit(train_data.text)

tficf_vec = TfBinIcfVectorizer(sublinear_tf=True)
tficf_vec.fit(cvec.transform(text), y)

For more detailed examples see Basic example and other notebooks in Examples

Currently implemented methods:

  • TfIcfVectorizer
  • TforVectorizer
  • TfgrVectorizer
  • TfigVectorizer
  • Tfchi2Vectorizer
  • TfrfVectorizer
  • TfrrfVectorizer
  • TfBinIcfVectorizer
  • TfpfVectorizer
  • SifVectorizer
  • TfbnsVectorizer

Most of the vectorization techniques you can find in articles1,2,3. If you see any method with wrong name or reference please commit!


TODO

  • [ ] Docs

REFERENCE

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].