Alternatives and detailed information of text-classification-baseline

dayyass / text-classification-baseline

Licence: MIT license

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Programming Languages

python

139335 projects - #7 most used programming language

Makefile

30231 projects

Dockerfile

14818 projects

Projects that are alternatives of or similar to text-classification-baseline

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+47.27%)

Mutual labels: text-classification, logistic-regression, tf-idf

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+1336.36%)

Mutual labels: text-classification, tf-idf

Nlp Recipes

Natural Language Processing Best Practices & Examples

Stars: ✭ 5,783 (+10414.55%)

Mutual labels: text-classification, text

Text classification

Text Classification Algorithms: A Survey

Stars: ✭ 1,276 (+2220%)

Mutual labels: text-classification, logistic-regression

FNet-pytorch

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Stars: ✭ 204 (+270.91%)

Mutual labels: text-classification, text

text2class

Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT

Stars: ✭ 15 (-72.73%)

Mutual labels: text-classification, text

Nlp Experiments In Pytorch

PyTorch repository for text categorization and NER experiments in Turkish and English.

Stars: ✭ 35 (-36.36%)

Mutual labels: text-classification, text

Textclassification

several methods for text classification

Stars: ✭ 180 (+227.27%)

Mutual labels: logistic-regression, tf-idf

Textvec

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (+203.64%)

Mutual labels: text-classification, tf-idf

Fake news detection

Fake News Detection in Python

Stars: ✭ 194 (+252.73%)

Mutual labels: text-classification, logistic-regression

Utox

µTox the lightest and fluffiest Tox client

Stars: ✭ 820 (+1390.91%)

Mutual labels: fast, text

Text-Classification-LSTMs-PyTorch

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (-18.18%)

Mutual labels: text-classification, baseline

Kevinpro-NLP-demo

All NLP you Need Here. 个人实现了一些好玩的NLP demo，目前包含13个NLP应用的pytorch实现

Stars: ✭ 117 (+112.73%)

Mutual labels: text-classification, baseline

Artificial Adversary

🗣️ Tool to generate adversarial text examples and test machine learning models against them

Stars: ✭ 348 (+532.73%)

Mutual labels: text-classification, text

baseline

New method for creating leading on the web

Stars: ✭ 31 (-43.64%)

Mutual labels: text, baseline

Text Classification Benchmark

文本分类基准测试

Stars: ✭ 18 (-67.27%)

Mutual labels: text-classification, logistic-regression

Nlp Pretrained Model

A collection of Natural language processing pre-trained models.

Stars: ✭ 122 (+121.82%)

Mutual labels: text-classification, text

bns-short-text-similarity

📖 Use Bi-normal Separation to find document vectors which is used to compute similarity for shorter sentences.

Stars: ✭ 24 (-56.36%)

Mutual labels: text-classification, tf-idf

Nepali-News-Classifier

Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.

Stars: ✭ 13 (-76.36%)

Mutual labels: text-classification, tf-idf

probabilistic nlg

Tensorflow Implementation of Stochastic Wasserstein Autoencoder for Probabilistic Sentence Generation (NAACL 2019).

Stars: ✭ 28 (-49.09%)

Mutual labels: text

View All Similar Projects ➔

Text Classification Baseline

Pipeline for fast building text classification baselines with TF-IDF + LogReg.

Usage

Instead of writing custom code for specific text classification task, you just need:

install pipeline:

pip install text-classification-baseline

run pipeline:

either in terminal:

text-clf-train --path_to_config config.yaml

or in python:

import text_clf

model, target_names_mapping = text_clf.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

text
target

The target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of two files:

config.yaml - general configuration with sklearn TF-IDF and LogReg parameters
hyperparams.py - sklearn GridSearchCV parameters

Change config.yaml and hyperparams.py to create the desired configuration and train text classification model with the following command:

terminal:

text-clf-train --path_to_config config.yaml

python:

import text_clf

model, target_names_mapping = text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models
experiment_name: model

# data
data:
  train_data_path: data/train.csv
  test_data_path: data/test.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
  lemmatization: null  # pymorphy2

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  n_jobs: -1

# grid-search
grid-search:
  do_grid_search: false
  grid_search_params_path: hyperparams.py

NOTE: grid search is disabled by default, to use it set do_grid_search: true.

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to grid-search which is sklearn GridSearchCV parametrized with hyperparams.py.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with TF-IDF and LogReg steps
target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
config.yaml - config that was used to train the model
hyperparams.py - grid-search parameters (if grid-search was used)
logging.txt - logging file

Additional functions

text_clf.token_frequency.get_token_frequency(path_to_config) -
get token frequency of train dataset according to the config file parameters

Only for binary classifiers:

text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
get precision and recall metrics for precision-recall curve
text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve
text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
plot precision-recall curve
text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
plot roc curve
text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
plot precision, recall, f1-score curves for probability thresholds

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

dayyass / text-classification-baseline

Programming Languages

Labels

Projects that are alternatives of or similar to text-classification-baseline

Text Classification Baseline

Usage

Config

Output

Additional functions

Requirements

Citation