All Projects → SeanLee97 → clfzoo

SeanLee97 / clfzoo

Licence: MIT license
A deep text classifiers library.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to clfzoo

seededlda
Semisupervided LDA for theory-driven text analysis
Stars: ✭ 46 (+24.32%)
Mutual labels:  text-classification
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+118.92%)
Mutual labels:  text-classification
TLA
A comprehensive tool for linguistic analysis of communities
Stars: ✭ 47 (+27.03%)
Mutual labels:  text-classification
sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-43.24%)
Mutual labels:  text-classification
nlp classification workshop
NLP Classification Workshop
Stars: ✭ 22 (-40.54%)
Mutual labels:  text-classification
TextCategorization
⚡ Using deep learning (MLP, CNN, Graph CNN) to classify text in TensorFlow.
Stars: ✭ 30 (-18.92%)
Mutual labels:  text-classification
text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (-13.51%)
Mutual labels:  text-classification
QGNN
Quaternion Graph Neural Networks (ACML 2021) (Pytorch and Tensorflow)
Stars: ✭ 31 (-16.22%)
Mutual labels:  text-classification
cnn-text-classification
Text classification with Convolution Neural Networks on Yelp, IMDB & sentence polarity dataset v1.0
Stars: ✭ 108 (+191.89%)
Mutual labels:  text-classification
classifier multi label seq2seq attention
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search
Stars: ✭ 26 (-29.73%)
Mutual labels:  text-classification
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+216.22%)
Mutual labels:  text-classification
WeSHClass
[AAAI 2019] Weakly-Supervised Hierarchical Text Classification
Stars: ✭ 83 (+124.32%)
Mutual labels:  text-classification
TextFeatureSelection
Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
Stars: ✭ 42 (+13.51%)
Mutual labels:  text-classification
deepnlp
小时候练手的nlp项目
Stars: ✭ 11 (-70.27%)
Mutual labels:  text-classification
alter-nlu
Natural language understanding library for chatbots with intent recognition and entity extraction.
Stars: ✭ 45 (+21.62%)
Mutual labels:  text-classification
text-classification-baseline
Pipeline for fast building text classification TF-IDF + LogReg baselines.
Stars: ✭ 55 (+48.65%)
Mutual labels:  text-classification
awesome-text-classification
Text classification meets word embeddings.
Stars: ✭ 27 (-27.03%)
Mutual labels:  text-classification
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (+48.65%)
Mutual labels:  text-classification
classifier multi label
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification
Stars: ✭ 127 (+243.24%)
Mutual labels:  text-classification
Relation-Classification
Relation Classification - SEMEVAL 2010 task 8 dataset
Stars: ✭ 46 (+24.32%)
Mutual labels:  text-classification

/ clfzoo /

Eng / CN

clfzoo is a toolkit for text classification. We have implemented some baseline models, such as TextCNN, TextRNN, RCNN, Transformer, HAN, DPCNN. And We have designed a unified and friendly API to train / predict / test the models. Looking forward to your code contributions and suggestions.

Requiements

python3+
numpy
sklearn
tensorflow>=1.6.0

Installation

git clone https://github.com/SeanLee97/clfzoo.git
cd clfzoo

Overview

project
│    README.md
│
└─── docs
│
└─── clfzoo    # models
│   │  base.py       # base model template
│   │  config.py     # default configure
│   │  dataloader.py
│   │  instance.py   # data instance
│   │  vocab.py      # vocabulary
│   │  libs          # layers and functions
│   │  dpcnn         # implement dpcnn model
│   │   │  __init__.py  # model apis
│   │   │  model.py     # model
│   │  ...           # implement other models
└───examples
    │   ...

Data Prepare

Each line is a document. The line format is "label \t sentence". The default word tokenizer is split by blank space, so words in sentence should split by blank space.

for english sample

greeting    how are you.

for chinese sample

打招呼  你 最近 过得 怎样 啊 ?

Usage

train

# import model api
import clfzoo.textcnn as clf  

# import model config
from clfzoo.config import ConfigTextCNN

"""define model config

You can assign value to hy-params defined on base model config (here is ConfigTextCNN)
"""

class Config(ConfigTextCNN):
    def __init__(self):
        # it is required to implement super() function
        super(Config, self).__init__()

    # it is required to provide dataset
    train_file = '/path/to/train'
    dev_file = '/path/to/test'
    
    # ... other hy-params

# `training` is flag to indicate train mode.
clf.model(Config(), training=True)

# start to train
clf.train()

The train log will output to log.txt, the model weights and checkpoint summaries will output to models folder.

predict

Predit the labels and probability scores.

import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNN

class Config(ConfigTextCNN):
    def __init__(self):
        super(Config, self).__init__()
    
    # the same hy-params as train

# inject config to model
clf.model(Config())

"""
Input: a list
    each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]

"""
Return: a list
    [('label 1', 'score 1'), ('label 2', 'score 2'), ...]
"""
preds = clf.predict(datas)

test

Predit the labels and probability scores and get result metrics. In order to calculate metrics you should provide ground-truth label.

import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNN

class Config(ConfigTextCNN):
    def __init__(self):
        super(Config, self).__init__()
    
    # the same hy-params as train

# inject config to model
clf.model(Config())

"""
Input: a list
    each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]
labels = ['greeting', 'weather', ...]

"""
Return: a tuple
    - predicts: a list
        [('label 1', 'score 1'), ('label 2', 'score 2'), ...]
    - metrics: a dict
        {'recall': '', 'precision': '', 'f1': , 'accuracy': ''}
"""
preds, metrics = clf.test(datas, labels)

Benchmark Results

here we use smp2017-ECDT dataset as an example, which is a multi-label (31 labels)、short-text and chinese dataset.

We train all models in 20 epochs, and calculate metrics by sklearn metrics functions. As we all know fasttext is a strong baseline in text-classification, so here we give the result on fasttext

Models Precision Recall F1
fasttext 0.81 0.81 0.81
TextCNN 0.83 0.84 0.83
TextRNN 0.84 0.83 0.82
RCNN 0.86 0.85 0.85
DPCNN 0.87 0.85 0.85
Transformer 0.74 0.67 0.68
HAN TODO TODO TODO

Attention! It seems that Transformer and HAN can`t perform well now, We will fix bugs and update their result later.

Contributors

Refrence

Some code modules from

Papers

Contact Us

Any questions please mailto xmlee97#gmail.com

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].