All Projects → RandolphVI → Multi Label Text Classification

RandolphVI / Multi Label Text Classification

Licence: apache-2.0
About Muti-Label Text Classification Based on Neural Network.

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Multi Label Text Classification

detecting-offensive-language-in-tweets
Detecting cyberbullying in tweets using Machine Learning
Stars: ✭ 19 (-95.21%)
Mutual labels:  text-classification
Bert seq2seq
pytorch实现bert做seq2seq任务,使用unilm方案,现在也可以做自动摘要,文本分类,情感分析,NER,词性标注等任务,支持GPT2进行文章续写。
Stars: ✭ 298 (-24.94%)
Mutual labels:  text-classification
Nlp Projects
word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding
Stars: ✭ 360 (-9.32%)
Mutual labels:  text-classification
Bertweet
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (-28.97%)
Mutual labels:  text-classification
Textfooler
A Model for Natural Language Attack on Text Classification and Inference
Stars: ✭ 298 (-24.94%)
Mutual labels:  text-classification
Text Classification Cnn Rnn
CNN-RNN中文文本分类,基于TensorFlow
Stars: ✭ 3,613 (+810.08%)
Mutual labels:  text-classification
Lbl2Vec
Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.
Stars: ✭ 25 (-93.7%)
Mutual labels:  text-classification
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (-4.28%)
Mutual labels:  text-classification
Text Cnn
嵌入Word2vec词向量的CNN中文文本分类
Stars: ✭ 298 (-24.94%)
Mutual labels:  text-classification
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (-9.82%)
Mutual labels:  text-classification
Chinese Text Classification
Chinese-Text-Classification,Tensorflow CNN(卷积神经网络)实现的中文文本分类。QQ群:522785813,微信群二维码:http://www.tensorflownews.com/
Stars: ✭ 284 (-28.46%)
Mutual labels:  text-classification
Bert For Sequence Labeling And Text Classification
This is the template code to use BERT for sequence lableing and text classification, in order to facilitate BERT for more tasks. Currently, the template code has included conll-2003 named entity identification, Snips Slot Filling and Intent Prediction.
Stars: ✭ 293 (-26.2%)
Mutual labels:  text-classification
Snips Nlu
Snips Python library to extract meaning from text
Stars: ✭ 3,583 (+802.52%)
Mutual labels:  text-classification
2018 Dc Datagrand Textintelprocess
2018-DC-“达观杯”文本智能处理挑战赛:冠军 (1st/3131)
Stars: ✭ 260 (-34.51%)
Mutual labels:  text-classification
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (-9.32%)
Mutual labels:  text-classification
cnn-text-classification-keras
Convolutional Neural Network for Text Classification in Keras
Stars: ✭ 14 (-96.47%)
Mutual labels:  text-classification
Gather Deployment
Gathers scalable tensorflow and infrastructure deployment
Stars: ✭ 326 (-17.88%)
Mutual labels:  text-classification
Zhihu Text Classification
[2017知乎看山杯 多标签 文本分类] ye组(第六名) 解题方案
Stars: ✭ 392 (-1.26%)
Mutual labels:  text-classification
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (-5.54%)
Mutual labels:  text-classification
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (-12.34%)
Mutual labels:  text-classification

Deep Learning for Multi-Label Text Classification

Python Version Build Status Codacy Badge License Issues

This repository is my research project, and it is also a study of TensorFlow, Deep Learning (Fasttext, CNN, LSTM, etc.).

The main objective of the project is to solve the multi-label text classification problem based on Deep Neural Networks. Thus, the format of the data label is like [0, 1, 0, ..., 1, 1] according to the characteristics of such a problem.

Requirements

  • Python 3.6
  • Tensorflow 1.15.0
  • Tensorboard 1.15.0
  • Sklearn 0.19.1
  • Numpy 1.16.2
  • Gensim 3.8.3
  • Tqdm 4.49.0

Project

The project structure is below:

.
├── Model
│   ├── test_model.py
│   ├── text_model.py
│   └── train_model.py
├── data
│   ├── word2vec_100.model.* [Need Download]
│   ├── Test_sample.json
│   ├── Train_sample.json
│   └── Validation_sample.json
└── utils
│   ├── checkmate.py
│   ├── data_helpers.py
│   └── param_parser.py
├── LICENSE
├── README.md
└── requirements.txt

Innovation

Data part

  1. Make the data support Chinese and English (Can use jieba or nltk ).
  2. Can use your pre-trained word vectors (Can use gensim).
  3. Add embedding visualization based on the tensorboard (Need to create metadata.tsv first).

Model part

  1. Add the correct L2 loss calculation operation.
  2. Add gradients clip operation to prevent gradient explosion.
  3. Add learning rate decay with exponential decay.
  4. Add a new Highway Layer (Which is useful according to the model performance).
  5. Add Batch Normalization Layer.

Code part

  1. Can choose to train the model directly or restore the model from the checkpoint in train.py.
  2. Can predict the labels via threshold and top-K in train.py and test.py.
  3. Can calculate the evaluation metrics --- AUC & AUPRC.
  4. Can create the prediction file which including the predicted values and predicted labels of the Testset data in test.py.
  5. Add other useful data preprocess functions in data_helpers.py.
  6. Use logging for helping to record the whole info (including parameters display, model training info, etc.).
  7. Provide the ability to save the best n checkpoints in checkmate.py, whereas the tf.train.Saver can only save the last n checkpoints.

Data

See data format in /data folder which including the data sample files. For example:

{"testid": "3935745", "features_content": ["pore", "water", "pressure", "metering", "device", "incorporating", "pressure", "meter", "force", "meter", "influenced", "pressure", "meter", "device", "includes", "power", "member", "arranged", "control", "pressure", "exerted", "pressure", "meter", "force", "meter", "applying", "overriding", "force", "pressure", "meter", "stop", "influence", "force", "meter", "removing", "overriding", "force", "pressure", "meter", "influence", "force", "meter", "resumed"], "labels_index": [526, 534, 411], "labels_num": 3}
  • "testid": just the id.
  • "features_content": the word segment (after removing the stopwords)
  • "labels_index": The label index of the data records.
  • "labels_num": The number of labels.

Text Segment

  1. You can use nltk package if you are going to deal with the English text data.

  2. You can use jieba package if you are going to deal with the Chinese text data.

Data Format

This repository can be used in other datasets (text classification) in two ways:

  1. Modify your datasets into the same format of the sample.
  2. Modify the data preprocessing code in data_helpers.py.

Anyway, it should depend on what your data and task are.

🤔Before you open the new issue about the data format, please check the data_sample.json and read the other open issues first, because someone maybe ask me the same question already. For example:

Pre-trained Word Vectors

You can download the Word2vec model file (dim=100). Make sure they are unzipped and under the /data folder.

You can pre-training your word vectors (based on your corpus) in many ways:

  • Use gensim package to pre-train data.
  • Use glove tools to pre-train data.
  • Even can use a fasttext network to pre-train data.

Usage

See Usage.

Network Structure

FastText

References:


TextANN

References:

  • Personal ideas 🙃

TextCNN

References:


TextRNN

Warning: Model can use but not finished yet 🤪!

TODO

  1. Add BN-LSTM cell unit.
  2. Add attention.

References:


TextCRNN

References:

  • Personal ideas 🙃

TextRCNN

References:

  • Personal ideas 🙃

TextHAN

References:


TextSANN

Warning: Model can use but not finished yet 🤪!

TODO

  1. Add attention penalization loss.
  2. Add visualization.

References:


About Me

黄威,Randolph

SCU SE Bachelor; USTC CS Ph.D.

Email: [email protected]

My Blog: randolph.pro

LinkedIn: randolph's linkedin

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].