FernandoLpz / Text-Classification-LSTMs-PyTorch

Licence: other

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Text-Classification-LSTMs-PyTorch

Awesome-Pytorch-Tutorials

Awesome Pytorch Tutorials

Stars: ✭ 23 (-48.89%)

Mutual labels: pytorch-tutorial, pytorch-nlp, pytorch-implementation

Applied Text Mining In Python

Repo for Applied Text Mining in Python (coursera) by University of Michigan

Stars: ✭ 59 (+31.11%)

Mutual labels: text-mining, text-classification, text-processing

support-tickets-classification

This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en

Stars: ✭ 142 (+215.56%)

Mutual labels: text-mining, text-classification, text-processing

Text Mining

Text Mining in Python

Stars: ✭ 18 (-60%)

Mutual labels: text-mining, text-classification, text-processing

Pytorch Seq2seq

Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.

Stars: ✭ 3,418 (+7495.56%)

Mutual labels: pytorch-tutorial, pytorch-nlp, pytorch-implementation

nlp classification

Implementing nlp papers relevant to classification with PyTorch, gluonnlp

Stars: ✭ 224 (+397.78%)

Mutual labels: text-classification, pytorch-nlp, pytorch-implementation

Artificial Adversary

🗣️ Tool to generate adversarial text examples and test machine learning models against them

Stars: ✭ 348 (+673.33%)

Mutual labels: text-mining, text-classification, text-processing

Textcluster

短文本聚类预处理模块 Short text cluster

Stars: ✭ 115 (+155.56%)

Mutual labels: text-mining, text-processing

Cogcomp Nlpy

CogComp's light-weight Python NLP annotators

Stars: ✭ 115 (+155.56%)

Mutual labels: text-mining, text-processing

Awesome Text Classification

Awesome-Text-Classification Projects,Papers,Tutorial .

Stars: ✭ 158 (+251.11%)

Mutual labels: text-mining, text-classification

Hdltex

HDLTex: Hierarchical Deep Learning for Text Classification

Stars: ✭ 191 (+324.44%)

Mutual labels: text-mining, text-classification

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+1655.56%)

Mutual labels: text-mining, text-classification

Udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Stars: ✭ 160 (+255.56%)

Mutual labels: text-mining, tokenizer

Pyss3

A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI

Stars: ✭ 191 (+324.44%)

Mutual labels: text-mining, text-classification

Pipeit

PipeIt is a text transformation, conversion, cleansing and extraction tool.

Stars: ✭ 57 (+26.67%)

Mutual labels: text-mining, text-processing

Xioc

Extract indicators of compromise from text, including "escaped" ones.

Stars: ✭ 148 (+228.89%)

Mutual labels: text-mining, text-processing

Tokenizers

Fast, Consistent Tokenization of Natural Language Text

Stars: ✭ 161 (+257.78%)

Mutual labels: text-mining, tokenizer

Cnn Text Classification Keras

Text Classification by Convolutional Neural Network in Keras

Stars: ✭ 213 (+373.33%)

Mutual labels: text-mining, text-classification

Shallowlearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (+335.56%)

Mutual labels: text-mining, text-classification

clustext

Easy, fast clustering of texts

Stars: ✭ 18 (-60%)

Mutual labels: text-mining, text-classification

View All Similar Projects ➔

Text Classification through LSTMs

If you want to delve into the details regarding how the text was pre-processed, how the sequences were generated, how the neural network was built from the LSTMCells and how the model was trained, I highly recommend reading the blog: Text Classification with PyTorch

1. Data

As it was mentioned above, the implemented dataset is about Tweets regarding fake news. The rawdataset contains some unnecessary columns which are going to be removed in the preprocessing step, in the end, we will be working with a dataset with a head such as this:

id	text	target
1	Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all	1
2	SOOOO PUMPED FOR ABLAZE ???? @southridgelife	0
3	INEC Office in Abia Set Ablaze - http://t.co/3ImaomknnA	1
4	Building the perfect tracklist to life leave the streets ablaze	0

This raw dataset can be found in data/tweets.csv.

2. The model

As it was already commented, the aim of this repository is to provide a base line model for text classfication. In this sense, the model is based on a two-stacked LSTM layers followed by two linear layers. The dataset is preprocessed through a tokens-based technique, then tokens are associated to an embedding layer. The following image describes the pipeline of the model.

3. Dependencies

This model was developed under these specified versions:

torch==1.0.1.post2
torchtext==0.6.0
tensorflow==1.12.0
Keras==2.0.0
numpy==1.15.4
pandas==1.0.3

4. How to use it

The model can be executed easily by typing:

python main.py

You can define some hyperparameters manually, such as:

 main.py [-h] [--epochs EPOCHS] [--learning_rate LEARNING_RATE]
         [--hidden_dim HIDDEN_DIM] [--lstm_layers LSTM_LAYERS]
         [--batch_size BATCH_SIZE] [--test_size TEST_SIZE]
         [--max_len MAX_LEN] [--max_words MAX_WORDS]

5. Demo

The follwowing configuration was implemented in order to achieve the best results

python -B main.py --epochs 10 --learning_rate 0.01 --hidden_dim 128 --lstm_layers 2 --batch_size 64

by getting the following output:

Epoch: 1, loss: 0.53032, Train accuracy: 0.59376, Test accuracy: 0.63099
Epoch: 2, loss: 0.43361, Train accuracy: 0.63251, Test accuracy: 0.72948
Epoch: 3, loss: 0.36803, Train accuracy: 0.76141, Test accuracy: 0.75509
Epoch: 4, loss: 0.26117, Train accuracy: 0.80821, Test accuracy: 0.77807
Epoch: 5, loss: 0.19844, Train accuracy: 0.83547, Test accuracy: 0.77741
Epoch: 6, loss: 0.16377, Train accuracy: 0.86453, Test accuracy: 0.77216
Epoch: 7, loss: 0.02130, Train accuracy: 0.88391, Test accuracy: 0.75509
Epoch: 8, loss: 0.00315, Train accuracy: 0.89704, Test accuracy: 0.74787
Epoch: 9, loss: 0.02075, Train accuracy: 0.91018, Test accuracy: 0.76428
Epoch: 10, loss: 0.01348, Train accuracy: 0.92808, Test accuracy: 0.75378

So the learning curves will look like:

6. Future work

As it was mentioned, the aim of this repository is to provdie a base line for the text classification task. It's important to mention that, the problem of text classifications goes beyond than a two-stacked LSTM architecture where texts are preprocessed under tokens-based methodology. Recent works have shown impressive results by implemeting transformers based architectures (e.g. BERT). Nevertheless, by following this thread, this proposed model can be improved by removing the tokens-based methodology and implementing a word embeddings based model instead (e.g. word2vec-gensim). Likewise, bi-directional LSTMs can be applied in order to catch more context (in a forward and backward way).

The question remains open: how to learn semantics? what is semantics? would DL-based models be capable to learn semantics?

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

FernandoLpz / Text-Classification-LSTMs-PyTorch

Programming Languages

Labels

Projects that are alternatives of or similar to Text-Classification-LSTMs-PyTorch

Text Classification through LSTMs

1. Data

2. The model

3. Dependencies

4. How to use it

5. Demo

6. Future work