All Projects → FernandoLpz → Text-Classification-LSTMs-PyTorch

FernandoLpz / Text-Classification-LSTMs-PyTorch

Licence: other
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Text-Classification-LSTMs-PyTorch

Awesome-Pytorch-Tutorials
Awesome Pytorch Tutorials
Stars: ✭ 23 (-48.89%)
Mutual labels:  pytorch-tutorial, pytorch-nlp, pytorch-implementation
Applied Text Mining In Python
Repo for Applied Text Mining in Python (coursera) by University of Michigan
Stars: ✭ 59 (+31.11%)
Mutual labels:  text-mining, text-classification, text-processing
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+215.56%)
Mutual labels:  text-mining, text-classification, text-processing
Text Mining
Text Mining in Python
Stars: ✭ 18 (-60%)
Mutual labels:  text-mining, text-classification, text-processing
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+7495.56%)
Mutual labels:  pytorch-tutorial, pytorch-nlp, pytorch-implementation
nlp classification
Implementing nlp papers relevant to classification with PyTorch, gluonnlp
Stars: ✭ 224 (+397.78%)
Mutual labels:  text-classification, pytorch-nlp, pytorch-implementation
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+673.33%)
Mutual labels:  text-mining, text-classification, text-processing
Textcluster
短文本聚类预处理模块 Short text cluster
Stars: ✭ 115 (+155.56%)
Mutual labels:  text-mining, text-processing
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (+155.56%)
Mutual labels:  text-mining, text-processing
Awesome Text Classification
Awesome-Text-Classification Projects,Papers,Tutorial .
Stars: ✭ 158 (+251.11%)
Mutual labels:  text-mining, text-classification
Hdltex
HDLTex: Hierarchical Deep Learning for Text Classification
Stars: ✭ 191 (+324.44%)
Mutual labels:  text-mining, text-classification
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+1655.56%)
Mutual labels:  text-mining, text-classification
Udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Stars: ✭ 160 (+255.56%)
Mutual labels:  text-mining, tokenizer
Pyss3
A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI
Stars: ✭ 191 (+324.44%)
Mutual labels:  text-mining, text-classification
Pipeit
PipeIt is a text transformation, conversion, cleansing and extraction tool.
Stars: ✭ 57 (+26.67%)
Mutual labels:  text-mining, text-processing
Xioc
Extract indicators of compromise from text, including "escaped" ones.
Stars: ✭ 148 (+228.89%)
Mutual labels:  text-mining, text-processing
Tokenizers
Fast, Consistent Tokenization of Natural Language Text
Stars: ✭ 161 (+257.78%)
Mutual labels:  text-mining, tokenizer
Cnn Text Classification Keras
Text Classification by Convolutional Neural Network in Keras
Stars: ✭ 213 (+373.33%)
Mutual labels:  text-mining, text-classification
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+335.56%)
Mutual labels:  text-mining, text-classification
clustext
Easy, fast clustering of texts
Stars: ✭ 18 (-60%)
Mutual labels:  text-mining, text-classification

Text Classification through LSTMs

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle

If you want to delve into the details regarding how the text was pre-processed, how the sequences were generated, how the neural network was built from the LSTMCells and how the model was trained, I highly recommend reading the blog: Text Classification with PyTorch

1. Data

As it was mentioned above, the implemented dataset is about Tweets regarding fake news. The rawdataset contains some unnecessary columns which are going to be removed in the preprocessing step, in the end, we will be working with a dataset with a head such as this:

id text target
1 Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all 1
2 SOOOO PUMPED FOR ABLAZE ???? @southridgelife 0
3 INEC Office in Abia Set Ablaze - http://t.co/3ImaomknnA 1
4 Building the perfect tracklist to life leave the streets ablaze 0

This raw dataset can be found in data/tweets.csv.

2. The model

As it was already commented, the aim of this repository is to provide a base line model for text classfication. In this sense, the model is based on a two-stacked LSTM layers followed by two linear layers. The dataset is preprocessed through a tokens-based technique, then tokens are associated to an embedding layer. The following image describes the pipeline of the model.

3. Dependencies

This model was developed under these specified versions:

torch==1.0.1.post2
torchtext==0.6.0
tensorflow==1.12.0
Keras==2.0.0
numpy==1.15.4
pandas==1.0.3

4. How to use it

The model can be executed easily by typing:

python main.py

You can define some hyperparameters manually, such as:

 main.py [-h] [--epochs EPOCHS] [--learning_rate LEARNING_RATE]
         [--hidden_dim HIDDEN_DIM] [--lstm_layers LSTM_LAYERS]
         [--batch_size BATCH_SIZE] [--test_size TEST_SIZE]
         [--max_len MAX_LEN] [--max_words MAX_WORDS]

5. Demo

The follwowing configuration was implemented in order to achieve the best results

python -B main.py --epochs 10 --learning_rate 0.01 --hidden_dim 128 --lstm_layers 2 --batch_size 64

by getting the following output:

Epoch: 1, loss: 0.53032, Train accuracy: 0.59376, Test accuracy: 0.63099
Epoch: 2, loss: 0.43361, Train accuracy: 0.63251, Test accuracy: 0.72948
Epoch: 3, loss: 0.36803, Train accuracy: 0.76141, Test accuracy: 0.75509
Epoch: 4, loss: 0.26117, Train accuracy: 0.80821, Test accuracy: 0.77807
Epoch: 5, loss: 0.19844, Train accuracy: 0.83547, Test accuracy: 0.77741
Epoch: 6, loss: 0.16377, Train accuracy: 0.86453, Test accuracy: 0.77216
Epoch: 7, loss: 0.02130, Train accuracy: 0.88391, Test accuracy: 0.75509
Epoch: 8, loss: 0.00315, Train accuracy: 0.89704, Test accuracy: 0.74787
Epoch: 9, loss: 0.02075, Train accuracy: 0.91018, Test accuracy: 0.76428
Epoch: 10, loss: 0.01348, Train accuracy: 0.92808, Test accuracy: 0.75378

So the learning curves will look like:

6. Future work

As it was mentioned, the aim of this repository is to provdie a base line for the text classification task. It's important to mention that, the problem of text classifications goes beyond than a two-stacked LSTM architecture where texts are preprocessed under tokens-based methodology. Recent works have shown impressive results by implemeting transformers based architectures (e.g. BERT). Nevertheless, by following this thread, this proposed model can be improved by removing the tokens-based methodology and implementing a word embeddings based model instead (e.g. word2vec-gensim). Likewise, bi-directional LSTMs can be applied in order to catch more context (in a forward and backward way).

The question remains open: how to learn semantics? what is semantics? would DL-based models be capable to learn semantics?

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].