All Projects → aniass → Product-Categorization-NLP

aniass / Product-Categorization-NLP

Licence: other
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Product-Categorization-NLP

Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+42443.33%)
Mutual labels:  word2vec, topic-modeling, gensim
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+2533.33%)
Mutual labels:  text-classification, word2vec, gensim
Twitterldatopicmodeling
Uses topic modeling to identify context between follower relationships of Twitter users
Stars: ✭ 48 (+60%)
Mutual labels:  nltk, topic-modeling, gensim
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+3673.33%)
Mutual labels:  text-classification, nltk, gensim
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (+206.67%)
Mutual labels:  word2vec, gensim, doc2vec
Ask2Transformers
A Framework for Textual Entailment based Zero Shot text classification
Stars: ✭ 102 (+240%)
Mutual labels:  text-classification, transformers, topic-modeling
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (+10%)
Mutual labels:  text-classification, topic-modeling, data-analysis
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (+323.33%)
Mutual labels:  text-classification, word2vec, gensim
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+553.33%)
Mutual labels:  text-classification, word2vec, gensim
nlp workshop odsc europe20
Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…
Stars: ✭ 127 (+323.33%)
Mutual labels:  transformers, nltk, gensim
text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (+6.67%)
Mutual labels:  text-classification, transformers, huggingface-transformers
tutorials
Short programming tutorials pertaining to data analysis.
Stars: ✭ 14 (-53.33%)
Mutual labels:  pandas, data-analysis
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (+390%)
Mutual labels:  transformers, topic-modeling
PandasVersusExcel
Python数据分析入门,数据分析师入门
Stars: ✭ 120 (+300%)
Mutual labels:  pandas, data-analysis
dataquest-guided-projects-solutions
My dataquest project solutions
Stars: ✭ 35 (+16.67%)
Mutual labels:  pandas, data-analysis
DataProfiler
What's in your data? Extract schema, statistics and entities from datasets
Stars: ✭ 843 (+2710%)
Mutual labels:  pandas, data-analysis
nlpbuddy
A text analysis application for performing common NLP tasks through a web dashboard interface and an API
Stars: ✭ 115 (+283.33%)
Mutual labels:  text-classification, gensim
online-course-recommendation-system
Built on data from Pluralsight's course API fetched results. Works with model trained with K-means unsupervised clustering algorithm.
Stars: ✭ 31 (+3.33%)
Mutual labels:  pandas, data-analysis
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (+213.33%)
Mutual labels:  word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (+73.33%)
Mutual labels:  word2vec, gensim

Product Categorization

Multi-Class Text Classification of products based on their description

General info

The goal of the project is product categorization based on their description with Machine Learning and Deep Learning (MLP, CNN, Distilbert) algorithms. Additionaly we have created Doc2vec and Word2vec models, Topic Modeling (with LDA analysis) and EDA analysis (data exploration, data aggregation and cleaning data).

Dataset

The dataset comes from http://makeup-api.herokuapp.com/ and has been obtained by an API. It can be seen at my previous project at Extracting Data using API.

Motivation

The aim of the project is multi-class text classification to make-up products based on their description. Based on given text as an input, we have predicted what would be the category. We have five types of categories corresponding to different makeup products. In our analysis we used a different methods for a feature extraction (such as Word2vec, Doc2vec) and various Machine Learning/Deep Lerning algorithms to get more accurate predictions and choose the most accurate one for our issue.

Project contains:

  • Multi-class text classification with ML algorithms- Text_analysis.ipynb
  • Text classification with Distilbert model - Bert_products.ipynb
  • Text classification with MLP and Convolutional Neural Netwok (CNN) models - Text_nn.ipynb
  • Text classification with Doc2vec model -Doc2vec.ipynb
  • Word2vec model - Word2vec.ipynb
  • LDA - Topic modeling - LDA_Topic_modeling.ipynb
  • EDA analysis - Products_analysis.ipynb
  • Python scripts to clean data and ML model - clean_data.py, text_model.py
  • data, models - data and models used in the project.

Summary

We begin with data analysis and data pre-processing from our dataset. Then we have used a few combination of text representation such as BoW and TF-IDF and we have trained the word2vec and doc2vec models from our data. We have experimented with several Machine Learning algorithms: Logistic Regression, Linear SVM, Multinomial Naive Bayes, Random Forest, Gradient Boosting and MLP and Convolutional Neural Network (CNN) using different combinations of text representations and embeddings. We have also used a pretrained Distilbert model from Huggingface Transformers library to resolve our problem. We applied a transfer learning with Distilbert model.

From our experiments we can see that the tested models give a overall high accuracy and similar results for our problem. The SVM (BOW +TF-IDF) model and MLP model give the best accuracy of validation set. Logistic regression performed very well both with BOW +TF-IDF and Doc2vec and achieved similar accuracy as MLP. CNN with word embeddings also has a very comparable result (0.93) to MLP. Transfer learning with Distilbert model also gave a similar results to previous models. We achieved an accuracy on the test set equal to 93 %. That shows the extensive models are not gave a better results to our problem than simple Machine Learning models such as SVM.

Model Embeddings Accuracy
CNN Word embedding 0.93
Distilbert Distilbert tokenizer 0.93
MLP Word embedding 0.93
SVM Doc2vec (DBOW) 0.93
SVM BOW +TF-IDF 0.93
Logistic Regression Doc2vec (DBOW) 0.91
Gradient Boosting BOW +TF-IDF 0.91
Logistic Regression BOW +TF-IDF 0.91
Random Forest BOW +TF-IDF 0.91
Naive Bayes BOW +TF-IDF 0.90
Logistic Regression Doc2vec (DM) 0.89

The project is created with:

  • Python 3.6/3.8
  • libraries: NLTK, gensim, Keras, TensorFlow, Hugging Face transformers, scikit-learn, pandas, numpy, seaborn, pyLDAvis.

Running the project:

  • To run this project use Jupyter Notebook or Google Colab.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].