All Projects → i008 → nyyelp

i008 / nyyelp

Licence: other
predicting yelp review rating using recurrent neural networks

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to nyyelp

imessage-chatbot
💬 Recurrent neural network -- generates messages in your style of speech! Trained on imessage data. Sqlite3, TensorFlow, Flask, Twilio SMS, AWS.
Stars: ✭ 33 (+65%)
Mutual labels:  recurrent-neural-networks
DeepSeparation
Keras Implementation and Experiments with Deep Recurrent Neural Networks for Source Separation
Stars: ✭ 19 (-5%)
Mutual labels:  recurrent-neural-networks
lyrics-generator
Generating lyrics with a recurrent neural network
Stars: ✭ 36 (+80%)
Mutual labels:  recurrent-neural-networks
Conversational-AI-Chatbot-using-Practical-Seq2Seq
A simple open domain generative based chatbot based on Recurrent Neural Networks
Stars: ✭ 17 (-15%)
Mutual labels:  recurrent-neural-networks
Deep-Learning
This repo provides projects on deep-learning mainly using Tensorflow 2.0
Stars: ✭ 22 (+10%)
Mutual labels:  recurrent-neural-networks
LSM
Liquid State Machines in Python and NEST
Stars: ✭ 39 (+95%)
Mutual labels:  recurrent-neural-networks
course-content-dl
NMA deep learning course
Stars: ✭ 537 (+2585%)
Mutual labels:  recurrent-neural-networks
Singing-Voice-Separation-RNN
Singing-Voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks
Stars: ✭ 44 (+120%)
Mutual labels:  recurrent-neural-networks
automatic-personality-prediction
[AAAI 2020] Modeling Personality with Attentive Networks and Contextual Embeddings
Stars: ✭ 43 (+115%)
Mutual labels:  recurrent-neural-networks
CS231n
PyTorch/Tensorflow solutions for Stanford's CS231n: "CNNs for Visual Recognition"
Stars: ✭ 47 (+135%)
Mutual labels:  recurrent-neural-networks
recsys2019
The complete code and notebooks used for the ACM Recommender Systems Challenge 2019
Stars: ✭ 26 (+30%)
Mutual labels:  recurrent-neural-networks
classifying-cancer
A Python-Tensorflow neural network for classifying cancer data
Stars: ✭ 30 (+50%)
Mutual labels:  recurrent-neural-networks
entailment-neural-attention-lstm-tf
(arXiv:1509.06664) Reasoning about Entailment with Neural Attention.
Stars: ✭ 43 (+115%)
Mutual labels:  recurrent-neural-networks
Meetup-Content
Entirety.ai Intuition to Implementation Meetup Content.
Stars: ✭ 33 (+65%)
Mutual labels:  recurrent-neural-networks
YelpDatasetSQL
Working with the Yelp Dataset in Azure SQL and SQL Server
Stars: ✭ 16 (-20%)
Mutual labels:  yelp-dataset
rnn darts fastai
Implement Differentiable Architecture Search (DARTS) for RNN with fastai
Stars: ✭ 21 (+5%)
Mutual labels:  recurrent-neural-networks
STORN-keras
This is a STORN (Stochastical Recurrent Neural Network) implementation for keras!
Stars: ✭ 23 (+15%)
Mutual labels:  recurrent-neural-networks
deeptrolldetector
Deep troll uses a deep learning model that identifies whether an audio contains the Gemidao troll (AAAWN OOOWN NHAAA AWWWWN AAAAAH).
Stars: ✭ 20 (+0%)
Mutual labels:  recurrent-neural-networks
dts
A Keras library for multi-step time-series forecasting.
Stars: ✭ 130 (+550%)
Mutual labels:  recurrent-neural-networks
spikeRNN
No description or website provided.
Stars: ✭ 28 (+40%)
Mutual labels:  recurrent-neural-networks

Project objective:

Predict review rating (how many stars the reviewer gave to given POI) given the textual-content of the review using pretrained GLOVE word embeddings and a 2 layers deep GRU(LSTM) recurrent neural network. Visualize clusters on a 2D space using PCA and t-SNE.

simple yelp-dataset exploration

yelp data: https://www.yelp.de/dataset_challenge
notebook used to load and explore basic properties of this dataset can be found here:
https://github.com/i008/nyyelp/blob/master/exploration.ipynb

NLP project

The notebook for the NLP part of this project can be found here:
https://github.com/i008/nyyelp/blob/master/nlp.ipynb

Steps(rough plan):

  1. Process .text data from yelp review-documents
  • filter to leave english-only reviews
  • balance the dataset (use similar amount of data for each star rating (1,2,3,4,5*)
  • balance the dataset by review length, so the length distribution of reviews for each rating-class is atleast similar (bucketing might be a good idea)
  1. Process text for deep-learning
  • bring GLOVE embeddings to a reasonable form
  • Tokenize each review
  • transform reviews into tokenized sequences
  • Pad sequences to a fixed length
  • Prepare the embedding weight-matrix to be used in the Embedding layer (we will not train this layer)
  • split into test and train
  1. Deep learning
  • loss function, regression(rmse) and/or classification(logloss)
  • prepare architecture (2 LSTM/GRU layers -> Dense)
  • train, wish for the best
  1. Post learning
  • extract features from last LSTM layer
  • casting to 2D space using t-SNE
  1. Remarks
  • we have to limit the amount of data, bc of memory issues possible solution is to save the sequences into HDF and flowing from disk during training

Summary

text Data:

  • For performance and memory reasons the final model was trained on a balanced and downsampled (by a factor of 10) subset of the reviews leaving us around 150,000 reviews (in total)
  • Review data was tokenized (20,000 words)
  • Max sequence length was set to 1000.
  • Embedding dimensions were set to 100 (GLOVE-100d representation was used) wich means every words is a vector with shape (100,)

Deep Learning Model

  • The Embedding layer used GLOVE (http://nlp.stanford.edu/projects/glove/) weights. Training of this layer was disabled
  • The objective function was logloss wich means that it was a classification task , altough regression would also be a valid option
  • After the 3rd-4th epoch we can observe that the model starts to overfit

Results

fig1. confusion matrix :

fig2. classification report:

fig3. Learning process:

As we can see on the images above the model managed to learn (something) and achieved decent accuracy(~0.62).

Key takeways:

  • Extreme(1,5) reviews are easier to classify and understand, people write them in a specific way.
  • As we can see on fig1. The model makes mistakes almost only "by 1-star" this is a very good sign and leads to the conclusion that the "grduation" of the data was learned to some degree

t-SNE and PCA on GRU features

t-SNE and PCA are dimmensionality reduction techniques, used on the features from last GRU layer they can give us an "2D" idea of the problem we are tackling.

fig4. t-SNE

fig5. PCA

Learning Performance

  • One epoch (~150,000 reviews) with a batch size of 128 reviews took around 15mins to train on a GTX1070.

Docker app (try it yourself)

docker run i008/nyyelp:latest python predict.py --review "this place really sucks, food is terrible"

number of stars: 1

docker run i008/nyyelp:latest python predict.py --review "i have mixed feeling about this place, on one hand its good on the other not really"

number of stars: 3

TODO, Ideas

  • Treat this problem as regression and see what happens
  • Cluster GRU features using DBSCAN or similar methods, might give interesting results.
  • Use more data to train the models.
  • Balance the dataset considering the length of the review.
  • Interactive plot for t-SNE (click on a data point should show the review plotly?)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].