All Projects → haoopeng → Cnn Yelp Challenge 2016 Sentiment Classification

haoopeng / Cnn Yelp Challenge 2016 Sentiment Classification

IPython Notebook for training a word-level Convolutional Neural Network model for sentiment classification task on Yelp-Challenge-2016 review dataset.

Projects that are alternatives of or similar to Cnn Yelp Challenge 2016 Sentiment Classification

Nlp Tutorial
A list of NLP(Natural Language Processing) tutorials
Stars: ✭ 1,188 (+1020.75%)
Mutual labels:  jupyter-notebook, sentiment-classification
60 days rl challenge
60_Days_RL_Challenge中文版
Stars: ✭ 92 (-13.21%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Mit Deep Learning
Tutorials, assignments, and competitions for MIT Deep Learning related courses.
Stars: ✭ 8,912 (+8307.55%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Computervision Recipes
Best Practices, code samples, and documentation for Computer Vision.
Stars: ✭ 8,214 (+7649.06%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Tia
Your Advanced Twitter stalking tool
Stars: ✭ 98 (-7.55%)
Mutual labels:  jupyter-notebook, sentiment-classification
Brihaspati
Collection of various implementations and Codes in Machine Learning, Deep Learning and Computer Vision ✨💥
Stars: ✭ 53 (-50%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Ai Dl Enthusiasts Meetup
AI & Deep Learning Enthusiasts Meetup Project & Study Sessions
Stars: ✭ 90 (-15.09%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Gaze Estimation
A deep learning based gaze estimation framework implemented with PyTorch
Stars: ✭ 33 (-68.87%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Rlai Exercises
Exercise Solutions for Reinforcement Learning: An Introduction [2nd Edition]
Stars: ✭ 97 (-8.49%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Person remover
People removal in images using Pix2Pix and YOLO.
Stars: ✭ 96 (-9.43%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Machine Learning From Scratch
Succinct Machine Learning algorithm implementations from scratch in Python, solving real-world problems (Notebooks and Book). Examples of Logistic Regression, Linear Regression, Decision Trees, K-means clustering, Sentiment Analysis, Recommender Systems, Neural Networks and Reinforcement Learning.
Stars: ✭ 42 (-60.38%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Deep Image Analogy Pytorch
Visual Attribute Transfer through Deep Image Analogy in PyTorch!
Stars: ✭ 100 (-5.66%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-63.21%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Notebooks
Some notebooks
Stars: ✭ 53 (-50%)
Mutual labels:  artificial-intelligence, jupyter-notebook
True artificial intelligence
真AI人工智能
Stars: ✭ 38 (-64.15%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Phormatics
Using A.I. and computer vision to build a virtual personal fitness trainer. (Most Startup-Viable Hack - HackNYU2018)
Stars: ✭ 79 (-25.47%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Deep Learning Experiments
Notes and experiments to understand deep learning concepts
Stars: ✭ 883 (+733.02%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Particle Filter Prototype
Particle Filter Implementations in Python and C++, with lecture notes and visualizations
Stars: ✭ 29 (-72.64%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Ds With Pysimplegui
Data science and Machine Learning GUI programs/ desktop apps with PySimpleGUI package
Stars: ✭ 93 (-12.26%)
Mutual labels:  artificial-intelligence, jupyter-notebook
Recommenders
Best Practices on Recommendation Systems
Stars: ✭ 11,818 (+11049.06%)
Mutual labels:  artificial-intelligence, jupyter-notebook

CNN-yelp-challenge-2016-sentiment-classification

This repository trains a word-level Convolutional Neural Network model for sentiment classification task on Yelp Challenge 2016 using standard deep learning packages.

The task is defined on the yelp_academic_dataset_review.json file (5 million rows) in the challenge. It has two fields: "stars" and "text". The "text" field is customer's raw review sentence, and the "stars" field is the customer's rating for the corresponding review, ranging from 1 to 5.

The model architecture is described in the Components section. For the first layer, I experimented with both word2vec and keras built-in embedding.

In order to train the model in a reasonable time, I randomly sampled 1 million datapoints, and ended up with 399850 samples after removing missing values. The class distribution of this subset is shown in table 1.

1 2 3 4 5
46906 34283 50678 106067 161916
11.7% 8.6% 12.7% 26.5% 40.5%

I applied the model to a binary classification task and a multi-lable classification task.

In the binary setting, reviews with a star greater than 2 are regarded as positive samples, otherwise as negative ones. The model achieved 77.91% accuracy on the validation set after 2 epochs of training (see Components section).

For the multi-label classification task, it achieved ~40% accuracy on test set after 1 epoch training. The result is shown in train_multi_class.ipynb.

Feel free to continue my work, and let me know if you obtain better results!

Requirements

  • Keras: pip install keras (1.0.3)
  • Theano: pip install theano (0.8.0.dev0)

Components

This repository contains the following components:

  • json-csv.pyThis is the script for data preprocessing, it converts the yelp_academic_dataset_review.json file to a csv file (named as review.csv).
  • Word2VecUtility.pyIt's borrowed from Kaggle's word2vec tutorial. It segments a sentence into a word list or a sentence list.
  • word2vec_model.ipynbIt trains a word2vec model on the review data. Each word is represented by a 300 dimensional vector. The trained model is named as 300features_40minwords_10context.
  • train_with_word2vec_embedding.ipynbThis file trains a 1D CNN for sentiment classification using word2vec embedding. (the embedded dataset has a shape of (N, 50, 300), see Details section). Unfortunately, my machine was unable to finish the training stage due to memory issues. So I turn to use Keras' built-in embedding layer instead.
  • train_keras_embedding.ipynbIt trains a model similar to the previous one. The only difference is the embedding layer. The architecture of this model is : Embedding layer - Dropout - Convolution1D - MaxPooling1D - Full Connected layer - Dropout - Relu activation - Sigmoid (with binary cross entropy loss). It was trained on 319880 samples and validated on 79970 samples (train acc: 0.7791 and val_acc: 0.7761 after 2 epoch training).
  • train_multi_class.ipynbIt trains a multi-label classification model with the same architecture on the same subset. It achieved ~40% validation accuracy after 1 epoch training.

Details

To train CNN models on textual data, we need to represent the dataset in 2-d matrices (just like traning CNN models on images). There are many ways to achieve this purpose. In this task, I tried two apporaches: (i) using the word2vec embedding and (ii) using keras' built-in embedding layer.

Word2vec embedding

With a word2vec model, we can transform each review into a fixed length of words with each word represented by its word vector using strategies such as truncating and padding.

e.g. We can set max_length = 50 (max number of words for each review) and the word2vec vocabulary size as 5000. The indices of words in word2vec model are all increased by 3 because 0, 1, 2 are reserved for special purposes. Specifically, reviews with less than 50 word are padded with 0 at the beginning, and longer reviews are truncated to only keep the first 50 words. We let all reviews begin with index 1 and all words outside of the vocabulary be replaced by index 2. Next, for each review, we can map its words to their corresponding word vectors. In the end, each review is represented as a (50, 300) matrix.

Keras embedding layer

This is similar to the previous case where we use word2vec to represent a review as a matrix. One can just consider Keras embedding layer as an end-to-end trained word embedding (it's the first layer in the respective model architecture).

Replication

To replication my results, please download the dataset and run json-csv.py and word2vec_model.py to sample the exact 399850 reviews that I used in this task. Run train_keras_embedding.py to train a CNN model using keras embedding layer. You can also run train_with_word2vec_embedding.py if you want to use word2vec embedding (You need to train a word2vec model beforehand). Make sure you get the sampled dataset before you train the model, or you are free to experiment on all 5 million reviews :)

If you would like to predict the 5-category star for each review, see my experiments in train_multi_class.ipynb.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].