All Projects → SrinidhiRaghavan → AI-Sentiment-Analysis-on-IMDB-Dataset

SrinidhiRaghavan / AI-Sentiment-Analysis-on-IMDB-Dataset

Licence: other
Sentiment Analysis using Stochastic Gradient Descent on 50,000 Movie Reviews Compiled from the IMDB Dataset

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to AI-Sentiment-Analysis-on-IMDB-Dataset

NTUA-slp-nlp
💻Speech and Natural Language Processing (SLP & NLP) Lab Assignments for ECE NTUA
Stars: ✭ 19 (-65.45%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Awesome Sentiment Analysis
Repository with all what is necessary for sentiment analysis and related areas
Stars: ✭ 459 (+734.55%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
SentimentAnalysis
Sentiment Analysis: Deep Bi-LSTM+attention model
Stars: ✭ 32 (-41.82%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Customer satisfaction analysis
基于在线民宿 UGC 数据的意见挖掘项目,包含数据挖掘和NLP 相关的处理,负责数据采集、主题抽取、情感分析等任务。目的是克服用户打分和评论不一致,实时对在线民宿的满意度评测,包含在线评论采集和情感可视化分析。搭建了百度地图POI查询入口,可以进行自动化的批量查询 POI 信息的功能;构建了基于在线民宿语料的 LDA 自动主题聚类模型,利用主题中心词能找出对应的主题属性字典;以用户打分作为标注,然后 litNlp 自带的字符级 TextCNN 进行情感分析,将情感分类概率分布作为情感趋势,最后通过 POI 热力图的方式对不同地域的民宿满意度进行展示。软件版本请见链接。
Stars: ✭ 262 (+376.36%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (+160%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
sentiment-analysis-of-tweets-in-russian
Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.
Stars: ✭ 51 (-7.27%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+550.91%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
brand-sentiment-analysis
Scripts utilizing Heartex platform to build brand sentiment analysis from the news
Stars: ✭ 21 (-61.82%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+125.45%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (+67.27%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Text Classification Keras
📚 Text classification library with Keras
Stars: ✭ 53 (-3.64%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Datastories Semeval2017 Task4
Deep-learning model presented in "DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis".
Stars: ✭ 184 (+234.55%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Pytorch Sentiment Neuron
Stars: ✭ 178 (+223.64%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Paribhasha
paribhasha.herokuapp.com/
Stars: ✭ 21 (-61.82%)
Mutual labels:  sentiment-analysis, nlp-machine-learning
Senti4SD
An emotion-polarity classifier specifically trained on developers' communication channels
Stars: ✭ 41 (-25.45%)
Mutual labels:  sentiment-analysis
DeepLearningReading
Deep Learning and Machine Learning mini-projects. Current Project: Deepmind Attentive Reader (rc-data)
Stars: ✭ 78 (+41.82%)
Mutual labels:  nlp-machine-learning
spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Stars: ✭ 55 (+0%)
Mutual labels:  sentiment-analysis
chronist
Long-term analysis of emotion, age, and sentiment using Lifeslice and text records.
Stars: ✭ 23 (-58.18%)
Mutual labels:  sentiment-analysis
deep-semantic-code-search
Deep Semantic Code Search aims to explore a joint embedding space for code and description vectors and then use it for a code search application
Stars: ✭ 63 (+14.55%)
Mutual labels:  nlp-machine-learning
sentibol
⚽ Notebook feito para analisar o case do Sentibol
Stars: ✭ 18 (-67.27%)
Mutual labels:  sentiment-analysis

AI-Sentiment-Analysis-on-IMDB-Dataset

Introduction

Given the availability of a large volume of online review data (Amazon, IMDB, etc.), sentiment analysis becomes increasingly important. In this project, a sentiment classifier is built which evaluates the polarity of a piece of text being either positive or negative.

Getting the Dataset

The "Large Movie Review Dataset"(*) shall be used for this project. The dataset is compiled from a collection of 50,000 reviews from IMDB on the condition there are no more than 30 reviews per movie. The numbers of positive and negative reviews are equal. Negative reviews have scores less or equal than 4 out of 10 while a positive review have score greater or equal than 7 out of 10. Neutral reviews are not included. The 50,000 reviews are divided evenly into the training and test set.

The Training Dataset used is stored in the zipped folder: aclImbdb.tar file. This can also be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

The Test Dataset is stored in the folder named 'test'

Data Preprocessing

The training dataset in aclImdb folder has two sub-directories pos/ for positive texts and neg/ for negative ones. Use only these two directories. The first task is to combine both of them to a single csv file, “imdb_tr.csv”. The csv file has three columns,"row_number" and “text” and “polarity”. The column “text” contains review texts from the aclImdb database and the column “polarity” consists of sentiment labels, 1 for positive and 0 for negative. The file imdb_tr.csv is an output of this preprocessing. In addition, common English stopwords should be removed. An English stopwords reference ('stopwords.en') is given in the code for reference.

Data Representations Used

Unigram , Bigram , TfIdf

Algorithmic Overview

In this project, we will train a Stochastic Gradient Descent Classifier. This is used instead of gradient descent as gradient descent is prohibitively expensive when the dataset is extremely large because every single data point needs to be processed. SGD algorithm performs just as good with a small random subset of the original data. This is the central idea of Stochastic SGD and particularly handy for the text data since text corpus are often humongous.

A good description of this algorithm can be found at: https://en.wikipedia.org/wiki/Stochastic_gradient_descent.

Functions used in the driver_3 file

imdb_data_preprocess : Explores the neg and pos folders from aclImdb/train and creates a imdb_tr.csv file in the required format

remove_stopwords : Takes a sentence and the stopwords as inputs and returns the sentence without any stopwords

unigram_process : Takes the data to be fit as the input and returns a vectorizer of the unigram as output

bigram_process : Takes the data to be fit as the input and returns a vectorizer of the bigram as output

tfidf_process : Takes the data to be fit as the input and returns a vectorizer of the tfidf as output

retrieve_data : Takes a CSV file as the input and returns the corresponding arrays of labels and data as output

stochastic_descent : Applies Stochastic on the training data and returns the predicted labels

accuracy : Finds the accuracy in percentage given the training and test labels

write_txt : Writes the given data to a text file

Environment

Language : Python 3

Libraries : Scikit, Pandas

How to Execute?

Run python driver_3.py

Results

Output files are:

unigram.output

unigramtfidf.output

bigram.output

bigramtfidf.output

Here, 1 is given for positive labels and 0 is for negative labels

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].