All Projects → xiangzhemeng → Kaggle-Twitter-Sentiment-Analysis

xiangzhemeng / Kaggle-Twitter-Sentiment-Analysis

Licence: other
Kaggle Twitter Sentiment Analysis Competition

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Kaggle-Twitter-Sentiment-Analysis

Sentiment-analysis-amazon-Products-Reviews
NLP with NLTK for Sentiment analysis amazon Products Reviews
Stars: ✭ 37 (+105.56%)
Mutual labels:  sentiment-analysis, text-classification
ML2017FALL
Machine Learning (EE 5184) in NTU
Stars: ✭ 66 (+266.67%)
Mutual labels:  sentiment-analysis, text-classification
overview-and-benchmark-of-traditional-and-deep-learning-models-in-text-classification
NLP tutorial
Stars: ✭ 41 (+127.78%)
Mutual labels:  sentiment-analysis, text-classification
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+13888.89%)
Mutual labels:  sentiment-analysis, text-classification
Text tone analyzer
Система, анализирующая тональность текстов и высказываний.
Stars: ✭ 15 (-16.67%)
Mutual labels:  sentiment-analysis, text-classification
Chinese ulmfit
中文ULMFiT 情感分析 文本分类
Stars: ✭ 208 (+1055.56%)
Mutual labels:  sentiment-analysis, text-classification
text analysis tools
中文文本分析工具包(包括- 文本分类 - 文本聚类 - 文本相似性 - 关键词抽取 - 关键短语抽取 - 情感分析 - 文本纠错 - 文本摘要 - 主题关键词-同义词、近义词-事件三元组抽取)
Stars: ✭ 410 (+2177.78%)
Mutual labels:  sentiment-analysis, text-classification
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+588.89%)
Mutual labels:  sentiment-analysis, text-classification
TLA
A comprehensive tool for linguistic analysis of communities
Stars: ✭ 47 (+161.11%)
Mutual labels:  sentiment-analysis, text-classification
awesome-text-classification
Text classification meets word embeddings.
Stars: ✭ 27 (+50%)
Mutual labels:  sentiment-analysis, text-classification
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (+694.44%)
Mutual labels:  sentiment-analysis, text-classification
NewsMTSC
Target-dependent sentiment classification in news articles reporting on political events. Includes a high-quality data set of over 11k sentences and a state-of-the-art classification model.
Stars: ✭ 54 (+200%)
Mutual labels:  sentiment-analysis, text-classification
Rcnn Text Classification
Tensorflow Implementation of "Recurrent Convolutional Neural Network for Text Classification" (AAAI 2015)
Stars: ✭ 127 (+605.56%)
Mutual labels:  sentiment-analysis, text-classification
Cnn Text Classification Keras
Text Classification by Convolutional Neural Network in Keras
Stars: ✭ 213 (+1083.33%)
Mutual labels:  sentiment-analysis, text-classification
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+11633.33%)
Mutual labels:  sentiment-analysis, text-classification
Subject-and-Sentiment-Analysis
汽车行业用户观点主题及情感识别
Stars: ✭ 24 (+33.33%)
Mutual labels:  sentiment-analysis, text-classification
Rnn Text Classification Tf
Tensorflow Implementation of Recurrent Neural Network (Vanilla, LSTM, GRU) for Text Classification
Stars: ✭ 114 (+533.33%)
Mutual labels:  sentiment-analysis, text-classification
Context
ConText v4: Neural networks for text categorization
Stars: ✭ 120 (+566.67%)
Mutual labels:  sentiment-analysis, text-classification
sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (+16.67%)
Mutual labels:  sentiment-analysis, text-classification
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (+33.33%)
Mutual labels:  sentiment-analysis, text-classification

Twitter Sentiment Analysis (Text classification)

Team: Hello World

Team Members: Sung Lin Chan, Xiangzhe Meng, Süha Kagan Köse

This repository is the final project of CS-433 Machine Learning Fall 2017 at EPFL. The private competition was hosted on Kaggle EPFL ML Text Classification we had a complete dataset of 2500000 tweets. One half of tweets are positive labels and the other half are negative labels Our task was to build a classifier to predict the test dataset of 10000 tweets. This README.md illustrates the the implementation of the classifier, and present the procedure to reproduce our works. The details of our implementation were written in the report. Ultimately, we ranked 9th of 63 teams on the leaderboard.

Project Specification

See Project Specification at EPFL Machine Learning Course CS-433 github page.

Hardware Environment

In this project, we use two instances on GCP (Google Cloud Platform) to accelerate the neural network training by GPU the text preprocessing by multiprocessing technique.

For neural network training:

  • GPU Platform:
    • CPU: 6 vCPUs Intel Broadwell
    • RAM: 22.5 GB
    • GPU: 1 x NVIDIA Tesla P100
    • OS: Ubuntu 16.04 LTS

For text preprocessing:

  • Pure CPU Platform:
    • CPU: 24 vCPUs Intel Broadwell
    • RAM: 30GB
    • OS: Ubuntu 16.04 LTS

Dependencies

All the scripts in this project ran in Python 3.5.2, the generic version on GCP instance. For nueral network framework, we used Keras, a high-level neural networks API, and use Tensorflow as backend.

The NVIDIA GPU CUDA version is 8.0 and the cuDNN version is v6.0. Although, there are newer version of CUDA and cuDNN at this time, we use the stable versions that are recommended by the official website of Tensorflow. For more information and installation guide about how to set up GPU environment for Tensorflow, please see here

  • [Scikit-Learn] (0.19.1)- Install scikit-learn library with pip

    $ sudo pip3 install scikit-learn
  • [Gensim] (3.2.0) - Install Gensim library

    $ sudo pip3 install gensim
  • [FastText] (0.8.3) - Install FastText implementation

    $ sudo pip3 install fasttext
  • [NLTK] (3.2.5) - Install NLTK and download all packages

    // Install
    $ sudo pip3 install nltk
    
    // Download packages
    $ python3
    $ >>> import nltk
    $ >>> nltk.download()
  • [Tensorflow] (1.4.0) - Install tensorflow. Depends on your platfrom, choose either without GPU version or with GPU version

    // Without GPU version
    $ sudo pip3 install tensorflow
    
    // With GPU version
    $ sudo pip3 install tensorflow-gpu
  • [Keras] (1.4.0) - Install Keras

    $ sudo pip3 install keras
  • [XGBoost] (0.6a2) - Install XGboost

    $ sudo pip3 install xgboost

Folder / Files

  • segmenter.py:
    helper function for preprocessing step

  • data_loading.py:
    helper function for loading the original dataset and output pandas dataframe object as pickles.

  • data_preprocessing.py:
    Module of preprocessing. Take output of data_loading.py and output preprocessed tweets

  • cnn_training.py:
    Module of three cnn models The the output of data_preprocessing.py and generate result as input of xgboost_training.py

  • xgboost_training.py:
    Module of xgboost model. Take the output of cnn_training.py and generate the prediction result.

  • run.py:
    Script for running the modules, data_loading.py, data_preprocessing.py, cnn_training.py and xgboost_training.py.

  • data:
    This folder contains the necessary metadata and intermediate files while running our scripts.

    • tweets: Contain the original train and test dataset downloaded from Kaggle.
    • dictionary: Contain the text files for text preprocessing
    • pickles: Contain the intermediate files of preprocessed text as the input of CNN model
    • xgboost: Contain the intermediate output files of CNN model and there are the input of XGboost model.
    • output : Contain output file of kaggle format from run.py

    Note: The files inside tweets and dictionary are essential for running the scripts from scratch. Download tweets and dictionary Then, unzip the downloaded file and move the extracted tweets and dictionary folder in data/ directory.

    If you want to skip the preprocessing step and CNN training step, download preprocessed data and pretrained model. Then, unzip the downloaded file and move all the extracted folders in data/ directory.

  • othermodels:

    The files in this folder are the models we explored, before coming out the best model.

    • keras_nn_model.py: This is the classifier using NN model and the word representation method is GloVE. Each was represented by the average of the sum of each word and fit into NN model.

    • fastText_model.py: This is the classifier using FastText. The word representation is FastText english pre-trained model.

    • svm_model.py: This is the classifier using support vector machine. The word representation is TF-IDF by using Scikit-Learn built-in method.

Reproduce Our Best Score on Kaggle

Here are our steps from original dataset to kaggle submission file in order. We had modulized each step into .py file, they can be executed individually. For your convenience, we provide run.py which could run the modules with simple command.

  1. Transform dataset to pandas dataframe - data_loading.py
  2. Preprocessing dataset - data_preprocessing.py
  3. CNN model training - cnn_training.py
  4. XGboost model training and generate submission file - xgboost_training.py

First, make sure all the essential data is put into "data/" directory

Second, there are three options to generate Kaggle submission file. We recommand the first options, which takes less than 10 minutes to reproduct the result with pretrianed models.

-if you want to skip preprocessing step and CNN model training step, execute run.py with -m argument "xgboost"

    $ python3 run.py -m xgboost

  Note: Make sure that there are test_model1.txt, test_model2.txt, test_model3.txt, train_model1.txt, train_model2.txt and train_model3.txt in "data/xgboost in order to launch run.py successfully.

-if you want to skip preprocessing step and start from CNN model training setp, execute run.py with -m argument "cnn"

    $ python3 run.py -m cnn

Note: Make sure that there are train_clean.pkl and test_clean.pkl in "data/pickles in order to launch run.py successfully.

-if you want to run all the steps from scratch, execute run.py with -m argument "all"

    $ python3 run.py -m all

Note: our preprocessing step require larges amount of CPU resource. It is a multiprocessing step, and will occupy all the cores of CPU. It took one hour to finish this step on 24 vCPUs instance on GCP and extra one and half hour more to finish CNN model training step with NVIDIA P100.

Finally, you can find prediction.csv in "data/output" directory

Contributors

  • Sung Lin Chan
  • Xiangzhe Meng
  • Süha Kagan Köse

License: MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].