All Projects → aspk → Quora_question_pairs_NLP_Kaggle

aspk / Quora_question_pairs_NLP_Kaggle

Licence: other
Quora Kaggle Competition : Natural Language Processing using word2vec embeddings, scikit-learn and xgboost for training

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Quora question pairs NLP Kaggle

datascienv
datascienv is package that helps you to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries
Stars: ✭ 53 (+211.76%)
Mutual labels:  scikit-learn, xgboost
handson-ml
도서 "핸즈온 머신러닝"의 예제와 연습문제를 담은 주피터 노트북입니다.
Stars: ✭ 285 (+1576.47%)
Mutual labels:  scikit-learn, xgboost
Nyoka
Nyoka is a Python library to export ML/DL models into PMML (PMML 4.4.1 Standard).
Stars: ✭ 127 (+647.06%)
Mutual labels:  scikit-learn, xgboost
Tpot
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Stars: ✭ 8,378 (+49182.35%)
Mutual labels:  scikit-learn, xgboost
Auto viml
Automatically Build Multiple ML Models with a Single Line of Code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Stars: ✭ 216 (+1170.59%)
Mutual labels:  scikit-learn, xgboost
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (+441.18%)
Mutual labels:  scikit-learn, nlp-machine-learning
Stacking
Stacked Generalization (Ensemble Learning)
Stars: ✭ 173 (+917.65%)
Mutual labels:  scikit-learn, xgboost
Openscoring
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
Stars: ✭ 536 (+3052.94%)
Mutual labels:  scikit-learn, xgboost
Eli5
A library for debugging/inspecting machine learning classifiers and explaining their predictions
Stars: ✭ 2,477 (+14470.59%)
Mutual labels:  scikit-learn, xgboost
Hyperactive
A hyperparameter optimization and data collection toolbox for convenient and fast prototyping of machine-learning models.
Stars: ✭ 182 (+970.59%)
Mutual labels:  scikit-learn, xgboost
Mljar Supervised
Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning 🚀
Stars: ✭ 961 (+5552.94%)
Mutual labels:  scikit-learn, xgboost
Machine-Learning-Models
In This repository I made some simple to complex methods in machine learning. Here I try to build template style code.
Stars: ✭ 30 (+76.47%)
Mutual labels:  xgboost, nlp-machine-learning
Machine Learning Alpine
Alpine Container for Machine Learning
Stars: ✭ 30 (+76.47%)
Mutual labels:  scikit-learn, xgboost
Auto ml
[UNMAINTAINED] Automated machine learning for analytics & production
Stars: ✭ 1,559 (+9070.59%)
Mutual labels:  scikit-learn, xgboost
Hyperparameter hunter
Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
Stars: ✭ 648 (+3711.76%)
Mutual labels:  scikit-learn, xgboost
M2cgen
Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies
Stars: ✭ 1,962 (+11441.18%)
Mutual labels:  scikit-learn, xgboost
Dtreeviz
A python library for decision tree visualization and model interpretation.
Stars: ✭ 1,857 (+10823.53%)
Mutual labels:  scikit-learn, xgboost
Autoviz
Automatically Visualize any dataset, any size with a single line of code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Stars: ✭ 310 (+1723.53%)
Mutual labels:  scikit-learn, xgboost
Mars
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
Stars: ✭ 2,308 (+13476.47%)
Mutual labels:  scikit-learn, xgboost
go-ml-benchmarks
⏱ Benchmarks of machine learning inference for Go
Stars: ✭ 27 (+58.82%)
Mutual labels:  scikit-learn, xgboost

Duplicate question detection using Word2Vec, XGBoost and Autoencoders

In this post, I tackle the problem of classifying questions pairs based on whether they are duplicate or not duplicate. This is important for companies like Quora, or Stack Overflow where multiple questions posted are duplicates of questions already answered. If a duplicate question is spotted by an algorithm, the user can be directed to it and reach the answer faster.

An example of two duplicate questions is 'How do I read and find my YouTube comments?' and 'How can I see all my Youtube comments?', and non duplicate questions is 'What's causing someone to be jealous?' and 'What can I do to avoid being jealous of someone?'. Two approaches are applied to this problem:

  1. Sequence Encoder trained by auto-encoder approach and dynamic pooling for classification
  2. Bag of Words model with Logistic Regression and XGBoost classifier

Bag of words model with ngrams = 4 and min_df = 0 achieves an accuracy of 82 % with XGBoost as compared to 89.5% whicch is the best accuracies reported in literature with Bi LSTM and attention. The encoder approach implemented here achieves 63.8% accuracy, which is lower than the other approaches. I found it interesting because of the autoencoder implementation and the approach considers similary between phrases as well as the words for variable length sequences. Perhaphs, the efficiency could be improved by changing the dimentions of dynamically pooled matrix, a different approach in cleaning the data, as well as spelling checks.

Classifier's can be compared based on three different evaluation metrics, log loss, auc, and accuracy. Log loss or the cross entropy loss is an indicator of how different the probability distribution of the output of the classifier is relative to the true probability distribution of the class labels. Receiver operating characteristic plots the true positive rate vs the false positive rate and an area under the curve (auc) of 0.5 corresponds to a random classifier. Higher the AUC better the classifier. Accuracy is a simple metric, which calculates the fraction of correct predicted labels.

In this post, I use accuracy as a metric for comparison, as there is specific reason to do otherwise.

BOW model

As shown in the figure, as min_df is changed from 0 to 600 the accuracy decreases from 80% to 72 % for ngram = 4. min_df thresholds the ngrams appearing in the vocabulary according to count. Any ngram with frequency of appearance below min_df in the corpus is ignored. Ngrams beyond 4 are not used as there is a negligible change in accuracy as ngrams are increased from 3 to 4. Tf-idf vectorizer instead of Count vectorizer is used to speed up computation and it also increases the accuracy by a small amount (less than 1% for one data point). An accuracy of 82% is obtained by running the same input through XGBoost.

For the BOW model parameter sweep vocabulary size ranges from 703912 (n-grams = 4 and min_df =0) to 1018 (ngrams = 1 and min_df = 600).

Auto-encoder and Dynamic Pooling CNN classifier

The figure above shows the implemented model, which is similar to Socher et al. Word2Vec embedding is generated with a vocabulary size of 100000 according to Tensorflow Word2Vec opensource release, using the skip gram model. In these embeddings, words which share similar context have smaller cosine distance. The key problem is dealing with questions of different lengths. The information content of a sentence is compressed by training an auto encoder. The main motivation behind this approach is to find similarity between sentences by comparing the entire sentence as well as the phrases in the sentence. The problem of different lengths is circumvented by upsampling and dynamic pooling as described below.

Sentences are encoded using the approach shown in the left figure. The three words and the two encodings are considered as input to generate the similarity matrix. The auto-encoder is trained as shown in the right figure using Tensorflow. The right figure descibes the encoder decoder architecture. I used a single layer Neural Network for the encoder and the decoder, multiple hidden layers could also be considered. Multiple batches of words are concatenated and fed into the encoder and in the ideal case the output of the decoder should match the input. Mean squared error loss of the neural net is minimized with Gradient Descent optimizer with learning rate of 0.1. L2 regularization coefficient of 1e-4 is used for the encoder and decoder weights.

The autoencoder here uses any two words for training and can be batch trained. It is different from the approach used by Socher et al., where the author encodes the entire sentence and decodes it by unfolding it into a question. Unfolding autoencoder is difficult or maybe even impossible to implement in Tensorflow. Dynamic computational graph construction tools like pytorch could potentially be a better fit to implment the full approach.

The entire sentence with its intermediate encodings can be used as input to the upsampling and dynamic pooling phase. In the upsampling phase, the smaller vector of the question pair considered is upsampled by repeating the encodings randomly chosen of the vector to match the length of the other question encodings. A pairwise similarity matrix is generated for each phrase vector, and the variable dimention matrix is pooled into a matrix of npool x npool. I used npool = 28. This matrix is fed into a CNN classifier to classify as duplicate or not. A hyper parameter optimization of npool could also increase tha accuracy. The accuracy of this model is 63.8 % .

Issues

I faced some issued with sklearn's Logistic regression. The model did output right class labels but wrong probabilities. I havent figured out a solution to this problem. There was no such problem with XGBoost.

References

Best question pair matching method : Wang, Zhiguo, Wael Hamza, and Radu Florian. "Bilateral multi-perspective matching for natural language sentences." arXiv preprint arXiv:1702.03814 (2017).

Understanding Crossentropy loss and visualizing information: http://colah.github.io/posts/2015-09-Visual-Information/

Unfolding recursive autoencoder approach :Socher, Richard, et al. "Dynamic pooling and unfolding recursive autoencoders for paraphrase detection." Advances in neural information processing systems. 2011.

Word2Vec embeddings tensroflow opensource release: https://github.com/tensorflow/tensorflow/blob/r1.9/tensorflow/examples/tutorials/word2vec/word2vec_basic.py Tensorflow : https://www.tensorflow.org/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].