All Projects → aerdem4 → Kaggle Quora Dup

aerdem4 / Kaggle Quora Dup

Licence: mit
Solution to Kaggle's Quora Duplicate Question Detection Competition

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Kaggle Quora Dup

Globbing
Introduction to "globbing" or glob matching, a programming concept that allows "filepath expansion" and matching using wildcards.
Stars: ✭ 86 (-33.33%)
Mutual labels:  regex
Orchestra
One language to be RegExp's Successor. Visually readable and rich, technically safe and extended, naturally scalable, advanced, and optimized
Stars: ✭ 103 (-20.16%)
Mutual labels:  regex
Learn Regex Zh
🇨🇳 翻译: 学习正则表达式的简单方法
Stars: ✭ 1,772 (+1273.64%)
Mutual labels:  regex
Regex
Regular expression engine in Python using Thompson's algorithm.
Stars: ✭ 91 (-29.46%)
Mutual labels:  regex
Hackvault
A container repository for my public web hacks!
Stars: ✭ 1,364 (+957.36%)
Mutual labels:  regex
Regex Snippets
Organized list of useful RegEx snippets
Stars: ✭ 109 (-15.5%)
Mutual labels:  regex
Lit Element Router
A LitElement Router (1278 bytes gzip)
Stars: ✭ 85 (-34.11%)
Mutual labels:  regex
Js Regular Expression Awesome
📄我收藏的正则表达式大全,欢迎补充
Stars: ✭ 120 (-6.98%)
Mutual labels:  regex
Boswatch
Python Script to process input data from rtl_fm and multimon-NG - multiple Plugin support
Stars: ✭ 101 (-21.71%)
Mutual labels:  regex
Proposal Regexp Unicode Property Escapes
Proposal to add Unicode property escapes `\p{…}` and `\P{…}` to regular expressions in ECMAScript.
Stars: ✭ 112 (-13.18%)
Mutual labels:  regex
To Regex Range
Pass two numbers, get a regex-compatible source string for matching ranges. Fast compiler, optimized regex, and validated against more than 2.78 million test assertions. Useful for creating regular expressions to validate numbers, ranges, years, etc.
Stars: ✭ 97 (-24.81%)
Mutual labels:  regex
Simpleaudioindexer
Searching for the occurrence seconds of words/phrases or arbitrary regex patterns within audio files
Stars: ✭ 100 (-22.48%)
Mutual labels:  regex
Blog
我的日记
Stars: ✭ 110 (-14.73%)
Mutual labels:  regex
Youtube Regex
Best YouTube Video ID regex. Online: https://regex101.com/r/rN1qR5/2 and http://regexr.com/3anm9
Stars: ✭ 87 (-32.56%)
Mutual labels:  regex
Lens Regex Pcre
Text lenses using PCRE regexes
Stars: ✭ 116 (-10.08%)
Mutual labels:  regex
Djurl
Simple yet helpful library for writing Django urls by an easy, short and intuitive way.
Stars: ✭ 85 (-34.11%)
Mutual labels:  regex
Command Line Text Processing
⚡ From finding text to search and replace, from sorting to beautifying text and more 🎨
Stars: ✭ 9,771 (+7474.42%)
Mutual labels:  regex
Ffind
A sane replacement for find
Stars: ✭ 124 (-3.88%)
Mutual labels:  regex
Grepbugs
A regex based source code scanner.
Stars: ✭ 118 (-8.53%)
Mutual labels:  regex
Homebridge Http Switch
Powerful http switch for Homebridge: https://github.com/homebridge/homebridge
Stars: ✭ 111 (-13.95%)
Mutual labels:  regex

Solution to Kaggle's Quora Duplicate Question Detection Competition

The competition can be found via the link: https://www.kaggle.com/c/quora-question-pairs I was ranked 23rd (top 1%) among 3307 teams with this solution. This is a relatively lightweight model considering the other top solutions.

Prerequisites

Pipeline

  • This code is written in Python 3.5 and tested on a machine with Intel i5-6300HQ processor and Nvidia GeForce GTX 950M. Keras is used with Tensorflow backend and GPU support.
  • First run nlp_feature_extraction.py and non_nlp_feature extraction.py scripts. They may take an hour to finish.
  • Then run model.py which may take around 5 hours to make 10 different predictions on the test set.
  • Finally, ensemble and postprocess the predictions by postprocess.py.

Model Explanation

  • Questions are preprocessed such that the different forms of writing the same thing are tried to be unified. So, LSTM does not learn different things from these different interpretations.
  • Words which occur more than 100 times in the train set are collected. The rest is considered as rare words and replaced by the word "memento" which is my favorite movie from C. Nolan. Since "memento" is irrelevant to almost anything, it is absically a placeholder. How many of the rare words are common in the both pairs and how many of them are numeric are used as features. This whole process leads to better generalization in LSTM so that it cannot overfit particular pairs by just using these rare words.
  • The features mentioned above are merged with NLP and non-NLP features. As a result, 4+15+6=25 features are prepared for the network.
  • The train data is divided into 10 folds. In every run, one fold is kept as the validation set for early stopping. So, every run uses 1 fold different than the other for training which can contribute to the model variance. Since we are going to ensemble the models, increasing model variance reasonably is something we may want. I also did more 10fold runs with different model parameters for better ensebling during the competition.

Network Architecture

alt text

Postprocessing

What made my model successful? BETTER GENERALIZATION

  • All the features are question order independent. When you swap the first and the second question, the feature matrix does not change. For example, instead of using question1_frequency and question2_frequency, I have used min_frequency and max_frequency.
  • Feature values are bounded when necessary. For example, number of neighbors are set to 5 for everything above 5, because I did not want to overfit on a particular pair with specific number of neighbor 76 etc.
  • Features generated by LSTM is also question order independent. They share the same LSTM layer. After the LSTM layer, output of question1 and question2 merged with commutative operations which are square of difference and summation.
  • I think a good preprocessing on the questions also leads to better generalization.
  • Replacing the rare words with a placeholder before LSTM is another thing that I did for better generalization.
  • The neural network is not so big and has reasonable amount of dropouts and gaussian noises.
  • Different NN ppredictions are ensembled at the end.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].