All Projects → YuriyGuts → kaggle-quora-question-pairs

YuriyGuts / kaggle-quora-question-pairs

Licence: MIT License
My solution to Kaggle Quora Question Pairs competition (Top 2%, Private LB log loss 0.13497).

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to kaggle-quora-question-pairs

Open Solution Home Credit
Open solution to the Home Credit Default Risk challenge 🏡
Stars: ✭ 397 (+281.73%)
Mutual labels:  competition, kaggle
Painters
🎨 Winning solution for the Painter by Numbers competition on Kaggle.
Stars: ✭ 257 (+147.12%)
Mutual labels:  competition, kaggle
Data-Science-Hackathon-And-Competition
Grandmaster in MachineHack (3rd Rank Best) | Top 70 in AnalyticsVidya & Zindi | Expert at Kaggle | Hack AI
Stars: ✭ 165 (+58.65%)
Mutual labels:  competition, kaggle
Open Solution Mapping Challenge
Open solution to the Mapping Challenge 🌎
Stars: ✭ 291 (+179.81%)
Mutual labels:  competition, kaggle
Data Science Bowl 2018
End-to-end one-class instance segmentation based on U-Net architecture for Data Science Bowl 2018 in Kaggle
Stars: ✭ 56 (-46.15%)
Mutual labels:  competition, kaggle
Open Solution Toxic Comments
Open solution to the Toxic Comment Classification Challenge
Stars: ✭ 154 (+48.08%)
Mutual labels:  competition, kaggle
open-solution-cdiscount-starter
Open solution to the Cdiscount’s Image Classification Challenge
Stars: ✭ 20 (-80.77%)
Mutual labels:  competition, kaggle
infinity
Infinity is a simple online puzzle hunt/jeopardy-style CTF platform.
Stars: ✭ 11 (-89.42%)
Mutual labels:  competition
Kaggle-Sea-Lions-Solution
NOAA Fisheries Steller Sea Lion Population Count
Stars: ✭ 13 (-87.5%)
Mutual labels:  kaggle
kaggle-satellite-imagery-feature-detection
Satellite Imagery Feature Detection (68 out of 419)
Stars: ✭ 29 (-72.12%)
Mutual labels:  kaggle
kaggle-tools
Some tools that I often find myself using in Kaggle challenges.
Stars: ✭ 33 (-68.27%)
Mutual labels:  kaggle
imaterialist-furniture-2018
Kaggle competition
Stars: ✭ 76 (-26.92%)
Mutual labels:  kaggle
Kaggle-Competition-Sberbank
Top 1% rankings (22/3270) code sharing for Kaggle competition Sberbank Russian Housing Market: https://www.kaggle.com/c/sberbank-russian-housing-market
Stars: ✭ 31 (-70.19%)
Mutual labels:  kaggle
fastknn
Fast k-Nearest Neighbors Classifier for Large Datasets
Stars: ✭ 64 (-38.46%)
Mutual labels:  kaggle
Flag-Capture
Solutions and write-ups from security-based competitions also known as Capture The Flag competition
Stars: ✭ 84 (-19.23%)
Mutual labels:  competition
hackathon
Repositório de hackathons do Training Center
Stars: ✭ 20 (-80.77%)
Mutual labels:  competition
TablutCompetition
Software for the Tablut Students Competition
Stars: ✭ 17 (-83.65%)
Mutual labels:  competition
Kaggle-Passenger-Screening-Challenge-Solution
10th place solution to the $1,500,000 Kaggle Passenger Screening Challenge sponsored by the Department of Homeland Security.
Stars: ✭ 19 (-81.73%)
Mutual labels:  kaggle
kaggle-plasticc
Solution to Kaggle's PLAsTiCC Astronomical Classification Competition
Stars: ✭ 50 (-51.92%)
Mutual labels:  kaggle
d2l-java
The Java implementation of Dive into Deep Learning (D2L.ai)
Stars: ✭ 94 (-9.62%)
Mutual labels:  kaggle

kaggle-quora-question-pairs

My solution to Kaggle Quora Question Pairs competition (Top 2%, Private LB log loss 0.13497).

Overview

The solution uses a mixture of purely statistical features, classical NLP features, and deep learning. Almost 200 handcrafted features are combined with out-of-fold predictions from 4 neural networks having different architectures.

The final model is a GBM (LightGBM), trained with early stopping and a very small learning rate, using stratified K-fold cross validation.

Overall solution structure

Reproducing the Solution

Hardware Requirements

Almost all code (with the exception of some 3rd-party scripts) can efficiently utilize multi-core machines. At the same time, some of them might be memory-hungry. All code has been tested on a machine with 64 GB RAM. For all non-neural notebooks, a c4.8xlarge AWS instance should do excellent.

For neural networks, a GPU is highly recommended. On a GTX 1080 Ti, it takes about 8-9 hours to complete all 4 "neural" notebooks.

You'll need about 30 GB of free disk space to store the pre-trained word embeddings and the extracted features.

Software Requirements

  1. Python >= 3.6.
  2. LightGBM (compiled from sources).
  3. FastText (compiled from sources).
  4. Python packages from requirements.txt.
  5. (Recommended) NVIDIA CUDA and a GPU version of TensorFlow.

Environment Provisioning

You can spin up a fresh Ubuntu 16.04 AWS instance and use Ansible to make all the necessary software installation and configuration (except the GPU-related stuff).

  1. Make sure to open the ports 22 and 8888 on the target machine.
  2. Navigate to provisioning directory.
  3. Edit config.yml:
    • jupyter_plaintext_password: the password to set for the Jupyter server on the target machine.
    • kaggle_username, kaggle_password: your Kaggle credentials (required to download the competition datasets). Otherwise, download them to the data folder manually.
  4. Edit inventory.ini and specify your instance DNS and the private key file (*.pem) to access it.
  5. Run:
    $ ansible-galaxy install -r requirements.yml
    $ ansible-playbook playbook.yml -i inventory.ini
    

Running the Code

Automatic

Run run-all.sh from the repository root. Check notebooks/output for execution progress and data/submissions for the final results.

Manual

Start a Jupyter server in the notebooks directory. If you used the Ansible playbook, the server will already be running on port 8888.

Run the notebooks in the following order:

  1. Preprocessing.

    1) preproc-tokenize-spellcheck.ipynb
    2) preproc-extract-unique-questions.ipynb
    3) preproc-embeddings-fasttext.ipynb
    4) preproc-nn-sequences-fasttext.ipynb
    
  2. Feature extraction.

    Run all feature-*.ipynb notebooks in arbitrary order.

    Note: for faster execution, run all feature-oofp-nn-*.ipynb notebooks on a machine with a GPU and NVIDIA CUDA.

  3. Prediction.

    Run classify-lightgbm-cv-pred.ipynb. The output file will be saved as DATETIME-submission-draft-CVSCORE.csv

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].