All Projects → hengluchang → Quora-Paraphrase-Question-Identification

hengluchang / Quora-Paraphrase-Question-Identification

Licence: other
Paraphrase question identification using Feature Fusion Network (FFN).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Quora-Paraphrase-Question-Identification

sklearn-feature-engineering
使用sklearn做特征工程
Stars: ✭ 114 (+500%)
Mutual labels:  kaggle, feature-engineering
Home Credit Default Risk
Default risk prediction for Home Credit competition - Fast, scalable and maintainable SQL-based feature engineering pipeline
Stars: ✭ 68 (+257.89%)
Mutual labels:  kaggle, feature-engineering
Open Solution Home Credit
Open solution to the Home Credit Default Risk challenge 🏡
Stars: ✭ 397 (+1989.47%)
Mutual labels:  kaggle, feature-engineering
question-pair
A siamese LSTM to detect sentence/question pairs.
Stars: ✭ 25 (+31.58%)
Mutual labels:  quora, quora-question-pairs
Nyaggle
Code for Kaggle and Offline Competitions
Stars: ✭ 209 (+1000%)
Mutual labels:  kaggle, feature-engineering
fastknn
Fast k-Nearest Neighbors Classifier for Large Datasets
Stars: ✭ 64 (+236.84%)
Mutual labels:  kaggle, feature-engineering
Kaggle Quora Question Pairs
Kaggle:Quora Question Pairs, 4th/3396 (https://www.kaggle.com/c/quora-question-pairs)
Stars: ✭ 705 (+3610.53%)
Mutual labels:  kaggle, feature-engineering
Bike-Sharing-Demand-Kaggle
Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand
Stars: ✭ 33 (+73.68%)
Mutual labels:  kaggle, feature-engineering
Lightautoml
LAMA - automatic model creation framework
Stars: ✭ 196 (+931.58%)
Mutual labels:  kaggle, feature-engineering
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (+726.32%)
Mutual labels:  kaggle, feature-engineering
Kaggler
Code for Kaggle Data Science Competitions
Stars: ✭ 614 (+3131.58%)
Mutual labels:  kaggle, feature-engineering
kaggle-berlin
Material of the Kaggle Berlin meetup group!
Stars: ✭ 36 (+89.47%)
Mutual labels:  kaggle, feature-engineering
Kaggle Competitions
There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.
Stars: ✭ 86 (+352.63%)
Mutual labels:  kaggle, feature-engineering
Data-Science
Using Kaggle Data and Real World Data for Data Science and prediction in Python, R, Excel, Power BI, and Tableau.
Stars: ✭ 15 (-21.05%)
Mutual labels:  kaggle, feature-engineering
Kaggle-Quora-Question-Pairs
This is our team's solution report, which achieves top 10% (305/3307) in this competition.
Stars: ✭ 58 (+205.26%)
Mutual labels:  kaggle, paraphrase-identification
kaggledatasets
Collection of Kaggle Datasets ready to use for Everyone (Looking for contributors)
Stars: ✭ 44 (+131.58%)
Mutual labels:  kaggle
kaggle-camera-model-identification
Code for reproducing 2nd place solution for Kaggle competition IEEE's Signal Processing Society - Camera Model Identification
Stars: ✭ 64 (+236.84%)
Mutual labels:  kaggle
Dog-Breed-Identification-Gluon
Kaggle 120种狗分类,Gluon实现
Stars: ✭ 45 (+136.84%)
Mutual labels:  kaggle
zca
ZCA whitening in python
Stars: ✭ 29 (+52.63%)
Mutual labels:  feature-engineering
speech-recognition-transfer-learning
Speech command recognition DenseNet transfer learning from UrbanSound8k in keras tensorflow
Stars: ✭ 18 (-5.26%)
Mutual labels:  kaggle

Codacy Badge

Paraphrase Question Identification using Feature Fusion Network

Identify question pairs that have the same meaning. Feature Fusion Network takes advantage of learning rich features not just from sentence representations but also from hand craft features.

For more detailed information, please see our project research paper: Paraphrase Question Identification using Feature Fusion Network.

Model architecture

Results

  • 0.895 testing accuracy for FFN (train for 100 epoch)

Requirements

  • Python 3.5 for running FFN
  • Python 2.7 for running Random Forest (RF) baseline

Package dependencies

RF baseline

  • scikit-learn 0.18
  • nltk
  • pandas

FFN

  • numpy 1.11
  • matplotlib 1.5
  • Keras 1.2
  • scikit-learn 0.18
  • h5py 2.6
  • hdf5 1.8
  • TensorFlow 0.10

How to run

$ git clone https://github.com/hengluchang/Quora-Paraphrase-Question-Identification

Run Random Forest baseline

  • create a folder named "dataset".
$ cd Quora-Paraphrase-Question-Identification
$ mkdir -p dataset
  • Go to Kaggle Quora Question Pairs website and download train.csv.zip and test.csv.zip and unzip both. Place the train.csv and test.csv under /dataset directory.

  • Create 10 Hand crafted features (HCFs). This will create train_10features.csv and test_10features.csv under /dataset directory.

$ cd ..
$ python feature_gen.py ../dataset/train.csv ../dataset/test.csv
  • Run Random Forest baseline on these 10 HCFs, this will give you ~ 0.84 testing accuracy.
$ python run_baseline.py ../dataset/train_10features.csv

Run Feature Fusion Network (FFN)

$ pyhon3 train_noHCF.py -i <QUESTION_PAIRS_FILE> -t <TEST_QUESTION_PAIRS_FILE> -g <GLOVE_FILE> -w <MODEL_WEIGHTS_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE>

For instance:

$ python3 train_noHCF.py -i train_rebalanced.csv -t test.csv -g glove.840B.300d.txt -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.h5  -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json
  • Train FFN
$ python3 train_HCF.py -i <QUESTION_PAIRS_FILE> -t <TEST_QUESTION_PAIRS_FILE> -f <HCF_FILE> -g <GLOVE_FILE> -w <MODEL_WEIGHTS_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE>

For instance:

$ python3 train_HCF.py -i train_rebalanced.csv -t test.csv -f train_rebalanced_10features.csv -g glove.840B.300d.txt -w question_pairs_weights_100epoch_test10_val20_dropout20_sumOP_noAVG_HCF_rebalanced.h5  -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json
  • Test FFN w/o HCF
$ python3 test_noHCF.py -i <QUESTION_PAIRS_FILE> -o <RESULT_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE> -w <MODEL_WEIGHTS_FILE>

For instance:

$ python3 test_noHCF.py -i test.csv  -o result_question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.csv -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.h5 
  • Test FFN
$ python3 test_HCF.py -i <QUESTION_PAIRS_FILE> -o <RESULT_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE> -w <MODEL_WEIGHTS_FILE>

For instance:

$ python3 test_sum_HCF.py -i test.csv -f -test_10features.csv -o result_question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_HCF_rebalanced.csv -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_HCF_rebalanced.h5

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].