Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → HouJP → Kaggle Quora Question Pairs

HouJP / Kaggle Quora Question Pairs

Kaggle：Quora Question Pairs, 4th/3396 (https://www.kaggle.com/c/quora-question-pairs)

Programming Languages

139335 projects - #7 most used programming language

Labels

kaggle feature-engineering

Projects that are alternatives of or similar to Kaggle Quora Question Pairs

Kaggle Competitions

There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.

Stars: ✭ 86 (-87.8%)

Mutual labels: kaggle, feature-engineering

Code for Kaggle Data Science Competitions

Stars: ✭ 614 (-12.91%)

Mutual labels: kaggle, feature-engineering

Machine Learning Workflow With Python

This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation

Stars: ✭ 157 (-77.73%)

Mutual labels: kaggle, feature-engineering

Home Credit Default Risk

Default risk prediction for Home Credit competition - Fast, scalable and maintainable SQL-based feature engineering pipeline

Stars: ✭ 68 (-90.35%)

Mutual labels: kaggle, feature-engineering

Bike-Sharing-Demand-Kaggle

Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand

Stars: ✭ 33 (-95.32%)

Mutual labels: kaggle, feature-engineering

LAMA - automatic model creation framework

Stars: ✭ 196 (-72.2%)

Mutual labels: kaggle, feature-engineering

Code for Kaggle and Offline Competitions

Stars: ✭ 209 (-70.35%)

Mutual labels: kaggle, feature-engineering

Using Kaggle Data and Real World Data for Data Science and prediction in Python, R, Excel, Power BI, and Tableau.

Stars: ✭ 15 (-97.87%)

Mutual labels: kaggle, feature-engineering

Quora-Paraphrase-Question-Identification

Paraphrase question identification using Feature Fusion Network (FFN).

Stars: ✭ 19 (-97.3%)

Mutual labels: kaggle, feature-engineering

Material of the Kaggle Berlin meetup group!

Stars: ✭ 36 (-94.89%)

Mutual labels: kaggle, feature-engineering

sklearn-feature-engineering

使用sklearn做特征工程

Stars: ✭ 114 (-83.83%)

Mutual labels: kaggle, feature-engineering

Fast k-Nearest Neighbors Classifier for Large Datasets

Stars: ✭ 64 (-90.92%)

Mutual labels: kaggle, feature-engineering

Open Solution Home Credit

Open solution to the Home Credit Default Risk challenge 🏡

Stars: ✭ 397 (-43.69%)

Mutual labels: kaggle, feature-engineering

Kaggle Imaterialist

The First Place Solution of Kaggle iMaterialist (Fashion) 2019 at FGVC6

Stars: ✭ 451 (-36.03%)

Mutual labels: kaggle

Data Science Competitions

Goal of this repo is to provide the solutions of all Data Science Competitions(Kaggle, Data Hack, Machine Hack, Driven Data etc...).

Stars: ✭ 572 (-18.87%)

Mutual labels: kaggle

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+3027.38%)

Mutual labels: kaggle

Awesome Feature Engineering

A curated list of resources dedicated to Feature Engineering Techniques for Machine Learning

Stars: ✭ 433 (-38.58%)

Mutual labels: feature-engineering

An open source python library for automated feature engineering

Stars: ✭ 5,891 (+735.6%)

Mutual labels: feature-engineering

Multi Class Text Classification Cnn Rnn

Classify Kaggle San Francisco Crime Description into 39 classes. Build the model with CNN, RNN (GRU and LSTM) and Word Embeddings on Tensorflow.

Stars: ✭ 570 (-19.15%)

Mutual labels: kaggle

[UNMAINTAINED] Automated machine learning- just give it a data file! Check out the production-ready version of this project at ClimbsRocks/auto_ml

Stars: ✭ 412 (-41.56%)

Mutual labels: kaggle

View All Similar Projects ➔

Kaggle: Quora Question Pairs (Comming Soon)

Author: Liang Pang, Yixing Fan, Jianpeng Hou, Xinyu Yue, Guocheng Niu

Categories

Abstract
Summary

Abstract

In the Quora Question Pairs Challenge, we were asked to build a model to classify whether question pairs are duplicates or not (multiple versions of the same question). Our final submission was a stacking result of multiple models. This submission scored 0.11450 on Public LB and 0.11768 on Private LB (with post-process), ranking 4 out of 3396 teams. This documents describes our team's solution which can be divided into diffrent parts: Pre-processing, Feature Engineering, Modeling and Post-processing.

Summary

Our solution consisted of four main parts: Pre-processing, Feature Engineering, Modeling and Post-processing. What's more, we developed a light weight Machine Learning framework FeatWheel to help us to finish ML jobs, such as feature extraction, feature merging and so on.

In pre-processing, we process the text of data with text cleaning, word stemming, removing stop words and shared words and can form different versions of original data. In feature engineering, we extracted features based on various versions of data. The features can be classified in to three categories：Statistical Features, NLP Features and Graph Features. In modeling, we build deep models, boosting models (using XGBoost, LightGBM) and linear models (Linear Regression) and build a multi-layer stacking system to ensemble different models together. As we all know, the distribution of the training data and test data are quite different, so we made post-processing on the prediction results. We cut the data into different parts according to the clique size and rescale the results in different parts.

Flowchart

The flowchart of our method is shown as follows:

Submission

Submissions were evaluated on the log loss between the predicted values and the group truth. In specific, the best single model we have obtained during the competition was an XGBoost model with tree booster of Public LB score 0.12653 and Private LB score 0.13067 (without post-process). Our final submission was a stacking result of multiple models. This submission scored 0.11450 on Public LB and 0.11768 on Private LB (with post-process), ranking 4 out of 3396 teams.

Deep Model

Please see TextNet and TextNet-Model. For tensorflow version, please checkout MatchZoo.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 705

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (9) 🔗