All Projects → yew1eb → Dm Competition Getting Started

yew1eb / Dm Competition Getting Started

数据挖掘竞赛(Kaggle,Data Castle,Analytics Vidhya,DrivenData)入门实践

Projects that are alternatives of or similar to Dm Competition Getting Started

Machine Learning For Cybersecurity Cookbook
Machine Learning for Cybersecurity Cookbook, published by Packt
Stars: ✭ 77 (-1.28%)
Mutual labels:  jupyter-notebook
Covid19 Dashboard
A site that displays up to date COVID-19 stats, powered by fastpages.
Stars: ✭ 1,212 (+1453.85%)
Mutual labels:  jupyter-notebook
Learning python
Source material for Python Like You Mean it
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Jupyterlab Commenting
Commenting and annotation for JupyterLab
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Azure Sentinel
Cloud-native SIEM for intelligent security analytics for your entire enterprise.
Stars: ✭ 1,208 (+1448.72%)
Mutual labels:  jupyter-notebook
Brain Tumor Segmentation Using Deep Neural Networks
Keras implementation of paper by the same name
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Foxtracker
Facial Head Pose Tracker for Gaming
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Machine Learning
My Attempt(s) In The World Of ML/DL....
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Keras Deeplab V3 Plus
Keras implementation of Deeplab v3+ with pretrained weights
Stars: ✭ 1,212 (+1453.85%)
Mutual labels:  jupyter-notebook
Machinelearning
机器学习有关算法和实例
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Aws Ml Guide
[Video]AWS Certified Machine Learning-Specialty (ML-S) Guide
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Scikit Learn Tips
🤖⚡️ scikit-learn tips
Stars: ✭ 1,203 (+1442.31%)
Mutual labels:  jupyter-notebook
Fftnet
Pytorch Implementation of FFTNet
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Kaggle Notebooks
Sample notebooks for Kaggle competitions
Stars: ✭ 77 (-1.28%)
Mutual labels:  jupyter-notebook
Tutorials 2018
Geophysical Tutorials column for 2018
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Covid19italia
Condividiamo informazioni e segnalazioni sul COVID19
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Neural Networks Demystified
Supporting code for short YouTube series Neural Networks Demystified.
Stars: ✭ 1,215 (+1457.69%)
Mutual labels:  jupyter-notebook
Noaa Ghcn Weather Data
Fetching and processing NOAA Global Historical Climatology Network Weather Data
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Leveling Up Jupyter
Leveling up your Jupyter notebook skills
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook
Tf Serving K8s Tutorial
A Tutorial for Serving Tensorflow Models using Kubernetes
Stars: ✭ 78 (+0%)
Mutual labels:  jupyter-notebook

Data Mining Competition Getting Started


Analytics Vidhya

AV Loan Prediction url

仅作为练习的小问题, 根据用户的特征预测是否发放住房贷款,二分类问题
总11个特征(Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area) ,Loan_ID是用户ID,Loan_Status是需要预测的,特征包含数值类型和分类类型

Data Castle

微额借款用户人品预测大赛 url

同上,区别在与这个的特征比较多

Kaggle

Digit Recognizer url

多分类练习题

Titanic: Machine Learning from Disaster url

二分类问题,给出0/1即可,评价指标为accuracy。

Bag of Words Meets Bags of Popcorn url

这是一个文本情感二分类问题。评价指标为AUC。 http://www.cnblogs.com/lijingpeng/p/5787549.html

Display Advertising Challenge url

这是一个广告CTR预估的比赛,由知名广告公司Criteo赞助举办。数据包括4千万训练样本,500万测试样本,特征包括13个数值特征,26个类别特征,评价指标为logloss。 CTR工业界做法一般都是LR,只是特征会各种组合/transform,可以到上亿维。这里我也首选LR,特征缺失值我用的众数,对于26个类别特征采用one-hot编码, 数值特征我用pandas画出来发现不符合正态分布,有很大偏移,就没有scale到[0,1], 采用的是根据五分位点(min,25%,中位数,75%,max)切分为6个区间(负值/过大值分别分到了1和6区间作为异常值处理),然后一并one-hot编码,最终特征100万左右,训练文件20+G。 强调下可能遇到的坑:1.one-hot最好自己实现,除非你机器内存足够大(需全load到numpy,而且非sparse);2.LR最好用SGD或者mini-batch, 而且out-of-core模式(http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py), 除非还是你的内存足够大;3.Think twice before code.由于数据量大,中间出错重跑的话时间成品比较高。 我发现sklearn的LR和liblinear的LR有着截然不同的表现,sklearn的L2正则化结果好于L1,liblinear的L1好于L2,我理解是他们优化方法不同导致的。 最终结果liblinear的LR的L1最优,logloss=0.46601,LB为227th/718,这也正符合lasso产生sparse的直觉。 我也单独尝试了xgboost,logloss=0.46946,可能还是和GBRT对高维度sparse特征效果不好有关。Facebook有一篇论文把GBRT输出作为transformed feature喂给下游的线性分类器, 取得了不错的效果,可以参考下。(Practical Lessons from Predicting Clicks on Ads at Facebook) 我只是简单试验了LR作为baseline,后面其实还有很多搞法,可以参考forum获胜者给出的solution, 比如:1. Vowpal Wabbit工具不用区分类别和数值特征;2.libFFM工具做特征交叉组合;3.feature hash trick;4.每个特征的评价点击率作为新特征加入;5.多模型ensemble等。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].