All Projects → luoda888 → Tianchi Diabetes Top12

luoda888 / Tianchi Diabetes Top12

Projects that are alternatives of or similar to Tianchi Diabetes Top12

Kapsamli derin ogrenme rehberi
Bu çalışma araştırmalar yaparken benzerlerine rastlayıp iyileştirerek derlemeye çalıştığım ve derin öğrenme (deep learning) konusunda kısa bir özet ve bolca kaynak yönlendirmesi olan (hatta sonunda koca bir liste var) hızlıca konuya giriş yapılabilinmesi için gereklilikleri özetlemektedir. Lütfen katkı vermekten çekinmeyin 👽
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Tensorflow2.0 Notes
Tensorflow 2.0 Notes 提供了TF2.0案例实战以及TF2.0基础实战,目标是帮助那些希望和使用Tensorflow 2.0进行深度学习开发和研究的朋友快速入门,其中包含的Tensorflow 2.0教程基本通过测试保证可以成功运行(有问题的可以提issue,笔记网站正在建设中)。
Stars: ✭ 187 (-1.58%)
Mutual labels:  jupyter-notebook
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (-0.53%)
Mutual labels:  jupyter-notebook
Notebooks
Jupyter Notebooks with Deep Learning Tutorials
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Clustergrammer
An interactive heatmap visualization built using D3.js
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Juniper
🍇 Edit and execute code snippets in the browser using Jupyter kernels
Stars: ✭ 189 (-0.53%)
Mutual labels:  jupyter-notebook
California Coronavirus Data
The Los Angeles Times' independent tally of coronavirus cases in California.
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Seldon Core
An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
Stars: ✭ 2,815 (+1381.58%)
Mutual labels:  jupyter-notebook
Scipy Lecture Notes cn
Pytho科学计算生态介绍的中文翻译,英文原文地址:
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Nbinteract
Create interactive webpages from Jupyter Notebooks
Stars: ✭ 189 (-0.53%)
Mutual labels:  jupyter-notebook
Dragan
A stable algorithm for GAN training
Stars: ✭ 189 (-0.53%)
Mutual labels:  jupyter-notebook
Carputer
Toy car that drives itself using neural networks
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Whotracks.me
Data from the largest and longest measurement of online tracking.
Stars: ✭ 189 (-0.53%)
Mutual labels:  jupyter-notebook
Ipypublish
A workflow for creating and editing publication ready scientific reports and presentations, from one or more Jupyter Notebooks, without leaving the browser!
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Deep Learning With Tensorflow 2 And Keras
Deep Learning with TensorFlow 2 and Keras, published by Packt
Stars: ✭ 190 (+0%)
Mutual labels:  jupyter-notebook
Iridescent
Solid data structure and algorithms
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Faceshifter
Try to reproduce FaceShifter
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook
Thinkdsp
Think DSP: Digital Signal Processing in Python, by Allen B. Downey.
Stars: ✭ 2,485 (+1207.89%)
Mutual labels:  jupyter-notebook
Beginners Pytorch Deep Learning
Repository for scripts and notebooks from the book: Programming PyTorch for Deep Learning
Stars: ✭ 190 (+0%)
Mutual labels:  jupyter-notebook
Practical Time Series Analysis
Practical Time-Series Analysis, published by Packt
Stars: ✭ 188 (-1.05%)
Mutual labels:  jupyter-notebook

Readme.md

天池精准医疗大赛-糖尿病遗传风险预测

Top12 思路 由于初赛和复赛题目相差太大,谨在此给出复赛的一点思路权当抛砖引玉

特征工程

新特征构造

1.构造加减乘除四则运算特征,做特征间的交互(考虑可解释的 基因拮抗、基因协同)
2.构造特征本身的乘方,幂方,开方等数值特征
3.利用多项式特征包来构造特征(线上表现不行)

缺失值的处理

1.观察数据分布,对于缺失数据在非长尾的特征,均值填充/中值填充
2.把缺失值的特征当Label,考虑Label Propagation传播算法,半监督填充Label
3.不用GBDT等模型填充的原因是对于缺失值较多的(40%-75%),无法保证数据的分布一致
4.将缺失值数量超过75%的进行删除

模型的选择

其实可以很轻松的发现这题数据量小,利用堆叠复杂的模型可能导致过拟合,故我们采用的是贪心法选择最优特征,基本框架为

if Choose_Best_Feature(now_feature)<the_last_best:
    now_feature.pop()
else:
    print('Now CV:',cv_mean)

在Choose_Best_Feature模块中,是每次加入一个新特征计算的整体CV的值,不断更新最优值,显然,其一,这种选择方法是具有一定的盲目性的,贪心法陷入的是局部最优解,可能该组特征向量只是近似最优解,故可以考虑引入模拟退火机制,Random一个数满足某个条件则改变最优值;其二,如果数据量大,特征多,在时间效率上是无法承受的,故笔者提出了一种小技巧仅供参考,小技巧有两个方向

def get_pic(model,feature_name):
    ans = DF()
    ans['name'] = feature_name
    ans['score'] = model.feature_importances_
    print(ans[ans['score']>0].shape)
    return ans.sort_values(by=['score'],ascending=False).reset_index(drop=True)
    
nums = 45
feature_name1 = train_data[feature_name].columns
get_ans_face = list(set(get_pic(lgb_model,feature_name1).head(nums)['name'])|set(get_pic(xgb_model,feature_name1).head(nums)['name'])|set(get_pic(gbc_model,feature_name1).head(nums)['name']))
# get_ans_face = list(set(get_pic(lgb_model,feature_name1).head(nums)['name'])&set(get_pic(xgb_model,feature_name1).head(nums)['name'])&set(get_pic(gbc_model,feature_name1).head(nums)['name']))
# 先训练好三个模型 第一种方法是将三个模型的Feature_importances的Top K选择出来后,将这些特征取并集;而第二种方法则是取交集

在经验上 第一种方法所需要设置的nums较小,而第二种方法所需要设置的nums较大,籍此选出较强的特征后进入前文所述的贪心选择法中,即选择出较优的特征向量组,而在Choose_Best_Feature中,笔者使用的是Xgboost,Lightgbm,GBDT三种模型的CV值的平均值量度加入New_Feature对模型的影响,如此可以保证线上与线下的同增同减

def get_model(nums,cv_fold):
    feature_name1 = train_data[feature_name].columns
    get_ans_face = list(set(get_pic(gbc_model,feature_name1).head(nums)['name'])&set(get_pic(xgb_model,feature_name1).head(nums)['name'])&set(get_pic(lgb_model,feature_name1).head(nums)['name']))
    print('New Feature: ',len(get_ans_face))
    new_lgb_model = lgb.LGBMClassifier(objective='binary',n_estimators=300,max_depth=3,min_child_samples=6,learning_rate=0.102,random_state=1)
    cv_model = cv(new_lgb_model, train_data[get_ans_face], train_label,  cv=cv_fold, scoring='f1')
    new_lgb_model.fit(train_data[get_ans_face], train_label)
    m1 = cv_model.mean()

    new_xgb_model1 = xgb.XGBClassifier(objective='binary:logistic',n_estimators=300,max_depth=4,learning_rate=0.101,random_state=1)
    cv_model = cv(new_xgb_model1, train_data[get_ans_face].values, train_label,  cv=cv_fold, scoring='f1')
    new_xgb_model1.fit(train_data[get_ans_face].values, train_label)
    m2 = cv_model.mean()

    new_gbc_model = GBC(n_estimators=310,subsample=1,min_samples_split=2,max_depth=3,learning_rate=0.1900,min_weight_fraction_leaf=0.1)
    kkk = train_data[get_ans_face].fillna(7)
    cv_model = cv(new_gbc_model, kkk[get_ans_face], train_label,  cv=cv_fold, scoring='f1')
    new_gbc_model.fit(kkk.fillna(7),train_label)

    m3 = cv_model.mean()
    print((m1+m2+m3)/3)
    pro1 = new_lgb_model.predict_proba(test_data[get_ans_face])
    pro2 = new_xgb_model1.predict_proba(test_data[get_ans_face].values)
    pro3 = new_gbc_model.predict_proba(test_data[get_ans_face].fillna(7).values)
    ans = (pro1+pro2+pro3)/3
    return ans

在最后的结果提交环节中,也有一个可以参考的小技巧,将选择出来的特征向量组放入三个树模型中可以得到Ans1,Ans2,Ans3,也可以得到概率P1,P2,P3,那么将Ans1、2、3做结果的投票融合得到Ans4,将P1/P2/P3做概率融合得到Ans5,再利用线下表现较好的线性模型利用特征向量组产生Ans6,把Ans4,Ans5,Ans6再进行结果投票即可得到Ans7,Ans7的效果经过笔者的实践证明还不错

如果您觉得笔者的骚操作是可以借鉴的,那么请给个可爱的Star吧!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].