All Projects → juliandewit → Kaggle_ndsb2017

juliandewit / Kaggle_ndsb2017

Licence: mit
Kaggle datascience bowl 2017

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Kaggle ndsb2017

Open Solution Data Science Bowl 2018
Open solution to the Data Science Bowl 2018
Stars: ✭ 159 (-73.46%)
Mutual labels:  kaggle, medical-imaging
Data Science Bowl 2018
End-to-end one-class instance segmentation based on U-Net architecture for Data Science Bowl 2018 in Kaggle
Stars: ✭ 56 (-90.65%)
Mutual labels:  kaggle, medical-imaging
Learning Deep Learning
Paper reading notes on Deep Learning and Machine Learning
Stars: ✭ 388 (-35.23%)
Mutual labels:  medical-imaging
Pyradiomics
Open-source python package for the extraction of Radiomics features from 2D and 3D images and binary masks. Support: https://discourse.slicer.org/c/community/radiomics
Stars: ✭ 563 (-6.01%)
Mutual labels:  medical-imaging
Kaggle Imaterialist
The First Place Solution of Kaggle iMaterialist (Fashion) 2019 at FGVC6
Stars: ✭ 451 (-24.71%)
Mutual labels:  kaggle
U Net Brain Tumor
U-Net Brain Tumor Segmentation
Stars: ✭ 399 (-33.39%)
Mutual labels:  medical-imaging
Ctk
A set of common support code for medical imaging, surgical navigation, and related purposes.
Stars: ✭ 498 (-16.86%)
Mutual labels:  medical-imaging
Pytorch Unet
PyTorch implementation of the U-Net for image semantic segmentation with high quality images
Stars: ✭ 4,770 (+696.33%)
Mutual labels:  kaggle
Multi Class Text Classification Cnn Rnn
Classify Kaggle San Francisco Crime Description into 39 classes. Build the model with CNN, RNN (GRU and LSTM) and Word Embeddings on Tensorflow.
Stars: ✭ 570 (-4.84%)
Mutual labels:  kaggle
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+3580.8%)
Mutual labels:  kaggle
Medicalzoopytorch
A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation
Stars: ✭ 546 (-8.85%)
Mutual labels:  medical-imaging
Dipy
DIPY is the paragon 3D/4D+ imaging library in Python. Contains generic methods for spatial normalization, signal processing, machine learning, statistical analysis and visualization of medical images. Additionally, it contains specialized methods for computational anatomy including diffusion, perfusion and structural imaging.
Stars: ✭ 417 (-30.38%)
Mutual labels:  medical-imaging
D2l Vn
Một cuốn sách tương tác về học sâu có mã nguồn, toán và thảo luận. Đề cập đến nhiều framework phổ biến (TensorFlow, Pytorch & MXNet) và được sử dụng tại 175 trường Đại học.
Stars: ✭ 402 (-32.89%)
Mutual labels:  kaggle
Robot Surgery Segmentation
Wining solution and its improvement for MICCAI 2017 Robotic Instrument Segmentation Sub-Challenge
Stars: ✭ 528 (-11.85%)
Mutual labels:  medical-imaging
Open Solution Home Credit
Open solution to the Home Credit Default Risk challenge 🏡
Stars: ✭ 397 (-33.72%)
Mutual labels:  kaggle
Tutorials
CatBoost tutorials repository
Stars: ✭ 563 (-6.01%)
Mutual labels:  kaggle
U Net
U-Net: Convolutional Networks for Biomedical Image Segmentation
Stars: ✭ 374 (-37.56%)
Mutual labels:  medical-imaging
Machinejs
[UNMAINTAINED] Automated machine learning- just give it a data file! Check out the production-ready version of this project at ClimbsRocks/auto_ml
Stars: ✭ 412 (-31.22%)
Mutual labels:  kaggle
Kaggle Homedepot
3rd Place Solution for HomeDepot Product Search Results Relevance Competition on Kaggle.
Stars: ✭ 452 (-24.54%)
Mutual labels:  kaggle
Data Science Competitions
Goal of this repo is to provide the solutions of all Data Science Competitions(Kaggle, Data Hack, Machine Hack, Driven Data etc...).
Stars: ✭ 572 (-4.51%)
Mutual labels:  kaggle

Kaggle national datascience bowl 2017 2nd place code

This is the source code for my part of the 2nd place solution to the National Data Science Bowl 2017 hosted by Kaggle.com. For documenation about the approach go to: http://juliandewit.github.io/kaggle-ndsb2017/
Note that this is my part of the code.
The work of my teammate Daniel Hammack can be found here: https://github.com/dhammack/DSB2017

Dependencies & data

The solution is built using Keras with a tensorflow backend on windows 64bit. Next to this I used scikit-learn, pydicom, simpleitk, beatifulsoup, opencv and XgBoost. All in all it was quite an engineering effort.

General

The source is cleaned up as much as possible. However I was afraid that results would not be 100% reproducible if I changed too much. Therefore some pieces could be a bit cleaner. Also I left in some bugs that I found while cleaning up. (See end of this document),

The solution relies on manual labels, generated labels and 2 resulting submissions from team member Daniel Hammack. These files are all in the "resources" map. All other file location can be configured in the settings.py. The raw patient data must be downloaded from the Kaggle website and the LUNA16 website.

Trained models as provided to Kaggle after phase 1 are also provided through the following download: https://retinopaty.blob.core.windows.net/ndsb3/trained_models.rar

The solution is a combination of nodule detectors/malignancy regressors. My two parts are trained with LUNA16 data with a mix of positive and negative labels + malignancy info from the LIDC dataset. My second part also uses some manual annotations made on the NDSB3 trainset. Predictions are generated from the raw nodule/malignancy predictions combined with the location information and general “mass” information. Masses are no nodules but big suspicious tissues present in the CT-images. De masses are detected with a U-net trained with manual labels.

The 3rd and 4th part of te solution come from Daniel Hammack. The final solution is a blend of the 4 different part. Blending is done by taking a simple average.

Preprocessing

First run step1_preprocess_ndsb.py. This will extract all the ndsb dicom files , scale to 1x1x1 mm, and make a directory containing .png slice images. Lung segmentation mask images are also generated. They will be used later in the process for faster predicting. Then run step1_preprocess_luna16.py. This will extract all the LUNA source files , scale to 1x1x1 mm, and make a directory containing .png slice images. Lung segmentation mask images are also generated. This step also generates various CSV files for positive and negative examples.

The nodule detectors are trained on positive and negative 3d cubes which must be generated from the LUNA16 and NDSB datasets. step1b_preprocess_make_train_cubes.py takes the different csv files and cuts out 3d cubes from the patient slices. The cubes are saved in different directories. resources/step1_preprocess_mass_segmenter.py is to generate the mass u-net trainset. It can be run but the generated resized images + labels is provided in this archive so this step does not need to be run. However, this file can be used to regenerate the traindata.

Training neural nets

First train the 3D convnets that detect nodules and predict malignancy. This can be done by running the step2_train_nodule_detector.py file. This will train various combinations of positive and negative labels. The resulting models (NAMES) are stored in the ./workdir directory and the final results are copied to the models folder. The mass detector can be trained using step2_train_mass_segmenter.py. It trains 3 folds and final models are stored in the models (names) folder. Training the 3D convnets will be around 10 hours per piece. The 3 mass detector folds will take around 8 hours in total

Predicting neural nets

Once trained or downloaded through the url (https://retinopaty.blob.core.windows.net/ndsb3/trained_models.rar) the models are placed in the ./models/ directory. From there the nodule detector step3_predict_nodules.py can be run to detect nodules in a 3d grid per patient. The detected nodules and predicted malignancy are stored per patient in a separate directory. The masses detector is already run through the step2_train_mass_segmenter.py and will stored a csv with estimated masses per patient.

Training of submissions, combining submissions for final submission.

Based on the per-patient csv’s the masses.csv and other metadata we will train an xgboost model to generate submissions (step4_train_submissions.py). There are 3 levels of submissions. First the per-model submissions. (level1). Different models are combined in level2, and Daniel’s submissions are added. These level 2 submissions will be combined (averaged) into one final submission. Below are the different models that will be generated/combined.

  • Level 1:
    Luna16_fs (trained on full luna16 set)
    Luna16_ndsbposneg v1 (trained on luna16 + manual pos/neg labels in ndsb)
    Luna16_ndsbposneg v2 (trained on luna16 + manual pos/neg labels in ndsb)
    Daniel model 1
    Daniel model 2
    posneg, daniel will be averaged into one level 2 model

  • Level 2.
    Luna16_fs
    Luna16_ndsbposneg
    Daniel

These 3 models will be averaged into 1 final_submission.csv

Bugs and suggestions.

First of all. Duringing cleanup I noticed that I missed 10% of the LUNA16 patients because I overlooked subset0. That might be a 100.000 dollar mistake. For reprodicibility reasons I kept the bug in. In settings.py you can adjust the code to also take this subset into account.

Suggestions for improvement would be:

  • Take the 10% extra LUNA16 condidates.
  • Use different blends of the positive and negative labels
  • Other neural network architectures.
  • Etc..
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].