All Projects β†’ basveeling β†’ Pcam

basveeling / Pcam

Licence: mit
The PatchCamelyon (PCam) deep learning classification benchmark.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pcam

Fashion Mnist
A MNIST-like fashion product database. Benchmark πŸ‘‡
Stars: ✭ 9,675 (+2745.59%)
Mutual labels:  dataset, benchmark
Hpatches Benchmark
Python & Matlab code for local feature descriptor evaluation with the HPatches dataset.
Stars: ✭ 129 (-62.06%)
Mutual labels:  dataset, benchmark
Core50
CORe50: a new Dataset and Benchmark for Continual Learning
Stars: ✭ 91 (-73.24%)
Mutual labels:  dataset, benchmark
Caffenet Benchmark
Evaluation of the CNN design choices performance on ImageNet-2012.
Stars: ✭ 700 (+105.88%)
Mutual labels:  dataset, benchmark
Weatherbench
A benchmark dataset for data-driven weather forecasting
Stars: ✭ 227 (-33.24%)
Mutual labels:  dataset, benchmark
Datasets
A repository of pretty cool datasets that I collected for network science and machine learning research.
Stars: ✭ 302 (-11.18%)
Mutual labels:  dataset, benchmark
Tape
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
Stars: ✭ 295 (-13.24%)
Mutual labels:  dataset, benchmark
Okutama Action
Okutama-Action: An Aerial View Video Dataset for Concurrent Human Action Detection
Stars: ✭ 36 (-89.41%)
Mutual labels:  dataset, benchmark
Hand pose action
Dataset and code for the paper "First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations", CVPR 2018.
Stars: ✭ 173 (-49.12%)
Mutual labels:  dataset, benchmark
Clue
δΈ­ζ–‡θ―­θ¨€η†θ§£ζ΅‹θ―„εŸΊε‡† Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+613.24%)
Mutual labels:  dataset, benchmark
Medmnist
[ISBI'21] MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis
Stars: ✭ 338 (-0.59%)
Mutual labels:  dataset, benchmark
MaskedFaceRepresentation
Masked face recognition focuses on identifying people using their facial features while they are wearing masks. We introduce benchmarks on face verification based on masked face images for the development of COVID-safe protocols in airports.
Stars: ✭ 17 (-95%)
Mutual labels:  benchmark, dataset
Pglib Opf
Benchmarks for the Optimal Power Flow Problem
Stars: ✭ 114 (-66.47%)
Mutual labels:  dataset, benchmark
Sensaturban
πŸ”₯Urban-scale point cloud dataset (CVPR 2021)
Stars: ✭ 135 (-60.29%)
Mutual labels:  dataset, benchmark
BIRL
BIRL: Benchmark on Image Registration methods with Landmark validations
Stars: ✭ 66 (-80.59%)
Mutual labels:  benchmark, dataset
Deeperforensics 1.0
[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection
Stars: ✭ 338 (-0.59%)
Mutual labels:  dataset, benchmark
Covid19 twitter
Covid-19 Twitter dataset for non-commercial research use and pre-processing scripts - under active development
Stars: ✭ 304 (-10.59%)
Mutual labels:  dataset
Oltpbench
Database Benchmarking Framework
Stars: ✭ 317 (-6.76%)
Mutual labels:  benchmark
Css10
CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages
Stars: ✭ 302 (-11.18%)
Mutual labels:  dataset
Cob
Continuous Benchmark for Go Project
Stars: ✭ 326 (-4.12%)
Mutual labels:  benchmark

PatchCamelyon (PCam)

That which is measured, improves. - Karl Pearson

The PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annoted with a binary label indicating presence of metastatic tissue. PCam provides a new benchmark for machine learning models: bigger than CIFAR10, smaller than imagenet, trainable on a single GPU.

PCam example images. Green boxes indicate positive labels. Example images from PCam. Green boxes indicate tumor tissue in center region, which dictates a positive label.

Table of Contents

Why PCam

Fundamental machine learning advancements are predominantly evaluated on straight-forward natural-image classification datasets. Think MNIST, CIFAR, SVHN. Medical imaging is becoming one of the major applications of ML and we believe it deserves a spot on the list of go-to ML datasets. Both to challenge future work, and to steer developments into directions that are beneficial for this domain.

We think PCam can play a role in this. It packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and WSI diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty and explainability.

Download

The data is stored in gzipped HDF5 files and can be downloaded using the following links. Each set consist of a data and target file. An additional meta csv file is provided which describes from which Camelyon16 slide the patches were extracted from, but this information is not used in training for or evaluating the benchmark. Please report any downloading problems via a github issue.

Download all at once from Google Drive.

Name Content Size Link MD5 Checksum
camelyonpatch_level_2_split_train_x.h5.gz training images 6.1 GB Download 1571f514728f59376b705fc836ff4b63
camelyonpatch_level_2_split_train_y.h5.gz training labels 21 KB Download 35c2d7259d906cfc8143347bb8e05be7
camelyonpatch_level_2_split_valid_x.h5.gz valid images 0.8 GB Download d8c2d60d490dbd479f8199bdfa0cf6ec
camelyonpatch_level_2_split_valid_y.h5.gz valid labels 3.0 KB Download 60a7035772fbdb7f34eb86d4420cf66a
camelyonpatch_level_2_split_test_x.h5.gz test images 0.8 GB Download d5b63470df7cfa627aeec8b9dc0c066e
camelyonpatch_level_2_split_test_y.h5.gz test labels 3.0 KB Download 2b85f58b927af9964a4c15b8f7e8f179
camelyonpatch_level_2_split_train_meta.csv training meta Download 5a3dd671e465cfd74b5b822125e65b0a
camelyonpatch_level_2_split_valid_meta.csv valid meta Download 3455fd69135b66734e1008f3af684566
camelyonpatch_level_2_split_test_meta.csv test meta Download 67589e00a4a37ec317f2d1932c7502ca

Mirror Zenodo:

https://zenodo.org/record/2546921

Baidu AI Studio:

https://aistudio.baidu.com/aistudio/datasetdetail/30060

Usage and Tips

Keras Example

General dataloader for keras

from keras.utils import HDF5Matrix
from keras.preprocessing.image import ImageDataGenerator

x_train = HDF5Matrix('camelyonpatch_level_2_split_train_x.h5', 'x')
y_train = HDF5Matrix('camelyonpatch_level_2_split_train_y.h5', 'y')

datagen = ImageDataGenerator(
              preprocessing_function=lambda x: x/255.,
              width_shift_range=4,  # randomly shift images horizontally
              height_shift_range=4,  # randomly shift images vertically 
              horizontal_flip=True,  # randomly flip images
              vertical_flip=True)  # randomly flip images
              
model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    steps_per_epoch=len(x_train) // batch_size
                    epochs=1024,
                    )

Details

Numbers

The dataset is divided into a training set of 262.144 (2^18) examples, and a validation and test set both of 32.768 (2^15) examples. There is no overlap in WSIs between the splits, and all splits have a 50/50 balance between positive and negative examples.

Labeling

A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable the design of fully-convolutional models that do not use any zero-padding, to ensure consistent behavior when applied to a whole-slide image. This is however not a requirement for the PCam benchmark.

Patch selection

PCam is derived from the Camelyon16 Challenge [2], which contains 400 H&E stained WSIs of sentinel lymph node sections. The slides were acquired and digitized at 2 different centers using a 40x objective (resultant pixel resolution of 0.243 microns). We undersample this at 10x to increase the field of view. We follow the train/test split from the Camelyon16 challenge [2], and further hold-out 20% of the train WSIs for the validation set. To prevent selecting background patches, slides are converted to HSV, blurred, and patches filtered out if maximum pixel saturation lies below 0.07 (which was validated to not throw out tumor data in the training set). The patch-based dataset is sampled by iteratively choosing a WSI and selecting a positive or negative patch with probability p. Patches are rejected following a stochastic hard-negative mining scheme with a small CNN, and p is adjusted to retain a balance close to 50/50.

Statistics

Coming soon

Contact

For problems and questions not fit for a github issue, please email Bas Veeling.

Citing PCam

If you use PCam in a scientific publication, we would appreciate references to the following paper:

[1] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. "Rotation Equivariant CNNs for Digital Pathology". arXiv:1806.03962

A citation of the original Camelyon16 dataset paper is appreciated as well:

[2] Ehteshami Bejnordi et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA: The Journal of the American Medical Association, 318(22), 2199–2210. doi:jama.2017.14585

Biblatex entry:

@ARTICLE{Veeling2018-qh,
  title         = "Rotation Equivariant {CNNs} for Digital Pathology",
  author        = "Veeling, Bastiaan S and Linmans, Jasper and Winkens, Jim and
                   Cohen, Taco and Welling, Max",
  month         =  jun,
  year          =  2018,
  archivePrefix = "arXiv",
  primaryClass  = "cs.CV",
  eprint        = "1806.03962"
}

Benchmark

Name Reference Augmentations Acc AUC NLL FROC*
GDensenet [1] Following Liu et al. 89.8 96.3 0.260 75.8 (64.3, 87.2)
Add yours

* Performance on Camelyon16 tumor detection task, not part of the PCam benchmark.

Contributing

Contributions with example scripts for other frameworks are welcome!

License

The data is provided under the CC0 License, following the license of Camelyon16.

The rest of this repository is under the MIT License.

Acknowledgements

  • Babak Ehteshami Bejnordi, Geert Litjens, Jeroen van der Laak for their input on the configuration of this dataset.
  • README derived from Fashion-MNIST.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].