Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → modestyachts → Cifar 10.1

modestyachts / Cifar 10.1

Licence: mit

Release of CIFAR-10.1, a new test set for CIFAR-10.

Labels

jupyter-notebook dataset

Projects that are alternatives of or similar to Cifar 10.1

Fma

FMA: A Dataset For Music Analysis

Stars: ✭ 1,391 (+737.95%)

Mutual labels: jupyter-notebook, dataset

Protest Detection Violence Estimation

Implementation of the model used in the paper Protest Activity Detection and Perceived Violence Estimation from Social Media Images (ACM Multimedia 2017)

Stars: ✭ 114 (-31.33%)

Mutual labels: jupyter-notebook, dataset

Faceaging By Cyclegan

Stars: ✭ 105 (-36.75%)

Mutual labels: jupyter-notebook, dataset

Cubicasa5k

CubiCasa5k floor plan dataset

Stars: ✭ 98 (-40.96%)

Mutual labels: jupyter-notebook, dataset

Datasets

🎁 3,000,000+ Unsplash images made available for research and machine learning

Stars: ✭ 1,805 (+987.35%)

Mutual labels: jupyter-notebook, dataset

Objectron

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes

Stars: ✭ 1,352 (+714.46%)

Mutual labels: jupyter-notebook, dataset

Bertqa Attention On Steroids

BertQA - Attention on Steroids

Stars: ✭ 112 (-32.53%)

Mutual labels: jupyter-notebook, dataset

Wikipedia ner

📖 Labeled examples from wiki dumps in Python

Stars: ✭ 61 (-63.25%)

Mutual labels: jupyter-notebook, dataset

Coronawatchnl

Numbers concerning COVID-19 disease cases in The Netherlands by RIVM, LCPS, NICE, ECML, and Rijksoverheid.

Stars: ✭ 135 (-18.67%)

Mutual labels: jupyter-notebook, dataset

Contactpose

Large dataset of hand-object contact, hand- and object-pose, and 2.9 M RGB-D grasp images.

Stars: ✭ 129 (-22.29%)

Mutual labels: jupyter-notebook, dataset

Openml R

R package to interface with OpenML

Stars: ✭ 81 (-51.2%)

Mutual labels: jupyter-notebook, dataset

Lacmus

Lacmus is a cross-platform application that helps to find people who are lost in the forest using computer vision and neural networks.

Stars: ✭ 142 (-14.46%)

Mutual labels: jupyter-notebook, dataset

Symbolic Musical Datasets

🎹 symbolic musical datasets

Stars: ✭ 79 (-52.41%)

Mutual labels: jupyter-notebook, dataset

Scientificsummarizationdatasets

Datasets I have created for scientific summarization, and a trained BertSum model

Stars: ✭ 100 (-39.76%)

Mutual labels: jupyter-notebook, dataset

Raccoon dataset

The dataset is used to train my own raccoon detector and I blogged about it on Medium

Stars: ✭ 1,177 (+609.04%)

Mutual labels: jupyter-notebook, dataset

Imagenetv2

A new test set for ImageNet

Stars: ✭ 109 (-34.34%)

Mutual labels: jupyter-notebook, dataset

Cinemanet

Stars: ✭ 57 (-65.66%)

Mutual labels: jupyter-notebook, dataset

Animegan

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

Stars: ✭ 1,095 (+559.64%)

Mutual labels: jupyter-notebook, dataset

Know Your Intent

State of the Art results in Intent Classification using Sematic Hashing for three datasets: AskUbuntu, Chatbot and WebApplication.

Stars: ✭ 116 (-30.12%)

Mutual labels: jupyter-notebook, dataset

Gossiping Chinese Corpus

PTT 八卦版問答中文語料

Stars: ✭ 137 (-17.47%)

Mutual labels: jupyter-notebook, dataset

View All Similar Projects ➔

CIFAR-10.1

This repository contains the CIFAR-10.1 dataset, which is a new test set for CIFAR-10.

CIFAR-10.1 contains roughly 2,000 new test images that were sampled after multiple years of research on the original CIFAR-10 dataset. The data collection for CIFAR-10.1 was designed to minimize distribution shift relative to the original dataset. We describe the creation of CIFAR-10.1 in the paper "Do CIFAR-10 Classifiers Generalize to CIFAR-10?". The images in CIFAR-10.1 are a subset of the TinyImages dataset.

Using the Dataset

Dataset Releases

There are currently two versions of the CIFAR-10.1 dataset:

v4 is the first version of our dataset on which we tested any classifier. As mentioned above, this makes the v4 dataset independent of the classifiers we evaluate. The numbers reported in the main sections of our paper use this version of the dataset. It was built from the top 25 TinyImages keywords for each class, which led to a slight class imbalance. The largest difference is that ships make up only 8% of the test set instead of 10%. v4 contains 2,021 images.
v6 is derived from a slightly improved keyword allocation that is exactly class balanced. This version of the dataset corresponds to the results in Appendix D of our paper. v6 contains 2,000 images.

The overlap between v4 and v6 is more than 90% of the respective datasets. Moreover, the classification accuracies are very close (see Appendix D of our paper). For future experiments, we recommend the v6 version of our dataset.

Missing version numbers correspond to internal releases during our quality control process (e.g., near-duplicate removal) or potential variants of our dataset we did not pursue further.

Loading the Dataset

The datasets directory contains the dataset files in the NumPy binary format:

The v4 files are cifar10.1_v4_data.npy and cifar10.1_v4_labels.npy.
The v6 files are cifar10.1_v6_data.npy and cifar10.1_v6_labels.npy.

The notebooks directory contains a short script inspect_dataset_simple.ipynb to browse the CIFAR-10.1 dataset. The notebook uses a utility function to load the dataset from utils.py in the the code directory.

Dataset Creation Pipeline

WARNING: This is currently work in progress, some parts may be incomplete.

This repository contains code to replicate the creation process of CIFAR-10.1. The dataset creation process has several stages outlined below. We describe the process here at a high level. If you have questions about any individual steps, please do not hesitate to contact Rebecca Roelofs ([email protected]) and Ludwig Schmidt ([email protected]).

1. Extracting Data from TinyImages

Since the TinyImages dataset is quite large (around 280 GB), we first extract the relevant data for further processing. In particular, we require the following information:

The TinyImages keyword for each image in CIFAR-10.
All images in TinyImages belonging to these keywords.

We have automated these two steps via two scripts in the code directory:

find_all_cifar10_keywords.sh
build_tinyimage_subset.sh

We recommend running these scripts on a machine with at least 1 TB of RAM, e.g., an x1.16xlarge instance on AWS. After downloading the TinyImage dataset, running the scripts will take about 30h.

The scripts will produce the following data files, all of which are stored in the other_data folder:

cifar10_keywords.json
cifar10_keywords_unique.json
tinyimage_subset_indices.json
tinyimage_subset_data.pickle

2. Collecting Candidate Images

After downloading the relevant subset of TinyImages (keywords and image data) to a local machine, we can now assemble a set of candidate images for the new dataset. We proceed in two steps:

2.1 Keyword counts for the new dataset

The notebook generate_keyword_counts.ipynb decides which keywords we want to include in the new dataset and determines the number of images we require for each of these keywords.

2.2 Labeling new images

Once we know the number of new images we require for each keyword, we can collect corresponding images from TinyImages. We used two notebooks for this process:

The first labeler (or set of labelers) use labeling_ui.ipynb in order to collect a set of candidate images.
The second labeler (or set of labelers) verify this selection via the labeling_ui_subselect.ipynb notebook.

3. Assembling a New Dataset

Given a pool of new candidate images, we can now sample a new dataset from this pool. We have the following notebooks for this step:

sample_subselected_indices_v4.ipynb samples the pool of labeled images and creates the new dataset for v4
sample_subselected_indices.ipynb samples the pool of labeled images and creates the new dataset for v6 or v7

After sampling a new dataset, it is necessary to run some final checks via the check_dataset_ui.ipynb notebook. In particular, this notebook checks for near-duplicates both within the new test set and in CIFAR-10 (a new test set would not be interesting if it contains many near-duplicates of the original test set). In our experience, the process involves a few round-trips of sampling a new test set, checking for near-duplicates, and adding the near-duplicates to the blacklist. Sometimes it is necessary to collect a few additional images for keywords with many near-duplicates (using the notebooks from Step 2 above).

In order to avoid re-computing L2 distances to CIFAR-10, the notebook compute_distances_to_cifar10.ipynb computes all top-10 nearest neighbors between our TinyImages subset and CIFAR-10. Running this notebook takes only a few minutes when executed on 100 m5.4xlarge instances via PyWren.

4. Inspecting Model Predictions (Extra Step)

After assembling a final dataset, we ran a broad range of classifiers on the new test set via our CIFAR-10 model test bed. The notebook inspect_model_predictions.ipynb explores the resulting predictions and displays a Pandas dataframe including the original and new accuracy for each model.

Intermediate Data Files

In order to run only individual steps of the process outlined above, we provide all intermediate data files. They are stored in the S3 bucket cifar-10-1 and can be downloaded with the script other_data/download.py. The script requires Boto 3, which can be installed via pip: pip install boto3.

License

Unless noted otherwise in individual files, the code in this repository is released under the MIT license (see the LICENSE file). The LICENSE file does not apply to the actual image and label data in the datasets folder. The image data is part of the Tiny Images dataset and can be used the same way as the Tiny Images dataset.

Citing CIFAR-10.1

To cite the CIFAR-10.1 dataset, please use the following references:

@article{recht2018cifar10.1,
  author = {Benjamin Recht and Rebecca Roelofs and Ludwig Schmidt and Vaishaal Shankar},
  title = {Do CIFAR-10 Classifiers Generalize to CIFAR-10?},
  year = {2018},
  note = {\url{https://arxiv.org/abs/1806.00451}},
}

@article{torralba2008tinyimages, 
  author = {Antonio Torralba and Rob Fergus and William T. Freeman}, 
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title = {80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition}, 
  year = {2008}, 
  volume = {30}, 
  number = {11}, 
  pages = {1958-1970}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 166

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗