All Projects → igorbrigadir → Downloadconceptualcaptions

igorbrigadir / Downloadconceptualcaptions

Licence: mit
Reliably download millions of images efficiently

Projects that are alternatives of or similar to Downloadconceptualcaptions

Dalr
Implementation of "Domain-adaptive deep network compression", ICCV 2017
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Tetrahedra
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Symbolic Metamodeling
Codebase for "Demystifying Black-box Models with Symbolic Metamodels", NeurIPS 2019.
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Icyface offline
offline part of icyface
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Gpufilter
GPU Recursive Filtering
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Financial Machine Learning Articles
Contains the code for my financial machine learning articles
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Deep learning projects
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Kaggle Santander Customer Transaction Prediction 5th Place Partial Solution
Kaggle Competition notebooks
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Adv fin ml exercises
Experimental solutions to selected exercises from the book [Advances in Financial Machine Learning by Marcos Lopez De Prado]
Stars: ✭ 944 (+3271.43%)
Mutual labels:  jupyter-notebook
Kaggle
Kaggle에서 진행하는 경진대회의 코드를 올려둔 공간입니다.
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Keras Faster Rcnn
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Python plotting snippets
Tips and tricks for plotting in python
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Mlnet Workshop
ML.NET Workshop to predict car sales prices
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Neural networks
This is the code for "Neural Networks - The Math of Intelligence #4" by Siraj Raval on Youtube
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Artee.ai
AI Generated Tees
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Imageretrieval
Stars: ✭ 28 (+0%)
Mutual labels:  jupyter-notebook
Ismir2020 u nets svs
A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Chatbot
Chatbot based on Rasa Framework
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Eci2019 Nlp
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook
Tensorflow In Practice Specialization
DeepLearning.AI TensorFlow Developer Professional Certificate Specialization
Stars: ✭ 29 (+3.57%)
Mutual labels:  jupyter-notebook

Download Conceptual Captions Data

Place data from: https://ai.google.com/research/ConceptualCaptions/download in this folder

Train_GCC-training.tsv Training Split (3,318,333)

Validation_GCC-1.1.0-Validation.tsv Validation Split (15,840)

Test Split (~12,500) human approved image caption pairs is not public.

run download_data.py

Images will be in training and validation folders. You can stop and resume, the settings for splitting downloads into chunks / threads are not optimal, but it maxed out my connection so i kept them as is.

Note: A previous version of this script used a different file naming scheme, this changed and if you are resuming a previously started download, you will get duplicates.

A bunch of them will fail to download, and return web pages instead. These will need to be cleaned up later. See downloaded_validation_report.tsv after it downloads for HTTP errors. Around 8% of images are gone, based on validation set results. Setting the user agent could fix some errors too maybe - not sure if any requests are rejected by sites based on this.

It should take about a day or two to download the training data, keep an eye on disk space.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].