Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → amineHorseman → Images Web Crawler

amineHorseman / Images Web Crawler

Licence: gpl-3.0

This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web, download images, rename / resize / covert the images and merge folders..

Programming Languages

python

139335 projects - #7 most used programming language

Labels

machine-learning image-processing dataset crawler image-classification images

Projects that are alternatives of or similar to Images Web Crawler

Deep learning projects

Stars: ✭ 28 (-45.1%)

Mutual labels: dataset, image-classification, image-processing

Ipyplot

IPyPlot is a small python package offering fast and efficient plotting of images inside Python Notebooks. It's using IPython with HTML for faster, richer and more interactive way of displaying big numbers of images.

Stars: ✭ 152 (+198.04%)

Mutual labels: image-classification, image-processing, images

Chafa

📺🗿 Terminal graphics for the 21st century.

Stars: ✭ 774 (+1417.65%)

Mutual labels: image-processing, images

Dmsmsgrcg

A photo OCR project aims to output DMS messages contained in sign structure images.

Stars: ✭ 18 (-64.71%)

Mutual labels: image-classification, image-processing

Sv Images

Image manipulation library with an HTTP based API.

Stars: ✭ 7 (-86.27%)

Mutual labels: image-processing, images

Label Studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format

Stars: ✭ 7,264 (+14143.14%)

Mutual labels: dataset, image-classification

Oblique

With Oblique explore new styles of displaying images

Stars: ✭ 633 (+1141.18%)

Mutual labels: image-processing, images

Python Compare Images

This repository is mainly about comparing two images. The technique used is SSIM. i.e. Structural Similarity Index Measure We use some of the inbuilt functions available in python's skimage library to measure the SSIM value. Along with SSIM we also measure the MSE ( Mean Square Error ) To know more about the SSIM technique Refer Here: https://en.wikipedia.org/wiki/Structural_similarity

Stars: ✭ 25 (-50.98%)

Mutual labels: image-processing, images

Trashnet

Dataset of images of trash; Torch-based CNN for garbage image classification

Stars: ✭ 368 (+621.57%)

Mutual labels: dataset, image-classification

Pytorch Toolbelt

PyTorch extensions for fast R&D prototyping and Kaggle farming

Stars: ✭ 942 (+1747.06%)

Mutual labels: image-classification, image-processing

Albumentations

Fast image augmentation library and an easy-to-use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about the library: https://www.mdpi.com/2078-2489/11/2/125

Stars: ✭ 9,353 (+18239.22%)

Mutual labels: image-classification, image-processing

Multidigitmnist

Combine multiple MNIST digits to create datasets with 100/1000 classes for few-shot learning/meta-learning

Stars: ✭ 48 (-5.88%)

Mutual labels: dataset, image-classification

Cvat

Powerful and efficient Computer Vision Annotation Tool (CVAT)

Stars: ✭ 6,557 (+12756.86%)

Mutual labels: dataset, image-classification

Awesome Project Ideas

Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas

Stars: ✭ 6,114 (+11888.24%)

Mutual labels: dataset, image-classification

Caer

High-performance Vision library in Python. Scale your research, not boilerplate.

Stars: ✭ 452 (+786.27%)

Mutual labels: image-classification, image-processing

Concise Ipython Notebooks For Deep Learning

Ipython Notebooks for solving problems like classification, segmentation, generation using latest Deep learning algorithms on different publicly available text and image data-sets.

Stars: ✭ 23 (-54.9%)

Mutual labels: image-classification, image-processing

Openexr

The OpenEXR project provides the specification and reference implementation of the EXR file format, the professional-grade image storage format of the motion picture industry.

Stars: ✭ 992 (+1845.1%)

Mutual labels: image-processing, images

Face recognition

🍎 My own face recognition with deep neural networks.

Stars: ✭ 328 (+543.14%)

Mutual labels: image-classification, image-processing

Sianet

An easy to use C# deep learning library with CUDA/OpenCL support

Stars: ✭ 353 (+592.16%)

Mutual labels: image-classification, image-processing

Cometa

Super fast, on-demand and on-the-fly, image processing.

Stars: ✭ 8 (-84.31%)

Mutual labels: image-processing, images

View All Similar Projects ➔

Web Image Crawler & Dataset Builder

This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). With this package you can:

Download a large number of images using a list of keywords, and organize the images in subfolders
Rename and order the files automatically
Resize the images to the desired dimensions
Crop the images
Convert images to the desired format
Merge several subfolders of images into one single big folder
Convert images to grayscale
Encode the dataset in a single array file
Generate labels automatically from subfolders names'
Flat the images

The actual version can crawl and download images from Google Search Engine and Flickr Search, throught the official APIs. More search engines will be added later (e.g: Bing, Yahoo...)

Dependencies

Please make sure the following python packages are installed before using the package:

pip install --upgrade google-api-python-client
pip install --upgrade flickrapi
pip install --upgrade scipy
pip install --upgrade shutil
pip install --upgrade urllib
pip install --upgrade json

How to use?

This package can be used in different manners depending on what you want to do (a complete example can be found in sample.py file):

1. Crawl the web and download images

from web_crawler import WebCrawler

keywords = ["cats", "dogs", "birds"]
api_keys = {'google': ('XXXXXXXXXXXXXXXXXXXXXXXX', 'YYYYYYYYY'),
            'flickr': ('XXXXXXXXXXXXXXXXXXXXXXXX', 'YYYYYYYYY')} # replace XXX.. and YYY.. by your own keys
images_nbr = 20 # number of images to fetch
download_folder = "./data" # folder in which the images wil be stored

### Crawl and download images ###
from web_crawler import WebCrawler
crawler = WebCrawler(api_keys)

# Crawl the web and collect URLs:
crawler.collect_links_from_web(keywords, images_nbr, remove_duplicated_links=True)

# Save URLs to download them later (optional):
crawler.save_urls(download_folder + "/links.txt")
# crawler.save_urls_to_json(download_folder + "/links.json")

# Download the images:
crawler.download_images(keywords, target_folder=download_folder)

For each keyword, this program will crawl Google Search Images and Flikr to collect 20 images and save them in the download_folder.

Note in this case that, the program will consume 6 queries from you Google Search Engine. That's because Google's API limits the number of images per querry to 10 (20 images * 3 keywords / 10 images_pre_query => 6 queries).

To test the program, make sure to replace the values of 'api_keys' variables by your own keys.

2. Download images from existing list of URLs

from web_crawler import WebCrawler

keywords = "cats, dogs, birds"
api_keys = {'google': ('XXXXXXXXXXXXXXXXXXXXXXXX', 'YYYYYYYYY'),
            'flickr': ('XXXXXXXXXXXXXXXXXXXXXXXX', 'YYYYYYYYY')} # replace XXX.. and YYY.. by your own keys
images_nbr = 50 # number of images to get for each keyword
download_folder = "./data" # folder in which the images wil be stored

### Crawl and download images ###
from web_crawler import WebCrawler
crawler = WebCrawler(api_keys)

# Loads URLs from a file:
crawler.load_urls(download_folder + "/links.txt")
# crawler.load_urls_from_json(download_folder + "/links.json")

# Download the images:
crawler.download_images(keywords, target_folder=download_folder)

3. Rename the downloaded files

from dataset_builder import DatasetBuilder
source_folder = "./data"
target_folder = "./data_renamed"
dataset_builder = DatasetBuilder()
dataset_builder.rename_files(source_folder)

This program will read all .jpg, .jpeg and .png files from source_folder, copy them to target_folder, and rename them according to this pattern: 1.jpg, 2.jpg, 3.jpg...

You can also specify target_folder and accepted extensions by passing extra argument to the last command (default: .jpg, .jpeg and .png):

dataset_builder.rename_files(source_folder, target_folder, extensions=('.png', '.gif'))

If your files have no extensions (this can happens with images downloaded using a browser), you can simple send an empty string in 'extensions' argument

dataset_builder.rename_files(source_folder, target_folder, extensions=(''))

4. Resize the images

from dataset_builder import DatasetBuilder
source_folder = "./data"
target_folder = "./data_resized"
dataset_builder = DatasetBuilder()
dataset_builder.reshape_images(source_folder, target_folder)

This will resize the downloaded images to the default size of 128x128. To change the height and width to a custom size you can pass them as extra parameters:

dataset_builder.reshape_images(source_folder, target_folder, width=64, height=64)

You can also specify the images extensions' (default: .jpg, .jpeg and .png):

dataset_builder.reshape_images(source_folder, target_folder, width=64, height=64, extensions=('.png', '.gif'))

If your files have no extensions:

dataset_builder.reshape_images(source_folder, target_folder, width=64, height=64, extensions=(''))

5. Crop the images:

from dataset_builder import DatasetBuilder
source_folder = "./data"
target_folder = "./data_cropped"
dataset_builder = DatasetBuilder()
dataset_builder.crop_images(source_folder, target_folder, height=55, width=55)
dataset_builder.crop_images(source_folder, target_folder, height=55, width=55, extensions=('.jpg', '.jpeg', '.png', '.gif'))

Center crop the images. The new image dimenssions are height * width.

6. Merge images in one single folder

Sometimes it's interessting to have all the images in only one single folder (especially for non suppervised learning datasets).

The following code will copy all the images in source subfolders, and copy then to the target folder.

Note that the images will be renamed to avoid overwriting files having the same name, and also because most datasets have the follwing naming format : 1.jpg, 2.jpg, 3.jpg...

from dataset_builder import DatasetBuilder
source_folder = "./data"
target_folder = "./data_merged"
dataset_builder = DatasetBuilder()
dataset_builder.merge_folders(source_folder, target_folder, extensions=('.jpg', '.jpeg', '.png', '.gif'))

7. Convert images to grayscale

Some Machine Learning algorithms need grayscale images:

from dataset_builder import DatasetBuilder
source_folder = "./data"
target_folder = "./data_merged"
dataset_builder = DatasetBuilder()
dataset_builder.convert_to_grayscale(source_folder, target_folder, extensions=('.jpg', '.jpeg', '.png', '.gif'))

If your files have no extensions:

dataset_builder.convert_to_grayscale(source_folder, target_folder, extensions='')

8. Convert images' format

In case you want to change images to a specified format (eg: covert all images to PNG):

from dataset_builder import DatasetBuilder
source_folder = "./data"
target_folder = "./data_png_format"
dataset_builder = DatasetBuilder()
dataset_builder.convert_format(source_folder, target_folder, new_extension='.png', extensions=('.jpg', '.jpeg', '.png', '.gif'))

If your files have no extensions:

dataset_builder.convert_format(source_folder, target_folder, new_extension='.png', extensions='')

9. Convert dataset to a single file

Many datasets in Machine Learning are encoded in a single array file containing all data (e.g: Mnist, Cifar10...)

The following lines merge all the images into a single numpy variable stored in disk as "data.npy".

from dataset_builder import DatasetBuilder
source_folder = "./data"
target_folder = "./data_single_file"
dataset_builder = DatasetBuilder()
dataset_builder.convert_to_single_file(source_folder, target_folderr, flatten=False)
#dataset_builder.convert_to_single_file(source_folder, target_folder, flatten=False, extensions=('.jpg', '.jpeg', '.png', '.gif'))

If you want to generate automatically labels from images subfolders, set create_labels_file argument to True. In this case, two files will be generated: data.npy and labels.npy.:

source_folder = download_folder
target_folder = download_folder + "_single_file"
dataset_builder.convert_to_single_file(source_folder, target_folder, flatten=False, create_labels_file=True)

If you want the images to be flatten during the operation, set the optional argument flatten to True. In that case, the images will be grouped in a 2-D matrix, where each row contains a flatten image.

source_folder = download_folder
target_folder = download_folder + "_single_file"
dataset_builder.convert_to_single_file(source_folder, target_folder, flatten=True, create_labels_file=True)

Note about APIs Limitations'

This package is not intended to simulate a browser in order to bypass the API limitations of the search engines.

Google Search API is limited to 100 queries per day, with 10 images per query (in the free version).
Flikr API is limited to 3600 queries per hour with 200 images per query, and returns at most 4,000 results for each keyword.
Bing API is limited to 5000 queries per month.
Yahoo! API is limited to 50 images per query, and return at most 1000 results for each keyword.

TODO:

Feel free to contribute to this package, or propose your ideas:

Add more search engines (Bing, Yahoo...)
Test on python 3.x
Change crawling and download method?
Detect duplicated or very similar images?

Please report any issue here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 51

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗