openzim / zimit

Licence: GPL-3.0 License

Make a ZIM file from any Web site and surf offline!

Programming Languages

python

139335 projects - #7 most used programming language

Dockerfile

14818 projects

Projects that are alternatives of or similar to zimit

Android-Web-Scraper

Android Web Scraper is a simple library for android web automation. You can perform web task in background to fetch website data programmatically.

Stars: ✭ 38 (-43.28%)

Mutual labels: webscraping

metacritic api

PHP Metacritic API - Mirrored by my GitLab

Stars: ✭ 31 (-53.73%)

Mutual labels: webscraping

bing-ip2hosts

bingip2hosts is a Bing.com web scraper that discovers websites by IP address

Stars: ✭ 99 (+47.76%)

Mutual labels: webscraping

Mimo-Crawler

A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.

Stars: ✭ 22 (-67.16%)

Mutual labels: webscraping

image-crawler

An image scraper that scraps images from unsplash.com

Stars: ✭ 12 (-82.09%)

Mutual labels: webscraping

koishi

Python wrapper for the unofficial scraped API of the satori testing system.

Stars: ✭ 13 (-80.6%)

Mutual labels: webscraping

Projeto de calculo de Imposto de Renda em operacoes na bovespa automaticamente. Tags:canal eletronico do investidor, CEI, selenium, bovespa, IRPF, IR, imposto de renda, finance, yahoo finance, acao, fii, etf, python, crawler, webscraping, calculadora ir

Stars: ✭ 120 (+79.1%)

Mutual labels: webscraping

OkanimeDownloader

Scrape your favorite Anime from Okanime.com without effort

Stars: ✭ 13 (-80.6%)

Mutual labels: webscraping

scrapism

a work-in-progress guide to web scraping as an artistic and critical practice

Stars: ✭ 43 (-35.82%)

Mutual labels: webscraping

NYTimes-iOS

🗽 NY Times is an Minimal News 🗞 iOS app 📱 built to describe the use of SwiftSoup and CoreData with SwiftUI🔥

Stars: ✭ 152 (+126.87%)

Mutual labels: webscraping

newspaperjs

News extraction and scraping. Article Parsing

Stars: ✭ 59 (-11.94%)

Mutual labels: webscraping

newsemble

API for fetching data from news websites.

Stars: ✭ 42 (-37.31%)

Mutual labels: webscraping

hk0weather

Web scraper project to collect the useful Hong Kong weather data from HKO website

Stars: ✭ 49 (-26.87%)

Mutual labels: webscraping

node-libzim

Binding to libzim, read/write ZIM files in Javascript

Stars: ✭ 23 (-65.67%)

Mutual labels: zim

phomber

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Stars: ✭ 59 (-11.94%)

Mutual labels: webscraping

blog.brasil.io

Blog do Brasil.IO

Stars: ✭ 24 (-64.18%)

Mutual labels: webscraping

TrackPurchase

단 몇줄의 코드로 다양한 쇼핑 플랫폼에서 결제 내역을 긁어오자!

Stars: ✭ 19 (-71.64%)

Mutual labels: webscraping

allitebooks.com

Download all the ebooks with indexed csv of "allitebooks.com"

Stars: ✭ 24 (-64.18%)

Mutual labels: webscraping

FisherMan

CLI program that collects information from facebook user profiles via Selenium.

Stars: ✭ 117 (+74.63%)

Mutual labels: webscraping

Bitcoin-Bar

Physical Bitcoin Stat Ticker

Stars: ✭ 32 (-52.24%)

Mutual labels: webscraping

View All Similar Projects ➔

Zimit

Zimit is a scraper allowing to create ZIM file from any Web site.

⚠️ Important: this tool uses warc2zim to create Zim files and thus require the Zim reader to support Service Workers. At the time of zimit:1.0, that's mostly kiwix-android and kiwix-serve. Note that service workers have protocol restrictions as well so you'll need to run it either from localhost or over HTTPS.

Technical background

This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.

The system extends the crawling system in Browsertrix Crawler and converts the crawled WARC files to ZIM using warc2zim

The zimit.py is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which can be mounted as a volume.

Using the --keep flag, the crawled WARCs will also be kept in a temp directory inside /output

Usage

zimit is intended to be run in Docker.

To build locally run:

docker build -t openzim/zimit .

The image accepts the following parameters:

--url URL - the url to be crawled (required)
--workers N - number of crawl workers to be run in parallel
--wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
--name - Name of ZIM file (defaults to the hostname of the URL)
--output - output directory (defaults to /output)
--limit U - Limit capture to at most U URLs
--exclude <regex> - skip URLs that match the regex from crawling. Can be specified multiple times. An example is --exclude="(\?q=|signup-landing\?|\?cid=)", where URLs that contain either ?q= or signup-landing? or ?cid= will be excluded.
--scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
--keep - if set, keep the WARC files in a temp directory inside the output directory

The following is an example usage. The --cap-add and --shm-size flags are needed to run Chrome in Docker.

Example command:

docker run  -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN \
       --shm-size=1gb openzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded

The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.

Note: Image automatically filters out a large number of ads by using the 3 blocklists from anudeepND. If you don't want this filtering, disable the image's entrypoint in your container (docker run --entrypoint="" openzim/zimit ...).

Nota bene

A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.

License

GPLv3 or later, see LICENSE for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

openzim / zimit

Programming Languages

Labels

Projects that are alternatives of or similar to zimit

Zimit

Technical background

Usage

Nota bene

License