All Projects → openzim → zimit

openzim / zimit

Licence: GPL-3.0 License
Make a ZIM file from any Web site and surf offline!

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to zimit

Android-Web-Scraper
Android Web Scraper is a simple library for android web automation. You can perform web task in background to fetch website data programmatically.
Stars: ✭ 38 (-43.28%)
Mutual labels:  webscraping
metacritic api
PHP Metacritic API - Mirrored by my GitLab
Stars: ✭ 31 (-53.73%)
Mutual labels:  webscraping
bing-ip2hosts
bingip2hosts is a Bing.com web scraper that discovers websites by IP address
Stars: ✭ 99 (+47.76%)
Mutual labels:  webscraping
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
Stars: ✭ 22 (-67.16%)
Mutual labels:  webscraping
image-crawler
An image scraper that scraps images from unsplash.com
Stars: ✭ 12 (-82.09%)
Mutual labels:  webscraping
koishi
Python wrapper for the unofficial scraped API of the satori testing system.
Stars: ✭ 13 (-80.6%)
Mutual labels:  webscraping
ir
Projeto de calculo de Imposto de Renda em operacoes na bovespa automaticamente. Tags:canal eletronico do investidor, CEI, selenium, bovespa, IRPF, IR, imposto de renda, finance, yahoo finance, acao, fii, etf, python, crawler, webscraping, calculadora ir
Stars: ✭ 120 (+79.1%)
Mutual labels:  webscraping
OkanimeDownloader
Scrape your favorite Anime from Okanime.com without effort
Stars: ✭ 13 (-80.6%)
Mutual labels:  webscraping
scrapism
a work-in-progress guide to web scraping as an artistic and critical practice
Stars: ✭ 43 (-35.82%)
Mutual labels:  webscraping
NYTimes-iOS
🗽 NY Times is an Minimal News 🗞 iOS app 📱 built to describe the use of SwiftSoup and CoreData with SwiftUI🔥
Stars: ✭ 152 (+126.87%)
Mutual labels:  webscraping
newspaperjs
News extraction and scraping. Article Parsing
Stars: ✭ 59 (-11.94%)
Mutual labels:  webscraping
newsemble
API for fetching data from news websites.
Stars: ✭ 42 (-37.31%)
Mutual labels:  webscraping
hk0weather
Web scraper project to collect the useful Hong Kong weather data from HKO website
Stars: ✭ 49 (-26.87%)
Mutual labels:  webscraping
node-libzim
Binding to libzim, read/write ZIM files in Javascript
Stars: ✭ 23 (-65.67%)
Mutual labels:  zim
phomber
Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.
Stars: ✭ 59 (-11.94%)
Mutual labels:  webscraping
blog.brasil.io
Blog do Brasil.IO
Stars: ✭ 24 (-64.18%)
Mutual labels:  webscraping
TrackPurchase
단 몇줄의 코드로 다양한 쇼핑 플랫폼에서 결제 내역을 긁어오자!
Stars: ✭ 19 (-71.64%)
Mutual labels:  webscraping
allitebooks.com
Download all the ebooks with indexed csv of "allitebooks.com"
Stars: ✭ 24 (-64.18%)
Mutual labels:  webscraping
FisherMan
CLI program that collects information from facebook user profiles via Selenium.
Stars: ✭ 117 (+74.63%)
Mutual labels:  webscraping
Bitcoin-Bar
Physical Bitcoin Stat Ticker
Stars: ✭ 32 (-52.24%)
Mutual labels:  webscraping

Zimit

Zimit is a scraper allowing to create ZIM file from any Web site.

CodeFactor Docker Build Status License: GPL v3

⚠️ Important: this tool uses warc2zim to create Zim files and thus require the Zim reader to support Service Workers. At the time of zimit:1.0, that's mostly kiwix-android and kiwix-serve. Note that service workers have protocol restrictions as well so you'll need to run it either from localhost or over HTTPS.

Technical background

This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.

The system extends the crawling system in Browsertrix Crawler and converts the crawled WARC files to ZIM using warc2zim

The zimit.py is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which can be mounted as a volume.

Using the --keep flag, the crawled WARCs will also be kept in a temp directory inside /output

Usage

zimit is intended to be run in Docker.

To build locally run:

docker build -t openzim/zimit .

The image accepts the following parameters:

  • --url URL - the url to be crawled (required)
  • --workers N - number of crawl workers to be run in parallel
  • --wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
  • --name - Name of ZIM file (defaults to the hostname of the URL)
  • --output - output directory (defaults to /output)
  • --limit U - Limit capture to at most U URLs
  • --exclude <regex> - skip URLs that match the regex from crawling. Can be specified multiple times. An example is --exclude="(\?q=|signup-landing\?|\?cid=)", where URLs that contain either ?q= or signup-landing? or ?cid= will be excluded.
  • --scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
  • --keep - if set, keep the WARC files in a temp directory inside the output directory

The following is an example usage. The --cap-add and --shm-size flags are needed to run Chrome in Docker.

Example command:

docker run  -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN \
       --shm-size=1gb openzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded

The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.

Note: Image automatically filters out a large number of ads by using the 3 blocklists from anudeepND. If you don't want this filtering, disable the image's entrypoint in your container (docker run --entrypoint="" openzim/zimit ...).

Nota bene

A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.

License

GPLv3 or later, see LICENSE for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].