All Projects → openzim → gutenberg

openzim / gutenberg

Licence: GPL-3.0 license
Scraper for downloading the entire ebooks repository of project Gutenberg

Programming Languages

javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language
CSS
56736 projects
HTML
75241 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to gutenberg

Crawler
A high performance web crawler in Elixir.
Stars: ✭ 781 (+681%)
Mutual labels:  scraper, offline
youtube
Create a ZIM file from a Youtube channel/username/playlist
Stars: ✭ 25 (-75%)
Mutual labels:  scraper, zim
node-libzim
Binding to libzim, read/write ZIM files in Javascript
Stars: ✭ 23 (-77%)
Mutual labels:  offline, zim
sotoki
StackExchange websites to ZIM scraper
Stars: ✭ 64 (-36%)
Mutual labels:  scraper, zim
youtube-playlist
❄️ Extract links, ids, and names from a youtube playlist
Stars: ✭ 73 (-27%)
Mutual labels:  scraper
react-relay-appsync
AppSync for Relay
Stars: ✭ 19 (-81%)
Mutual labels:  offline
ionic-resource-generator
Painless, Offline First, No Dependency, Ionic resources generator
Stars: ✭ 31 (-69%)
Mutual labels:  offline
electron-releases
castLabs Electron for Content Security
Stars: ✭ 173 (+73%)
Mutual labels:  offline
benenson
A Gutenberg WordPress theme
Stars: ✭ 70 (-30%)
Mutual labels:  gutenberg
Instagram-to-discord
Monitor instagram user account and automatically post new images to discord channel via a webhook. Working 2022!
Stars: ✭ 113 (+13%)
Mutual labels:  scraper
civic-scraper
Tools for downloading agendas, minutes and other documents produced by local government
Stars: ✭ 21 (-79%)
Mutual labels:  scraper
gutenberg-workshop
⚒️ A Gutenberg Workshop 🅱️
Stars: ✭ 21 (-79%)
Mutual labels:  gutenberg
browser-shots
A WordPress plugin for taking screenshots of websites using the block editor.
Stars: ✭ 17 (-83%)
Mutual labels:  gutenberg
TikTok
Download public videos on TikTok using Python with Selenium
Stars: ✭ 37 (-63%)
Mutual labels:  scraper
azure-sql-db-sync-api-change-tracking
Using Azure SQL Change Tracking API to Sync mobile Apps data with the Cloud
Stars: ✭ 58 (-42%)
Mutual labels:  offline
robotstxt
robots.txt file parsing and checking for R
Stars: ✭ 65 (-35%)
Mutual labels:  scraper
engine
Benefit from new browsers' technologies to speed up your site
Stars: ✭ 39 (-61%)
Mutual labels:  offline
TelegramScraper
Using this tool you can easily add so many members from any group to your group. Less than 2 minutes. Super easy. Time saver. But this tool is only for educational purpose. You could be banned from Telegram. So be careful. Recommanded to use this tool only on Termux.
Stars: ✭ 234 (+134%)
Mutual labels:  scraper
CourseCake
By serving course 📚 data that is more "edible" 🍰 for developers, we hope CourseCake offers a smooth approach to build useful tools for students.
Stars: ✭ 21 (-79%)
Mutual labels:  scraper
unfurl
Extract rich metadata from URLs
Stars: ✭ 41 (-59%)
Mutual labels:  scraper

Gutenberg Offline

This scraper downloads the whole Project Gutenberg library and puts it in a ZIM file, a clean and user friendly format for storing content for offline usage.

Python package Docker CodeFactor License: GPL v3

Setting up the environment

It's recommended that you use virtualenv and py3.6+.

Install the dependencies

GNU/Linux

sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle curl zip
sudo pip install virtualenv

macOS

sudo easy_install pip
sudo pip install virtualenv
brew install advancecomp jpegoptim pngquant p7zip gifsicle

Set up the project

git clone [email protected]:kiwix/gutenberg.git
cd gutenberg
virtualenv gut-env (or any name you want)
./gut-env/bin/pip install -r requirements.pip

Working in the environment

  • Activate the environment: source gut-env/bin/activate
  • Quit the environment: deactivate

Getting started

After setting up the whole environment you can just run the main script gutenberg2zim. It will download, process and export the content.

./gutenberg2zim

Arguments

You can also specify parameters to customize the content. Only want books with the Id 100-200? Books only in French? English? Or only those both? No problem! You can also include or exclude book formats. You can add bookshelves and the option to search books by title to enrich your user experince.

./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search

This will download books in English and French that have the Id 100 to 200 in the HTML (default) and PDF format.

You can find the full arguments list below:

-h --help                       Display this help message
-y --wipe-db                    Empty cached book metadata
-F --force                      Redo step even if target already exist

-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)

-r --rdf-folder=<folder>        Don't download rdf-files.tar.bz2 and use extracted folder instead
-e --static-folder=<folder>     Use-as/Write-to this folder static HTML
-z --zim-file=<file>            Write ZIM into this file path
-t --zim-title=<title>          Set ZIM title
-n --zim-desc=<description>     Set ZIM description
-d --dl-folder=<folder>         Folder to use/write-to downloaded ebooks
-u --rdf-url=<url>              Alternative rdf-files.tar.bz2 URL
-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals
-c --concurrency=<nb>           Number of concurrent process for download and parsing tasks
--dlc=<nb>                      Number of concurrent *download* process for download (overwrites --concurrency). if server blocks high rate requests
-m --one-language-one-zim=<folder> When more than 1 language, do one zim for each   language (and one with all)
--no-index                      Do NOT create full-text index within ZIM file
--check                         Check dependencies
--prepare                       Download & extract rdf-files.tar.bz2
--parse                         Parse all RDF files and fill-up the DB
--download                      Download ebooks based on filters
--export                        Export downloaded content to zim-friendly static HTML
--dev                           Exports *just* Home+JS+CSS files (overwritten by --zim step)
--zim                           Create a ZIM file
--title-search                  Add field to search a book by title and directly jump to it
--bookshelves                   Add bookshelves
--optimization-cache=<url>      URL with credentials to S3 bucket for using as optimization cache
--use-any-optimized-version     Try to use any optimized version found on optimization cache

Screenshots

License

GPLv3 or later, see LICENSE for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].