All Projects → peterk → warcworker

peterk / warcworker

Licence: GPL-3.0 license
A dockerized, queued high fidelity web archiver based on Squidwarc

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects
HTML
75241 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to warcworker

munin-indexer
A social media open post web archiving tool
Stars: ✭ 16 (-66.67%)
Mutual labels:  archiving, preservation, webarchiving, high-fidelity-preservation
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+8.33%)
Mutual labels:  archiving, webarchiving
Warc
Golang WARC (Web ARChive) Library
Stars: ✭ 25 (-47.92%)
Mutual labels:  archiving
Reprozip
ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
Stars: ✭ 231 (+381.25%)
Mutual labels:  archiving
Wal G
Archival and Restoration for Postgres
Stars: ✭ 1,974 (+4012.5%)
Mutual labels:  archiving
Static Filez
Build compressed archives for static files and serve them over HTTP
Stars: ✭ 33 (-31.25%)
Mutual labels:  archiving
Wikipedia Mirror
🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kimix + ZIM dump, and MediaWiki/XOWA + XML dump
Stars: ✭ 160 (+233.33%)
Mutual labels:  archiving
Django Urlarchivefield
A custom Django model field that automatically archives a URL
Stars: ✭ 5 (-89.58%)
Mutual labels:  archiving
anchorage
Save your bookmark collection in the Internet Archive, or locally.
Stars: ✭ 19 (-60.42%)
Mutual labels:  archiving
Libarchive
Multi-format archive and compression library
Stars: ✭ 1,625 (+3285.42%)
Mutual labels:  archiving
Archivebot
ArchiveBot, an IRC bot for archiving websites
Stars: ✭ 218 (+354.17%)
Mutual labels:  archiving
I7j Pdfhtml
pdfHTML is an iText 7 add-on for Java that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.
Stars: ✭ 104 (+116.67%)
Mutual labels:  archiving
Paperless
Scan, index, and archive all of your paper documents
Stars: ✭ 7,662 (+15862.5%)
Mutual labels:  archiving
Jarchivelib
A simple archiving and compression library for Java
Stars: ✭ 162 (+237.5%)
Mutual labels:  archiving
Crocoite
Web archiving using Google Chrome
Stars: ✭ 30 (-37.5%)
Mutual labels:  archiving
Archiveror
Archiveror will help you preserve the webpages you love. 💾
Stars: ✭ 246 (+412.5%)
Mutual labels:  archiving
Itext7
iText 7 for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText 7 can be a boon to nearly every workflow.
Stars: ✭ 913 (+1802.08%)
Mutual labels:  archiving
Cli
A tiny CLI for HedgeDoc
Stars: ✭ 94 (+95.83%)
Mutual labels:  archiving
Archiveis
A simple Python wrapper for the archive.is capturing service
Stars: ✭ 140 (+191.67%)
Mutual labels:  archiving
chronicle-etl
📜 A CLI toolkit for extracting and working with your digital history
Stars: ✭ 78 (+62.5%)
Mutual labels:  archiving

Warcworker

A dockerized queued high fidelity web archiver based on Squidwarc (Chrome headless), RabbitMQ and a small web frontend. Using the scripting abilities of Squidwarc, you can add scripts that should be run for a specific job (e.g. src-set enrichment, comment expansion etc). Please note that Warcworker is not a crawler (it will not crawl a website automatically - you have to use other software to build lists of URL:s to send to Warcworker).

screenshot of Warcworker

Installation

Copy .env_example to .env. Update information in .env.

Start with docker-compose up -d --scale worker=3 (wait a minute for everything to start up)

Archiving and playback

Open web front end at http://0.0.0.0:5555 to enter URLs for archiving. You can prefill the text fields with the url and description request parameters. Play back the resulting WARC-files with Webrecorder Player

Using

Bookmarklet

Add a bookmarklet to your browser with the following link:

javascript:window.open('http://0.0.0.0:5555?url='+encodeURIComponent(location.href) + '&description=' + encodeURIComponent(document.title));window.focus();

Now you have two-click web archiving from your browser.

Command line

To use from the command line with curl:

curl -d "scripts=srcset&scripts=scroll_everything&url=https://www.peterkrantz.com/" -X POST http://0.0.0.0:5555/process/

Archivenow handler

To use from archivenow add a handler file handlers/ww_handler.py like this:

import requests
import json

class WW_handler(object):

    def __init__(self):
        self.enabled = True
        self.name = 'Warcworker'
        self.api_required = False

    def push(self, uri_org, p_args=[]):
        msg = ''
        try:
	    # add scripts in the order you want them to be run on the page
            payload = {"url":uri_org, "scripts":["scroll_everything", "srcset"]}

            r = requests.post('http://0.0.0.0:5555/process/', timeout=120,
                    data=payload,
                    allow_redirects=True)

            r.raise_for_status()
            return "%s added to queue" % uri_org

        except Exception as e:
            msg = "Error (" + self.name+ "): " + str(e)
        return msg
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].