ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

Stars: ✭ 231 (+381.25%)

Mutual labels: archiving

Wal G

Archival and Restoration for Postgres

Stars: ✭ 1,974 (+4012.5%)

Mutual labels: archiving

Static Filez

Build compressed archives for static files and serve them over HTTP

Stars: ✭ 33 (-31.25%)

Mutual labels: archiving

Wikipedia Mirror

🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kimix + ZIM dump, and MediaWiki/XOWA + XML dump

Stars: ✭ 160 (+233.33%)

Mutual labels: archiving

Django Urlarchivefield

A custom Django model field that automatically archives a URL

Stars: ✭ 5 (-89.58%)

Mutual labels: archiving

anchorage

Save your bookmark collection in the Internet Archive, or locally.

Stars: ✭ 19 (-60.42%)

Mutual labels: archiving

Libarchive

Multi-format archive and compression library

Stars: ✭ 1,625 (+3285.42%)

Mutual labels: archiving

Archivebot

ArchiveBot, an IRC bot for archiving websites

Stars: ✭ 218 (+354.17%)

Mutual labels: archiving

I7j Pdfhtml

pdfHTML is an iText 7 add-on for Java that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.

Stars: ✭ 104 (+116.67%)

Mutual labels: archiving

Paperless

Scan, index, and archive all of your paper documents

Stars: ✭ 7,662 (+15862.5%)

Mutual labels: archiving

Jarchivelib

A simple archiving and compression library for Java

Stars: ✭ 162 (+237.5%)

Mutual labels: archiving

Crocoite

Web archiving using Google Chrome

Stars: ✭ 30 (-37.5%)

Mutual labels: archiving

Archiveror

Archiveror will help you preserve the webpages you love. 💾

Stars: ✭ 246 (+412.5%)

Mutual labels: archiving

Itext7

iText 7 for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText 7 can be a boon to nearly every workflow.

Stars: ✭ 913 (+1802.08%)

Mutual labels: archiving

Cli

A tiny CLI for HedgeDoc

Stars: ✭ 94 (+95.83%)

Mutual labels: archiving

Archiveis

A simple Python wrapper for the archive.is capturing service

Stars: ✭ 140 (+191.67%)

Mutual labels: archiving

chronicle-etl

📜 A CLI toolkit for extracting and working with your digital history

Stars: ✭ 78 (+62.5%)

Mutual labels: archiving

View All Similar Projects ➔

Warcworker

A dockerized queued high fidelity web archiver based on Squidwarc (Chrome headless), RabbitMQ and a small web frontend. Using the scripting abilities of Squidwarc, you can add scripts that should be run for a specific job (e.g. src-set enrichment, comment expansion etc). Please note that Warcworker is not a crawler (it will not crawl a website automatically - you have to use other software to build lists of URL:s to send to Warcworker).

Installation

Copy .env_example to .env. Update information in .env.

Start with docker-compose up -d --scale worker=3 (wait a minute for everything to start up)

Archiving and playback

Open web front end at http://0.0.0.0:5555 to enter URLs for archiving. You can prefill the text fields with the url and description request parameters. Play back the resulting WARC-files with Webrecorder Player

Using

Bookmarklet

Add a bookmarklet to your browser with the following link:

javascript:window.open('http://0.0.0.0:5555?url='+encodeURIComponent(location.href) + '&description=' + encodeURIComponent(document.title));window.focus();

Now you have two-click web archiving from your browser.

Command line

To use from the command line with curl:

curl -d "scripts=srcset&scripts=scroll_everything&url=https://www.peterkrantz.com/" -X POST http://0.0.0.0:5555/process/

Archivenow handler

To use from archivenow add a handler file handlers/ww_handler.py like this:

import requests
import json

class WW_handler(object):

    def __init__(self):
        self.enabled = True
        self.name = 'Warcworker'
        self.api_required = False

    def push(self, uri_org, p_args=[]):
        msg = ''
        try:
	    # add scripts in the order you want them to be run on the page
            payload = {"url":uri_org, "scripts":["scroll_everything", "srcset"]}

            r = requests.post('http://0.0.0.0:5555/process/', timeout=120,
                    data=payload,
                    allow_redirects=True)

            r.raise_for_status()
            return "%s added to queue" % uri_org

        except Exception as e:
            msg = "Error (" + self.name+ "): " + str(e)
        return msg

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

peterk / warcworker

Programming Languages

Labels

Projects that are alternatives of or similar to warcworker

Warcworker

Installation

Archiving and playback

Using

Bookmarklet

Command line

Archivenow handler