All Projects → edgi-govdata-archiving → archivers-harvesting-tools

edgi-govdata-archiving / archivers-harvesting-tools

Licence: GPL-3.0 license
ARCHIVED--Collection of scripts and code snippets for data harvesting after generating the zip starter

Programming Languages

python
139335 projects - #7 most used programming language
ruby
36898 projects - #4 most used programming language
shell
77523 projects

Labels

Projects that are alternatives of or similar to archivers-harvesting-tools

Jarchivelib
A simple archiving and compression library for Java
Stars: ✭ 162 (+422.58%)
Mutual labels:  archiving
warcworker
A dockerized, queued high fidelity web archiver based on Squidwarc
Stars: ✭ 48 (+54.84%)
Mutual labels:  archiving
munin-indexer
A social media open post web archiving tool
Stars: ✭ 16 (-48.39%)
Mutual labels:  archiving
Archivebot
ArchiveBot, an IRC bot for archiving websites
Stars: ✭ 218 (+603.23%)
Mutual labels:  archiving
anchorage
Save your bookmark collection in the Internet Archive, or locally.
Stars: ✭ 19 (-38.71%)
Mutual labels:  archiving
jupyter-archive
A Jupyter/Jupyterlab extension to make, download and extract archive files.
Stars: ✭ 57 (+83.87%)
Mutual labels:  archiving
Archiveis
A simple Python wrapper for the archive.is capturing service
Stars: ✭ 140 (+351.61%)
Mutual labels:  archiving
paperless-ng
A supercharged version of paperless: scan, index and archive all your physical documents
Stars: ✭ 4,840 (+15512.9%)
Mutual labels:  archiving
chronicle-etl
📜 A CLI toolkit for extracting and working with your digital history
Stars: ✭ 78 (+151.61%)
Mutual labels:  archiving
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+67.74%)
Mutual labels:  archiving
Reprozip
ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
Stars: ✭ 231 (+645.16%)
Mutual labels:  archiving
Unifiedarchive
UnifiedArchive - an archive manager with a unified way for different formats. Supports all basic (listing, reading, extracting and creation) and specific features (compression level, password-protection). Bundled with console program for working with archives.
Stars: ✭ 246 (+693.55%)
Mutual labels:  archiving
deptoolkit
The Toolkit API, app, and browser extension. Start preserving now.
Stars: ✭ 40 (+29.03%)
Mutual labels:  archiving
Pdf Archiver
A tool for tagging files and archiving tasks.
Stars: ✭ 182 (+487.1%)
Mutual labels:  archiving
archiveis
A simple Python wrapper for the archive.is capturing service
Stars: ✭ 152 (+390.32%)
Mutual labels:  archiving
Wikipedia Mirror
🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kimix + ZIM dump, and MediaWiki/XOWA + XML dump
Stars: ✭ 160 (+416.13%)
Mutual labels:  archiving
savepagenow
A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
Stars: ✭ 140 (+351.61%)
Mutual labels:  archiving
fimfarchive
Preserves stories from Fimfiction
Stars: ✭ 15 (-51.61%)
Mutual labels:  archiving
i7n-pdfhtml
pdfHTML is an iText 7 add-on for C# (.NET) that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.
Stars: ✭ 111 (+258.06%)
Mutual labels:  archiving
irc-docs
Collected IRC protocol documentation
Stars: ✭ 47 (+51.61%)
Mutual labels:  archiving

Harvesting Tools

A collection of scripts and code snippets for data harvesting after generating the zip starter.

We welcome tools written in any language! Especially if they cover use cases we haven't described. To add a new tool to this repository please review our Contributing Guidelines

Usage

  • Familiarize yourself with the harvesting instructions in the DataRescue workflow repo. Within the Archivers app, click Download Zip Starter from the detail page related to a URL you have checked out
  • Unzip the zip stater file
  • Choose a tool that seems likely to be helpful in capturing this particular resource, and copy the contents of its directory in this repo to the tools directory, e.g. with:
    cp -r harvesting-tools/TOOLNAME/* RESOURCEUUID/tools/
    
  • Adjust the base URL for the dataset, along with any other relevant variables, and tweak the content of the tool as necessary
  • After the dataset has been harvested, proceed with the further steps in the harvesting instructions

Matching Tools and Datasets

Each tool in this repo has a fairly specific use case. Your choice will depend on the shape and size of the data you're dealing with. Some datasetswill require more creativity/more elaborate tools. If you write a new tool, please add it to the repo.

wget-loop for largely static resources

If you encounter a page that links to lots of data (for example a "downloads" page), this approach may well work. It's important to only use this approach when you encounter data, for example pdf's, .zip archives, .csv datasets, etc.

The tricky part of this approach is generating a list of urls to download from the page.

  • If you're skilled with using scripts in combination with html-parsers (for example python's wonderful beautiful-soup package), go for it.
  • if the URL's you're trying to access are dynamically generated by JavaScript in the browser environment, we've also included the jquery-url-extraction guide], which will guide you through the process of exracting URL's directly from the browser console.

download_ftp_tree.py for FTP datasets

Government datasets are often stored on FTP; this script will capture FTP directories and subdirectories.

PLEASE NOTE that the Internet Archive has captured over 100Tb of government FTP resoures since December 2016. Be sure to check the URL using check-ia/url-check, the Wayback Machine Extension, or your own tool that uses the Wayback Machine's API (example 1, example 2 w/ wildcard. If the FTP directory you're looking at has not been saved to the Internet Archive, be sure that it has also been nominated as a web crawl seed.

Whether it has been saved or not, you may decide to download it for chain-of-custody preservation reasons. If so, this script should do what you need.

Ruby/Watir for full browser automation

The last resort of harvesting should be to drive it with a full web browser. It is slower than other approaches such as wget, curl, or a headless browser. Additionally, this implementation is prone to issues where the resulting page is saved before it's done loading. There is a ruby example in tools/example-hacks/watir.rb.

Identify Data Links & acquire them via WARCFactory

For search results from large document sets, you may need to do more sophisticated "scraping" and "crawling" -- check out tools built at previous events such as the EIS WARC archiver or the EPA Search Utils for ideas on how to proceed.

API scrape / Custom Solution

If you encounter an API, chances are you'll have to build some sort of custom solution, like epa-envirofacts-scraper or investigate a social angle. For example: asking someone with greater access for a database dump. Be sure to include your code in the tools directory of your zipfile, and if there is any likelihood of general application, please add to this repo.

Explore Other Tools

the utils directory is for scripts that have been useful in the past but may not have very general application. But you still might find something youu like!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].