Heritrix3Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Archivebox🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
warc⚙️ A Rust library for reading and writing WARC files
node-warcParse And Create Web ARChive (WARC) files with node.js
warc📇 Tools to Work with the Web Archive Ecosystem in R
wget-luaWget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
wail🐋 One-Click User Instigated Preservation
CommonCrawlDocumentDownloadA small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika