Top 10 warc open source projects

Heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Archivebox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
warc
⚙️ A Rust library for reading and writing WARC files
warc
📇 Tools to Work with the Web Archive Ecosystem in R
mixnode-warcreader-php
Read Web ARChive (WARC) files in PHP.
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
1-10 of 10 warc projects