All Categories → No Category → warc

Top 10 warc open source projects

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

✭ 2,104

java javascript HTML Rich Text Format FreeMarker PostScript warc heritrix webcrawling

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

⚙️ A Rust library for reading and writing WARC files

✭ 26

rust rust-library warc

Parse And Create Web ARChive (WARC) files with node.js

✭ 69

javascript warc web-archiving webarchive web-archives webarchiving warc-files chrome-remote-interface pupeteer

📇 Tools to Work with the Web Archive Ecosystem in R

✭ 21

r C++rstats warc warc-files r-cyber warc-ecosystem

mixnode-warcreader-php

Read Web ARChive (WARC) files in PHP.

✭ 20

PHP warc webarchive

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

✭ 52

c python perl shell Module Management System M4 crawler scraper downloader spider ftp scraping crawling archiving wget crawl zstd crawlers warc webarchiving archiveteam wget-lua

🐋 One-Click User Instigated Preservation

✭ 107

electron warc web-archiving high-fidelity-preservation browser-based-presrevation

CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

✭ 43

java shell mime-types warc cdx-files commoncrawl

chatnoir-resiliparse

A robust web archive analytics toolkit

✭ 26

cython python c web bigdata extraction warc webarchive htmlparser

1-10 of 10 warc projects