All Projects → warc → Similar Projects or Alternatives

22 Open source projects that are alternatives of or similar to warc

node-warc
Parse And Create Web ARChive (WARC) files with node.js
Stars: ✭ 69 (+228.57%)
Mutual labels:  warc, warc-files
greynoise
Query 'GreyNoise Intelligence 'API' in R
Stars: ✭ 15 (-28.57%)
Mutual labels:  r-cyber
mixnode-warcreader-php
Read Web ARChive (WARC) files in PHP.
Stars: ✭ 20 (-4.76%)
Mutual labels:  warc
xattrs
🗃 Work With Filesystem Object Extended Attributes — https://hrbrmstr.github.io/xattrs/index.html
Stars: ✭ 17 (-19.05%)
Mutual labels:  r-cyber
urlscan
👀 Analyze Websites and Resources They Request
Stars: ✭ 21 (+0%)
Mutual labels:  r-cyber
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+147.62%)
Mutual labels:  warc
wail
🐋 One-Click User Instigated Preservation
Stars: ✭ 107 (+409.52%)
Mutual labels:  warc
htmlunit
🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
Stars: ✭ 39 (+85.71%)
Mutual labels:  r-cyber
CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
Stars: ✭ 43 (+104.76%)
Mutual labels:  warc
mhn
🍯 Analyze and Visualize Data from Modern Honey Network Servers with R
Stars: ✭ 16 (-23.81%)
Mutual labels:  r-cyber
reapr
🕸→ℹ️ Reap Information from Websites
Stars: ✭ 14 (-33.33%)
Mutual labels:  r-cyber
gdns
Tools to work with the Google DNS over HTTPS API in R
Stars: ✭ 23 (+9.52%)
Mutual labels:  r-cyber
chatnoir-resiliparse
A robust web archive analytics toolkit
Stars: ✭ 26 (+23.81%)
Mutual labels:  warc
curlconverter
➰ ➡️ ➖ Translate cURL command lines into parameters for use with httr or actual httr calls (R)
Stars: ✭ 86 (+309.52%)
Mutual labels:  r-cyber
pdfbox
📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)
Stars: ✭ 46 (+119.05%)
Mutual labels:  r-cyber
wayback
⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs
Stars: ✭ 52 (+147.62%)
Mutual labels:  r-cyber
Heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Stars: ✭ 2,104 (+9919.05%)
Mutual labels:  warc
Archivebox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Stars: ✭ 12,383 (+58866.67%)
Mutual labels:  warc
warc
⚙️ A Rust library for reading and writing WARC files
Stars: ✭ 26 (+23.81%)
Mutual labels:  warc
shodan
🌑 R package to work with the Shodan API
Stars: ✭ 16 (-23.81%)
Mutual labels:  r-cyber
webhose
🔨 Tools to Work with the 'webhose.io' 'API' in R
Stars: ✭ 12 (-42.86%)
Mutual labels:  r-cyber
jericho
📔 Extract plain or structured text from HTML content in R
Stars: ✭ 14 (-33.33%)
Mutual labels:  r-cyber
1-22 of 22 similar projects