A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Stars: ✭ 43 (+104.76%)

Mutual labels: warc

mhn

🍯 Analyze and Visualize Data from Modern Honey Network Servers with R

Stars: ✭ 16 (-23.81%)

Mutual labels: r-cyber

reapr

🕸→ℹ️ Reap Information from Websites

Stars: ✭ 14 (-33.33%)

Mutual labels: r-cyber

gdns

Tools to work with the Google DNS over HTTPS API in R

Stars: ✭ 23 (+9.52%)

Mutual labels: r-cyber

chatnoir-resiliparse

A robust web archive analytics toolkit

Stars: ✭ 26 (+23.81%)

Mutual labels: warc

curlconverter

➰ ➡️ ➖ Translate cURL command lines into parameters for use with httr or actual httr calls (R)

Stars: ✭ 86 (+309.52%)

Mutual labels: r-cyber

pdfbox

📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)

Stars: ✭ 46 (+119.05%)

Mutual labels: r-cyber

wayback

⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs

Stars: ✭ 52 (+147.62%)

Mutual labels: r-cyber

Heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Stars: ✭ 2,104 (+9919.05%)

Mutual labels: warc

Archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Stars: ✭ 12,383 (+58866.67%)

Mutual labels: warc

warc

⚙️ A Rust library for reading and writing WARC files

Stars: ✭ 26 (+23.81%)

Mutual labels: warc

shodan

🌑 R package to work with the Shodan API

Stars: ✭ 16 (-23.81%)

Mutual labels: r-cyber

webhose

🔨 Tools to Work with the 'webhose.io' 'API' in R

Stars: ✭ 12 (-42.86%)

Mutual labels: r-cyber

jericho

📔 Extract plain or structured text from HTML content in R

Stars: ✭ 14 (-33.33%)

Mutual labels: r-cyber

1-22 of 22 similar projects