All Projects → N0taN3rd → Squidwarc

N0taN3rd / Squidwarc

Licence: apache-2.0
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Squidwarc

Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+4003.2%)
Mutual labels:  crawler, crawling, chrome, puppeteer, headless-chrome
Puppeteer Sharp Extra
Plugin framework for PuppeteerSharp
Stars: ✭ 39 (-68.8%)
Mutual labels:  chrome, puppeteer, headless-chrome, chrome-headless
Jvppeteer
Headless Chrome For Java (Java 爬虫)
Stars: ✭ 193 (+54.4%)
Mutual labels:  crawler, chrome, puppeteer, chrome-headless
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (+191.2%)
Mutual labels:  crawler, crawling, puppeteer, headless-chrome
bots-zoo
No description or website provided.
Stars: ✭ 59 (-52.8%)
Mutual labels:  crawler, crawling, puppeteer
puppet-master
Puppeteer as a service hosted on Saasify.
Stars: ✭ 25 (-80%)
Mutual labels:  crawling, headless-chrome, puppeteer
Puppeteer Walker
a puppeteer walker 🕷 🕸
Stars: ✭ 78 (-37.6%)
Mutual labels:  crawler, chrome, puppeteer
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+3769.6%)
Mutual labels:  crawler, crawling, chrome
Cdp4j
cdp4j - Chrome DevTools Protocol for Java
Stars: ✭ 232 (+85.6%)
Mutual labels:  crawling, chrome, chrome-headless
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+2423.2%)
Mutual labels:  crawling, puppeteer, headless-chrome
Sms Boom
利用chrome的headless模式,模拟用户注册进行短信轰炸机
Stars: ✭ 507 (+305.6%)
Mutual labels:  chrome, puppeteer, chrome-headless
Puppeteer Lambda Starter Kit
Starter Kit for running Headless-Chrome by Puppeteer on AWS Lambda.
Stars: ✭ 563 (+350.4%)
Mutual labels:  chrome, puppeteer, headless-chrome
Url To Pdf Api
Web page PDF/PNG rendering done right. Self-hosted service for rendering receipts, invoices, or any content.
Stars: ✭ 6,544 (+5135.2%)
Mutual labels:  chrome, puppeteer, headless-chrome
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+36.8%)
Mutual labels:  crawler, crawling, puppeteer
Awesome Puppeteer
A curated list of awesome puppeteer resources.
Stars: ✭ 1,728 (+1282.4%)
Mutual labels:  crawling, puppeteer, headless-chrome
Rendora
dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites
Stars: ✭ 1,853 (+1382.4%)
Mutual labels:  crawler, puppeteer, chrome-headless
Api
API that uncovers the technologies used on websites and generates thumbnail from screenshot of website
Stars: ✭ 189 (+51.2%)
Mutual labels:  chrome, headless-chrome, chrome-headless
Puppeteer Extra
💯 Teach puppeteer new tricks through plugins.
Stars: ✭ 3,397 (+2617.6%)
Mutual labels:  chrome, puppeteer, headless-chrome
Puppeteer Deep
Puppeteer, Headless Chrome;爬取《es6标准入门》、自动推文到掘金、站点性能分析;高级爬虫、自动化UI测试、性能分析;
Stars: ✭ 1,033 (+726.4%)
Mutual labels:  chrome, puppeteer, headless-chrome
Gowitness
🔍 gowitness - a golang, web screenshot utility using Chrome Headless
Stars: ✭ 996 (+696.8%)
Mutual labels:  chrome, headless-chrome, chrome-headless

Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head.

Squidwarc aims to address the need for a high fidelity crawler akin to Heritrix while still being easy enough for the personal archivist to setup and use.

Squidwarc does not seek (at the moment) to dethrone Heritrix as the queen of wide archival crawls rather seeks to address Heritrix's shortcomings namely:

  • No JavaScript execution
  • Everything is plain text
  • Requiring configuration to know how to preserve the web
  • Setup time and technical knowledge required of its users

For more information about this see

Squidwarc is built using Node.js, node-warc, and chrome-remote-interface.

If running a crawler through the commandline is not your thing, then Squidwarc highly recommends warcworker, a web front end for Squidwarc by @peterk.

If you are unable to install Node on your system but have docker, then you can use the provided docker file or compose file.

If you have neither then Squidwarc highly recommends WARCreate or WAIL. WARCreate did this first and if it had not Squidwarc would not exist 💕

If recording the web is what you seek, Squidwarc highly recommends Webrecorder.

Out Of The Box Crawls

Page Only

Preserve the only the page, no links are followed

Page + Same Domain Links

Page Only option plus preserve all links found on the page that are on the same domain as the page

Page + All internal and external links

Page + Same Domain Link option plus all links from other domains

Usage

Squidwarc uses a bootstrapping script to install dependencies. First, get the latest version from source:

$ git clone https://github.com/N0taN3rd/Squidwarc
$ cd Squidwarc

Then run the bootstrapping script to install the dependencies:

$ ./bootstrap.sh

Once the dependencies have been installed you can start a pre-configured (but customizable) crawl with either:

$ ./run-crawler.sh -c conf.json

or:

$ node index.js -c conf.json

Config file

The config.json file example below is provided for you without annotations as the annotations (comments) are not valid json

For more detailed information about the crawl configuration file and its field please consult the manual available online.

{
  "mode": "page-only", // the mode you wish to crawl using
  "depth": 1, // how many hops out do you wish to crawl

  // path to the script you want Squidwarc to run per page. See `userFns.js` for more information
  "script": "./userFns.js",
  // the crawls starting points
  "seeds": [
    "https://www.instagram.com/visit_berlin/"
  ],

  "warc": {
    "naming": "url", // currently this is the only option supported do not change.....
    "append": false // do you want this crawl to use a save all preserved data to a single WARC or WARC per page
  },

  // Chrome instance we are to connect to is running on host, port.
  // must match --remote-debugging-port=<port> set when Squidwarc is connecting to an already running instance of  Chrome.
  // localhost is default host when only setting --remote-debugging-port
  "connect": {
    "launch": true, // if you want Squidwarc to attempt to launch the version of Chrome already on your system or not
    "host": "localhost",
    "port": 9222
  },

  // time is in milliseconds
  "crawlControl": {
    "globalWait": 60000, // maximum time spent visiting a page
    "inflightIdle": 1000, // how long to wait for until network idle is determined when there are only `numInflight` (no response recieved) requests
    "numInflight": 2, // when there are only N inflight (no response recieved) requests start network idle count down
    "navWait": 8000 // wait at maximum 8 seconds for Chrome to navigate to a page
  }
}

JavaScript Style Guide

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].