Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.

Stars: ✭ 3,154 (+2423.2%)

Mutual labels: crawling, puppeteer, headless-chrome

Sms Boom

利用chrome的headless模式，模拟用户注册进行短信轰炸机

Stars: ✭ 507 (+305.6%)

Mutual labels: chrome, puppeteer, chrome-headless

Puppeteer Lambda Starter Kit

Starter Kit for running Headless-Chrome by Puppeteer on AWS Lambda.

Stars: ✭ 563 (+350.4%)

Mutual labels: chrome, puppeteer, headless-chrome

Url To Pdf Api

Web page PDF/PNG rendering done right. Self-hosted service for rendering receipts, invoices, or any content.

Stars: ✭ 6,544 (+5135.2%)

Mutual labels: chrome, puppeteer, headless-chrome

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (+36.8%)

Mutual labels: crawler, crawling, puppeteer

Awesome Puppeteer

A curated list of awesome puppeteer resources.

Stars: ✭ 1,728 (+1282.4%)

Mutual labels: crawling, puppeteer, headless-chrome

Rendora

dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites

Stars: ✭ 1,853 (+1382.4%)

Mutual labels: crawler, puppeteer, chrome-headless

Api

API that uncovers the technologies used on websites and generates thumbnail from screenshot of website

Stars: ✭ 189 (+51.2%)

Mutual labels: chrome, headless-chrome, chrome-headless

Puppeteer Extra

💯 Teach puppeteer new tricks through plugins.

Stars: ✭ 3,397 (+2617.6%)

Mutual labels: chrome, puppeteer, headless-chrome

Puppeteer Deep

Puppeteer, Headless Chrome；爬取《es6标准入门》、自动推文到掘金、站点性能分析；高级爬虫、自动化UI测试、性能分析；

Stars: ✭ 1,033 (+726.4%)

Mutual labels: chrome, puppeteer, headless-chrome

Gowitness

🔍 gowitness - a golang, web screenshot utility using Chrome Headless

Stars: ✭ 996 (+696.8%)

Mutual labels: chrome, headless-chrome, chrome-headless

View All Similar Projects ➔

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head.

Squidwarc aims to address the need for a high fidelity crawler akin to Heritrix while still being easy enough for the personal archivist to setup and use.

Squidwarc does not seek (at the moment) to dethrone Heritrix as the queen of wide archival crawls rather seeks to address Heritrix's shortcomings namely:

No JavaScript execution
Everything is plain text
Requiring configuration to know how to preserve the web
Setup time and technical knowledge required of its users

For more information about this see

Squidwarc is built using Node.js, node-warc, and chrome-remote-interface.

If running a crawler through the commandline is not your thing, then Squidwarc highly recommends warcworker, a web front end for Squidwarc by @peterk.

If you are unable to install Node on your system but have docker, then you can use the provided docker file or compose file.

If you have neither then Squidwarc highly recommends WARCreate or WAIL. WARCreate did this first and if it had not Squidwarc would not exist 💕

If recording the web is what you seek, Squidwarc highly recommends Webrecorder.

Out Of The Box Crawls

Page Only

Preserve the only the page, no links are followed

Page + Same Domain Links

Page Only option plus preserve all links found on the page that are on the same domain as the page

Page + All internal and external links

Page + Same Domain Link option plus all links from other domains

Usage

Squidwarc uses a bootstrapping script to install dependencies. First, get the latest version from source:

$ git clone https://github.com/N0taN3rd/Squidwarc
$ cd Squidwarc

Then run the bootstrapping script to install the dependencies:

$ ./bootstrap.sh

Once the dependencies have been installed you can start a pre-configured (but customizable) crawl with either:

$ ./run-crawler.sh -c conf.json

or:

$ node index.js -c conf.json

Config file

The config.json file example below is provided for you without annotations as the annotations (comments) are not valid json

For more detailed information about the crawl configuration file and its field please consult the manual available online.

{
  "mode": "page-only", // the mode you wish to crawl using
  "depth": 1, // how many hops out do you wish to crawl

  // path to the script you want Squidwarc to run per page. See `userFns.js` for more information
  "script": "./userFns.js",
  // the crawls starting points
  "seeds": [
    "https://www.instagram.com/visit_berlin/"
  ],

  "warc": {
    "naming": "url", // currently this is the only option supported do not change.....
    "append": false // do you want this crawl to use a save all preserved data to a single WARC or WARC per page
  },

  // Chrome instance we are to connect to is running on host, port.
  // must match --remote-debugging-port=<port> set when Squidwarc is connecting to an already running instance of  Chrome.
  // localhost is default host when only setting --remote-debugging-port
  "connect": {
    "launch": true, // if you want Squidwarc to attempt to launch the version of Chrome already on your system or not
    "host": "localhost",
    "port": 9222
  },

  // time is in milliseconds
  "crawlControl": {
    "globalWait": 60000, // maximum time spent visiting a page
    "inflightIdle": 1000, // how long to wait for until network idle is determined when there are only `numInflight` (no response recieved) requests
    "numInflight": 2, // when there are only N inflight (no response recieved) requests start network idle count down
    "navWait": 8000 // wait at maximum 8 seconds for Chrome to navigate to a page
  }
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 125

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (11) 🔗