crawlkit / crawlkit

Licence: MIT license

A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.

Programming Languages

javascript

184084 projects - #8 most used programming language

Projects that are alternatives of or similar to crawlkit

Headless Chrome Crawler

Distributed crawler powered by Headless Chrome

Stars: ✭ 5,129 (+22200%)

Mutual labels: scraper, crawling

Newspaper

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Stars: ✭ 11,545 (+50095.65%)

Mutual labels: scraper, crawling

Scrapyrt

HTTP API for Scrapy spiders

Stars: ✭ 637 (+2669.57%)

Mutual labels: scraper, crawling

Crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

Stars: ✭ 440 (+1813.04%)

Mutual labels: scraper, crawling

web-crawler

Python Web Crawler with Selenium and PhantomJS

Stars: ✭ 19 (-17.39%)

Mutual labels: scraper, phantomjs

Dataflowkit

Extract structured data from web sites. Web sites scraping.

Stars: ✭ 456 (+1882.61%)

Mutual labels: scraper, crawling

Lambda Phantom Scraper

PhantomJS/Node.js web scraper for AWS Lambda

Stars: ✭ 93 (+304.35%)

Mutual labels: scraper, phantomjs

img-cli

An interactive Command-Line Interface Build in NodeJS for downloading a single or multiple images to disk from URL

Stars: ✭ 15 (-34.78%)

Mutual labels: phantomjs, crawling

Goose Parser

Universal scrapping tool, which allows you to extract data using multiple environments

Stars: ✭ 211 (+817.39%)

Mutual labels: scraper, phantomjs

Colly

Elegant Scraper and Crawler Framework for Golang

Stars: ✭ 15,535 (+67443.48%)

Mutual labels: scraper, crawling

bots-zoo

No description or website provided.

Stars: ✭ 59 (+156.52%)

Mutual labels: scraper, crawling

proxycrawl-python

ProxyCrawl Python library for scraping and crawling

Stars: ✭ 51 (+121.74%)

Mutual labels: scraper, crawling

kick-off-web-scraping-python-selenium-beautifulsoup

A tutorial-based introduction to web scraping with Python.

Stars: ✭ 18 (-21.74%)

Mutual labels: scraper, phantomjs

Ferret

Declarative web scraping

Stars: ✭ 4,837 (+20930.43%)

Mutual labels: scraper, crawling

Mimo-Crawler

A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.

Stars: ✭ 22 (-4.35%)

Mutual labels: scraper, crawling

Lulu

[Unmaintained] A simple and clean video/music/image downloader 👾

Stars: ✭ 789 (+3330.43%)

Mutual labels: scraper, crawling

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (+643.48%)

Mutual labels: scraper, crawling

diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

Stars: ✭ 53 (+130.43%)

Mutual labels: scraper, crawling

wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Stars: ✭ 52 (+126.09%)

Mutual labels: scraper, crawling

InstagramLocationScraper

No description or website provided.

Stars: ✭ 13 (-43.48%)

Mutual labels: scraper

View All Similar Projects ➔

CrawlKit

A crawler based on PhantomJS. Allows discovery of dynamic content and supports custom scrapers. For all your ajaxy crawling & scraping needs.

Parallel crawling/scraping via Phantom pooling.
Custom-defined link discovery.
Custom-defined runners (scrape, test, validate, etc.)
Can follow redirects (and because it's based on PhantomJS, JavaScript redirects will be followed as well as <meta> redirects.)
Streaming
Resilient to PhantomJS crashes
Ignores page errors

Install

npm install crawlkit --save

Usage

const CrawlKit = require('crawlkit');
const anchorFinder = require('crawlkit/finders/genericAnchors');

const crawler = new CrawlKit('http://your/page');
crawler.setFinder({
    getRunnable: () => anchorFinder
});

crawler.crawl()
    .then((results) => {
        console.log(JSON.stringify(results, true, 2));
    }, (err) => console.error(err));

Also, have a look at the samples.

API

See the API docs (published) or the docs on doclets.io (live).

Debugging

CrawlKit uses debug for debugging purposes. In short, you can add DEBUG="*" as an environment variable before starting your app to get all the logs. A more sane configuration is probably DEBUG="*:info,*:error,-crawlkit:pool*" if your page is big.

Contributing

Please contribute away :)

Please add tests for new functionality and adapt them for changes.

The commit messages need to follow the conventional changelog format so semantic-release picks the semver versions properly. It is probably easiest if you install commitizen via npm install -g commitizen and commit your changes via git cz.

Available runners

HTML Codesniffer runner: Audit a website with the HTML Codesniffer to find accessibility defects.
Google Chrome Accessibility Developer Tools runner: Audit a website with the Google Chrome Accessibility Developer Tools to find accessibility defects.
aXe runner: Audit a website with aXe.
Yours? Create a PR to add it to this list here!

Products using CrawlKit

Atlassian Accessibility Dashboard

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

crawlkit / crawlkit

Programming Languages

Labels

Projects that are alternatives of or similar to crawlkit

CrawlKit

Install

Usage

API

Debugging

Contributing

Available runners

Products using CrawlKit