All Projects → crawlkit → crawlkit

crawlkit / crawlkit

Licence: MIT license
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to crawlkit

Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+22200%)
Mutual labels:  scraper, crawling
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+50095.65%)
Mutual labels:  scraper, crawling
Scrapyrt
HTTP API for Scrapy spiders
Stars: ✭ 637 (+2669.57%)
Mutual labels:  scraper, crawling
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+1813.04%)
Mutual labels:  scraper, crawling
web-crawler
Python Web Crawler with Selenium and PhantomJS
Stars: ✭ 19 (-17.39%)
Mutual labels:  scraper, phantomjs
Dataflowkit
Extract structured data from web sites. Web sites scraping.
Stars: ✭ 456 (+1882.61%)
Mutual labels:  scraper, crawling
Lambda Phantom Scraper
PhantomJS/Node.js web scraper for AWS Lambda
Stars: ✭ 93 (+304.35%)
Mutual labels:  scraper, phantomjs
img-cli
An interactive Command-Line Interface Build in NodeJS for downloading a single or multiple images to disk from URL
Stars: ✭ 15 (-34.78%)
Mutual labels:  phantomjs, crawling
Goose Parser
Universal scrapping tool, which allows you to extract data using multiple environments
Stars: ✭ 211 (+817.39%)
Mutual labels:  scraper, phantomjs
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+67443.48%)
Mutual labels:  scraper, crawling
bots-zoo
No description or website provided.
Stars: ✭ 59 (+156.52%)
Mutual labels:  scraper, crawling
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (+121.74%)
Mutual labels:  scraper, crawling
kick-off-web-scraping-python-selenium-beautifulsoup
A tutorial-based introduction to web scraping with Python.
Stars: ✭ 18 (-21.74%)
Mutual labels:  scraper, phantomjs
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+20930.43%)
Mutual labels:  scraper, crawling
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
Stars: ✭ 22 (-4.35%)
Mutual labels:  scraper, crawling
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Stars: ✭ 789 (+3330.43%)
Mutual labels:  scraper, crawling
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+643.48%)
Mutual labels:  scraper, crawling
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (+130.43%)
Mutual labels:  scraper, crawling
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+126.09%)
Mutual labels:  scraper, crawling
InstagramLocationScraper
No description or website provided.
Stars: ✭ 13 (-43.48%)
Mutual labels:  scraper

CrawlKit

Build status npm npm David node bitHound Overall Score Commitizen friendly semantic-release Code Climate

A crawler based on PhantomJS. Allows discovery of dynamic content and supports custom scrapers. For all your ajaxy crawling & scraping needs.

  • Parallel crawling/scraping via Phantom pooling.
  • Custom-defined link discovery.
  • Custom-defined runners (scrape, test, validate, etc.)
  • Can follow redirects (and because it's based on PhantomJS, JavaScript redirects will be followed as well as <meta> redirects.)
  • Streaming
  • Resilient to PhantomJS crashes
  • Ignores page errors

Install

npm install crawlkit --save

Usage

const CrawlKit = require('crawlkit');
const anchorFinder = require('crawlkit/finders/genericAnchors');

const crawler = new CrawlKit('http://your/page');
crawler.setFinder({
    getRunnable: () => anchorFinder
});

crawler.crawl()
    .then((results) => {
        console.log(JSON.stringify(results, true, 2));
    }, (err) => console.error(err));

Also, have a look at the samples.

API

See the API docs (published) or the docs on doclets.io (live).

Debugging

CrawlKit uses debug for debugging purposes. In short, you can add DEBUG="*" as an environment variable before starting your app to get all the logs. A more sane configuration is probably DEBUG="*:info,*:error,-crawlkit:pool*" if your page is big.

Contributing

Please contribute away :)

Please add tests for new functionality and adapt them for changes.

The commit messages need to follow the conventional changelog format so semantic-release picks the semver versions properly. It is probably easiest if you install commitizen via npm install -g commitizen and commit your changes via git cz.

Available runners

Products using CrawlKit

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].