All Projects → pietrovismara → scavenger

pietrovismara / scavenger

Licence: other
Scrape and take screenshots of dynamic and static webpages

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to scavenger

Ferret
Declarative web scraping
Stars: ✭ 4,837 (+34450%)
Mutual labels:  scraping, scraping-websites
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (+0%)
Mutual labels:  scraping, scraping-websites
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (+264.29%)
Mutual labels:  scraping, scraping-websites
scrapman
Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
Stars: ✭ 21 (+50%)
Mutual labels:  scraping, scraping-websites
gochanges
**[ARCHIVED]** website changes tracker 🔍
Stars: ✭ 12 (-14.29%)
Mutual labels:  scraping, scraping-websites
D4n155
OWASP D4N155 - Intelligent and dynamic wordlist using OSINT
Stars: ✭ 105 (+650%)
Mutual labels:  dynamic, scraping
torchestrator
Spin up Tor containers and then proxy HTTP requests via these Tor instances
Stars: ✭ 32 (+128.57%)
Mutual labels:  scraping, scraping-websites
readability-cli
A CLI for Mozilla Readability. Get clean, uncluttered, ready-to-read HTML from any webpage!
Stars: ✭ 41 (+192.86%)
Mutual labels:  scraping, scraping-websites
reason-rust-scraper
🦀 Scraping & crawling websites using Rust, and ReasonML
Stars: ✭ 21 (+50%)
Mutual labels:  scraping, scraping-websites
Instagram-to-discord
Monitor instagram user account and automatically post new images to discord channel via a webhook. Working 2022!
Stars: ✭ 113 (+707.14%)
Mutual labels:  scraping, scraping-websites
ha-multiscrape
Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.
Stars: ✭ 103 (+635.71%)
Mutual labels:  scraping
SuluFormBundle
Form Bundle for handling Dynamic and Symfony Forms in https://sulu.io
Stars: ✭ 51 (+264.29%)
Mutual labels:  dynamic
ebayMarketAnalyzer
Scrape all eBay sold listings to determine average/median pricing, plot listings over time with trend lines, and extract to excel
Stars: ✭ 116 (+728.57%)
Mutual labels:  scraping-websites
browser-automation-api
Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.
Stars: ✭ 24 (+71.43%)
Mutual labels:  scraping
tenjin
📝 A template engine.
Stars: ✭ 15 (+7.14%)
Mutual labels:  dynamic
AgileStringDecryptor
a dynamic Agile.NET string decryptor that relies on invoke by wwh1004 | Version : 6.X
Stars: ✭ 24 (+71.43%)
Mutual labels:  dynamic
dtw-python
Python port of R's Comprehensive Dynamic Time Warp algorithms package
Stars: ✭ 139 (+892.86%)
Mutual labels:  dynamic
lets-hotfix
Dynamic class reloading for java。Java代码热更新,支持本地、远程
Stars: ✭ 124 (+785.71%)
Mutual labels:  dynamic
anime-scraper
[partially working] Scrape and add anime episode stream URLs to uGet (Linux) or IDM (Windows) ~ Python3
Stars: ✭ 21 (+50%)
Mutual labels:  scraping
steps
Simulation Toolkit for Electrical Power Systems
Stars: ✭ 23 (+64.29%)
Mutual labels:  dynamic

Scavenger

Command line tool / Node.js package for scraping/taking screenshots of dynamic and static webpages using Nightmare.

Features

  1. Can extract data from html and convert it to JSON (only in programmatic use).
  2. Supports dynamic (Angularjs, etc) and static web pages.
  3. Can be piped to other programs.
  4. Can be used from command line or programmatically
  5. Runs on any linux based os. (Probably on windows and mac too, but it hasn't been tested yet)

Install

As a global package:

$ npm install -g scavenger

As a local package:

$ npm install scavenger

Programmatic usage

const scavenger = require('scavenger');

Minimalistic usage:

scavenger.scrape("https://reddit.com")
.then((html) => {})

API

.scrape(url, options, mapFn)

url can be either a String or an Array, in which case scavenger will scrape every given url in sequence.

The result can be a String or an Array depending on url and mapFn.

mapFn is a function which is executed for every url and takes as argument the html of the scraped page. Can be passed as second argument if no options are passed. See .extract and .createExtractor for more info.

scavenger.scrape(url)
.then((html) => {
    console.log(html);
    // '<body>....</body>'
});

// Or
scavenger.scrape(url, {    
    selector: '#id', // ID of a DOMElement to wait before scraping
    minify: false, // If true, minify the html
    driverFn: function(){}, // A function that is evaluated in Nightmarejs context to interact with the page,
    useragent: 'Scavenger - https://www.npmjs.com/package/scavenger', // By default,
    nightmareOptions: {} // This options go directly to the Nightmarejs constructor
})
.then((html) => {});


// Multiple urls with mapFn (get length of html for each scraped page)
scavenger.scrape(urls, {/*options*/}, html => html.length)
.then((htmlLengths) => {
    console.log(htmlLengths);
    // [10040, 22351, ...]
});

.screenshot(url, options)

Returns an object of buffers of the screenshot.

scavenger.screenshot(url)
.then((buffers) => {
    console.log(buffers);
    // {
    //     "full": <Buffer>
    // }
});

// Or

scavenger.screenshot(url, {    
    selector: '#id', // ID of a DOMElement to wait before scraping
    format: 'png', // Default: png. Available: jpeg, png.
    crop: [{
        width: 1280,
        height: 680
    }, ...],
    width: 1280, // Viewport width in pixels. By default it adapts to the page width. Height is always 100% of the page.
    useragent: 'Scavenger - https://www.npmjs.com/package/scavenger', // By default
    nightmareOptions: {} // This options go directly to the Nightmarejs constructor
})
.then((buffers) => {
    console.log(buffers);
    // {
    //     "full": <Buffer>,
    //     "1280X680": <Buffer>
    // }
});

.ss(url, options, mapFn)

Combines .scrape and .screenshot. If mapFn is passed, it will be executed on html only.

scavenger.ss(url, {    
    selector: '#id',
    minify: false,
    driverFn: function(){},
    nightmareOptions: {},
    format: 'png',
    crop: [{
        width: 1280,
        height: 680
    }, ...],
    width: 1280
})
.then((result) => {
    console.log(result);
    // {
    //    html: '',
    //    buffers: {
    //        "full": <Buffer>,
    //        "1280X680": <Buffer>
    //    }
    // }
});

.extract(html, options)

See also the examples.

Extracts text from given html and returns it in json format. Supports tables or any element.

Generic HTML elements:

const authors = scavenger.extract(html, {
    scope: '.class', // Any css selector
    fields: { // Fields are found within the scope element if given
        author: 'h3.author', // Any css selector
        url: {
            selector: 'a.link',
            attribute: 'href' // Gets the href attribute value for the element found at selector
        },
        any: '',
        ...
    },
    groupBy: 'author' // a field name to group results by
});

// Or passing just the fields
scavenger.extract(html, {    
    author: 'h3.author',
    url: {
        selector: 'a.link',
        attribute: 'href'
    },
    any: '',
    ...    
});

.createExtractor(options, fn)

See also the examples.

Helper method. Returns an extract function which can be passed to .scrape as mapFn.

If no fn is passed, .extract will be used by default.

const extract = scavenger.createExtractor({
    scope: 'section',
    fields: {
        title: 'h1.fly-title',
        headline: 'article h2.headline',
        rubric: 'article p.rubric'
    }
});

return scavenger.scrape('http://www.economist.com', extract);

.paginateUrl(options, fn)

Helper method. Returns an array of urls with the correct query for pagination.

const urls = scavenger.paginateUrl({
    baseUrl: 'https://www.google.com/search?',
    params: {
        q: 'scavenger scraper',
        start: 0
    },
    paginationParam: 'start',
    limit: 30,
    step: 10
});

// [
//     'https://www.google.com/search?q=scavenger%20scraper&start=0',    
//     'https://www.google.com/search?q=scavenger%20scraper&start=10'
// ]

scavenger.scrape(urls);

Command line usage

Help

$ scavenger -h

Screenshot

Save image to a png file:

$ scavenger screenshot -u https://reddit.com
$ # Creates a file called https_reddit_com.png

Pipe image to ImageMagick display and show it:

$ scavenger screenshot -u https://reddit.com | display

Scrape

Pipe html to less:

$ scavenger scrape -u https://reddit.com | less

Save html to a file:

$ scavenger screenshot -u https://reddit.com > reddit.html

Or

$ scavenger screenshot -u https://reddit.com
$ # Creates a file called https_reddit_com.html

Scrape + Screenshot

$ scavenger ss -u https://reddit.com

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].