All Projects → get-set-fetch → scraper

get-set-fetch / scraper

Licence: MIT License
Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom.

Programming Languages

typescript
32286 projects

Projects that are alternatives of or similar to scraper

Instagram-to-discord
Monitor instagram user account and automatically post new images to discord channel via a webhook. Working 2022!
Stars: ✭ 113 (+205.41%)
Mutual labels:  scraper, scraping
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+40.54%)
Mutual labels:  scraper, scraping
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (+37.84%)
Mutual labels:  scraper, scraping
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (+43.24%)
Mutual labels:  scraper, scraping
scrapy facebooker
Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.
Stars: ✭ 22 (-40.54%)
Mutual labels:  scraper, scraping
scrapman
Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
Stars: ✭ 21 (-43.24%)
Mutual labels:  scraper, scraping
angel.co-companies-list-scraping
No description or website provided.
Stars: ✭ 54 (+45.95%)
Mutual labels:  scraper, scraping
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (+43.24%)
Mutual labels:  scraper, scraping
Scraper-Projects
🕸 List of mini projects that involve web scraping 🕸
Stars: ✭ 25 (-32.43%)
Mutual labels:  scraper, scraping
Captcha-Tools
All-in-one Python (And now Go!) module to help solve captchas with Capmonster, 2captcha and Anticaptcha API's!
Stars: ✭ 23 (-37.84%)
Mutual labels:  scraper, scraping
TorScrapper
A Scraper made 100% in Python using BeautifulSoup and Tor. It can be used to scrape both normal and onion links. Happy Scraping :)
Stars: ✭ 24 (-35.14%)
Mutual labels:  scraper, scraping
Zeiver
A Scraper, Downloader, & Recorder for static open directories.
Stars: ✭ 14 (-62.16%)
Mutual labels:  scraper, scraping
scrapers
scrapers for building your own image databases
Stars: ✭ 46 (+24.32%)
Mutual labels:  scraper, scraping
ha-multiscrape
Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.
Stars: ✭ 103 (+178.38%)
Mutual labels:  scraper, scraping
gochanges
**[ARCHIVED]** website changes tracker 🔍
Stars: ✭ 12 (-67.57%)
Mutual labels:  scraper, scraping
copycat
A PHP Scraping Class
Stars: ✭ 70 (+89.19%)
Mutual labels:  scraper, scraping
google-scraper
This class can retrieve search results from Google.
Stars: ✭ 33 (-10.81%)
Mutual labels:  scraper, scraping
Pahe.ph-Scraper
Pahe.ph [Pahe.in] Movies Website Scraper
Stars: ✭ 57 (+54.05%)
Mutual labels:  scraper, scraping
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (-62.16%)
Mutual labels:  scraper, scraping
whatsapp-tracking
Scraping the status of WhatsApp contacts
Stars: ✭ 49 (+32.43%)
Mutual labels:  scraper, scraping

License Audit Status Build Status Coverage Status

Node.js web scraper

get-set, Fetch! is a plugin based, nodejs web scraper. It scrapes, stores and exports data.
At its core, an ordered list of plugins is executed against each to be scraped URL.

Supported databases: SQLite, MySQL, PostgreSQL.
Supported browser clients: Puppeteer, Playwright.
Supported DOM-like clients: Cheerio, JSdom.

Use it in your own javascript/typescript code

import { Scraper, Project, CsvExporter } from '@get-set-fetch/scraper';

const scraper = new Scraper(ScrapeConfig.storage, ScrapeConfig.client);
scraper.on(ScrapeEvent.ProjectScraped, async (project: Project) => {
  const exporter = new CsvExporter({ filepath: 'languages.csv' });
  await exporter.export(project);
});

scraper.scrape(ScrapeConfig.project, ScrapeConfig.concurrency);

Note: package is exported both as CommonJS and ES Module.

Use it from the command line

gsfscrape \
--config scrape-config.json \
--loglevel info --logdestination scrape.log \
--save \
--overwrite \
--export project.csv

Run it with Docker

docker run \
-v <host_dir>/scraper/docker/data:/home/gsfuser/scraper/data getsetfetch:latest \
--version \
--config data/scrape-config.json \
--save \
--overwrite \
--scrape \
--loglevel info \
--logdestination data/scrape.log \
--export data/export.csv

Note: you have to build the image manually from './docker' directory.

Run it in cloud

module "benchmark_1000k_1project_multiple_scrapers_csv_urls" {
  source = "../../node_modules/@get-set-fetch/scraper/cloud/terraform"

  region                 = "fra1"
  public_key_name        = "get-set-fetch"
  public_key_file        = var.public_key_file
  private_key_file       = var.private_key_file
  ansible_inventory_file = "../ansible/inventory/hosts.cfg"

  pg = {
    name                  = "pg"
    image                 = "ubuntu-20-04-x64"
    size                  = "s-4vcpu-8gb"
    ansible_playbook_file = "../ansible/pg-setup.yml"
  }

  scraper = {
    count                 = 4
    name                  = "scraper"
    image                 = "ubuntu-20-04-x64"
    size                  = "s-1vcpu-1gb"
    ansible_playbook_file = "../ansible/scraper-setup.yml"
  }
}

Note: only DigitalOcean terraform provider is supported atm.

Benchmarks

For quick, small projects under 10K URLs storing the queue and scraped content under SQLite is fine. For anything larger use PostgreSQL. You will be able to start/stop/resume the scraping process across multiple scraper instances each with its own IP and/or dedicated proxies.

Using a PostgreSQL database and 4 scraper instances it takes 9 minutes to scrape 1 million URLs. That's 0.5ms per scraped URL. Scrapers are using synthetic data, there is no external traffic, results are not influenced by web server response times and upload/download speeds. See benchmarks for more info.

Getting Started

What follows is a brief "Getting Started" guide using SQLite as storage and Puppeteer as browser client. For an in-depth documentation visit getsetfetch.org. See changelog for past release notes and development for technical tidbits.

Install the scraper

$ npm install @get-set-fetch/scraper

Install peer dependencies

$ npm install knex sqlite3 puppeteer

Supported storage options and browser clients are defined as peer dependencies. Manually install your selected choices.

Init storage

const { KnexConnection } = require('@get-set-fetch/scraper');
const connConfig = {
  client: 'sqlite3',
  useNullAsDefault: true,
  connection: {
    filename: ':memory:'
  }
}
const conn = new KnexConnection(connConfig);

See Storage on full configurations for supported SQLite, MySQL, PostgreSQL.

Init browser client

const { PuppeteerClient } = require('@get-set-fetch/scraper');
const launchOpts = {
  headless: true,
}
const client = new PuppeteerClient(launchOpts);

Init scraper

const { Scraper } = require('@get-set-fetch/scraper');
const scraper = new Scraper(conn, client);

Define project options

const projectOpts = {
  name: "myScrapeProject",
  pipeline: 'browser-static-content',
  pluginOpts: [
    {
      name: 'ExtractUrlsPlugin',
      maxDepth: 3,
      selectorPairs: [
        {
          urlSelector: '#searchResults ~ .pagination > a.ChoosePage:nth-child(2)',
        },
        {
          urlSelector: 'h3.booktitle a.results',
        },
        {
          urlSelector: 'a.coverLook > img.cover',
        },
      ],
    },
    {
      name: 'ExtractHtmlContentPlugin',
      selectorPairs: [
        {
          contentSelector: 'h1.work-title',
          label: 'title',
        },
        {
          contentSelector: 'h2.edition-byline a',
          label: 'author',
        },
        {
          contentSelector: 'ul.readers-stats > li.avg-ratings > span[itemProp="ratingValue"]',
          label: 'rating value',
        },
        {
          contentSelector: 'ul.readers-stats > li > span[itemProp="reviewCount"]',
          label: 'review count',
        },
      ],
    },
  ],
  resources: [
    {
      url: 'https://openlibrary.org/authors/OL34221A/Isaac_Asimov?page=1'
    }
  ]
};

You can define a project in multiple ways. The above example is the most direct one.

You define one or more starting urls, a predefined pipeline containing a series of scrape plugins with default options, and any plugin options you want to override. See pipelines and plugins for all available options.

ExtractUrlsPlugin.maxDepth defines a maximum depth of resources to be scraped. The starting resource has depth 0. Resources discovered from it have depth 1 and so on. A value of -1 disables this check.

ExtractUrlsPlugin.selectorPairs defines CSS selectors for discovering new resources. urlSelector property selects the links while the optional titleSelector can be used for renaming binary resources like images or pdfs. In order, the define selectorPairs extract pagination URLs, book detail URLs, image cover URLs.

ExtractHtmlContentPlugin.selectorPairs scrapes content via CSS selectors. Optional labels can be used for specifying columns when exporting results as csv.

Define concurrency options

const concurrencyOpts = {
  project: {
    delay: 1000
  }
  domain: {
    delay: 5000
  }
}

A minimum delay of 5000 ms will be enforced between scraping consecutive resources from the same domain. At project level, across all domains, any two resources will be scraped with a minimum 1000 ms delay between requests. See concurrency options for all available options.

Start scraping

scraper.scrape(projectOpts, concurrencyOpts);

The entire process is asynchronous. Listen to the emitted scrape events to monitor progress.

Export results

const { ScrapeEvent, CsvExporter, ZipExporter } = require('@get-set-fetch/scraper');

scraper.on(ScrapeEvent.ProjectScraped, async (project) => {
  const csvExporter = new CsvExporter({ filepath: 'books.csv' });
  await csvExporter.export(project);

  const zipExporter = new ZipExporter({ filepath: 'book-covers.zip' });
  await zipExporter.export(project);
})

Wait for scraping to complete by listening to ProjectScraped event.

Export scraped html content as csv. Export scraped images under a zip archive. See Export for all supported parameters.

Browser Extension

This project is based on lessons learned developing get-set-fetch-extension, a scraping browser extension for Chrome, Firefox and Edge.

Both projects share the same storage, pipelines, plugins concepts but unfortunately no code. I'm planning to fix this in the future so code from scraper can be used in the extension.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].