All Projects → joseconstela → webparsy

joseconstela / webparsy

Licence: MIT license
Node.JS library and cli for scraping websites using Puppeteer (or not) and YAML definitions

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to webparsy

Foxr
🦊 Node.js API to control Firefox
Stars: ✭ 783 (+1857.5%)
Mutual labels:  headless, puppeteer
Marionette
Selenium alternative for Crystal. Browser manipulation without the Java overhead.
Stars: ✭ 119 (+197.5%)
Mutual labels:  headless, puppeteer
Url To Pdf Api
Web page PDF/PNG rendering done right. Self-hosted service for rendering receipts, invoices, or any content.
Stars: ✭ 6,544 (+16260%)
Mutual labels:  headless, puppeteer
Puppetron
Puppeteer (Headless Chrome Node API)-based rendering solution.
Stars: ✭ 429 (+972.5%)
Mutual labels:  headless, puppeteer
Serverless Puppeteer Layers
Serverless Framework + AWS Lambda Layers + Puppeteer = ❤️
Stars: ✭ 247 (+517.5%)
Mutual labels:  headless, puppeteer
Puppeteer Api Zh cn
📖 Puppeteer中文文档(官方指定的中文文档)
Stars: ✭ 697 (+1642.5%)
Mutual labels:  headless, puppeteer
Puppeteer Walker
a puppeteer walker 🕷 🕸
Stars: ✭ 78 (+95%)
Mutual labels:  headless, puppeteer
Sms Boom
利用chrome的headless模式,模拟用户注册进行短信轰炸机
Stars: ✭ 507 (+1167.5%)
Mutual labels:  headless, puppeteer
Ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
Stars: ✭ 237 (+492.5%)
Mutual labels:  headless, puppeteer
Google Meet Scheduler
😴 Attends classes for you.
Stars: ✭ 150 (+275%)
Mutual labels:  headless, puppeteer
Daily Signin
网站签到脚本
Stars: ✭ 52 (+30%)
Mutual labels:  headless, puppeteer
CrawlerSamples
This is a Puppeteer+AngleSharp crawler console app samples, used C# 7.1 coding and dotnet core build.
Stars: ✭ 36 (-10%)
Mutual labels:  headless, puppeteer
Wendigo
A proper monster for front-end automated testing
Stars: ✭ 121 (+202.5%)
Mutual labels:  headless, puppeteer
macaca-puppeteer
Macaca puppeteer driver
Stars: ✭ 39 (-2.5%)
Mutual labels:  headless, puppeteer
puppeteer-lambda
Module for using Headless-Chrome by Puppeteer on AWS Lambda.
Stars: ✭ 117 (+192.5%)
Mutual labels:  headless, puppeteer
yaml-vim
YAML syntax/indent plugin for Vim
Stars: ✭ 37 (-7.5%)
Mutual labels:  yaml
Doramon
个人工具汇总:一致性哈希工具,Bitmap工具,布隆过滤器参数生成器,Yaml和properties互转工具,一键式生成整个前后端工具,单机高性能幂等工具,zookeeper客户端工具,分布式全局id生成器,时间转换工具,Http封装工具
Stars: ✭ 53 (+32.5%)
Mutual labels:  yaml
scikit-ci
Simpler and centralized CI configuration for Python extensions.
Stars: ✭ 15 (-62.5%)
Mutual labels:  yaml
apipkgen
Generate an R package from API specs
Stars: ✭ 12 (-70%)
Mutual labels:  yaml
top.gg-automatic-voter
This is a script that votes for specified bot automatically per 12 hours on top.gg
Stars: ✭ 144 (+260%)
Mutual labels:  puppeteer

WebPary logo

All Contributors

Fast and light NodeJS library and cli to scrape and interact with websites using Puppeteer (or not) and YAML definitions

version: 1
jobs:
  main:
    steps:
      - goto: https://github.com/marketplace?category=code-quality
      - pdf:
          path: Github_Tools.pdf
          format: A4
      - many: 
          as: github_tools
          event: githubTool
          selector: main .col-lg-9.mt-1.mb-4.float-lg-right a.col-md-6.mb-4.d-flex.no-underline
          element:
            - property:
                selector: a
                type: string
                property: href
                as: url
                transform: absoluteUrl
            - text:
                selector: h3.h4
                type: string
                transform: trim
                as: name
            - text:
                selector: p
                type: string
                transform: trim
                as: description

Return an array with Github's tools, and creates a PDF. Example output:

{
  "github_tools": [
    {
      "url": "https://github.com/marketplace/codelingo",
      "name": "codelingo",
      "description": "Your Code, Your Rules - Automate code reviews with your own best practices"
    },
    {
      "url": "https://github.com/marketplace/codebeat",
      "name": "codebeat",
      "description": "Code review expert on demand. Automated for mobile and web"
    },
    ...
  ]
}

Don't panic. There are examples for all WebParsy features in the examples folder. This are as basic as possible to help you get started.

Contributors

Thanks goes to these wonderful people (emoji key):


Dumi-k

🐛

KilianCM

🤔 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Table of Contents
  • Overview
  • Browser config
  • Output
  • Transform
  • Types
  • Multi-jobs
  • Steps
    • setContent Sets the HTML markup to assign to the page.
    • goto Navigate to an URL
    • run Runs a group of steps by its name.
    • goBack Navigate to the previous page in history
    • screenshot Takes an screenshot of the page
    • pdf Takes a pdf of the page
    • text Gets the text for a given CSS selector
    • many Returns an array of elements given their CSS selectors
    • title Gets the title for the current page.
    • form Fill and submit forms
    • html Return HTML code for the page or a DOM element
    • click Click on an element (CSS and xPath selectors)
    • url Return the current URL
    • type Types a text (key events) in a given selector
    • waitFor Wait for selectors, time, functions, etc before continuing
    • keyboardPress Simulates the press of a keyboard key
    • scrollTo Scroll to bottom, top, x, y, selector, xPath before continuing
    • scrollToEnd Scroll's to the very bottom (infinite scroll pages)

Overview

You can use WebParsy either as cli from your terminal or as a NodeJS library.

Cli

Install webparsy:

$ npm i webparsy -g
$ webparsy example/_weather.yml --customFlag "custom flag value"
Result:

{
  "title": "Madrid, España Pronóstico del tiempo y condiciones meteorológicas - The Weather Channel | Weather.com",
  "city": "Madrid, España",
  "temp": 18
}

Library

const webparsy = require('webparsy')
const parsingResult = await webparsy.init({
  file: 'jobdefinition.yml'
  flags: { ... } // optional
})

Methods

init(options)

options:

One of yaml, file or string is required.

  • yaml: A yaml npm module instance of the scraping definition.
  • string: The YAML definition, as a plain string.
  • file: The path for the YAML file containing the scraping definition.

Additionally, you can pass a flags object property to input additional values to your scraping process.

Browser config

You can setup Chrome's details in the browser property within the main job.

None of the following settings are required.

jobs:
  main:
    browser:
      width: 1200
      height: 800
      scaleFactor: 1
      timeout: 60
      delay: 0
      headless: true
      executablePath: ''
      userDataDir: ''
      keepOpen: false
  • executablePath: If provided, webparsy will launch Chrome from the specified path.
  • userDataDir: If provided, webparsy will launch Chrome with the specified user's profile.

Output

In order for WebParsy to get contents, it needs some very basic details. This are:

  • as the property you want to be returned
  • selector the css selector to extract the html or text from

Other optional options are

  • parent Get the parent of the element filtered by a selector.

Example

text:
  selector: .entry-title
  as: entryLink
  parent: a

Transform

When you extract texts from a web page, you might want to transform the data before returning them. example

You can use the following - transform methods:

  • uppercase transforms the result to uppercase
  • lowercase transforms the result to lowercase
  • absoluteUrl return the absolute url for a link

Types

When extractring details from a page, you might want them to be returned in different formats, for example as a number in the example of grabing temperatures. example

You can use the following values for - type:

  • string
  • number
  • integer
  • float
  • fcd tranform to float an string-number that uses comma for thousands
  • fdc tranform to float an string-number that uses dot for thousands

Multi-jobs support

You can define groups of steps (jobs) that you can reuse at any moment during an scraping process.

For example, let's say you want to signup twice in a website. You will have a "main" job (that executes by defaul) but you can create an additional one called "signup", that you can reuse in the "main" one.

version: 1
jobs:
  main:
    steps:
      - goto: https://example.com/
      - run: signup
      - click: '#logout'
      - run: signup
  signup:
    steps:
      - goto: https://example.com/register
      - form:
          selector: "#signup-user"
          submit: true
          fill:
            - selector: '[name="username"]'
              value: [email protected]

Steps

Steps are the list of things the browser must do.

setContent

Sets the HTML markup to assign to the page.

Setting a string:

- setContent:
    html: Hello!

Loading the HTML from a file:

- setContent:
    file: myMarkup.html

Loading the HTML from a environment variable:

- setContent:
    env: MY_MARKUP_ENVIRONMENT_VARIABLE

Loading the HTML from a flag:

- setContent:
    flag: markup

goto

URL to navigate page to. The url should include scheme, e.g. https://. example

- goto: https://example.com

You can also tell WebParsy to don't use Puppeteer to browse, and instead do a direct HTTP request via got. This will perform much faster, but it may not be suitable for websites that requires JavaScript. simple example / extended example

Note that some methods (for example: form, click and others) will not be available if you are not browsing using puppeteer.

- goto:
    url: https://google.com
    method: got

You can also tell WebParsy which urls it should visit via flags (available via cli and library). Example:

- goto:
    flag: websiteUrl

You can then call webparsy as:

webparsy definition.yaml --websiteUrl "https://google.com"

or

webparsy.init({
  file: 'definition.yml'
  flags: { websiteUrl: 'https://google.com' }
})

example

Authentication

You can perform basic HTTP authentication by providing the user and password as in the following example:

- goto: 
    url: http://example.com
    method: got
    authentication:
      type: basic
      username: my_user
      password: my_password

run

Runs a group of steps by its name.

- run: signupProcess

goBack

Navigate to the previous page in history. example

- goBack

screenshot

Takes an screenshot of the page. This triggers pupetteer's page.screenshot. example

- screenshot:
  - path: Github.png

If you are using WebParsy as a NodeJS module, you can also get the screenshot retuned as a Buffer by using the as property.

- screenshot:
  - as: myScreenshotBuffer

pdf

Takes a pdf of the page. This triggers pupetteer's page.pdf

- pdf:
  - path: Github.pdf

If you are using WebParsy as a NodeJS module, you can also get the PDF file retuned as a Buffer by using the as property.

- pdf:
  - as: pdfFileBuffer

title

Gets the title for the current page. If no output.as property is defined, the page's title will tbe returned as { title }. example

- title

many

Returns an array of elements given their CSS selectors. example

Example:

- many: 
  as: articles
  selector: main ol.articles-list li.article-item
  element:
    - text:
      selector: .title
      as: title

When you scape large amount of contents, you might end consuming hords of RAM, your system might become slow and the scraping process might fail.

To prevent this, WebParsy allows you to use process events so you can have the scraped contents as they are scraped, instead of storing them in memory and waiting for the whole process to finish.

To do this, simply add an event property to many, with the event's name you want to listen to. The event will contain each scraped item.

event will give you the data as it's being scraped. To prevent it from being stored in memory, set eventMethod to discard.

Example using events

form

Fill and submit forms. example

Form filling can use values from environment variables. This is useful if you want to keep users login details in secret. If this is your case, instead of specifying the value as a string, set it as the env property for value. Check the example below or refer to banking example

Example:

- form:
    selector: "#tsf"            # form selector
    submit: true               # Submit after filling all details
    fill:                      # array of inputs to fill
      - selector: '[name="q"]' # input selector
        value: test            # input value

Using environment variables

- form:
    selector: "#login"            # form selector
    submit: true                  # Submit after filling all details
    fill:                         # array of inputs to fill
      - selector: '[name="user"]' # input selector
        value:
          env: USERNAME           # process.env.USERNAME
      - selector: '[name="pass"]' 
        value: 
          env: PASSWORD           # process.env.PASSWORD

html

Gets the HTML code. If no selector specified, it returns the page's full HTML code. If no output.as property is defined, the result will be returned as { html }. example

Example:

- html
    as: divHtml
    selector: div

click

Click on an element. example

Example:

Default behaviour (CSS selector)

- click: button.click-me

Same as

- click: 
    selector: button.click-me

By xPath (clicks on the first match)

- click: 
    xPath: '/html/body/div[2]/div/div/div/div/div[3]/span'

type

Sends a keydown, keypress/input, and keyup event for each character in the text.

Example:

- type:
    selector: input.user
    text: [email protected]
    options:
      delay: 4000

url

Return the current URL.

Example:

- url:
    as: currentUrl

waitFor

Wait for specified CSS, XPath selectors, on an specific amount of time before continuing example

Examples:

- waitFor:
   selector: "#search-results"
- waitFor:
   xPath: "/html/body/div[1]/header/div[1]/a/svg"
- waitFor:
   function: "console.log(Date.now())"
- waitFor:
    time: 1000 # Time in milliseconds

keyboardPress

Simulates the press of a keyboard key. extended docs

- keyboardPress: 
    key: 'Enter'

scrollTo

Scoll to specified CSS, XPath selectors, to bottom/top or to specified x/y value before continuing example

Examples:

- scrollTo:
   top: true
- scrollTo:
   bottom: true
- scrollTo:
   x: 340
- scrollTo:
   y: 500
- scrollTo:
   selector: "#search-results"
- scrollTo:
   xPath: "/html/body/div[1]/header/div[1]/a/svg"

scrollToEnd

Scroll's to the very bottom (infinite scroll pages) example

This accepts three settings:

  • step: how many pixels to scroll every time. Default is 10.
  • max: up to how many pixels as maximun you want to scroll down - so you are not waiting for decades on non-ending infinite scroll pages. Default is 9999999.
  • sleep: how long to wait before scrolls - in milliseconds. Defaul is 100

Examples:

- scrollToEnd
- scrollToEnd:
    step: 300
    sleep: 1000
    max: 300000

License

MIT © Jose Constela

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].