All Projects → microlinkhq → Browserless

microlinkhq / Browserless

Licence: mit
A browser driver on top of puppeteer, ready for production scenarios.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Browserless

Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+672.44%)
Mutual labels:  puppeteer, headless-chrome
Rendertron
A Headless Chrome rendering solution
Stars: ✭ 5,593 (+742.32%)
Mutual labels:  puppeteer, headless-chrome
thal
译文:Puppeteer 与 Chrome Headless —— 从入门到爬虫
Stars: ✭ 651 (-1.96%)
Mutual labels:  headless-chrome, puppeteer
Pptraas.com
Puppeteer as a service
Stars: ✭ 433 (-34.79%)
Mutual labels:  puppeteer, headless-chrome
Md To Pdf
Hackable CLI tool for converting Markdown files to PDF using Node.js and headless Chrome.
Stars: ✭ 374 (-43.67%)
Mutual labels:  puppeteer, headless-chrome
purescript-toppokki
A binding to puppeteer to drive headless Chrome.
Stars: ✭ 48 (-92.77%)
Mutual labels:  headless-chrome, puppeteer
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+375%)
Mutual labels:  puppeteer, headless-chrome
nest-puppeteer
Puppeteer (Headless Chrome) provider for Nest.js
Stars: ✭ 68 (-89.76%)
Mutual labels:  headless-chrome, puppeteer
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (-45.18%)
Mutual labels:  puppeteer, headless-chrome
Pyppeteer
Headless chrome/chromium automation library (unofficial port of puppeteer)
Stars: ✭ 3,480 (+424.1%)
Mutual labels:  puppeteer, headless-chrome
Puppeteer Lambda Starter Kit
Starter Kit for running Headless-Chrome by Puppeteer on AWS Lambda.
Stars: ✭ 563 (-15.21%)
Mutual labels:  puppeteer, headless-chrome
Docker Puppeteer
A minimal Docker image for Puppeteer
Stars: ✭ 656 (-1.2%)
Mutual labels:  puppeteer, headless-chrome
puppeteer-email
Email automation driven by headless chrome.
Stars: ✭ 135 (-79.67%)
Mutual labels:  headless-chrome, puppeteer
puppeteer-github
GitHub automation driven by headless chrome.
Stars: ✭ 15 (-97.74%)
Mutual labels:  headless-chrome, puppeteer
phantom-lord
Handy API for Headless Chromium
Stars: ✭ 24 (-96.39%)
Mutual labels:  headless-chrome, puppeteer
hc-pdf-server
Convert HTML to PDF Server by headless chrome with TypeScript. The new version of hcep-pdf-server.
Stars: ✭ 24 (-96.39%)
Mutual labels:  headless-chrome, puppeteer
throughout
🎪 End-to-end testing made simple (using Jest and Puppeteer)
Stars: ✭ 16 (-97.59%)
Mutual labels:  headless-chrome, puppeteer
puppeteer-instagram
Instagram automation driven by headless chrome.
Stars: ✭ 87 (-86.9%)
Mutual labels:  headless-chrome, puppeteer
Mochify.js
☕️ TDD with Browserify, Mocha, Headless Chrome and WebDriver
Stars: ✭ 338 (-49.1%)
Mutual labels:  puppeteer, headless-chrome
Try Puppeteer
Run Puppeteer code in the cloud
Stars: ✭ 642 (-3.31%)
Mutual labels:  puppeteer, headless-chrome

browserless

Last version NPM Status

The Headless Chrom[e|ium] driver for Node.js.

browserless is a headless Chrome/Chromium driver built on top of puppeteer, created to handle resources efficiently and satisfy the most desired production scenarios.

Highlights

Installation

You can install it via npm:

$ npm install browserless puppeteer --save

browserless has a puppeteer-like API and it uses puppeteer under the hood.

You can use it with puppeteer, puppeteer-core or puppeteer-firefox, interchangeably.

Usage

browserless has the same API is than puppeteer:

const browserless = require('browserless')()
const termImg = require('term-img')

async function main () {
  const buffer = await browserless.screenshot('http://example.com', {
    device: 'iPhone 6'
  })

  console.log(termImg(buffer))
}

If you're already using puppeteer, you can upgrade to use browserless instead almost with no effort.

Additionally, you can use some specific packages in your codebase, interacting with them from puppeteer.

Basic

All methods follow the same interface:

  • <url>: The target URL. It's required.
  • [options]: Specific settings for the method. It's optional.

The methods returns a Promise or a Node.js callback if pass an additional function as the last parameter.

.constructor(options)

It creates the browser instance, using puppeter.launch method.

// Creating a simple instance
const browserless = require('browserless')()

or passing specific launchers options:

// Creating an instance for running it at AWS Lambda
const browserless = require('browserless')({
  ignoreHTTPSErrors: true,
  args: ['--disable-gpu', '--single-process', '--no-zygote', '--no-sandbox', '--hide-scrollbars']
})

options

See puppeteer.launch#options.

Additionally, you can setup:

defaultDevice

type: string default: 'Macbook Pro 13'

Sets a consistent device viewport for each page.

lossyDeviceName

type: boolean default: false

It enables lossy detection over the device descriptor input.

const browserless = require('browserless')({ lossyDeviceName: true })

browserless.getDevice({ device: 'macbook pro 13' })
browserless.getDevice({ device: 'MACBOOK PRO 13' })
browserless.getDevice({ device: 'macbook pro' })
browserless.getDevice({ device: 'macboo pro' })

This setting is oriented for find the device even if the descriptor device name is not exactly the same.

mode

type: string default: launch values: 'launch' | 'connect'

It defines if browser should be spawned using puppeteer.launch or puppeteer.connect

timeout

type: number default: 30000

This setting will change the default maximum navigation time.

retry

type: number default: 5

The number of retries that can be performed before considering a navigation as failed.

proxy

type: string default: undefined

It will setup a proxy to be used to communicate between the browser and the target URL.

puppeteer

type: Puppeteer default: puppeteer|puppeteer-core|puppeteer-firefox

It's automatically detected based on your dependencies being supported puppeteer, puppeteer-core or puppeteer-firefox.

Alternatively, you can pass it.

incognito

type: boolean default: false

Every time a new page is created, it will be an incognito page.

An incognito page will not share cookies/cache with other browser pages.

.html(url, options)

It serializes the content from the target url into HTML.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const html = await browserless.html(url)
  console.log(html)
})()

options

See browserless.goto to know all the options and values supported.

.text(url, options)

It serializes the content from the target url into plain text.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const text = await browserless.text(url)
  console.log(text)
})()

options

See browserless.goto to know all the options and values supported.

.pdf(url, options)

It generates the PDF version of a website behind an url.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const buffer = await browserless.pdf(url)
  console.log('PDF generated!')
})()

options

This method use the following options by default:

{
  margin: '0.35cm',
  printBackground: true,
  scale: 0.65
}

See browserless.goto to know all the options and values supported.

Also, any page.pdf option is supported.

Additionally, you can setup:

margin

type: stringstring[] default: '0.35cm'

It sets paper margins. All possible units are:

  • px for pixel.
  • in for inches.
  • cm for centimeters.
  • mm for millimeters.

You can pass an object object specifing each corner side of the paper:

;(async () => {
  const buffer = await browserless.pdf(url.toString(), {
    margin: {
      top: '0.35cm',
      bottom: '0.35cm',
      left: '0.35cm',
      right: '0.35cm'
    }
  })
})()

Or, in case you pass an string, it will be used for all the sides:

;(async () => {
  const buffer = await browserless.pdf(url.toString(), {
    margin: '0.35cm'
  })
})()

.screenshot(url, options)

It takes a screenshot from the target url.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const buffer = await browserless.screenshot(url)
  console.log('Screenshot taken!')
})()

options

This method use the following options by default:

{
  device: 'macbook pro 13'
}

See browserless.goto to know all the options and values supported.

Also, any page.screenshot option is supported.

Additionally, you can setup:

codeScheme

type: string default: 'atom-dark'

When this value is present and the response 'Content-Type' header is 'json', it beautifies HTML markup using Prism.

The syntax highlight theme can be customized, being possible to setup:

  • A prism-themes identifier (e.g., 'dracula').
  • A remote URL (e.g., 'https://unpkg.com/prism-theme-night-owl').
element

type: string

Capture the DOM element matching the given CSS selector. It will wait for the element to appear in the page and to be visible.

overlay

type: object

After the screenshot has been taken, this option allows you to place the screenshot into a fancy overlay

You can configure the overlay specifying:

  • browser: It sets the browser image overlay to use, being light and dark supported values.
  • background: It sets the background to use, being supported to pass:
    • An hexadecimal/rgb/rgba color code, eg. #c1c1c1.
    • A CSS gradient, eg. linear-gradient(225deg, #FF057C 0%, #8D0B93 50%, #321575 100%)
    • An image url, eg. https://source.unsplash.com/random/1920x1080.
;(async () => {
  const buffer = await browserless.screenshot(url.toString(), {
    hide: ['.crisp-client', '#cookies-policy'],
    overlay: {
      browser: 'dark',
      background:
        'linear-gradient(45deg, rgba(255,18,223,1) 0%, rgba(69,59,128,1) 66%, rgba(69,59,128,1) 100%)'
    }
  })
})()

.devices

It has all the devices presets available, being possible to load viewport and user agents settings based on a device descriptor.

These devices are used for emulation purposes. It extends from puppeteer.devices.

.getDevice({ device, viewport, headers })

Get a specific device descriptor settings by descriptor name.

It doesn't matter if device name is lower/upper case.

const browserless = require('browserless')

browserless.getDevice({ device: 'Macbook Pro 15' })
// {
//   userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
//   viewport: {
//     width: 1440,
//     height: 900,
//     deviceScaleFactor: 2,
//     isMobile: false,
//     hasTouch: false,
//     isLandscape: false
//   }
// }

Advanced

The following methods are exposed to be used in scenarios where you need more granularity control and less magic.

.browser

It returns the internal browser instance used as singleton.

const browserless = require('browserless')

;(async () => {
  const browserInstance = await browserless.browser
})()

.evaluate(fn, gotoOpts)

It exposes an interface for creating your own evaluate function, passing you the page and response.

The fn will receive page and response as arguments:

const browserless = require('browserless')()

const getUrlInfo = browserless.evaluate((page, response) => ({
  statusCode: response.status(),
  url: response.url(),
  redirectUrls: response.request().redirectChain()
}))

;(async () => {
  const url = 'https://example.com'
  const info = await getUrlInfo(url)

  console.log(info)
  // {
  //   "statusCode": 200,
  //   "url": "https://example.com/",
  //   "redirectUrls": []
  // }
})()

Note you don't need to close the page; It will be done under the hood.

Internally, the method performs a browserless.goto, being possible to pass extra arguments as second parameter:

const browserless = require('browserless')()

const getText = browserless.evaluate(page => page.evaluate(() => document.body.innerText), {
  waitUntil: 'domcontentloaded'
})

;(async () => {
  const url = 'https://example.com'
  const text = await getText(url)

  console.log(text)
})()

.goto(page, options)

It performs a smart page.goto, using a builtin adblocker by Cliqz.

const browserless = require('browserless')

;(async () => {
  const page = await browserless.page()
  const { response, device } = await browserless.goto(page, { url: 'http://example.com' })
})()

options

Any option passed here will bypass to page.goto.

Additionally, you can setup:

adblock

type: boolean default: true

It will be abort requests detected as ads.

animations

type: boolean
default: false

Disable CSS animations and transitions, also it sets prefers-reduced-motion consequently.

click

type: stringstring[]

Click the DOM element matching the given CSS selector.

device

type: string default: 'macbook pro 13'

It specifies the device descriptor to use in order to retrieve userAgent and viewport.

evasions

type: string[] default: require('@browserless/goto').evasions

It makes your Headless undetectable, preventing to being blocked.

These techniques are used by antibot systems to check if you are a real browser and block any kind of automated access

Evasions techniques implemented are:

Evasion Description
chromeRuntime It creates the window.chrome object associated to any Chrome browser
consoleDebug Ensure console.debug exists.
errorStackTrace Prevent detect Puppeteer via variable name.
mediaCodecs Ensure media codedcs are defined.
navigatorPermissions Mock over Notification.permissions.
navigatorPlugins Ensure your browser has NavigatorPlugins defined.
navigatorWebdriver Ensure Navigator.webdriver exists.
randomizeUserAgent Use a different User-Agent every time.
webglVendor Ensure WebGLRenderingContext & WebGL2RenderingContext returns browser-like information.

All the evasions techinques are enabled by default.

const evasions = require('@browserless/goto').evasions.filter(
  evasion => evasion !== 'randomizeUserAgent'
)

const browserless = require('browserless')({ evasions })

headers

type: object

An object containing additional HTTP headers to be sent with every request.

const browserless = require('browserless')

;(async () => {
  const page = await browserless.page()
  await browserless.goto(page, {
    url: 'http://example.com',
    headers: {
      'user-agent': 'googlebot',
      cookie: 'foo=bar; hello=world'
    }
  })
})()
hide

type: stringstring[]

Hide DOM elements matching the given CSS selectors.

;(async () => {
  const buffer = await browserless.screenshot(url.toString(), {
    hide: ['.crisp-client', '#cookies-policy']
  })
})()

This sets visibility: hidden on the matched elements.

html

type: string

In case you provide HTML markup, a page.setContent avoiding fetch the content from the target URL.

javascript

type: boolean
default: true

When it's false, it disables JavaScript on the current page.

mediaType

type: string default: 'screen'

Changes the CSS media type of the page using page.emulateMediaType.

modules

type: stringstring[]

Injects <script type="module"> into the browser page.

It can accept:

  • Absolute URLs (e.g., 'https://cdn.jsdelivr.net/npm/@microlink/[email protected]/src/browser.js').
  • Local file (e.g., `'local-file.js').
  • Inline code (e.g., "document.body.style.backgroundColor = 'red'").
;(async () => {
  const buffer = await browserless.screenshot(url.toString(), {
    modules: [
      'https://cdn.jsdelivr.net/npm/@microlink/[email protected]/src/browser.js',
      'local-file.js',
      "document.body.style.backgroundColor = 'red'"
    ]
  })
})()
remove

type: stringstring[]

Remove DOM elements matching the given CSS selectors.

;(async () => {
  const buffer = await browserless.screenshot(url.toString(), {
    remove: ['.crisp-client', '#cookies-policy']
  })
})()

This sets display: none on the matched elements, so it could potentially break the website layout.

colorScheme

type: string default: 'no-preference'

Sets prefers-color-scheme CSS media feature, used to detect if the user has requested the system use a 'light' or 'dark' color theme.

scripts

type: stringstring[]

Injects <script> into the browser page.

It can accept:

  • Absolute URLs (e.g., 'https://cdn.jsdelivr.net/npm/@microlink/[email protected]/src/browser.js').
  • Local file (e.g., `'local-file.js').
  • Inline code (e.g., "document.body.style.backgroundColor = 'red'").
;(async () => {
  const buffer = await browserless.screenshot(url.toString(), {
    scripts: [
      'https://cdn.jsdelivr.net/npm/[email protected]/dist/jquery.min.js',
      'local-file.js',
      "document.body.style.backgroundColor = 'red'"
    ]
  })
})()

Prefer to use modules whenever possible.

scroll

type: string | object

Scroll to the DOM element matching the given CSS selector.

styles

type: stringstring[]

Injects <style> into the browser page.

It can accept:

  • Absolute URLs (e.g., 'https://cdn.jsdelivr.net/npm/[email protected]/dist/dark.css').
  • Local file (e.g., `'local-file.css').
  • Inline code (e.g., "body { background: red; }").
;(async () => {
  const buffer = await browserless.screenshot(url.toString(), {
    styles: [
      'https://cdn.jsdelivr.net/npm/[email protected]/dist/dark.css',
      'local-file.css',
      'body { background: red; }'
    ]
  })
})()
timezone

type: string

It changes the timezone of the page.

url

type: string

The target URL.

viewport

It will setup a custom viewport, using page.setViewport method.

waitForSelector

type:string

Wait a quantity of time, selector or function using page.waitForSelector.

waitForTimeout

type:number

Wait a quantity of time, selector or function using page.waitForTimeout.

waitUntil

type: string | string[] default: 'auto' values: 'auto' | 'load' | 'domcontentloaded' | 'networkidle0' | 'networkidle2'

When to consider navigation succeeded.

If you provide an array of event strings, navigation is considered to be successful after all events have been fired.

Events can be either:

  • 'auto': A combination of 'load' and 'networkidle2' in a smart way to wait the minimum time necessary.
  • 'load': Consider navigation to be finished when the load event is fired.
  • 'domcontentloaded': Consider navigation to be finished when the DOMContentLoaded event is fired.
  • 'networkidle0': Consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
  • 'networkidle2': Consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.

.page()

It returns a standalone browser new page.

const browserless = require('browserless')

;(async () => {
  const page = await browserless.page()
})()

Command Line Interface

You can perform any browserless from CLI installing @browserless/cli globally:

Additionally, can do it under demand using npx:

npx @browserless/cli --help

That's useful when you want to do under CI/CD scenarios.

Pool of Instances

browserless uses internally a singleton browser instance.

If you want to keep multiple browsers open, you can use @browserless/pool package.

const createBrowserless = require('@browserless/pool')
const onExit = require('signal-exit')

const browserlessPool = createBrowserless({
  max: 2, // max browsers to keep open
  timeout: 30000 // max time a browser is consiedered fresh
})

// pool shutdown gracefully on process exit.
onExit(() => browserlessPool.drain().then(() => browserlessPool.clear()))

You can still pass specific puppeteer options as second argument:

const createBrowserless = require('@browserless/pool')

const browserlessPool = createBrowserless(
  {
    max: 2, // max browsers to keep open
    timeout: 30000 // max time a browser is consiedered fresh
  },
  {
    ignoreHTTPSErrors: true,
    args: ['--disable-gpu', '--single-process', '--no-zygote', '--no-sandbox', '--hide-scrollbars']
  }
)

After that, the API is the same than browserless:

browserlessPool.screenshot('http://example.com', { device: 'iPhone 6' }).then(buffer => {
  console.log('your screenshot is here!')
})

Every time you call the pool, it handles acquire and release a new browser instance from the pool ✨.

Lighthouse

browserless has a Lighthouse integration that uses Puppeteer under the hood.

const lighthouse = require('@browserless/lighthouse')

lighthouse('https://browserless.js.org').then(report => {
  console.log(JSON.stringify(report, null, 2))
})

.lighthouse(url, options)

It generates a report from the target url, extending from lighthouse:default settings, being these settings the same than Google Chrome Audits reports on Developer Tools.

options

The following options are used by default:

{
  logLevel: 'error',
  output: 'json',
  device: 'desktop',
  onlyCategories: ['perfomance', 'best-practices', 'accessibility', 'seo']
}

See Lighthouse configuration to know all the options and values supported.

Additionally, you can setup:

getBrowserless

type: function default: require('browserless')

The browserless instance to use for getting the browser.

logLevel

type: string default: 'error' values: 'silent' | 'error' | 'info' | 'verbose'

The level of logging to enable.

output

type: string | string[] default: 'json' values: 'json' | 'csv' | 'html'

The type(s) of report output to be produced.

device

type: string default: 'desktop' values: 'desktop' | 'mobile' | 'none'

How emulation (useragent, device screen metrics, touch) should be applied. 'none' indicates Lighthouse should leave the host browser as-is.

onlyCategories

type: string[]null default: ['performance', 'best-practices', 'accessibility', 'seo'] values: 'performance' | 'best-practices' | 'accessibility' | 'pwa' | 'seo'

Includes only the specified categories in the final report.

Packages

browserless is internally divided into multiple packages for ensuring just use the mininum quantity of code necessary for your user case.

Package Version Dependencies
browserless npm Dependency Status
@browserless/benchmark npm Dependency Status
@browserless/cli npm Dependency Status
@browserless/devices npm Dependency Status
@browserless/examples npm Dependency Status
@browserless/errors npm Dependency Status
@browserless/function npm Dependency Status
@browserless/goto npm Dependency Status
@browserless/pdf npm Dependency Status
@browserless/pool npm Dependency Status
@browserless/screenshot npm Dependency Status
@browserless/lighthouse npm Dependency Status

Benchmark

For testing different approaches, we included a tiny benchmark tool called @browserless/benchmark.

FAQ

Q: Why use browserless over puppeteer?

browserless not replace puppeteer, it complements. It's just a syntactic sugar layer over official Headless Chrome oriented for production scenarios.

Q: Why do you block ads scripts by default?

Headless navigation is expensive compared with just fetch the content from a website.

In order to speed up the process, we block ads scripts by default because they are so bloat.

Q: My output is different from the expected

Probably browserless was too smart and it blocked a request that you need.

You can active debug mode using DEBUG=browserless environment variable in order to see what is happening behind the code:

DEBUG=browserless node index.js

Consider open an issue with the debug trace.

Q: I want to use browserless with my AWS Lambda like project

Yes, check chrome-aws-lambda to setup AWS Lambda with a binary compatible.

License

browserless © Microlink, Released under the MIT License.
Authored and maintained by Kiko Beats with help from contributors.

The logo has been designed by xinh studio.

microlink.io · GitHub @MicrolinkHQ · Twitter @microlinkhq

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].