Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → hrbrmstr → Decapitated

hrbrmstr / Decapitated

Licence: other

Headless 'Chrome' Orchestration in R

Programming Languages

javascript

184084 projects - #8 most used programming language

7636 projects

Labels

rstats headless-chrome web-scraping

Projects that are alternatives of or similar to Decapitated

Actor Page Analyzer

Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSON-LD metadata, analyzes AJAX requests, etc.

Stars: ✭ 124 (+90.77%)

Mutual labels: web-scraping, headless-chrome

Splashr

💦 Tools to Work with the 'Splash' JavaScript Rendering Service in R

Stars: ✭ 93 (+43.08%)

Mutual labels: web-scraping, rstats

codepen-puppeteer

Use Puppeteer to download pens from Codepen.io as single html pages

Stars: ✭ 22 (-66.15%)

Mutual labels: web-scraping, headless-chrome

Ayakashi

⚡️ Ayakashi.io - The next generation web scraping framework

Stars: ✭ 117 (+80%)

Mutual labels: web-scraping, headless-chrome

Apify Js

Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.

Stars: ✭ 3,154 (+4752.31%)

Mutual labels: web-scraping, headless-chrome

Nodbi

Document DBI connector for R

Stars: ✭ 56 (-13.85%)

Mutual labels: rstats

Textdata

Download, parse, store, and load text datasets instead of storing it in packages

Stars: ✭ 59 (-9.23%)

Mutual labels: rstats

Colormap

R package to generate colors from a list of 44 pre-defined palettes

Stars: ✭ 55 (-15.38%)

Mutual labels: rstats

Scrapy Craigslist

Web Scraping Craigslist's Engineering Jobs in NY with Scrapy

Stars: ✭ 54 (-16.92%)

Mutual labels: web-scraping

Sysreqs

R package to install system requirements

Stars: ✭ 63 (-3.08%)

Mutual labels: rstats

Sever

🔪Good-looking problems: customise your Shiny disconnected screen and error messages

Stars: ✭ 60 (-7.69%)

Mutual labels: rstats

Sigmajs

Σ sigma.js for R

Stars: ✭ 58 (-10.77%)

Mutual labels: rstats

Drake Examples

Example workflows for the drake R package

Stars: ✭ 57 (-12.31%)

Mutual labels: rstats

Instago

Download/access photos, videos, stories, story highlights, postlives, following and followers of Instagram

Stars: ✭ 59 (-9.23%)

Mutual labels: web-scraping

Rtimes

R wrapper for NYTimes API for government data - ABANDONED

Stars: ✭ 55 (-15.38%)

Mutual labels: rstats

Prioritizr

Systematic conservation prioritization in R

Stars: ✭ 62 (-4.62%)

Mutual labels: rstats

Vcr

Record HTTP calls and replay them

Stars: ✭ 54 (-16.92%)

Mutual labels: rstats

Mixomics

Development repository for the Bioconductor package 'mixOmics '

Stars: ✭ 58 (-10.77%)

Mutual labels: rstats

Social Media Profile Scrapers

Fetch user's data across social media

Stars: ✭ 60 (-7.69%)

Mutual labels: web-scraping

Lawn

⛔ ARCHIVED ⛔ turf.js R client

Stars: ✭ 57 (-12.31%)

Mutual labels: rstats

View All Similar Projects ➔

NOTE: I am putting my support behind the {crrri} package and you should, too.

decapitated

Headless ‘Chrome’ Orchestration

Description

The ‘Chrome’ browser https://www.google.com/chrome/ has a headless mode which can be instrumented programmatically. Tools are provided to perform headless ‘Chrome’ instrumentation on the command-line, including retrieving the javascript-executed web page, PDF output or screen shot of a URL.

IMPORTANT

You'll need to set an envrionment variable HEADLESS_CHROME to use this package.

If this value is not set, a location heuristic is used on package start which looks for the following depending on the operating system:

Windows(32bit): C:/Program Files/Google/Chrome/Application/chrome.exe
Windows(64bit): C:/Program Files (x86)/Google/Chrome/Application/chrome.exe
macOS: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
Linux: /usr/bin/google-chrome

If a verification test fails, you will be notified.

It is HIGHLY recommended that you use decapitated::download_chromium() to use a standalone version of Chrome with this packge for your platform.

It's best to use ~/.Renviron to store this value.

Working around headless Chrome & OS security restrictions:

Security restrictions on various operating systems and OS configurations can cause headless Chrome execution to fail. As a result, headless Chrome operations should use a special directory for decapitated package operations. You can pass this in as work_dir. If work_dir is NULL a .rdecapdata directory will be created in your home directory and used for the data, crash dumps and utility directories for Chrome operations.

tempdir() does not always meet these requirements (after testing on various macOS 10.13 systems) as Chrome does some interesting attribute setting for some of its file operations.

If you pass in a work_dir, it must be one that does not violate OS security restrictions or headless Chrome will not function.

Helping it “always work”

The three core functions have a prime parameter. In testing (again, especially on macOS), I noticed that the first one or two requests to a URL often resulted in an empty <body> response. I don’t use Chrome as my primary browser anymore so I’m not sure if that has something to do with it, but requests after the first one or two do return content. The prime parameter lets you specify TRUE, FALSE or a numeric value that will issue the URL retrieval multiple times before returning a result (or generating a PDF or PNG). Until there is more granular control over the command-line execution of headless Chrome.

What’s in the tin?

The following functions are implemented:

CLI-based ops

downlaod_chromium: Download a standalone version of Chromium (recommended)
chrome_dump_pdf: "Print" to PDF
chrome_read_html: Read a URL via headless Chrome and return the raw or rendered '' 'innerHTML' DOM elements
chrome_shot: Capture a screenshot
chrome_version: Get Chrome version
get_chrome_env: get an envrionment variable 'HEADLESS_CHROME'
set_chrome_env: set an envrionment variable 'HEADLESS_CHROME'

`gepetto`-based ops

Helpers to get gepetto installed:

install_gepetto: Install gepetto
start_gepetto: Start/stop gepetto
stop_gepetto: Start/stop gepetto

API interface functions:

gepetto: Create a connection to a Gepetto API server
gep_active: Get test whether the gepetto server is active
gep_debug: Get "debug-level" information of a running gepetto server
gep_render_har: Render a page in a javascript context and serialize to HAR
gep_render_html: Render a page in a javascript context and serialize to HTML
gep_render_magick: Render a page in a javascript context and take a screenshot
gep_render_pdf: Render a page in a javascript context and rendero to PDF

More information on gepetto is forthcoming but you can take a sneak peek here.

Installation

devtools::install_github("hrbrmstr/decapitated")

Usage

library(decapitated)

# current verison
packageVersion("decapitated")

## [1] '0.3.0'

chrome_version()

chrome_read_html("http://httpbin.org/")

## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http-equiv="content-type" valu ...
## [2] <body id="manpage">\n<a href="http://github.com/kennethreitz/httpbin"><img style="position: absolute; top: 0; rig ...

chrome_dump_pdf("http://httpbin.org/")

chrome_shot("http://httpbin.org/")

##   format width height colorspace filesize
## 1    PNG  1600   1200       sRGB   215680

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 65

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (7) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

hrbrmstr / Decapitated

Programming Languages

Labels

Projects that are alternatives of or similar to Decapitated

decapitated

Description

IMPORTANT

Working around headless Chrome & OS security restrictions:

Helping it “always work”

What’s in the tin?

CLI-based ops

gepetto-based ops

Installation

Usage

`gepetto`-based ops