All Projects → hrbrmstr → htmlunit

hrbrmstr / htmlunit

Licence: Apache-2.0 license
🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

Programming Languages

r
7636 projects
java
68154 projects - #9 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to htmlunit

reapr
🕸→ℹ️ Reap Information from Websites
Stars: ✭ 14 (-64.1%)
Mutual labels:  web-scraping, r-cyber
wayback
⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs
Stars: ✭ 52 (+33.33%)
Mutual labels:  web-scraping, r-cyber
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-53.85%)
Mutual labels:  web-scraping
selectorlib
A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them
Stars: ✭ 53 (+35.9%)
Mutual labels:  web-scraping
ioweb
Web Scraping Framework
Stars: ✭ 31 (-20.51%)
Mutual labels:  web-scraping
BookingScraper
🌎 🏨 Scrape Booking.com 🏨 🌎
Stars: ✭ 68 (+74.36%)
Mutual labels:  web-scraping
TikTokDownloader PyWebIO
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音|TikTok数据爬取工具,支持API调用,在线批量解析及下载。
Stars: ✭ 919 (+2256.41%)
Mutual labels:  web-scraping
web-poet
Web scraping Page Objects core library
Stars: ✭ 67 (+71.79%)
Mutual labels:  web-scraping
Data-Wrangling-with-Python
Simplify your ETL processes with these hands-on data sanitation tips, tricks, and best practices
Stars: ✭ 90 (+130.77%)
Mutual labels:  web-scraping
saveddit
Bulk Downloader for Reddit
Stars: ✭ 130 (+233.33%)
Mutual labels:  web-scraping
mhn
🍯 Analyze and Visualize Data from Modern Honey Network Servers with R
Stars: ✭ 16 (-58.97%)
Mutual labels:  r-cyber
Stock-Market-Predictor
Stock Market Predictor with LSTM network. Web scraping and analyzing tools (ohlc, mean)
Stars: ✭ 28 (-28.21%)
Mutual labels:  web-scraping
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (+135.9%)
Mutual labels:  web-scraping
Python
covers python basic to advance topics, practice questions, logical problems in python, web development using html, css, bootstrap, jquery, DOM, Django 🚀🚀. 💥 🌈
Stars: ✭ 29 (-25.64%)
Mutual labels:  web-scraping
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1723.08%)
Mutual labels:  web-scraping
rymscraper
Python API to extract data from rateyourmusic.com.
Stars: ✭ 63 (+61.54%)
Mutual labels:  web-scraping
codepen-puppeteer
Use Puppeteer to download pens from Codepen.io as single html pages
Stars: ✭ 22 (-43.59%)
Mutual labels:  web-scraping
gdns
Tools to work with the Google DNS over HTTPS API in R
Stars: ✭ 23 (-41.03%)
Mutual labels:  r-cyber
grailer
web scraping tool for grailed.com
Stars: ✭ 30 (-23.08%)
Mutual labels:  web-scraping
cl-torrents
Searching torrents on popular trackers - CLI, readline, GUI, web client. Tutorial and binaries (issue tracker on https://gitlab.com/vindarel/cl-torrents/)
Stars: ✭ 83 (+112.82%)
Mutual labels:  web-scraping

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Signed by Signed commit % Linux build Status Coverage Status Minimal R Version License

htmlunit

Tools to Scrape Dynamic Web Content via the ‘HtmlUnit’ Java Library

Description

‘HtmlUnit’ (https://htmlunit.sourceforge.net/) is a “‘GUI’-Less browser for ‘Java’ programs”. It models ‘HTML’ documents and provides an ‘API’ that allows one to invoke pages, fill out forms, click links and more just like one does in a “normal” browser. The library has fairly good and constantly improving ‘JavaScript’ support and is able to work even with quite complex ‘AJAX’ libraries, simulating ‘Chrome’, ‘Firefox’ or ‘Internet Explorer’ depending on the configuration used. It is typically used for testing purposes or to retrieve information from web sites. Tools are provided to work with this library at a higher level than provided by the exposed ‘Java’ libraries in the ‘htmlunitjars’ package.

What’s Inside The Tin

The following functions are implemented:

DSL

  • web_client/webclient: Create a new HtmlUnit WebClient instance

  • wc_go: Visit a URL

  • wc_html_nodes: Select nodes from web client active page html content

  • wc_html_text: Extract attributes, text and tag name from webclient page html content

  • wc_html_attr: Extract attributes, text and tag name from webclient page html content

  • wc_html_name: Extract attributes, text and tag name from webclient page html content

  • wc_headers: Return response headers of the last web request for current page

  • wc_browser_info: Retreive information about the browser used to create the ‘webclient’

  • wc_content_length: Return content length of the last web request for current page

  • wc_content_type: Return content type of web request for current page

  • wc_render: Retrieve current page contents

  • wc_css: Enable/Disable CSS support

  • wc_dnt: Enable/Disable Do-Not-Track

  • wc_geo: Enable/Disable Geolocation

  • wc_img_dl: Enable/Disable Image Downloading

  • wc_load_time: Return load time of the last web request for current page

  • wc_resize: Resize the virtual browser window

  • wc_status: Return status code of web request for current page

  • wc_timeout: Change default request timeout

  • wc_title: Return page title for current page

  • wc_url: Return load time of the last web request for current page

  • wc_use_insecure_ssl: Enable/Disable Ignoring SSL Validation Issues

  • wc_wait: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing

Just the Content (pls)

  • hu_read_html: Read HTML from a URL with Browser Emulation & in a JavaScript Context

Content++

  • wc_inspect: Perform a “Developer Tools”-like Network Inspection of a URL

Installation

install.packages("htmlunit", repos = c("https://cinc.rud.is", "https://cloud.r-project.org/"))
# or
remotes::install_git("https://git.rud.is/hrbrmstr/htmlunit.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/htmlunit")
# or
remotes::install_gitlab("hrbrmstr/htmlunit")
# or
remotes::install_bitbucket("hrbrmstr/htmlunit")
# or
remotes::install_github("hrbrmstr/htmlunit")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg

# current verison
packageVersion("htmlunit")
## [1] '0.5.0'

Something xml2::read_html() cannot do, read the table from https://hrbrmstr.github.io/htmlunitjars/index.html:

test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

pg <- xml2::read_html(test_url)

html_table(pg)
## list()

☹️

But, hu_read_html() can!

pg <- hu_read_html(test_url)

html_table(pg)
## [[1]]
##      X1   X2
## 1   One  Two
## 2 Three Four
## 3  Five  Six

All without needing a separate Selenium or Splash server instance.

Content++

We can also get a HAR-like content + metadata dump:

xdf <- wc_inspect("https://rstudio.com")

colnames(xdf)
## [1] "method"         "url"            "status_code"    "message"        "content"        "content_length"
## [7] "content_type"   "load_time"      "headers"

select(xdf, method, url, status_code, content_length, load_time)
## # A tibble: 36 x 5
##    method url                                                                       status_code content_length load_time
##    <chr>  <chr>                                                                           <int>          <dbl>     <dbl>
##  1 GET    https://rstudio.com/                                                              200          14621       495
##  2 GET    https://metadata-static-files.sfo2.cdn.digitaloceanspaces.com/pixel/lp.js         200           3576       221
##  3 GET    https://snap.licdn.com/li.lms-analytics/insight.min.js                            200           1576       162
##  4 GET    https://connect.facebook.net/en_US/fbevents.js                                    200          34269       138
##  5 GET    https://connect.facebook.net/signals/config/151855192184380?v=2.9.23&r=s…         200         134841        66
##  6 GET    https://munchkin.marketo.net/munchkin-beta.js                                     200            752       230
##  7 GET    https://munchkin.marketo.net/159/munchkin.js                                      200           4810        27
##  8 GET    https://x.clearbitjs.com/v1/pk_60c5aa2221e3c03eca10fb6876aa6df7/clearbit…         200          86568       483
##  9 GET    https://cdn.segment.com/analytics.js/v1/gO0uTGfCkO4DQpfkRim9mBsjdKrehtnu…         200          62860       243
## 10 GET    https://static.hotjar.com/c/hotjar-1446157.js?sv=6                                200           1708       212
## # … with 26 more rows

group_by(xdf, content_type) %>% 
  summarise(
    total_size = sum(content_length), 
    total_load_time = sum(load_time)/1000
  )
## # A tibble: 7 x 3
##   content_type             total_size total_load_time
##   <chr>                         <dbl>           <dbl>
## 1 application/javascript       431338           2.58 
## 2 application/json               4118           1.37 
## 3 application/x-javascript     176248           0.623
## 4 image/gif                        35           0.232
## 5 text/html                     16640           1.36 
## 6 text/javascript              254971           0.996
## 7 text/plain                       28           0.189

DSL

wc <- web_client(emulate = "chrome")

wc %>% wc_browser_info()
## < Netscape / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 / en-US >

wc <- web_client()

wc %>% wc_go("https://usa.gov/")

# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()

wc %>%
  wc_html_nodes("a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)
##  [1] "Skip to main content"               ""                                   "Español"                           
##  [4] "1-844-USA-GOV1"                     "All Topics and Services"            "About the U.S."                    
##  [7] "American Flag"                      "Branches of the U.S. Government"    "Budget of the U.S. Government"     
## [10] "Data and Statistics about the U.S."

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)
##  [1] "Skip to main content"               ""                                   "Español"                           
##  [4] "1-844-USA-GOV1"                     "All Topics and Services"            "About the U.S."                    
##  [7] "American Flag"                      "Branches of the U.S. Government"    "Budget of the U.S. Government"     
## [10] "Data and Statistics about the U.S."

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_attr, "href") %>% 
  head(10)
##  [1] "#content"                "/"                       "/espanol/"               "/phone"                 
##  [5] "/#tpcs"                  "#"                       "/flag"                   "/branches-of-government"
##  [9] "/budget"                 "/statistics"

Handy function to get rendered plain text for text mining:

wc %>% 
  wc_render("text") %>% 
  substr(1, 300) %>% 
  cat()
## Official Guide to Government Information and Services | USAGov
## Skip to main content
## An official website of the United States government Here's how you know
## 
## 
## Main Navigation
## Search
## Search
## Search
## 1-844-USA-GOV1
## All Topics and Services
## Benefits, Grants, Loans
## Government Agencies and Elected Officials

htmlunit Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
R 14 0.70 341 0.72 188 0.70 377 0.82
Java 3 0.15 52 0.11 23 0.09 3 0.01
Rmd 1 0.05 41 0.09 52 0.19 75 0.16
Maven 1 0.05 30 0.06 0 0.00 1 0.00
make 1 0.05 10 0.02 4 0.01 4 0.01

clock Package Metrics for htmlunit

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].