All Projects → tidyverse → Rvest

tidyverse / Rvest

Licence: other
Simple web scraping for R

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Rvest

Coolqlcool
Nextjs server to query websites with GraphQL
Stars: ✭ 623 (-50.28%)
Mutual labels:  web-scraping
Actor Google Search Scraper
Apify actor that crawls Google Search result pages (SERPs) and extracts a list of organic results, ads, related queries and more. It supports selection of custom country, language and location.
Stars: ✭ 38 (-96.97%)
Mutual labels:  web-scraping
Cascadia
Go cascadia package command line CSS selector
Stars: ✭ 67 (-94.65%)
Mutual labels:  web-scraping
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (-47.65%)
Mutual labels:  web-scraping
Snoop
Snoop — инструмент разведки на основе открытых данных (OSINT world)
Stars: ✭ 886 (-29.29%)
Mutual labels:  web-scraping
Scrapy Craigslist
Web Scraping Craigslist's Engineering Jobs in NY with Scrapy
Stars: ✭ 54 (-95.69%)
Mutual labels:  web-scraping
Scrapy Fake Useragent
Random User-Agent middleware based on fake-useragent
Stars: ✭ 520 (-58.5%)
Mutual labels:  web-scraping
Reader
Extract clean(er), readable text from web pages via Mercury Web Parser.
Stars: ✭ 75 (-94.01%)
Mutual labels:  web-scraping
Uc Davis Cs Exams Analysis
📈 Regression and Classification with UC Davis student quiz data and exam data
Stars: ✭ 33 (-97.37%)
Mutual labels:  web-scraping
Decapitated
Headless 'Chrome' Orchestration in R
Stars: ✭ 65 (-94.81%)
Mutual labels:  web-scraping
Youtube tutorials
Collection of scripts corresponding to LucidProgramming YouTube tutorials
Stars: ✭ 769 (-38.63%)
Mutual labels:  web-scraping
Webmiddle
Node.js framework for modular web scraping and data extraction
Stars: ✭ 13 (-98.96%)
Mutual labels:  web-scraping
Instago
Download/access photos, videos, stories, story highlights, postlives, following and followers of Instagram
Stars: ✭ 59 (-95.29%)
Mutual labels:  web-scraping
Faster Than Requests
Faster requests on Python 3
Stars: ✭ 639 (-49%)
Mutual labels:  web-scraping
Arachnid
Powerful web scraping framework for Crystal
Stars: ✭ 68 (-94.57%)
Mutual labels:  web-scraping
Pythoncode Tutorials
The Python Code Tutorials
Stars: ✭ 544 (-56.58%)
Mutual labels:  web-scraping
Project Tauro
A Router WiFi key recovery/cracking tool with a twist.
Stars: ✭ 52 (-95.85%)
Mutual labels:  web-scraping
Detect Cms
PHP Library for detecting CMS
Stars: ✭ 78 (-93.77%)
Mutual labels:  web-scraping
Ping Sm
Receive an email or Telegram message as soon as Migros Sanalmarket is available for delivery in your neighborhood.
Stars: ✭ 71 (-94.33%)
Mutual labels:  web-scraping
Social Media Profile Scrapers
Fetch user's data across social media
Stars: ✭ 60 (-95.21%)
Mutual labels:  web-scraping

rvest

CRAN status R-CMD-check Codecov test coverage

Overview

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.

Installation

# The easiest way to get rvest is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just rvest:
install.packages("rvest")

Usage

library(rvest)

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Then find elements that match a css selector or XPath expression
# using html_elements(). In this example, each <section> corresponds
# to a different film
films <- starwars %>% html_elements("section")
films
#> {xml_nodeset (7)}
#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...

# Then use html_element() to extract one element per film. Here
# we the title is given by the text inside <h2>
title <- films %>% 
  html_element("h2") %>% 
  html_text2()
title
#> [1] "The Phantom Menace"      "Attack of the Clones"   
#> [3] "Revenge of the Sith"     "A New Hope"             
#> [5] "The Empire Strikes Back" "Return of the Jedi"     
#> [7] "The Force Awakens"

# Or use html_attr() to get data out of attributes. html_attr() always
# returns a string so we convert it to an integer using a readr function
episode <- films %>% 
  html_element("h2") %>% 
  html_attr("data-id") %>% 
  readr::parse_integer()
episode
#> [1] 1 2 3 4 5 6 7

If the page contains tabular data you can convert it directly to a data frame with html_table():

html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")

html %>% 
  html_element(".tracklist") %>% 
  html_table()
#> # A tibble: 29 x 4
#>    No.   Title                    `Performer(s)`                          Length
#>    <chr> <chr>                    <chr>                                   <chr> 
#>  1 1.    "\"Everything Is Awesom… "Tegan and Sara featuring The Lonely I… 2:43  
#>  2 2.    "\"Prologue\""           ""                                      2:28  
#>  3 3.    "\"Emmett's Morning\""   ""                                      2:00  
#>  4 4.    "\"Emmett Falls in Love… ""                                      1:11  
#>  5 5.    "\"Escape\""             ""                                      3:26  
#>  6 6.    "\"Into the Old West\""  ""                                      1:00  
#>  7 7.    "\"Wyldstyle Explains\"" ""                                      1:21  
#>  8 8.    "\"Emmett's Mind\""      ""                                      2:17  
#>  9 9.    "\"The Transformation\"" ""                                      1:46  
#> 10 10.   "\"Saloons and Wagons\"" ""                                      3:38  
#> # … with 19 more rows

Code of Conduct

Please note that the rvest project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].