Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → epiqueras → Getsy

epiqueras / Getsy

Licence: mit

A simple browser/client-side web scraper.

Programming Languages

typescript

32286 projects

Labels

browser scraper client-side web-scraper

Projects that are alternatives of or similar to Getsy

Linkedin-Client

Web scraper for grabing data from Linkedin profiles or company pages (personal project)

Stars: ✭ 42 (-82.35%)

Mutual labels: scraper, web-scraper

AzurLaneWikiScrapers

A console application that can scrape the Azur Lane wiki and export the data to Json files

Stars: ✭ 12 (-94.96%)

Mutual labels: scraper, web-scraper

yellowpages-scraper

Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.

Stars: ✭ 56 (-76.47%)

Mutual labels: scraper, web-scraper

Scrape Linkedin Selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Stars: ✭ 239 (+0.42%)

Mutual labels: scraper, web-scraper

Spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+175.63%)

Mutual labels: scraper, web-scraper

Scrapers

A list of scrapers from around the web.

Stars: ✭ 366 (+53.78%)

Mutual labels: scraper, web-scraper

OLX Scraper

📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.

Stars: ✭ 15 (-93.7%)

Mutual labels: scraper, web-scraper

Phpscraper

PHP Scraper - an highly opinionated web-interface for PHP

Stars: ✭ 148 (-37.82%)

Mutual labels: scraper, web-scraper

Navaid

A navigation aid (aka, router) for the browser in 850 bytes~!

Stars: ✭ 648 (+172.27%)

Mutual labels: client-side, browser

Awesome Crawler

A collection of awesome web crawler,spider in different languages

Stars: ✭ 4,793 (+1913.87%)

Mutual labels: scraper, web-scraper

Hat.sh

encrypt and decrypt files in your browser. Fast, Secure client-side File Encryption and Decryption using the web crypto api

Stars: ✭ 886 (+272.27%)

Mutual labels: client-side, browser

Goose Parser

Universal scrapping tool, which allows you to extract data using multiple environments

Stars: ✭ 211 (-11.34%)

Mutual labels: scraper, browser

Vue.py

Pythonic Vue.js

Stars: ✭ 223 (-6.3%)

Mutual labels: client-side

Annie

👾 Fast and simple video download library and CLI tool written in Go

Stars: ✭ 16,369 (+6777.73%)

Mutual labels: scraper

Cargo

🚂🚋🚋 A browser with almost no UI.

Stars: ✭ 221 (-7.14%)

Mutual labels: browser

Webcompat.com

Source code for webcompat.com

Stars: ✭ 220 (-7.56%)

Mutual labels: browser

Use Ssr

☯️ React hook to determine if you are on the server, browser, or react native

Stars: ✭ 230 (-3.36%)

Mutual labels: browser

Chromely

Build HTML Desktop Apps on .NET/.NET Core/.NET 5 using native GUI, HTML5, JavaScript, CSS

Stars: ✭ 2,728 (+1046.22%)

Mutual labels: browser

Ruiji.net

crawler framework, distributed crawler extractor

Stars: ✭ 220 (-7.56%)

Mutual labels: scraper

Core

Midori Web Browser - a lightweight, fast and free web browser using WebKit and GTK+

Stars: ✭ 221 (-7.14%)

Mutual labels: browser

View All Similar Projects ➔

Getsy

A simple browser/client-side web scraper. Try it out in a REPL: http://www.getgetsy.com

TODOS:

[x] Support for websites with infinite scroll.

[ ] Support for websites with click pagination.

Installation options:

Run npm install --save getsy or yarn add getsy
Download the umd build and link it using a script tag

How to use:

This library exposes a single function: getsy(url: string, optionsObject?: options): Promise<Getsy>

parameters:

url: The url of the website you wish to scrape.
optionsObject(optional):
- corsProxy(optional string): The endpoint of the corsProxy you wish to use. (Read corsProxy for more info)
- resolveURLs(optional boolean): Wether you want getsy to resolve all relative urls in the resource to absolute urls so they don't fail when they load in another page. (defaults to true)
- iframe: A boolean or object with width and height properties indicating if getsy should start in iframeMode or not. iframe mode will wait for the resource to be mounted in a hidden iframe so you can extract more data through pagination or infinite scrolling. (defaults to false)

The function returns a promise that resolves to a Getsy object on success and rejects if it was unable to load the requested page.

Getsy objects have a method getMe for scraping the resource's contents. This method is just a wrapper over the jQuery function so you can chain other jQuery methods on it. If you need to use the raw data you can access it's content property. (More on Getsy below)

Example (Promises):

import getsy from 'getsy'

getsy('https://en.wikipedia.org/wiki/"Hello,_World!"_program').then(myGetsy => {
  console.log(myGetsy.getMe('#firstHeading').text())
})

Example (Async/Await):

import getsy from 'getsy'

async function testing() {
  const myGetsy = await getsy('https://en.wikipedia.org/wiki/"Hello,_World!"_program')

  console.log(myGetsy.getMe('#firstHeading').text())
}

testing()

Here's how you might use it with a website that has infinite scrolling:

async function infiniteScrape() {
  myGetsy = await getsy('http://scrollmagic.io/examples/advanced/infinite_scrolling.html', { iframe: true })
  
  console.log(`${myGetsy.getMe('.box1').length} boxes.`)
  
  const { succesfulTimes, totalRetries } = await myGetsy.scroll(10)
  
  console.log(`New content loaded ${succesfulTimes} times with ${totalRetries} total retries.`)
  console.log(`${myGetsy.getMe('.box1').length} boxes.`) // More content!
}

infiniteScrape()

The Getsy Object:

The Getsy object has the following properties and methods:

corsProxy: The same one passed from the options object or the default value.
content: The original string data received from the request.
iframe: A reference to its iframe element if in iframe mode.
iframeDoc: A reference to its iframe's document if in iframe mode.
content: The original string data received from the request.
getMe(selector: string): JQuery: Query the resource's DOM or the iframe if in iframe mode with a jQuery selector. Returns a JQuery object.
scroll(numberOfTimes: number, element?: HTMLElement, interval?: number, retries?: number): Promise<scrollResolve>: Scroll to the bottom of an element (defaults to body) to load new data a specified numberOfTimes. The interval (defaults to 2000) is the time in milliseconds that Getsy waits before checking if new content has loaded. If no new content has loaded it will retry as many times as specified by retries (defaults to 5). If no new content has loaded and scroll is out of retries then it will resolve the Promise early to avoid waiting for the remaining numberOfTimes. Note: retries reset to 0 on every succesful content load. Returns a Promise that resolves to an object with the number of .succesfulTimes that new content was loaded and the .totalRetries.
hideFrame(): void: Hides the iframe if applicable.
showFrame(): void: Shows the iframe if applicable.

CorsProxy:

This library uses a corsProxy to get by the CORS Origin issue. If you don't provide one it will default to: https://crossorigin.me/.

Some node CorsProxy servers:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 238

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗