Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → apify → Actor Page Analyzer

apify / Actor Page Analyzer

Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSON-LD metadata, analyzes AJAX requests, etc.

Programming Languages

javascript

184084 projects - #8 most used programming language

Labels

headless-chrome web-scraping

Projects that are alternatives of or similar to Actor Page Analyzer

Apify Js

Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.

Stars: ✭ 3,154 (+2443.55%)

Mutual labels: web-scraping, headless-chrome

codepen-puppeteer

Use Puppeteer to download pens from Codepen.io as single html pages

Stars: ✭ 22 (-82.26%)

Mutual labels: web-scraping, headless-chrome

Ayakashi

⚡️ Ayakashi.io - The next generation web scraping framework

Stars: ✭ 117 (-5.65%)

Mutual labels: web-scraping, headless-chrome

Decapitated

Headless 'Chrome' Orchestration in R

Stars: ✭ 65 (-47.58%)

Mutual labels: web-scraping, headless-chrome

Rod

A Devtools driver for web automation and scraping

Stars: ✭ 1,392 (+1022.58%)

Mutual labels: web-scraping

Rvest

Simple web scraping for R

Stars: ✭ 1,253 (+910.48%)

Mutual labels: web-scraping

Reader

Extract clean(er), readable text from web pages via Mercury Web Parser.

Stars: ✭ 75 (-39.52%)

Mutual labels: web-scraping

Ping Sm

Receive an email or Telegram message as soon as Migros Sanalmarket is available for delivery in your neighborhood.

Stars: ✭ 71 (-42.74%)

Mutual labels: web-scraping

Cri

Type safe go bindings to interact with chrome remote interface.

Stars: ✭ 119 (-4.03%)

Mutual labels: headless-chrome

Dat8

General Assembly's 2015 Data Science course in Washington, DC

Stars: ✭ 1,516 (+1122.58%)

Mutual labels: web-scraping

Sillynium

Automate the creation of Python Selenium Scripts by drawing coloured boxes on webpage elements

Stars: ✭ 100 (-19.35%)

Mutual labels: web-scraping

Daftlistings

A library that enables programmatic interaction with daft.ie. Daft.ie has nationwide coverage and contains about 80% of the total available properties in Ireland.

Stars: ✭ 86 (-30.65%)

Mutual labels: web-scraping

Scrapyd Cluster On Heroku

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉

Stars: ✭ 106 (-14.52%)

Mutual labels: web-scraping

Detect Cms

PHP Library for detecting CMS

Stars: ✭ 78 (-37.1%)

Mutual labels: web-scraping

Splashr

💦 Tools to Work with the 'Splash' JavaScript Rendering Service in R

Stars: ✭ 93 (-25%)

Mutual labels: web-scraping

Puppeteer Functions

Puppeteer Firebase Functions demo

Stars: ✭ 75 (-39.52%)

Mutual labels: headless-chrome

Save For Offline

Android app for saving webpages for offline reading.

Stars: ✭ 114 (-8.06%)

Mutual labels: web-scraping

Hockey Scraper

Python Package for scraping NHL Play-by-Play and Shift data

Stars: ✭ 93 (-25%)

Mutual labels: web-scraping

Puppeteer Dart

A Dart library to automate the Chrome browser over the DevTools Protocol. This is a port of the Puppeteer API

Stars: ✭ 92 (-25.81%)

Mutual labels: headless-chrome

Pulsar

Turn large Web sites into tables and charts using simple SQLs.

Stars: ✭ 100 (-19.35%)

Mutual labels: web-scraping

View All Similar Projects ➔

Page analyzer

This Apify actor analyzes a web page on a specific URL. You can try out how it works live in the Page Analyzer on Apify. This actor extracts HTML and javascript variables from main response and HTML/JSON data from XHR requests. Then it analyses loaded data:

It performs analysis of initial HTML (html loaded directly from response):

Looks for Schema.org data and if it finds anything, it saves it to output as schemaOrgData variable.
Looks for JSON-LD link tags and parses found JSON, if it finds anything it outputs it as jsonLDData variable.
Looks for meta and title tags and outputs found content as metadata variable.

Loads all XHR requests -> discards request that do no contain HTML or JSON -> parses HTML and JSON into objects
When all XHR requests are finished it loads HTML from the rendered page (it might have changed thanks to JS manipulation) and does work from step 1 again because javascript might have changed the HTML of the website.
Loads all window variables and discards common global variables (console, innerHeight, navigator, ...), cleans the output (removes all functions and circular paths) and outputs it as allWindowProperties variable.

When analysis is finished it checks INPUT parameters if there are any strings to search for and if there are. Then it attempts to find the strings in all found content.

The actor ends when all output is parsed and searched. If connection to URL fails or if any part of the actor crashes, the actor ends with error in output and log.

Input to actor is provided from INPUT file. If the actor is run through Apify, then INPUT comes from key value store. If you want to start the actor localy, then call

npm run start-local

and provide input as a file in directory kv-store-dev.

INPUT

{
    // url to website, that is supposed to be analyzed
    "url": "http://example.com",
    // array of strings too look for on the website, if empty, search is skipped during analysis
    "searchFor": ["About us"]
}

During the actor run, it saves output into OUTPUT file, which is saved in key value store if the actor is run through Apify, or in kv-store-dev folder if the actor is run localy.

OUTPUT

{
  // Initial response headers
  "initialResponse": {
    "url": "https://www.flywire.com/",
    "headers": {...}
  },
  // True if window variables were parsed after XHR requests finished
  "windowPropertiesParsed": true,
  // True if meta tags were parsed from initial response
  "metaDataParsed": true,
  // True if Schema.org was loaded and parsed from initial response
  "schemaOrgDataParsed": true,
  // True if JSON-LD was loaded and parsed from initial response
  "jsonLDDataParsed": true,
  // True if HTML was loaded and parsed from initial response
  "htmlParsed": true,
  // True if HTML was loaded and parsed after XHR requests finished
  "htmlFullyParsed": true,
  // True if XHR requests were all parsed
  "xhrRequestsParsed": true,
  // Filtered window properties by search strings
  "windowProperties": {},
  // Object containing cleaned up window object properties
  "allWindowProperties": {...},
  // Array of properties which contain searched strings (at least one) with path to variable from root
  "windowPropertiesFound": [],
  // Schema.org data filtered by search strings.
  "schemaOrgData": {},
  // Array of schema org properties which contain searched strings (at least one) with path to variable from root
  "schemaOrgDataFound": [],
  // Complete output of found schema.org data
  "allSchemaOrgData": [],
  // Complete output of all found meta tags
  "metaData": {
    "viewport": "width=device-width, initial-scale=1",
    "og:title": "International Payments Solution",
    ...
  },
  // List of meta tags matching the searched strings
  "metaDataFound": [],
  // JSON-LD Data filtered by search strings.
  "jsonLDData": {},
  // Array of JSON-LD data properties which contain searched strings (at least one) with path to variable from root
  "jsonLDDataFound": [],
  // Complete output of found JSON-LD
  "allJsonLDData": [],
  // Array of selectors to HTML elements that contain the searched values
  "htmlFound": [],
  // Array of parsed XHR requests with content type of JSON or HTML
  "xhrRequests": [
    {
      "url": "https://www.flywire.com/destinations",
      "method": "GET",
      "responseStatus": 200,
      "responseHeaders": {...},
      "responseBody": {
        // Valid provides information whether JSON was parsed successfully
        "valid": true/false,
        // Data contains the parsed JSON
        "data": [...],
      }
    },
    {
      "url": "https://www.flywire.com/asdasd",
      "method": "GET",
      "responseStatus": 200,
      "responseHeaders": {...},
      // For HTML requests responseBody contains HTML as string
      "responseBody": "<html>...."
    },
  ],
  // same list as above, but filtered by search strings
  "xhrRequestsFound": [...],
  // contains error if actor failed outside of page function
  "error": null,
  // contains error if actor failed in page.evaluate
  "pageError": null,
  "outputFinished": true,

  // timestamps for debugging
  "analysisStarted": "2018-02-09T12:34:49.938Z",
  "scrappingStarted": "2018-02-09T12:34:50.050Z",
  "pageNavigated": "2018-02-09T12:34:53.495Z",
  "windowPropertiesSearched": "2018-02-09T12:34:53.810Z",
  "metadataSearched": "2018-02-09T12:34:51.624Z",
  "schemaOrgSearched": "2018-02-09T12:34:51.627Z",
  "jsonLDSearched": "2018-02-09T12:34:51.625Z",
  "htmlSearched": "2018-02-09T12:34:53.746Z",
  "xhrRequestsSearched": "2018-02-09T12:34:53.517Z",
  "analysisEnded": "2018-02-09T12:34:53.810Z",
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 124

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (9) 🔗