All Projects → apify → Actor Page Analyzer

apify / Actor Page Analyzer

Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSON-LD metadata, analyzes AJAX requests, etc.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Actor Page Analyzer

Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+2443.55%)
Mutual labels:  web-scraping, headless-chrome
codepen-puppeteer
Use Puppeteer to download pens from Codepen.io as single html pages
Stars: ✭ 22 (-82.26%)
Mutual labels:  web-scraping, headless-chrome
Ayakashi
⚡️ Ayakashi.io - The next generation web scraping framework
Stars: ✭ 117 (-5.65%)
Mutual labels:  web-scraping, headless-chrome
Decapitated
Headless 'Chrome' Orchestration in R
Stars: ✭ 65 (-47.58%)
Mutual labels:  web-scraping, headless-chrome
Rod
A Devtools driver for web automation and scraping
Stars: ✭ 1,392 (+1022.58%)
Mutual labels:  web-scraping
Rvest
Simple web scraping for R
Stars: ✭ 1,253 (+910.48%)
Mutual labels:  web-scraping
Reader
Extract clean(er), readable text from web pages via Mercury Web Parser.
Stars: ✭ 75 (-39.52%)
Mutual labels:  web-scraping
Ping Sm
Receive an email or Telegram message as soon as Migros Sanalmarket is available for delivery in your neighborhood.
Stars: ✭ 71 (-42.74%)
Mutual labels:  web-scraping
Cri
Type safe go bindings to interact with chrome remote interface.
Stars: ✭ 119 (-4.03%)
Mutual labels:  headless-chrome
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+1122.58%)
Mutual labels:  web-scraping
Sillynium
Automate the creation of Python Selenium Scripts by drawing coloured boxes on webpage elements
Stars: ✭ 100 (-19.35%)
Mutual labels:  web-scraping
Daftlistings
A library that enables programmatic interaction with daft.ie. Daft.ie has nationwide coverage and contains about 80% of the total available properties in Ireland.
Stars: ✭ 86 (-30.65%)
Mutual labels:  web-scraping
Scrapyd Cluster On Heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉
Stars: ✭ 106 (-14.52%)
Mutual labels:  web-scraping
Detect Cms
PHP Library for detecting CMS
Stars: ✭ 78 (-37.1%)
Mutual labels:  web-scraping
Splashr
💦 Tools to Work with the 'Splash' JavaScript Rendering Service in R
Stars: ✭ 93 (-25%)
Mutual labels:  web-scraping
Puppeteer Functions
Puppeteer Firebase Functions demo
Stars: ✭ 75 (-39.52%)
Mutual labels:  headless-chrome
Save For Offline
Android app for saving webpages for offline reading.
Stars: ✭ 114 (-8.06%)
Mutual labels:  web-scraping
Hockey Scraper
Python Package for scraping NHL Play-by-Play and Shift data
Stars: ✭ 93 (-25%)
Mutual labels:  web-scraping
Puppeteer Dart
A Dart library to automate the Chrome browser over the DevTools Protocol. This is a port of the Puppeteer API
Stars: ✭ 92 (-25.81%)
Mutual labels:  headless-chrome
Pulsar
Turn large Web sites into tables and charts using simple SQLs.
Stars: ✭ 100 (-19.35%)
Mutual labels:  web-scraping

Page analyzer

This Apify actor analyzes a web page on a specific URL. You can try out how it works live in the Page Analyzer on Apify. This actor extracts HTML and javascript variables from main response and HTML/JSON data from XHR requests. Then it analyses loaded data:

  1. It performs analysis of initial HTML (html loaded directly from response):
  • Looks for Schema.org data and if it finds anything, it saves it to output as schemaOrgData variable.
  • Looks for JSON-LD link tags and parses found JSON, if it finds anything it outputs it as jsonLDData variable.
  • Looks for meta and title tags and outputs found content as metadata variable.
  1. Loads all XHR requests -> discards request that do no contain HTML or JSON -> parses HTML and JSON into objects
  2. When all XHR requests are finished it loads HTML from the rendered page (it might have changed thanks to JS manipulation) and does work from step 1 again because javascript might have changed the HTML of the website.
  3. Loads all window variables and discards common global variables (console, innerHeight, navigator, ...), cleans the output (removes all functions and circular paths) and outputs it as allWindowProperties variable.

When analysis is finished it checks INPUT parameters if there are any strings to search for and if there are. Then it attempts to find the strings in all found content.

The actor ends when all output is parsed and searched. If connection to URL fails or if any part of the actor crashes, the actor ends with error in output and log.

Input to actor is provided from INPUT file. If the actor is run through Apify, then INPUT comes from key value store. If you want to start the actor localy, then call

npm run start-local

and provide input as a file in directory kv-store-dev.

INPUT

{
    // url to website, that is supposed to be analyzed
    "url": "http://example.com",
    // array of strings too look for on the website, if empty, search is skipped during analysis
    "searchFor": ["About us"]
}

During the actor run, it saves output into OUTPUT file, which is saved in key value store if the actor is run through Apify, or in kv-store-dev folder if the actor is run localy.

OUTPUT

{
  // Initial response headers
  "initialResponse": {
    "url": "https://www.flywire.com/",
    "headers": {...}
  },
  // True if window variables were parsed after XHR requests finished
  "windowPropertiesParsed": true,
  // True if meta tags were parsed from initial response
  "metaDataParsed": true,
  // True if Schema.org was loaded and parsed from initial response
  "schemaOrgDataParsed": true,
  // True if JSON-LD was loaded and parsed from initial response
  "jsonLDDataParsed": true,
  // True if HTML was loaded and parsed from initial response
  "htmlParsed": true,
  // True if HTML was loaded and parsed after XHR requests finished
  "htmlFullyParsed": true,
  // True if XHR requests were all parsed
  "xhrRequestsParsed": true,
  // Filtered window properties by search strings
  "windowProperties": {},
  // Object containing cleaned up window object properties
  "allWindowProperties": {...},
  // Array of properties which contain searched strings (at least one) with path to variable from root
  "windowPropertiesFound": [],
  // Schema.org data filtered by search strings.
  "schemaOrgData": {},
  // Array of schema org properties which contain searched strings (at least one) with path to variable from root
  "schemaOrgDataFound": [],
  // Complete output of found schema.org data
  "allSchemaOrgData": [],
  // Complete output of all found meta tags
  "metaData": {
    "viewport": "width=device-width, initial-scale=1",
    "og:title": "International Payments Solution",
    ...
  },
  // List of meta tags matching the searched strings
  "metaDataFound": [],
  // JSON-LD Data filtered by search strings.
  "jsonLDData": {},
  // Array of JSON-LD data properties which contain searched strings (at least one) with path to variable from root
  "jsonLDDataFound": [],
  // Complete output of found JSON-LD
  "allJsonLDData": [],
  // Array of selectors to HTML elements that contain the searched values
  "htmlFound": [],
  // Array of parsed XHR requests with content type of JSON or HTML
  "xhrRequests": [
    {
      "url": "https://www.flywire.com/destinations",
      "method": "GET",
      "responseStatus": 200,
      "responseHeaders": {...},
      "responseBody": {
        // Valid provides information whether JSON was parsed successfully
        "valid": true/false,
        // Data contains the parsed JSON
        "data": [...],
      }
    },
    {
      "url": "https://www.flywire.com/asdasd",
      "method": "GET",
      "responseStatus": 200,
      "responseHeaders": {...},
      // For HTML requests responseBody contains HTML as string
      "responseBody": "<html>...."
    },
  ],
  // same list as above, but filtered by search strings
  "xhrRequestsFound": [...],
  // contains error if actor failed outside of page function
  "error": null,
  // contains error if actor failed in page.evaluate
  "pageError": null,
  "outputFinished": true,

  // timestamps for debugging
  "analysisStarted": "2018-02-09T12:34:49.938Z",
  "scrappingStarted": "2018-02-09T12:34:50.050Z",
  "pageNavigated": "2018-02-09T12:34:53.495Z",
  "windowPropertiesSearched": "2018-02-09T12:34:53.810Z",
  "metadataSearched": "2018-02-09T12:34:51.624Z",
  "schemaOrgSearched": "2018-02-09T12:34:51.627Z",
  "jsonLDSearched": "2018-02-09T12:34:51.625Z",
  "htmlSearched": "2018-02-09T12:34:53.746Z",
  "xhrRequestsSearched": "2018-02-09T12:34:53.517Z",
  "analysisEnded": "2018-02-09T12:34:53.810Z",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].