Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → microlinkhq → Metascraper

microlinkhq / Metascraper

Licence: mit

Scrape data from websites using Open Graph, HTML metadata & fallbacks.

Labels

html metadata parse

Projects that are alternatives of or similar to Metascraper

Forensic Tools

A collection of tools for forensic analysis

Stars: ✭ 204 (-83.73%)

Mutual labels: metadata, parse

icecast-parser

Node.js module for getting and parsing metadata from SHOUTcast/Icecast radio streams

Stars: ✭ 66 (-94.74%)

Mutual labels: metadata, parse

icc

JavaScript module to parse International Color Consortium (ICC) profiles

Stars: ✭ 37 (-97.05%)

Mutual labels: metadata, parse

Ipdata

🌐 An IP lookup system utilizing open datasets

Stars: ✭ 58 (-95.37%)

Mutual labels: metadata

Metaforge

An OSINT Metadata analyzing tool that filters through tags and creates reports

Stars: ✭ 63 (-94.98%)

Mutual labels: metadata

Parse Ms

Parse milliseconds into an object

Stars: ✭ 74 (-94.1%)

Mutual labels: parse

Docker Apache Atlas

This Apache Atlas is built from the latest release source tarball and patched to be run in a Docker container.

Stars: ✭ 83 (-93.38%)

Mutual labels: metadata

Hlsinjector

ID3 metadata injector for MPEG TS (HLS) written in PHP

Stars: ✭ 56 (-95.53%)

Mutual labels: metadata

Page Renderer

Clojure PWA generator. Offline-ready web apps with service workers, social meta and async stylesheets.

Stars: ✭ 76 (-93.94%)

Mutual labels: metadata

Deltafs

Transient file system service featuring highly paralleled indexing on both file data and file system metadata

Stars: ✭ 70 (-94.42%)

Mutual labels: metadata

Parse Sdk Js

The JavaScript SDK for the Parse Platform

Stars: ✭ 1,158 (-7.66%)

Mutual labels: parse

Netkan

Metadata files used by the NetKAN/CKAN indexer

Stars: ✭ 64 (-94.9%)

Mutual labels: metadata

Jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Stars: ✭ 9,184 (+632.38%)

Mutual labels: parse

Schema Microdata Examples

Some examples of HTML markup using Schema.org microdata

Stars: ✭ 58 (-95.37%)

Mutual labels: metadata

Parse Dashboard For Ios

A beautiful mobile client for managing your Parse apps while you are on the go! Now you can easily view and modify your data in the same way you would on the offical desktop client.

Stars: ✭ 81 (-93.54%)

Mutual labels: parse

Cf Xarray

a lightweight accessor for xarray objects that interprets CF attributes

Stars: ✭ 58 (-95.37%)

Mutual labels: metadata

Sickle

Sickle: OAI-PMH for Humans

Stars: ✭ 76 (-93.94%)

Mutual labels: metadata

Sickbeard mp4 automator

Automatically convert video files to a standardized format with metadata tagging to create a beautiful and uniform media library

Stars: ✭ 1,142 (-8.93%)

Mutual labels: metadata

Python Patch

Library to parse and apply unified diffs

Stars: ✭ 65 (-94.82%)

Mutual labels: parse

Parser

Generate a JSON documentation for a SFC Vue component. Contribute: https://gitlab.com/vuedoc/parser#contribute

Stars: ✭ 74 (-94.1%)

Mutual labels: parse

View All Similar Projects ➔

A library to easily scrape metadata from an article on the web using Open Graph, JSON+LD, regular HTML metadata, and series of fallbacks.

Table of Contents
Getting Started
Installation
Usage
Metadata
How it works
Importing Rules
Rules bundles
API
Benchmark
License

Getting Started

metascraper is library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

It follows a few principles:

Have a high accuracy for online articles by default.
Make it simple to add new rules or override existing ones.
Don't restrict rules to CSS selectors or text accessors.

Installation

$ npm install metascraper --save

Usage

Let's extract accurate information from the following article:

Then call metascraper with the rules bundle you want to apply for extracting content:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

const got = require('got')

const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'

;(async () => {
  const { body: html, url } = await got(targetUrl)
  const metadata = await metascraper({ html, url })
  console.log(metadata)
})()

The output will be something like:

{
  "author": "Ellen Huet",
  "date": "2016-05-24T18:00:03.894Z",
  "description": "The HR startups go to war.",
  "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
  "publisher": "Bloomberg.com",
  "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
  "url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}

Metadata

?> Other metadata can be defined using a custom rule bundle.

Here is an example of the metadata that metascraper can collect:

audio — eg. https://cf-media.sndcdn.com/U78RIfDPV6ok.128.mp3
A audio URL that best represents the article.
author — eg. Noah Kulwin
A human-readable representation of the author's name.
date — eg. 2016-05-27T00:00:00.000Z
An ISO 8601 representation of the date the article was published.
description — eg. Venture capitalists are raising money at the fastest rate...
The publisher's chosen description of the article.
video — eg. https://assets.entrepreneur.com/content/preview.mp4
A video URL that best represents the article.
image — eg. https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg
An image URL that best represents the article.
lang — eg. en
An ISO 639-1 representation of the url content language.
logo — eg. https://entrepreneur.com/favicon180x180.png
An image URL that best represents the publisher brand.
publisher — eg. Fast Company
A human-readable representation of the publisher's name.
title — eg. Meet Wall Street's New A.I. Sheriffs
The publisher's chosen title of the article.
url — eg. http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion
The URL of the article.

How It Works

metascraper is built out of rules bundles.

It was designed to be easy to adapt. You can compose your own transformation pipeline using existing rules or write your own.

Rules bundles are a collection of HTML selectors around a determinate property. When you load the library, implicitly it is loading core rules.

Each set of rules load a set of selectors in order to get a determinate value.

These rules are sorted with priority: The first rule that resolve the value successfully, stop the rest of rules for get the property. Rules are sorted intentionally from specific to more generic.

Rules work as fallback between them:

If the first rule fails, then it fallback in the second rule.
If the second rule fails, time to third rule.
etc

metascraper do that until finish all the rule or find the first rule that resolves the value.

Importing Rules

metascraper exports a constructor that need to be initialized providing a collection of rules to load:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

Again, the order of rules are loaded are important: Just the first rule that resolve the value will be applied.

Use the first parameter to pass custom options specific per each rules bundle:

const metascraper = require('metascraper')([
  require('metascraper-clearbit')({
    size: 256,
    format: 'jpg'
  })
])

Rules Bundles

?> Can't find the rules bundle that you want? Let's open an issue to create it.

Official

Rules bundles maintained by metascraper maintainers.

Package	Version	Dependencies
`metascraper-amazon`
`metascraper-audio`
`metascraper-author`
`metascraper-clearbit`
`metascraper-date`
`metascraper-description`
`@metascraper/helpers`
`metascraper-image`
`metascraper-iframe`
`metascraper-lang`
`metascraper-logo`
`metascraper-logo-favicon`
`metascraper-media-provider`
`metascraper-publisher`
`metascraper-readability`
`metascraper-soundcloud`
`metascraper-telegram`
`metascraper-title`
`metascraper-uol`
`metascraper-url`
`metascraper-spotify`
`metascraper-video`
`metascraper-youtube`

Community

Rules bundles maintained by individual users.

metascraper-address – Get schema.org formatted address.
metascraper-shopping – Get product information from HTML markup on merchant websites.

Write Your Own Rules

See CONTRIBUTING.

API

constructor(rules)

Create a new metascraper instance declaring the rules bundle to be used explicitly.

rules

Type: Array

The collection of rules bundle to be loaded.

metascraper(options)

Call the instance for extracting content based on rules bundle provided at the constructor.

options

url

Required
Type: String

The URL associated with the HTML markup.

It is used for resolve relative links that can be present in the HTML markup.

it can be used as fallback field for different rules as well.

html

Type: String

The HTML markup for extracting the content.

rules

Type: Array

You can pass additional rules to add on execution time.

These rules will be merged with your loaded rules at the beginning.

validateUrl

Type: boolean
Default: true

Ensure the URL provided is validated as a WHATWG URL API compliant.

Benchmark

To give you an idea of how accurate metascraper is, here is a comparison of similar libraries:

Library	`metascraper`	`html-metadata`	`node-metainspector`	`open-graph-scraper`	`unfluff`
Correct	95.54%	74.56%	61.16%	66.52%	70.90%
Incorrect	1.79%	1.79%	0.89%	6.70%	10.27%
Missed	2.68%	23.67%	37.95%	26.34%	8.95%

A big part of the reason for metascraper's higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.

metascraper's default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.

If you're interested in the breakdown by individual pieces of metadata, check out the full comparison summary, or dive into the raw result data for each library.

License

metascraper © Ian Storm Taylor, Released under the MIT License.
Maintained by Kiko Beats with help from contributors.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 1,254

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (12) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

microlinkhq / Metascraper

Labels

Projects that are alternatives of or similar to Metascraper

Table of Contents

Getting Started

Installation

Usage

Metadata

How It Works

Importing Rules

Rules Bundles

Write Your Own Rules

API

constructor(rules)

rules

metascraper(options)

options

url

html

rules

validateUrl

Benchmark

License