All Projects â†’ postlight â†’ Mercury Parser

postlight / Mercury Parser

Licence: other
📜 Extract meaningful content from the chaos of a web page

Programming Languages

javascript
184084 projects - #8 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Mercury Parser

Ts Monorepo
Template for setting up a TypeScript monorepo
Stars: ✭ 459 (-88.6%)
Mutual labels:  rollup, jest
Alias Hq
The end-to-end solution for configuring, refactoring, maintaining and using path aliases
Stars: ✭ 77 (-98.09%)
Mutual labels:  rollup, jest
Download Your Travelmap
free your travelmap
Stars: ✭ 22 (-99.45%)
Mutual labels:  rollup, jest
Participle
A parser library for Go
Stars: ✭ 2,302 (-42.81%)
Mutual labels:  parser-library, parser
fly-helper
It's a Tool library, method collection
Stars: ✭ 21 (-99.48%)
Mutual labels:  jest, rollup
Tatsu
竜 TatSu generates Python parsers from grammars in a variation of EBNF
Stars: ✭ 198 (-95.08%)
Mutual labels:  parser-library, parser
Tsdx
Zero-config CLI for TypeScript package development
Stars: ✭ 9,010 (+123.85%)
Mutual labels:  rollup, jest
Mediawiki
MediaWiki API wrapper in python http://pymediawiki.readthedocs.io/en/latest/
Stars: ✭ 89 (-97.79%)
Mutual labels:  parser-library, parser
rollup-jest-boilerplate
🎉 Full featured boilerplate for building JavaScript libraries the modern way
Stars: ✭ 81 (-97.99%)
Mutual labels:  jest, rollup
zero
📦 A zero config scripts library
Stars: ✭ 17 (-99.58%)
Mutual labels:  jest, rollup
Pygdbmi
A library to parse gdb mi output and interact with gdb subprocesses
Stars: ✭ 139 (-96.55%)
Mutual labels:  parser-library, parser
Dart Petitparser
Dynamic parser combinators in Dart.
Stars: ✭ 266 (-93.39%)
Mutual labels:  parser-library, parser
Java Petitparser
Dynamic parser combinators in Java.
Stars: ✭ 118 (-97.07%)
Mutual labels:  parser-library, parser
Lark
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
Stars: ✭ 2,916 (-27.55%)
Mutual labels:  parser-library, parser
Cppcmb
A generic C++17 parser-combinator library with a natural grammar notation.
Stars: ✭ 108 (-97.32%)
Mutual labels:  parser-library, parser
Svelte Tailwind Extension Boilerplate
A Chrome extension boilerplate built with Svelte, TailwindCSS, Jest, and Rollup.
Stars: ✭ 26 (-99.35%)
Mutual labels:  rollup, jest
Codecharta
CodeCharta visualizes multiple code metrics using 3D tree maps.
Stars: ✭ 85 (-97.89%)
Mutual labels:  parser, jest
Substitution Schedule Parser
Java library for parsing schools' substitution schedules. Supports multiple different systems mainly used in the German-speaking countries, including Untis, svPlan, and DAVINCI
Stars: ✭ 33 (-99.18%)
Mutual labels:  parser-library, parser
Sketchmine
Tools to validate, generate and analyse sketch files from web pages
Stars: ✭ 114 (-97.17%)
Mutual labels:  rollup, jest
termy-the-terminal
Web-based terminal powered by React
Stars: ✭ 43 (-98.93%)
Mutual labels:  jest, rollup

Mercury Parser

Mercury Parser - Extracting content from chaos

CircleCI Greenkeeper badge Apache License MITC License Gitter chat

Postlight's Mercury Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Mercury Parser powers the Mercury AMP Converter and Mercury Reader, a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

Mercury Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @postlight/mercury-parser

# If you're using npm
npm install @postlight/mercury-parser

Usage

import Mercury from '@postlight/mercury-parser';

Mercury.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Mercury.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Mercury is unable to find a field, that field will return null.

parse() Options

Content Formats

By default, Mercury Parser returns the content field as HTML. However, you can override this behavior by passing in options to the parse function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html', 'markdown', and 'text'). For example:

Mercury.parse(url, { contentType: 'markdown' }).then(result =>
  console.log(result)
);

This returns the the page's content as GitHub-flavored Markdown:

"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."
Custom Request Headers

You can include custom headers in requests by passing name-value pairs to the parse function as follows:

Mercury.parse(url, {
  headers: {
    Cookie: 'name=value; name2=value2; name3=value3',
    'User-Agent':
      'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',
  },
}).then(result => console.log(result));
Pre-fetched HTML

You can use Mercury Parser to parse custom or pre-fetched HTML by passing an HTML string to the parse function as follows:

Mercury.parse(url, {
  html:
    '<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));

Note that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.

The command-line parser

Mercury Parser also ships with a CLI, meaning you can use the Mercury Parser from your command line like so:

Mercury Parser CLI Basic Usage

# Install Mercury globally
yarn global add @postlight/mercury-parser
#   or
npm -g install @postlight/mercury-parser

# Then
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source

# Pass optional --format argument to set content type (html|markdown|text)
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown

# Pass optional --header.name=value arguments to include custom headers in the request
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"

# Pass optional --extend argument to add a custom type to the response
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"

# Pass optional --extend-list argument to add a custom type with multiple matches
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"

# Get the value of attributes by adding a pipe to --extend or --extend-list
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"

# Pass optional --add-extractor argument to add a custom extractor at runtime.
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --add-extractor ./src/extractors/fixtures/postlight.com/index.js

License

Licensed under either of the below, at your preference:

Contributing

For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.


🔬 A Labs project from your friends at Postlight. Happy coding!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].