All Projects → fb55 → Readabilitysax

fb55 / Readabilitysax

Licence: bsd-2-clause
a fast and platform independent readability port (JS)

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Readabilitysax

Just Read
A customizable read mode web extension.
Stars: ✭ 874 (+304.63%)
Mutual labels:  readability
Orchestra
One language to be RegExp's Successor. Visually readable and rich, technically safe and extended, naturally scalable, advanced, and optimized
Stars: ✭ 103 (-52.31%)
Mutual labels:  readability
Newspaper
Read webpages in readability mode, inside your terminal.
Stars: ✭ 168 (-22.22%)
Mutual labels:  readability
Pdfsave
Convert websites into readable PDFs
Stars: ✭ 46 (-78.7%)
Mutual labels:  readability
Sspipe
Simple Smart Pipe: python productivity-tool for rapid data manipulation
Stars: ✭ 96 (-55.56%)
Mutual labels:  readability
Php Readability
A fork of https://bitbucket.org/fivefilters/php-readability
Stars: ✭ 127 (-41.2%)
Mutual labels:  readability
Stylebot
Change the appearance of the web instantly
Stars: ✭ 746 (+245.37%)
Mutual labels:  readability
Readability
Readability is Elixir library for extracting and curating articles.
Stars: ✭ 188 (-12.96%)
Mutual labels:  readability
Readability2
Readability2 converts HTML to plain text.
Stars: ✭ 100 (-53.7%)
Mutual labels:  readability
Reading List Mover
A Python utility for moving bookmarks/reading lists between services
Stars: ✭ 166 (-23.15%)
Mutual labels:  readability
Readability4j
A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.
Stars: ✭ 46 (-78.7%)
Mutual labels:  readability
Reader
Extract clean(er), readable text from web pages via Mercury Web Parser.
Stars: ✭ 75 (-65.28%)
Mutual labels:  readability
Py Readability Metrics
📗 Score text readability using a number of formulas: Flesch-Kincaid Grade Level, Gunning Fog, ARI, Dale Chall, SMOG, and more
Stars: ✭ 132 (-38.89%)
Mutual labels:  readability
Opendyslexic Chrome
Offical OpenDyslexic chrome extension
Stars: ✭ 36 (-83.33%)
Mutual labels:  readability
Cadmium
Natural Language Processing (NLP) library for Crystal
Stars: ✭ 172 (-20.37%)
Mutual labels:  readability
Code Review Tips
🔬 Common problems to look for in a code review
Stars: ✭ 861 (+298.61%)
Mutual labels:  readability
Mercury fulltext
📖 Enjoy full text for tt-rss.
Stars: ✭ 123 (-43.06%)
Mutual labels:  readability
Readability
📚 Turn any web page into a clean view
Stars: ✭ 2,281 (+956.02%)
Mutual labels:  readability
Article Parser
To extract main article from given URL with Node.js
Stars: ✭ 179 (-17.13%)
Mutual labels:  readability
Readability
visualise readability
Stars: ✭ 160 (-25.93%)
Mutual labels:  readability

readabilitySAX

a fast and platform independent readability port

About

This is a port of the algorithm used by the Readability bookmarklet to extract relevant pieces of information from websites, using a SAX parser.

The advantage over other ports, e.g. arrix/node-readability, is a smaller memory footprint and a much faster execution. In my tests, most pages, even large ones, were finished within 15ms (on node, see below for more information). It works with Rhino, so it runs on YQL, which may have interesting uses. And it works within a browser.

The Readability extraction algorithm was completely ported, but some adjustments were made:

  • <article> and <section> tags are recognized and gain a higher value

  • If a heading is part of the pages <title>, it is removed (Readability removed any single <h2>, and ignored other tags)

  • henry and instapaper-body are classes to show an algorithm like this where the content is. readabilitySAX recognizes them and adds additional points

  • Every bit of code that was taken from the original algorithm was optimized, eg. RegExps should now perform faster (they were optimized & use RegExp#test instead of String#match, which doesn't force the interpreter to build an array)

  • Some improvements made by GGReadability (an Obj-C port of Readability) were adopted

    • Images get additional scores when their height or width attributes are high - icon sized images (<= 32px) get skipped
    • Additional classes & ids are checked

How To

Install readabilitySAX

npm install readabilitySAX
CLI

A command line interface (CLI) may be installed via

npm install -g readabilitySAX

It's then available via

readability <domain> [<format>]

To get this readme, just run

readability https://github.com/FB55/readabilitySAX

The format is optional (it's either text or html, the default value is text).

Usage

Node

Just run require("readabilitySAX"). You'll get an object containing three methods:

  • Readability(settings): The readability constructor. It works as a handler for htmlparser2. Read more about it in the wiki!

  • WritableStream(settings, cb): A constructor that unites htmlparser2 and the Readability constructor. It's a writable stream, so simply .write all your data to it. Your callback will be called once .end was called. Bonus: You can also .pipe data into it!

  • createWritableStream(settings, cb): Returns a new instance of the WritableStream. (It's a simple factory method.)

There are two methods available that are deprecated and will be removed in a future version:

  • get(link, [settings], callback): Gets a webpage and process it.

  • process(data): Takes a string, runs readabilitySAX and returns the page.

Please don't use those two methods anymore. Streams are the way you should build interfaces in node, and that's what I want encourage people to use.

Browsers

I started to implement simplified SAX-"parsers" for Rhino/YQL (using E4X) and the browser (using the DOM) to increase the overall performance on those platforms. The DOM version is inside the /browsers dir.

A demo of how to use readabilitySAX inside a browser may be found at jsFiddle. Some basic example files are inside the /browsers directory.

YQL

A table using E4X-based events is available as the community table redabilitySAX, as well as here.

Parsers (on node)

Most SAX parsers (as sax.js) fail when a document is malformed XML, even if it's correct HTML. readabilitySAX should be used with htmlparser2, my fork of the htmlparser-module (used by eg. jsdom), which corrects most faults. It's listed as a dependency, so npm should install it with readabilitySAX.

Performance

Speed

Using a package of 724 pages from CleanEval (their website seems to be down, try to google it), readabilitySAX processed all of them in 5768 ms, that's an average of 7.97 ms per page.

The benchmark was done using tests/benchmark.js on a MacBook (late 2010) and is probably far from perfect.

Performance is the main goal of this project. The current speed should be good enough to run readabilitySAX on a singe-threaded web server with an average number of requests. That's an accomplishment!

Accuracy

The main goal of CleanEval is to evaluate the accuracy of an algorithm.

// TODO

Todo

  • Add documentation & examples
  • Add support for URLs containing hash-bangs (#!)
  • Allow fetching articles with more than one page
  • Don't remove all images inside <a> tags
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].