Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → dankito → Readability4j

dankito / Readability4j

Licence: apache-2.0

A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.

Labels

html readability

Projects that are alternatives of or similar to Readability4j

SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla

Stars: ✭ 88 (+91.3%)

Mutual labels: readability

Percollate

A command-line tool to turn web pages into beautiful, readable PDF, EPUB, or HTML docs.

Stars: ✭ 3,535 (+7584.78%)

Mutual labels: readability

Stylebot

Change the appearance of the web instantly

Stars: ✭ 746 (+1521.74%)

Mutual labels: readability

Readability.php

PHP port of Mozilla's Readability.js

Stars: ✭ 280 (+508.7%)

Mutual labels: readability

Elixir Scrape

Scrape any website, article or RSS/Atom Feed with ease!

Stars: ✭ 306 (+565.22%)

Mutual labels: readability

Midnight Lizard

Сustom color schemes for all websites

Stars: ✭ 406 (+782.61%)

Mutual labels: readability

ReadabiliPy

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.

Stars: ✭ 55 (+19.57%)

Mutual labels: readability

Opendyslexic Chrome

Offical OpenDyslexic chrome extension

Stars: ✭ 36 (-21.74%)

Mutual labels: readability

Elements Of Python Style

Goes beyond PEP8 to discuss what makes Python code feel great. A Strunk & White for Python.

Stars: ✭ 3,308 (+7091.3%)

Mutual labels: readability

Textstat

📝 python package to calculate readability statistics of a text object - paragraphs, sentences, articles.

Stars: ✭ 590 (+1182.61%)

Mutual labels: readability

Graby

Graby helps you extract article content from web pages

Stars: ✭ 281 (+510.87%)

Mutual labels: readability

Csharpformarkup

Use declarative style C# instead of XAML for Xamarin Forms UI

Stars: ✭ 302 (+556.52%)

Mutual labels: readability

Clean Mark

Convert an article into a clean text

Stars: ✭ 414 (+800%)

Mutual labels: readability

Typographic Email

Responsive email template that is optimised for readability.

Stars: ✭ 268 (+482.61%)

Mutual labels: readability

Code Review Tips

🔬 Common problems to look for in a code review

Stars: ✭ 861 (+1771.74%)

Mutual labels: readability

KaryScript

KaryScript is an experimental language to test the possibilities of a more readable textual language. It compiles to ES6 and can be considered a much better ECMAScript

Stars: ✭ 19 (-58.7%)

Mutual labels: readability

Php Goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

Stars: ✭ 392 (+752.17%)

Mutual labels: readability

Pdfsave

Convert websites into readable PDFs

Stars: ✭ 46 (+0%)

Mutual labels: readability

Just Read

A customizable read mode web extension.

Stars: ✭ 874 (+1800%)

Mutual labels: readability

Simpread

简悦 ( SimpRead ) - 让你瞬间进入沉浸式阅读的扩展

Stars: ✭ 5,352 (+11534.78%)

Mutual labels: readability

View All Similar Projects ➔

Readability4J

Readability4J is a Kotlin port of Mozilla's Readability.js, which is used for Firefox's reader view: https://github.com/mozilla/readability.

It tries to detect the relevant content of a website and removes all clutter from it such as advertisements, navigation bars, social media buttons, etc.

The extracted text then can be used for indexing web pages, to provide the user a pleasant reading experience and similar.

As it‘s compatible with Mozilla‘s Readability.js it produces exact the same output as you would see in Firefox‘s Reader View (just some white spaces differ due to Jsoup‘s different formatting, but you can‘t see them anyway).

Setup

Gradle:

dependencies {
  compile 'net.dankito.readability4j:readability4j:1.0.3'
}

Maven:

<dependency>
   <groupId>net.dankito.readability4j</groupId>
   <artifactId>readability4j</artifactId>
   <version>1.0.3</version>
</dependency>

Usage

String url = ...;
String html = ...;

Readability4J readability4J = new Readability4J(url, html); // url is just needed to resolve relative urls
Article article = readability4J.parse();

// returns extracted content in a <div> element
String extractedContentHtml = article.getContent();
// to get content wrapped in <html> tags and encoding set to UTF-8, see chapter 'Output encoding'
String extractedContentHtmlWithUtf8Encoding = article.getContentWithUtf8Encoding();
String extractedContentPlainText = article.getTextContent();
String title = article.getTitle();
String byline = article.getByline();
String excerpt = article.getExcerpt();

Readability4J and Readability4JExtended

With Readability4J class I wanted to stick close to Mozilla's Readability to keep compatibility.

But during development I found some handy features not supported by Readability, e. g. copying url from data-src attribute to <img src="" /> to display lazy loading images, using <head><base>'s href value for resolving relative urls and a better detection of which images to keep in output.

These features I implemented in Readability4JExtended.

If you want to use it, simply instantiate with (the rest of the code stays the same):

Readability4J readability4J = new Readability4JExtended(url, html);
Article article = readability4J.parse();

Output encoding

As users noted (see Issue #1 and #2) by default no encoding is applied to Readability4J's output resulting in incorrect display of non-ASCII characters.

The reason is like Readability.js Readability4J returns its output in a <div> element, and the only way to set the encoding in HTML is in a <head><meta charset=""> tag.

So I added these convenience methods to Article class

String contentHtmlWithUtf8Encoding = article.getContentWithUtf8Encoding();
// or
String contentHtmlWithCustomEncoding = article.getContentWithEncoding("ISO-8859-1");

which wrap the content in

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- content -->
 </body>
</html>

Compatibility with Mozilla‘s Readability.js

As mentioned before, this is almost an exact copy of Mozilla's Readability.js. But since I didn't find the original code very readable itself, I extracted some parts from the 2000 lines of code into a new classes:

Readability.js function	Readability4J location
_removeScripts() and _prepDocument()	Preprocessor.prepareDocument()
_grabArticle()	ArticleGrabber.grabArticle()
_postProcessContent()	Postprocessor.postProcessContent()
_getArticleMetadata()	MetadataParser.getArticleMetadata()

Overview of which Mozilla‘s Readability.js commit a Readability4J version matches:

Version	Commit	Date
1.0	8da91b9	12/5/17
1.0.1	834672e	02/27/18

Extensibility

I tried to create the library as extensible as possible. All above mentioned classes can be overwritten and passed to Readability4J's constructor.

Logging

Readability4J uses slf4j as logging facade.

So you can use any logger that supports slf4j, like Logback and log4j, to configure and get Readability4J's log output.

License

Copyright 2017 dankito

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 46

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (7) 🔗