All Projects → Strumenta → SmartReader

Strumenta / SmartReader

Licence: Apache-2.0 License
SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla

Programming Languages

HTML
75241 projects

Projects that are alternatives of or similar to SmartReader

trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+707.95%)
Mutual labels:  readability, article-extractor
readability
readability for golang. 网页文章标题和正文抽取工具
Stars: ✭ 30 (-65.91%)
Mutual labels:  readability
dale-chall-formula
Formula to find the grade level according to the (revised) Dale–Chall Readability Formula (1995)
Stars: ✭ 26 (-70.45%)
Mutual labels:  readability
gunning-fog
Formula to detect the ease of reading a text according to the Gunning fog index (1952)
Stars: ✭ 16 (-81.82%)
Mutual labels:  readability
terminal-columns
Render readable & responsive tables in the terminal
Stars: ✭ 27 (-69.32%)
Mutual labels:  readable
hast-util-reading-time
utility to estimate the reading time
Stars: ✭ 55 (-37.5%)
Mutual labels:  readability
Vyxal
A golfing language that has aspects of traditional programming languages - terse, elegant, readable.
Stars: ✭ 134 (+52.27%)
Mutual labels:  readability
ReadabiliPy
A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
Stars: ✭ 55 (-37.5%)
Mutual labels:  readability
xbytes
Parse bytes to human readable sizes (4747) → ('4.75 KB') and vice versa.
Stars: ✭ 17 (-80.68%)
Mutual labels:  readable
rePocketable
Tool to fetch articles from (getPocket|the web) and turn them into epub
Stars: ✭ 49 (-44.32%)
Mutual labels:  readability
pypely
Make your data processing easy
Stars: ✭ 17 (-80.68%)
Mutual labels:  readability
react-native-reader
Cross-platform native reader mode for react-native (safari like)
Stars: ✭ 52 (-40.91%)
Mutual labels:  readability
sneakpeek
Reddit bot to preview and post hyperlinks as comments
Stars: ✭ 60 (-31.82%)
Mutual labels:  article-extractor
flesch
Formula to detect the ease of reading a text according to Flesch Reading Ease (1975)
Stars: ✭ 25 (-71.59%)
Mutual labels:  readability
eslint-plugin-lodash-template
ESLint plugin for John Resig-style micro template, Lodash's template, Underscore's template and EJS.
Stars: ✭ 15 (-82.95%)
Mutual labels:  readability
IKFB
Involution King Fun Book (IKFB, Chinese: 快卷, 卷王快乐本) is an integrated management system for papers and literature. Powered by Electron.
Stars: ✭ 29 (-67.05%)
Mutual labels:  article-extractor
automated-readability
Formula to detect ease of reading according to the Automated Readability Index (1967)
Stars: ✭ 46 (-47.73%)
Mutual labels:  readability
nlpserver
NLP Web Service
Stars: ✭ 76 (-13.64%)
Mutual labels:  article-extractor
KaryScript
KaryScript is an experimental language to test the possibilities of a more readable textual language. It compiles to ES6 and can be considered a much better ECMAScript
Stars: ✭ 19 (-78.41%)
Mutual labels:  readability
readability-extractor
Javascript/Node wrapper around Mozilla's Readability library so that ArchiveBox can call it as a oneshot CLI command to extract each page's article text.
Stars: ✭ 18 (-79.55%)
Mutual labels:  readability


SmartReader
SmartReader

A library to extract the main content of a web page, removing ads, sidebars, etc.

Downloads on Nuget Build status Apache License

What and Why

This library supports the .NET Standard 2.0. The core algorithm is a port of the Mozilla Readability library. The original library is stable and used in production inside Firefox. This way we can piggyback on the hard and well-tested work of Mozilla.

SmartReader also add some improvements on the original library, getting more and better metadata:

  • site name
  • an author and publication date
  • the language
  • the excerpt of the article
  • the featured image
  • a list of images found (it can optionally also download them and store as data URI)
  • an estimate of the time needed to read the article

It also allows to perform custom operations before and after extracting the article.

Feel free to suggest new features.

Installation

It is trivial using the NuGet package.

PM> Install-Package SmartReader

Usage

There are mainly two ways to use the library:

  • The first is by creating a new Reader object, with the URI as the argument, and then calling the GetArticle method to obtain the extracted Article

  • The second one is by using one of the static methods ParseArticle of Reader directly, to return an Article.

Both ways are available also through an async method, called respectively GetArticleAsync and ParseArticleAsync. The advantage of using an object, instead of the static method, is that it gives you the chance to set some options.

There is also the option to parse directly a String or Stream that you have obtained by some other way. This is available either with one of the ParseArticle methods or by using the proper Reader constructor. In either case, you also need to give the original URI. It will not re-download the text, but it needs the URI to make some checks and fixing the links present on the page. If you cannot provide the original uri, you can use a fake one, like https:\\localhost.

If the extraction fails, the returned Article object will have the field IsReadable set to false.

The content of the article is unstyled, but it is wrapped in a div with the id readability-content that you can style yourself.

The library tries to detect the correct encoding of the text, if the correct tags are present in the text.

Getting Images

On the Article object you can call GetImagesAsync to obtain a Task for a list of Image objects, representing the images found in the extracted article. The method is async because it makes HEAD Requests, to obtain the size of the images and only returns the ones that are bigger than the specified size. The size by default is 75KB. This is done to exclude things such as images used in the UI.

On the Article object you can also call ConvertImagesToDataUriAsync to inline the images found in the article using the data URI scheme. The method is async. This will insert the images into the Content property of the Article. This may significantly increase the size of Content.

The data URI scheme is not efficient, because is using Base64 to encode the bytes of the image. Base64 encoded data is approximately 33% larger than the original data. The purpose of this method is to provide an offline article that can be fully stored long term. This is useful in case the original article is not accessible anymore. The method only converts the images that are bigger than the specified size. The size by default is 75KB. This is done to exclude things such as images used in the UI.

Notice that this method will not store other external elements that are not images, such as embedded videos.

Examples

Using the GetArticle method.

SmartReader.Reader sr = new SmartReader.Reader("https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");

sr.Debug = true;
sr.LoggerDelegate = Console.WriteLine;

SmartReader.Article article = sr.GetArticle();
var images = article.GetImagesAsync();

if(article.IsReadable)
{
	// do something with it	
}

Using the ParseArticle static method.

SmartReader.Article article = SmartReader.Reader.ParseArticle("https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");

if(article.IsReadable)
{
	Console.WriteLine($"Article title {article.Title}");
}

Settings

The following settings on the Reader class can be modified.

  • int MaxElemsToParse
    Max number of nodes supported by this parser.
    Default: 0 (no limit)
  • int NTopCandidates
    The number of top candidates to consider when analyzing how tight the competition is among candidates.
    Default: 5
  • bool Debug
    Set the Debug option. If set to true the library writes the data on Logger.
    Default: false
  • Action<string> LoggerDelegate
    Delegate of a function that accepts as argument a string; it will receive log messages.
    Default: does not do anything
  • ReportLevel Logging
    Level of information written with the LoggerDelegate. The valid values are the ones for the enum ReportLevel: Issue or Info. The first level logs only errors or issue that could prevent correctly obtaining an article. The second level logs all the information needed for debugging a problematic article.
    Default: ReportLevel.Issue
  • bool ContinueIfNotReadable
    The library tries to determine if it will find an article before actually trying to do it. This option decides whether to continue if the library heuristics fails. This value is ignored if Debug is set to true
    Default: true
  • int CharThreshold
    The minimum number of characters an article must have in order to return a result.
    Default: 500
  • bool KeepClasses
    Whether to preserve or clean CSS classes.
    Default: false
  • String[] ClassesToPreserve
    The CSS classes that must be preserved in the article, if we opt to not keep all of them.
    Default: ["page"]
  • bool DisableJSONLD
    The library look first at JSON-LD to determine metadata. This setting gives you the option of disabling it
    Default: false
  • int MinContentLengthReadearable
    The minimum node content length used to decide if the document is readerable (i.e., the library will find something useful)
    Default: 140
  • int MinScoreReaderable
    The minumum cumulated 'score' used to determine if the document is readerable
    Default: 20
  • Func<IElement, bool> IsNodeVisible
    The function used to determine if a node is visible. Used in the process of determinting if the document is readerable
    Default: NodeUtility.IsProbablyVisible

Article Model

A brief overview of the Article model returned by the library.

  • Uri Uri
    Original Uri
  • String Title
    Title
  • String Byline
    Byline of the article, usually containing author and publication date
  • String Dir
    Direction of the text
  • String FeaturedImage
    The main image of the article
  • String Content
    Html content of the article
  • String TextContent
    The plain text of the article with basic formatting
  • String Excerpt
    A summary of the article, based on metadata or first paragraph
  • String Language
    Language string (es. 'en-US')
  • String Author
    Author of the article
  • String SiteName
    Name of the site that hosts the article
  • int Length
    Length of the text of the article
  • TimeSpan TimeToRead
    Average time needed to read the article
  • DateTime? PublicationDate
    Date of publication of the article
  • bool IsReadable
    Indicate whether we successfully find an article

It's important to be aware that the fields Byline, Author and PublicationDate are found independently of each other. So there might be some inconsistencies and unexpected data. For instance, Byline may be a string in the form "@Date by @Author" or "@Author, @Date" or any other combination used by the publication.

The TimeToRead calculation is based on the research found in Standardized Assessment of Reading Performance: The New International Reading Speed Texts IReST. It should be accurate if the article is written in one of the languages in the research, but it is just an educated guess for the others languages.

The FeaturedImage property holds the image indicated by the Open Graph or Twitter meta tags. If neither of these is present, and you called the GetImagesAsync method, it will be set with the first image found.

The TextContent property is based on the pure text content of the HTML (i.e., the concatenations of text nodes. Then we apply some basic formatting, like removing double spaces or the newlines left by the formatting of the HTML code. We also add meaningful newlines for P and BR nodes.

Project Structure

This project has the following directory structure.

Folder Description
docfx_project/ Contains the DocFx project that generates the documentation website
src/ The main source folder
src/SmartReader Source for the SmartReader library
src/SmartReaderTests Source for the Tests
src/SmartReaderConsole Source for example console project
src/SmartReader.WebDemo Source for the demo web project

Demo

You can see the demo web live. So you can test for yourself how effective the library can be for you.

There is also a Docker project for the web demo.

Documentation

This README contains the info to get started in using the library. If you want to know more advanced options, API reference, etc. read the documentation on the main website.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].