All Projects → go-shiori → Go Readability

go-shiori / Go Readability

Licence: mit
Go package that cleans a HTML page for better readability.

Programming Languages

go
31211 projects - #10 most used programming language
golang
3204 projects

Projects that are alternatives of or similar to Go Readability

General News Extractor Js
🤔一个新闻网页正文通用抽取器,包括标题、作者和日期。
Stars: ✭ 55 (-78.17%)
Mutual labels:  readability
Readability
visualise readability
Stars: ✭ 160 (-36.51%)
Mutual labels:  readability
Readability
📚 Turn any web page into a clean view
Stars: ✭ 2,281 (+805.16%)
Mutual labels:  readability
Sspipe
Simple Smart Pipe: python productivity-tool for rapid data manipulation
Stars: ✭ 96 (-61.9%)
Mutual labels:  readability
Php Readability
A fork of https://bitbucket.org/fivefilters/php-readability
Stars: ✭ 127 (-49.6%)
Mutual labels:  readability
Newspaper
Read webpages in readability mode, inside your terminal.
Stars: ✭ 168 (-33.33%)
Mutual labels:  readability
Pdfsave
Convert websites into readable PDFs
Stars: ✭ 46 (-81.75%)
Mutual labels:  readability
Web Clipper
For Notion,OneNote,Bear,Yuque,Joplin。Clip anything to anywhere
Stars: ✭ 3,645 (+1346.43%)
Mutual labels:  readability
Py Readability Metrics
📗 Score text readability using a number of formulas: Flesch-Kincaid Grade Level, Gunning Fog, ARI, Dale Chall, SMOG, and more
Stars: ✭ 132 (-47.62%)
Mutual labels:  readability
Readability
Readability is Elixir library for extracting and curating articles.
Stars: ✭ 188 (-25.4%)
Mutual labels:  readability
Readability2
Readability2 converts HTML to plain text.
Stars: ✭ 100 (-60.32%)
Mutual labels:  readability
Mercury fulltext
📖 Enjoy full text for tt-rss.
Stars: ✭ 123 (-51.19%)
Mutual labels:  readability
Cadmium
Natural Language Processing (NLP) library for Crystal
Stars: ✭ 172 (-31.75%)
Mutual labels:  readability
Reader
Extract clean(er), readable text from web pages via Mercury Web Parser.
Stars: ✭ 75 (-70.24%)
Mutual labels:  readability
Readabilitysax
a fast and platform independent readability port (JS)
Stars: ✭ 216 (-14.29%)
Mutual labels:  readability
Readability4j
A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.
Stars: ✭ 46 (-81.75%)
Mutual labels:  readability
Reading List Mover
A Python utility for moving bookmarks/reading lists between services
Stars: ✭ 166 (-34.13%)
Mutual labels:  readability
Simpread Little
简悦( SimpRead ) · 轻阅版
Stars: ✭ 216 (-14.29%)
Mutual labels:  readability
Code Review Checklist
This code review checklist helps you be a more effective and efficient code reviewer.
Stars: ✭ 214 (-15.08%)
Mutual labels:  readability
Article Parser
To extract main article from given URL with Node.js
Stars: ✭ 179 (-28.97%)
Mutual labels:  readability

Go-Readability

GoDoc Travis CI Go Report Card Donate PayPal Donate Ko-fi

Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Status

This package is stable enough for use and up to date with Readability.js until commit d5621f8.

Installation

To install this package, just run go get :

go get -u -v github.com/go-shiori/go-readability

Example

To get the readable content from an URL, you can use readability.FromURL. It will fetch the web page from specified url, check if it's readable, then parses the response to find the readable content :

package main

import (
	"fmt"
	"log"
	"os"
	"time"

	readability "github.com/go-shiori/go-readability"
)

var (
	urls = []string{
		// this one is article, so it's parse-able
		"https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
		// while this one is not an article, so readability will fail to parse.
		"https://www.nytimes.com/",
	}
)

func main() {
	for i, url := range urls {
		article, err := readability.FromURL(url, 30*time.Second)
		if err != nil {
			log.Fatalf("failed to parse %s, %v\n", url, err)
		}

		dstTxtFile, _ := os.Create(fmt.Sprintf("text-%02d.txt", i+1))
		defer dstTxtFile.Close()
		dstTxtFile.WriteString(article.TextContent)

		dstHTMLFile, _ := os.Create(fmt.Sprintf("html-%02d.html", i+1))
		defer dstHTMLFile.Close()
		dstHTMLFile.WriteString(article.Content)

		fmt.Printf("URL     : %s\n", url)
		fmt.Printf("Title   : %s\n", article.Title)
		fmt.Printf("Author  : %s\n", article.Byline)
		fmt.Printf("Length  : %d\n", article.Length)
		fmt.Printf("Excerpt : %s\n", article.Excerpt)
		fmt.Printf("SiteName: %s\n", article.SiteName)
		fmt.Printf("Image   : %s\n", article.Image)
		fmt.Printf("Favicon : %s\n", article.Favicon)
		fmt.Printf("Text content saved to \"text-%02d.txt\"\n", i+1)
		fmt.Printf("HTML content saved to \"html-%02d.html\"\n", i+1)
		fmt.Println()
	}
}

However, sometimes you want to parse an URL no matter if it's an article or not. For example is when you only want to get metadata of the page. To do that, you have to download the page manually using http.Get, then parse it using readability.FromReader :

package main

import (
	"fmt"
	"log"
	"net/http"

	readability "github.com/go-shiori/go-readability"
)

var (
	urls = []string{
		// Both will be parse-able now
		"https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
		// But this one will not have any content
		"https://www.nytimes.com/",
	}
)

func main() {
	for _, url := range urls {
		resp, err := http.Get(url)
		if err != nil {
			log.Fatalf("failed to download %s: %v\n", url, err)
		}
		defer resp.Body.Close()

		article, err := readability.FromReader(resp.Body, url)
		if err != nil {
			log.Fatalf("failed to parse %s: %v\n", url, err)
		}

		fmt.Printf("URL     : %s\n", url)
		fmt.Printf("Title   : %s\n", article.Title)
		fmt.Printf("Author  : %s\n", article.Byline)
		fmt.Printf("Length  : %d\n", article.Length)
		fmt.Printf("Excerpt : %s\n", article.Excerpt)
		fmt.Printf("SiteName: %s\n", article.SiteName)
		fmt.Printf("Image   : %s\n", article.Image)
		fmt.Printf("Favicon : %s\n", article.Favicon)
		fmt.Println()
	}
}

Command Line Usage

You can also use go-readability as command line app. To do that, first install the CLI :

go get -u -v github.com/go-shiori/go-readability/cmd/...

Now you can use it by running go-readability in your terminal :

$ go-readability -h

go-readability is parser to fetch the readable content of a web page.
The source can be an url or existing file in your storage.

Usage:
  go-readability [flags] source

Flags:
  -h, --help       help for go-readability
  -m, --metadata   only print the page's metadata

Licenses

Go-Readability is distributed under MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request. If you like this project, please consider donating to me either via PayPal or Ko-Fi.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].