All Projects → ArchiveBox → readability-extractor

ArchiveBox / readability-extractor

Licence: other
Javascript/Node wrapper around Mozilla's Readability library so that ArchiveBox can call it as a oneshot CLI command to extract each page's article text.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to readability-extractor

good-karma-kit
😇 A Docker Compose bundle to run on servers with spare CPU, RAM, disk, and bandwidth to help the world. Includes Tor, ArchiveWarrior, BOINC, and more...
Stars: ✭ 238 (+1222.22%)
Mutual labels:  internet-archiving, archivebox
Archivebox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Stars: ✭ 12,383 (+68694.44%)
Mutual labels:  internet-archiving, archivebox
Nodeactyl
A NodeJS API for Pterodactyl panel, this was originally designed for discord.js (Discord bots)
Stars: ✭ 107 (+494.44%)
Mutual labels:  wrapper
fireREST
Python library for interacting with Cisco Firepower Management Center REST API
Stars: ✭ 47 (+161.11%)
Mutual labels:  wrapper
coinmarketcap-api
CoinMarketCap API wrapper for node
Stars: ✭ 111 (+516.67%)
Mutual labels:  wrapper
eslint-plugin-lodash-template
ESLint plugin for John Resig-style micro template, Lodash's template, Underscore's template and EJS.
Stars: ✭ 15 (-16.67%)
Mutual labels:  readability
SharpPhysFS
Managed wrapper for the PhysFS library
Stars: ✭ 14 (-22.22%)
Mutual labels:  wrapper
pygmentize
Pygmentize is a wrapper to `pygmentize`, the command line interface provided by Pygments, a python syntax highlighter.
Stars: ✭ 25 (+38.89%)
Mutual labels:  wrapper
WireGuard-Wrapper
Simple wrapper that makes WireGuard easier to use with VPN providers.
Stars: ✭ 29 (+61.11%)
Mutual labels:  wrapper
TLightFileStream
Implements a lightweight, high-performance, non-allocating advanced-record-based wrapper around the SysUtils file handling routines as an alternative to Classes.TFileStream.
Stars: ✭ 21 (+16.67%)
Mutual labels:  wrapper
uniswap-python
🦄 The unofficial Python client for the Uniswap exchange.
Stars: ✭ 533 (+2861.11%)
Mutual labels:  wrapper
JDSP4Linux
An audio effect processor for PipeWire and PulseAudio clients
Stars: ✭ 192 (+966.67%)
Mutual labels:  wrapper
dotty dict
Dictionary wrapper for quick access to deeply nested keys.
Stars: ✭ 67 (+272.22%)
Mutual labels:  wrapper
with-wrapper
React HOC for wrapper components.
Stars: ✭ 35 (+94.44%)
Mutual labels:  wrapper
ssh2.nim
Async SSH, SCP and SFTP client for Nim, using libssh2 wrapper [WIP]
Stars: ✭ 17 (-5.56%)
Mutual labels:  wrapper
Mega-index-heroku
Mega nz heroku index, Serves mega.nz to http via heroku web. It Alters downloading speed and stability
Stars: ✭ 165 (+816.67%)
Mutual labels:  wrapper
readability
readability for golang. 网页文章标题和正文抽取工具
Stars: ✭ 30 (+66.67%)
Mutual labels:  readability
firebase-db-wrapper-swift
An easy-to-use wrapper for Firebase's Realtime Database
Stars: ✭ 16 (-11.11%)
Mutual labels:  wrapper
raylib-nelua
Raylib wrapper to nelua language
Stars: ✭ 27 (+50%)
Mutual labels:  wrapper
MangaDex.py
An easy to use wrapper for the MangaDexAPIv5 written in Python using Requests.
Stars: ✭ 13 (-27.78%)
Mutual labels:  wrapper

Readability-Extractor

This is a tiny JS wrapper library around Mozilla's article-text extraction tool https://github.com/mozilla/readability.

It's designed to be used as an ArchiveBox archive method.

Install

npm install -g 'git+https://github.com/pirate/readability-extractor'

# which is equivalent to this:
curl https://raw.githubusercontent.com/pirate/readability-extractor/master/readability-extractor > /usr/local/bin/readability-extractor
chmod +x /usr/local/bin/readability-extractor

Usage

readability-extractor some_article.html 'https://exmaple.com/original/url/some/article.html' > some_article.json
{
    "title":"Title autodetected from article html",
    "byline": "Autodetected author...",
    "excerpt": "Autodetected short description",
    "dir": "ltr",
    "length": 1337,
    "content": "<div id=\"readability-page-1\" class=\"page\">abc some article body text...</div>",
    "textContent": "abc some article body text...",
}

ArchiveBox Integration

# You don't have to run these commands usually.
# Readability is on by default and ArchiveBox will find any 
# installed version in your $PATH automatically

# However, if you explicitly want to turn readability on
# and/or specify a manual path to the binary, you can do this:
archivebox config --set SAVE_READABILITY=True
archivebox config --set READABILITY_BINARY="$(which readability-extractor)"

# test archiving oneshot using only singlefile+readability
archivebox add --extract=singlefile,readability 'https://exmaple.com'
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].