All Projects → danburzo → Percollate

danburzo / Percollate

Licence: mit
A command-line tool to turn web pages into beautiful, readable PDF, EPUB, or HTML docs.

Programming Languages

javascript
184084 projects - #8 most used programming language
CSS
56736 projects
HTML
75241 projects

Projects that are alternatives of or similar to Percollate

Md To Pdf
Hackable CLI tool for converting Markdown files to PDF using Node.js and headless Chrome.
Stars: ✭ 374 (-89.42%)
Mutual labels:  cli, pdf, puppeteer
Pdfsave
Convert websites into readable PDFs
Stars: ✭ 46 (-98.7%)
Mutual labels:  cli, pdf, readability
Chrome Headless Render Pdf
Stars: ✭ 164 (-95.36%)
Mutual labels:  cli, pdf
Pubs
Your bibliography on the command line
Stars: ✭ 176 (-95.02%)
Mutual labels:  cli, pdf
rePocketable
Tool to fetch articles from (getPocket|the web) and turn them into epub
Stars: ✭ 49 (-98.61%)
Mutual labels:  epub, readability
Kju
Kju — Improved waiting time for the adidas.com splash page ❯❯❯_
Stars: ✭ 68 (-98.08%)
Mutual labels:  cli, puppeteer
Site Scan
CLI for capturing website screenshots, powered by puppeteer.
Stars: ✭ 137 (-96.12%)
Mutual labels:  cli, puppeteer
Gscholar
Query Google Scholar with Python
Stars: ✭ 209 (-94.09%)
Mutual labels:  cli, pdf
Backslide
💦 CLI tool for making HTML presentations with Remark.js using Markdown
Stars: ✭ 679 (-80.79%)
Mutual labels:  cli, pdf
Starter Book
A book starter to kickstart your writing journey 🎉
Stars: ✭ 277 (-92.16%)
Mutual labels:  pdf, epub
Cloud Reports
Scans your AWS cloud resources and generates reports. Check out free hosted version:
Stars: ✭ 255 (-92.79%)
Mutual labels:  pdf, puppeteer
Redux Offline Docs
Redux documentation in PDF, ePub and MOBI formats for offline reading.
Stars: ✭ 292 (-91.74%)
Mutual labels:  pdf, epub
Page2image
📷 page2image is a npm package for taking screenshots which also provides CLI command
Stars: ✭ 66 (-98.13%)
Mutual labels:  cli, puppeteer
Go Audio
An offline solution to convert pdfs into audiobooks
Stars: ✭ 153 (-95.67%)
Mutual labels:  cli, pdf
Singlefilez
Web Extension for Firefox/Chrome/MS Edge and CLI tool to save a faithful copy of an entire web page in a self-extracting HTML/ZIP polyglot file
Stars: ✭ 882 (-75.05%)
Mutual labels:  cli, puppeteer
Equa11y
A stream-lined command line tool for developers to easily run accessibility testing locally through axe-core and puppeteer.
Stars: ✭ 201 (-94.31%)
Mutual labels:  cli, puppeteer
Ruby Hacking Guide.github.com
Ruby Hacking Guide Translation
Stars: ✭ 305 (-91.37%)
Mutual labels:  pdf, epub
Wbot
A simple Web based BOT for WhatsApp™ in NodeJS 😜. Working as of 📅 Feb 14th, 2020
Stars: ✭ 638 (-81.95%)
Mutual labels:  cli, puppeteer
Epr
CLI Epub Reader
Stars: ✭ 657 (-81.41%)
Mutual labels:  cli, epub
HtmlOrMarkdownConvertedToPdf
📚 NodeJS爬虫 + percollate获取网络教程并转成PDF电子书,持续更新
Stars: ✭ 62 (-98.25%)
Mutual labels:  epub, puppeteer

percollate

npm version

Percollate is a command-line tool that turns web pages into beautifully formatted PDF, EPUB, or HTML files.

Sample Output

Sample spread from the generated PDF of a chapter in Dimensions of Colour; rendered here in black & white for a smaller image file size.

Installation

percollate is a Node.js command-line tool which you can install globally from npm:

npm install -g percollate

Percollate and its dependencies require Node.js 12.20.0 or later.

Community-maintained packages

There's a packaged version available on Arch User Repository, which you can install using your local AUR helper (yay, pacaur, or similar):

yay -S nodejs-percollate

Usage

Run percollate --help for a list of available commands and options.

Percollate is invoked on one or more operands (usually URLs):

percollate <command> [options] url [url]...

The following commands are available:

  • percollate pdf produces a PDF file;
  • percollate epub produces an EPUB file;
  • percollate html produces a HTML file.

The operands can be URLs, paths to local files, or the - character which stands for stdin (the standard inputs).

Available options

Unless otherwise stated, these options apply to all three commands.

-o, --output

Specify the path of the resulting bundle relative to the current folder.

percollate pdf https://example.com -o my-example.pdf

-u, --url

Using the - operand you can read the HTML content from stdin, as fetched by a separate command, such as curl. In this sort of setup, percollate does not know the URL from which the content has been fetched, and relative paths on images, anchors, et cetera won't resolve correctly.

Use the --url option to supply the source's original URL.

curl https://example.com | percollate pdf - --url=https://example.com

--individual

By default, percollate bundles all web pages in a single file. Use the --individual flag to export each source to a separate file.

percollate pdf --individual http://example.com/page1 http://example.com/page2

--template

Path to a custom HTML template. Applies to pdf and html.

--style

Path to a custom CSS stylesheet, relative to the current folder.

--css

Additional CSS styles you can pass from the command-line to override styles specified by the default/custom stylesheet.

--no-amp

Don't prefer the AMP version of the web page.

--debug

Print more detailed information.

-t, --title

Provide a title for the bundle.

percollate epub http://example.com/page-1 http://example.com/page-2 --title="Best Of Example"

-a, --author

Provide an author for the bundle.

percollate pdf --author="Ella Example" http://example.com

--cover

Generate a cover. The option is implicitly enabled when the --title option is provided, or when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-cover flag.

--toc

Generate a hyperlinked table of contents. The option is implicitly enabled when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-toc flag.

Applies to pdf and html.

--hyphenate

Hyphenation is enabled by default for pdf, and disabled for epub and html. You can opt into hyphenation with the --hyphenate flag, or disable it with the --no-hyphenate flag.

See also the Hyphenation and justification recipe.

--inline

Embed images inline with the document. Images are fetched and converted to Base64-encoded data URLs.

This option is particularly useful for html to produce self-contained HTML files.

Recipes

Basic bundling

To turn a single web page into a PDF:

percollate pdf --output=some.pdf https://example.com

To bundle several web pages into a single PDF, specify them as separate arguments to the command:

percollate pdf --output=some.pdf https://example.com/page1 https://example.com/page2

You can use common Unix commands and keep the list of URLs in a newline-delimited text file:

cat urls.txt | xargs percollate pdf --output=some.pdf

To transform several web pages into individual PDF files at once, use the --individual flag:

percollate pdf --individual https://example.com/page1 https://example.com/page2

If you'd like to fetch the HTML with an external command, you can use - as an operand, which stands for stdin (the standard input):

curl https://example.com/page1 | percollate pdf --url=https://example.com/page1 -

Notice we're using the url option to tell percollate the source of our (now-anonymous) HTML it gets on stdin, so that relative URLs on links and images resolve correctly.

The --css option

The --css option lets you pass a small snippet of CSS to percollate. Here are some common use-cases:

Custom page size / margins

The default page size is A5 (portrait). You can use the --css option to override it using any supported CSS size:

percollate pdf --css "@page { size: A3 landscape }" http://example.com

Similarly, you can define:

  • custom margins, e.g. @page { margin: 0 }
  • the base font size: html { font-size: 10pt }

Changing the font stacks

The default stylesheet includes CSS variables for the fonts used in the PDF:

:root {
	--main-font: Palatino, 'Palatino Linotype', 'Times New Roman',
		'Droid Serif', Times, 'Source Serif Pro', serif, 'Apple Color Emoji',
		'Segoe UI Emoji', 'Segoe UI Symbol';
	--alt-font: 'helvetica neue', ubuntu, roboto, noto, 'segoe ui', arial,
		sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol';
	--code-font: Menlo, Consolas, monospace;
}
CSS variable What it does
--main-font The font stack used for body text
--alt-font Used in headings, captions, et cetera
--code-font Used for code snippets

To override them, use the --css option:

percollate pdf --css ":root { --main-font: 'PT Serif';  --alt-font: Roboto; }" http://example.com

💡 To work correctly, you must have the fonts installed on your machine. Custom web fonts currently require you to use a custom CSS stylesheet / HTML template.

Remove the appended hrefs from hyperlinks

The idea with percollate is to make PDFs that can be printed without losing where the hyperlinks point to. However, for some link-heavy pages, the appended hrefs can become bothersome. You can remove them using:

percollate pdf --css "a:after { display: none }" http://example.com

Hyphenation and justification

Hyphenation is only enabled by default for PDFs, but you can opt in or out of it for any output format with a flag.

When hyphenation is enabled, paragraphs will be justified:

.article__content p {
	text-align: justify;
}

If you prefer left-aligned text:

percollate pdf --css ".article__content p { text-align: left }" http://example.com

The --style option

The --style option lets you use your own CSS stylesheet instead of the default one. Here are some common use-cases for this option:

⚠️ TODO add examples here

The --template option

The --template option lets you use a custom HTML template for the PDF.

💡 The HTML template is parsed with nunjucks, which is a close JavaScript relative of Twig for PHP, Jinja2 for Python and L for Ruby.

Here are some common use-cases:

Customizing the page header / footer

Puppeteer can print some basic information about the page in the PDF. The following CSS class names are available for the header / footer, into which the appropriate content will be injected:

  • date — The formatted print date
  • title — The document title
  • url — document location (Note: this will print the path of the temporary html, not the original web page URL)
  • pageNumber — the current page number
  • totalPages — total pages in the document

👉 See the Chromium source code for details.

You place your header / footer template in a template element in your HTML:

<template class="header-template"> My header </template>

<template class="footer-template">
	<div class="text center">
		<span class="pageNumber"></span>
	</div>
</template>

See the default HTML for example usage.

You can add CSS styles to the header / footer with either the --css option or a separate CSS stylesheet (the --style option).

💡 The header / footer template do not inherit their styles from the rest of the page (i.e. they are not part of the cascade), so you'll have to write the full CSS you want to apply to them.

An example from the default stylesheet:

.footer-template {
	font-size: 10pt;
	font-weight: bold;
}

Updating

To keep the tool up-to-date, you can run:

npm install -g percollate

Occasionally, an ugrade might not go according to plan; in this case, you can uninstall and re-install percollate:

npm uninstall -g percollate && npm install -g percollate

How it works

All export formats follow a common pipeline:

  1. Fetch the page(s) using node-fetch
  2. If an AMP version of the page exists, use that instead (disable with --no-amp flag)
  3. Enhance the DOM using jsdom
  4. Pass the DOM through mozilla/readability to strip unnecessary elements
  5. Apply the HTML template and the stylesheet to the resulting HTML

Different formats then use different tools to produce the final file.

PDFs are rendered with puppeteer.

EPUBs have external images fetched and bundled together with the HTML of each article. When the --inline option is used, images are instead converted to data URLs and embedded into the HTML.

HTMLs are saved without any further changes. When the --inline option is used, images are converted to data URLs and embedded into the HTML. External images are not otherwise fetched.

Limitations

Percollate inherits the limitations of two of its main components, Readability and Puppeteer (headless Chrome).

The imperative approach Readability takes will not be perfect in each case, especially on HTML pages with atypical markup; you may occasionally notice that it either leaves in superfluous content, or that it strips out parts of the content. You can confirm the problem against Firefox's Reader View. In this case, consider filing an issue on mozilla/readability.

Using a browser to generate the PDF is a double-edged sword. On the one hand, you get excellent support for web platform features. On the other hand, print CSS as defined by W3C specifications is only partially implemented, and it seems unlikely that support will be improved any time soon. However, even with modest print support, I think Chrome is the best (free) tool for the job.

Troubleshooting

On some Linux machines you'll need to install a few more Chrome dependencies before percollate works correctly. (Thanks to @ptica for sorting it out)

The percollate pdf command supports the --no-sandbox Puppeteer flag, but make sure you're aware of the implications before disabling the sandbox.

Contributing

Contributions of all kinds are welcome! See CONTRIBUTING.md for details.

See also

Here are some other projects to check out if you're interested in building books using the browser:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].