All Projects → esbenp → Pdf Bot

esbenp / Pdf Bot

Licence: mit
🤖 A Node queue API for generating PDFs using headless Chrome. Comes with a CLI, S3 storage and webhooks for notifying subscribers about generated PDFs

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Pdf Bot

Html Pdf Chrome
HTML to PDF converter via Chrome/Chromium
Stars: ✭ 629 (-75.34%)
Mutual labels:  pdf, pdf-generation, node-js, chromium, headless, headless-chrome, google-chrome
Phpchrometopdf
A slim PHP wrapper around google-chrome to convert url to pdf or to take screenshots , easy to use and clean OOP interface
Stars: ✭ 127 (-95.02%)
Mutual labels:  pdf, pdf-generation, chromium, headless, headless-chrome
Md To Pdf
Hackable CLI tool for converting Markdown files to PDF using Node.js and headless Chrome.
Stars: ✭ 374 (-85.34%)
Mutual labels:  pdf, pdf-generation, headless-chrome
Puppetron
Puppeteer (Headless Chrome Node API)-based rendering solution.
Stars: ✭ 429 (-83.18%)
Mutual labels:  pdf, chromium, headless
Resumake.io
📝 A website for automatically generating elegant LaTeX resumes.
Stars: ✭ 2,277 (-10.74%)
Mutual labels:  pdf, pdf-generation, pdf-generator
Doctron
Docker-powered html convert to pdf(html2pdf), html to image(html2image like jpeg,png),which using chrome(golang) kernel, add watermarks to pdf, convert pdf to images etc.
Stars: ✭ 141 (-94.47%)
Mutual labels:  pdf, pdf-generation, headless-chrome
node-headless-chrome
⚠️ 🚧 Install precompiled versions of the Chromium/Chrome headless shell using npm or yarn
Stars: ✭ 20 (-99.22%)
Mutual labels:  headless, chromium, headless-chrome
Cuprite
Headless Chrome/Chromium driver for Capybara
Stars: ✭ 743 (-70.87%)
Mutual labels:  chromium, headless, headless-chrome
Ferrum
Headless Chrome Ruby API
Stars: ✭ 1,009 (-60.45%)
Mutual labels:  chromium, headless, headless-chrome
Gotenberg
A Docker-powered stateless API for PDF files.
Stars: ✭ 3,272 (+28.26%)
Mutual labels:  pdf, google-chrome, chromium
Serverless Chrome
🌐 Run headless Chrome/Chromium on AWS Lambda
Stars: ✭ 2,625 (+2.9%)
Mutual labels:  chromium, headless-chrome, headless-chromium
headless-chrome-alpine
A Docker container running headless Chrome
Stars: ✭ 26 (-98.98%)
Mutual labels:  headless, chromium, headless-chrome
CrawlerSamples
This is a Puppeteer+AngleSharp crawler console app samples, used C# 7.1 coding and dotnet core build.
Stars: ✭ 36 (-98.59%)
Mutual labels:  headless, headless-chrome, headless-chromium
Playwright Go
Playwright for Go a browser automation library to control Chromium, Firefox and WebKit with a single API.
Stars: ✭ 272 (-89.34%)
Mutual labels:  chromium, headless, headless-chrome
Taiko
A node.js library for testing modern web applications
Stars: ✭ 2,964 (+16.19%)
Mutual labels:  headless, headless-chrome, headless-chromium
Url To Pdf Api
Web page PDF/PNG rendering done right. Self-hosted service for rendering receipts, invoices, or any content.
Stars: ✭ 6,544 (+156.53%)
Mutual labels:  pdf, headless, headless-chrome
Pdfsave
Convert websites into readable PDFs
Stars: ✭ 46 (-98.2%)
Mutual labels:  pdf, pdf-generation, node-js
Crawlergo
A powerful dynamic crawler for web vulnerability scanners
Stars: ✭ 1,088 (-57.35%)
Mutual labels:  chromium, headless, headless-chrome
Svglib
Read SVG files and convert them to other formats.
Stars: ✭ 139 (-94.55%)
Mutual labels:  pdf, pdf-generation
Pdf Lib
Create and modify PDF documents in any JavaScript environment
Stars: ✭ 3,426 (+34.3%)
Mutual labels:  pdf, pdf-generation

🤖 pdf-bot

npm Build Status Coverage Status

Easily create a microservice for generating PDFs using headless Chrome.

pdf-bot is installed on a server and will receive URLs to turn into PDFs through its API or CLI. pdf-bot will manage a queue of PDF jobs. Once a PDF job has run it will notify you using a webhook so you can fetch the API. pdf-bot supports storing PDFs on S3 out of the box. Failed PDF generations and Webhook pings will be retried after a configurable decaying schedule.

How to use the pdf-bot CLI

pdf-bot uses html-pdf-chrome under the hood and supports all the settings that it supports. Major thanks to @westy92 for making this possible.

How does it work?

Imagine you have an app that creates invoices. You want to save those invoices as PDF. You install pdf-bot on a server as an API. Your app server sends the URL of the invoice to the pdf-bot server. A cronjob on the pdf-bot server keeps checking for new jobs, generates a PDF using headless Chrome and sends the location back to the application server using a webhook.

Prerequisites

  • Node.js v6 or later

Installation

$ npm install -g pdf-bot
$ pdf-bot install

Make sure the node path is in your $PATH

pdf-bot install will prompt for some basic configurations and then create a storage folder where your database and pdf files will be saved.

Configuration

pdf-bot comes packaged with sensible defaults. At the very minimum you must have a config file in the same folder from which you are executing pdf-bot with a storagePath given. However, in reality what you probably want to do is use the pdf-bot install command to generate a configuration file and then use an alias ALIAS pdf-bot = "pdf-bot -c /home/pdf-bot.config.js"

pdf-bot.config.js

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'crazy-secret'
  },
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Timer(1000) // 1 sec timeout
  },
  storagePath: 'storage'
}
$ pdf-bot -c ./pdf-bot.config.js push https://esbenp.github.io

See a full list of the available configuration options.

Usage guide

Structure and concept

pdf-bot is meant to be a microservice that runs a server to generate PDFs for you. That usually means you will send requests from your application server to the PDF server to request an url to be generated as a PDF. pdf-bot will manage a queue and retry failed generations. Once a job is successfully generated a path to it will be sent back to your application server.

Let us check out the flow for an app that generates PDF invoices.

1. (App server): An invoice is created ----> Send URL to invoice to pdf-bot server
2. (pdf-bot server): Put the URL in the queue
3. (pdf-bot server): PDF is generated using headless Chrome
4. (pdf-bot server): (if failed try again using 1 min, 3 min, 10 min, 30 min, 60 min delay)
5. (pdf-bot server): Upload PDF to storage (e.g. Amazon S3)
6. (pdf-bot server): Send S3 location of PDF back to the app server
7. (App server): Receive S3 location of PDF -> Check signature sum matches for security
8. (App server): Handle PDF however you see fit (move it, download it, save it etc.)

You can send meta data to the pdf-bot server that will be sent back to the application. This can help you identify what PDF you are receiving.

Setup

On your pdf-bot server start by creating a config file pdf-bot.config.js. You can see an example file here

pdf-bot.config.js

module.exports = {
  api: {
    port: 3000,
    token: 'api-token'
  },
  storage: {
    's3': createS3Config({
      bucket: '',
      accessKeyId: '',
      region: '',
      secretAccessKey: ''
    })
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

As a minimum you should configure an access token for your API. This will be used to authenticate jobs sent to your pdf-bot server. You also need to add a webhook configuration to have pdf notifications sent back to your application server. You should add a secret that will be used to generate a signature used to check that the request has not been tampered with during transfer.

Start your API using

pdf-bot -c ./pdf-bot.config.js api

This will start an express server that listens for new jobs on port 3000.

Setting up Chrome

pdf-bot uses html-pdf-chrome which in turns uses chrome-launcher to launch chrome. You should check out those two resources on how to properly setup Chrome. However, with chrome-launcher Chrome should be started automatically. Otherwise, html-pdf-chrome has a small guide on how to have it running as a process using pm2.

You can install chrome on Ubuntu using

sudo apt-get update && apt-get install chromium-browser

If you are testing things on OSX or similar, chrome-launcher should be able to find and automatically startup Chrome for you.

Setting up the receiving API

In the examples folder there is a small example on how the application API could look. Basically, you just have to define an endpoint that will receive the webhook and check that the signature matches.

api.post('/hook', function (req, res) {
  var signature = req.get('X-PDF-Signature', 'sha1=')

  var bodyCrypted = require('crypto')
    .createHmac('sha1', '12345')
    .update(JSON.stringify(req.body))
    .digest('hex')

  if (bodyCrypted !== signature) {
    res.status(401).send()
    return
  }

  console.log('PDF webhook received', JSON.stringify(req.body))

  res.status(204).send()
})

Setup production environment

Follow the guide under production/ to see how to setup pdf-bot using pm2 and nginx

Setup crontab

We setup our crontab to continuously look for jobs that have not yet been completed.

* * * * * node $(npm bin -g)/pdf-bot -c ./pdf-bot.config.js shift:all >> /var/log/pdfbot.log 2>&1
* * * * * node $(npm bin -g)/pdf-bot -c ./pdf-bot.config.js ping:retry-failed >> /var/log/pdfbot.log 2>&1

Quick example using the CLI

Let us assume I want to generate a PDF for https://esbenp.github.io. I can add the job using the pdf-bot CLI.

$ pdf-bot -c ./pdf-bot.config.js push https://esbenp.github.io --meta '{"id":1}'

Next, if my crontab is not setup to run it automatically I can run it using the shift:all command

$ pdf-bot -c ./pdf-bot.config.js shift:all

This will look for the oldest uncompleted job and run it.

How can I generate PDFs for sites that use Javascript?

This is a common issue with PDF generation. Luckily, html-pdf-chrome has a really awesome API for dealing with Javascript. You can specify a timeout in milliseconds, wait for elements or custom events. To add a wait simply configure the generator key in your configuration. Below are a few examples.

Wait for 5 seconds

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'api-token'
  },
  // html-pdf-chrome options
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Timer(5000), // waits for 5 sec
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

Wait for event

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'api-token'
  },
  // html-pdf-chrome options
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Event(
      'myEvent', // name of the event to listen for
      '#myElement', // optional DOM element CSS selector to listen on, defaults to body
      5000 // optional timeout (milliseconds)
    )
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

In your Javascript trigger the event when rendering is complete

document.getElementById('myElement').dispatchEvent(new CustomEvent('myEvent'));

Wait for variable

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'api-token'
  },
  // html-pdf-chrome options
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Variable(
      'myVarName', // optional, name of the variable to wait for.  Defaults to 'htmlPdfDone'
      5000 // optional, timeout (milliseconds)
    )
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

In your Javascript set the variable when the rendering is complete

window.myVarName = true;

You can find more completion triggers in html-pdf-chrome's documentation

API

Below are given the endpoints that are exposed by pdf-server's REST API

Push URL to queue: POST /

key type required description
url string yes The URL to generate a PDF from
meta object Optional meta data object to send back to the webhook url

Example

curl -X POST -H 'Authorization: Bearer api-token' -H 'Content-Type: application/json' http://pdf-bot.com/ -d '
  {
    "url":"https://esbenp.github.io",
    "meta":{
      "type":"invoice",
      "id":1
    }
  }'

Database

LowDB (file-database) (default)

If you have low conurrency (run a job every now and then) you can use the default database driver that uses LowDB.

var LowDB = require('pdf-bot/src/db/lowdb')

module.exports = {
  api: {
    token: 'api-token'
  },
  db: LowDB({
    lowDbOptions: {},
    path: '' // defaults to $storagePath/db/db.json
  }),
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

PostgreSQL

var pgsql = require('pdf-bot/src/db/pgsql')

module.exports = {
  api: {
    token: 'api-token'
  },
  db: pgsql({
    database: 'pdfbot',
    username: 'pdfbot',
    password: 'pdfbot',
    port: 5432
  }),
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

Optionally, you can specify a database url by specifying a connectionString.

To install the necessary database tables, run db:migrate. You can also destroy the database by running db:destroy.

Storage

Currently pdf-bot comes bundled with build-in support for storing PDFs on Amazon S3.

Feel free to contribute a PR if you want to see other storage plugins in pdf-bot!

Amazon S3

To install S3 storage add a key to the storage configuration. Notice, you can add as many different locations you want by giving them different keys.

var createS3Config = require('pdf-bot/src/storage/s3')

module.exports = {
  api: {
    token: 'api-token'
  },
  storage: {
    'my_s3': createS3Config({
      bucket: '[YOUR BUCKET NAME]',
      accessKeyId: '[YOUR ACCESS KEY ID]',
      region: '[YOUR REGION]',
      secretAccessKey: '[YOUR SECRET ACCESS KEY]'
    })
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

Options

var decaySchedule = [
  1000 * 60, // 1 minute
  1000 * 60 * 3, // 3 minutes
  1000 * 60 * 10, // 10 minutes
  1000 * 60 * 30, // 30 minutes
  1000 * 60 * 60 // 1 hour
];

module.exports = {
  // The settings of the API
  api: {
    // The port your express.js instance listens to requests from. (default: 3000)
    port: 3000,
    // Spawn command when a job has been pushed to the API
    postPushCommand: ['/home/user/.npm-global/bin/pdf-bot', ['-c', './pdf-bot.config.js', 'shift:all']],
    // The token used to validate requests to your API. Not required, but 100% recommended.
    token: 'api-token'
  },
  db: LowDB(), // see other drivers under Database
  // html-pdf-chrome
  generator: {
    // Triggers that specify when the PDF should be generated
    completionTrigger: new htmlPdf.CompletionTrigger.Timer(1000), // waits for 1 sec
    // The port to listen for Chrome (default: 9222)
    port: 9222
  },
  queue: {
    // How frequent should pdf-bot retry failed generations?
    // (default: 1 min, 3 min, 10 min, 30 min, 60 min)
    generationRetryStrategy: function(job, retries) {
      return decaySchedule[retries - 1] ? decaySchedule[retries - 1] : 0
    },
    // How many times should pdf-bot try to generate a PDF?
    // (default: 5)
    generationMaxTries: 5,
    // How many generations to run at the same time when using shift:all
    parallelism: 4,
    // How frequent should pdf-bot retry failed webhook pings?
    // (default: 1 min, 3 min, 10 min, 30 min, 60 min)
    webhookRetryStrategy: function(job, retries) {
      return decaySchedule[retries - 1] ? decaySchedule[retries - 1] : 0
    },
    // How many times should pdf-bot try to ping a webhook?
    // (default: 5)
    webhookMaxTries: 5
  },
  storage: {
    's3': createS3Config({
      bucket: '',
      accessKeyId: '',
      region: '',
      secretAccessKey: ''
    })
  },
  webhook: {
    // The prefix to add to all pdf-bot headers on the webhook response.
    // I.e. X-PDF-Transaction and X-PDF-Signature. (default: X-PDF-)
    headerNamespace: 'X-PDF-',
    // Extra request options to add to the Webhook ping.
    requestOptions: {

    },
    // The secret used to generate the hmac-sha1 signature hash.
    // !Not required, but should definitely be included!
    secret: '1234',
    // The endpoint to send PDF messages to.
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

CLI

pdf-bot comes with a full CLI included! Use -c to pass a configuration to pdf-bot. You can also use --help to get a list of all commands. An example is given below.

$ pdf-bot.js --config ./examples/pdf-bot.config.js --help


  Usage: pdf-bot [options] [command]


  Options:

    -V, --version        output the version number
    -c, --config <path>  Path to configuration file
    -h, --help           output usage information


  Commands:

    api                   Start the API
    db:migrate
    db:destroy
    install
    generate [jobID]      Generate PDF for job
    jobs [options]        List all completed jobs
    ping [jobID]          Attempt to ping webhook for job
    ping:retry-failed
    pings [jobId]         List pings for a job
    purge [options]       Will remove all completed jobs
    push [options] [url]  Push new job to the queue
    shift                 Run the next job in the queue
    shift:all             Run all unfinished jobs in the queue

Debug mode

pdf-bot uses debug for debug messages. You can turn on debugging by setting the environment variable DEBUG=pdf:* like so

DEBUG=pdf:* pdf-bot jobs

Tests

$ npm run test

Issues

Please report issues to the issue tracker

License

The MIT License (MIT). Please see License File for more information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].