All Projects → shebinleo → pdf2html

shebinleo / pdf2html

Licence: Apache-2.0 license
pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to pdf2html

pyxpdf
Fast and memory-efficient Python PDF Parser based on xpdf sources
Stars: ✭ 26 (-52.73%)
Mutual labels:  pdf-converter, pdftohtml
pdftron-android-samples
PDFTron Android Samples
Stars: ✭ 30 (-45.45%)
Mutual labels:  pdf-converter, pdftohtml
media-command
Imports files as attachments, regenerates thumbnails, or lists registered image sizes.
Stars: ✭ 40 (-27.27%)
Mutual labels:  thumbnail
chromic pdf
Convenient HTML to PDF/A rendering library for Elixir based on Chrome & Ghostscript
Stars: ✭ 196 (+256.36%)
Mutual labels:  pdf-converter
ph-pdf-layout
Java library for creating fluid page layouts with Apache PDFBox. Supporting multi-page tables, different page layouts etc.
Stars: ✭ 33 (-40%)
Mutual labels:  pdfbox
WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
Stars: ✭ 273 (+396.36%)
Mutual labels:  pdf-converter
PdfToImage
Convert PDF To jpg in c# (using PdfiumViewer)
Stars: ✭ 23 (-58.18%)
Mutual labels:  pdf-converter
lamba-thumbnailer
AWS S3 Video Thumbnailer with Lambda
Stars: ✭ 21 (-61.82%)
Mutual labels:  thumbnail
tika-similarity
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+67.27%)
Mutual labels:  tika
video-snapshot
Get snapshots from a video file in the browser 🎥 🌅
Stars: ✭ 63 (+14.55%)
Mutual labels:  thumbnail
testarea-pdfbox2
Test area for public PDFBox v2 issues on stackoverflow etc
Stars: ✭ 58 (+5.45%)
Mutual labels:  pdfbox
Office2PDF
Office 文件(Word、Excel、PPT)批量转为 PDF 文件,文档完善,自用满意
Stars: ✭ 114 (+107.27%)
Mutual labels:  pdf-converter
pdf2jpg
Utility to convert PDF into JPG files
Stars: ✭ 39 (-29.09%)
Mutual labels:  pdf-converter
GeoParser
Extract and Visualize location from any file
Stars: ✭ 48 (-12.73%)
Mutual labels:  tika
press-ready
🚀 Make your PDF press-ready PDF/X-1a.
Stars: ✭ 56 (+1.82%)
Mutual labels:  pdf-converter
node-poppler
Asynchronous node.js wrapper for the Poppler PDF rendering library
Stars: ✭ 97 (+76.36%)
Mutual labels:  pdf-converter
YouTube-Thumbnail-Downloader
A youtube videos thumbnail downloader telegram bot.
Stars: ✭ 41 (-25.45%)
Mutual labels:  thumbnail
FroshWebP
WebP Support for Shopware
Stars: ✭ 29 (-47.27%)
Mutual labels:  thumbnail
ThumbnailSharp
A simple but awesome library to create a flexibel thumbnail from an image for .NET Framework 4.5+
Stars: ✭ 26 (-52.73%)
Mutual labels:  thumbnail
gotenberg-js-client
A simple JS/TS client for interacting with a Gotenberg API
Stars: ✭ 90 (+63.64%)
Mutual labels:  pdf-converter

pdf2html

NPM version npm module downloads Build Status view on npm

pdf2html helps to convert PDF file to HTML or Text using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

Installation

via yarn:

yarn add pdf2html

via npm:

npm install --save pdf2html

Java runtime environment (JRE) is required to run this module.

Usage

const pdf2html = require('pdf2html')

pdf2html.html('sample.pdf', (err, html) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(html)
    }
})

Convert to text

pdf2html.text('sample.pdf', (err, text) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(text)
    }
})

Convert as pages

pdf2html.pages('sample.pdf', (err, htmlPages) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(htmlPages)
    }
})
const options = { text: true }
pdf2html.pages('sample.pdf', options, (err, textPages) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(textPages)
    }
})

Extra metadata

pdf2html.meta('sample.pdf', (err, meta) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(meta)
    }
})

Generate thumbnail

pdf2html.thumbnail('sample.pdf', (err, thumbnailPath) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(thumbnailPath)
    }
})
const options = { page: 1, imageType: 'png', width: 160, height: 226 }
pdf2html.thumbnail('sample.pdf', options, (err, thumbnailPath) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(thumbnailPath)
    }
})

Manually download dependencies files

Sometimes downloading the dependencies might be too slow or unable to download in a HTTP proxy environment. Follow the step below to skip the dependency downloads.

cd node_modules/pdf2html/vendor
# These URLs come from https://github.com/shebinleo/pdf2html/blob/master/postinstall.js#L6-L7
wget https://archive.apache.org/dist/pdfbox/2.0.26/pdfbox-app-2.0.26.jar
wget https://archive.apache.org/dist/tika/2.4.0/tika-app-2.4.0.jar
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].