All Projects → Belval → Pdf2image

Belval / Pdf2image

Licence: mit
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pdf2image

Chromehtmltopdf
Convert HTML to PDF with Chrome
Stars: ✭ 122 (-83.06%)
Mutual labels:  pdf, convert
Pdf To Image
Convert a pdf to an image
Stars: ✭ 906 (+25.83%)
Mutual labels:  pdf, convert
Markdown Themeable Pdf
ARCHIVED. NOT MAINTAINED. Themeable Markdown Converter (Print to PDF, HTML, JPEG or PNG)
Stars: ✭ 130 (-81.94%)
Mutual labels:  pdf, convert
Officeproducer
Produce doc/docx/pdf format from doc/docx template
Stars: ✭ 95 (-86.81%)
Mutual labels:  pdf, convert
Markdown Pdf
Convert markdown to pdf, png or jpeg on the fly in Atom
Stars: ✭ 250 (-65.28%)
Mutual labels:  pdf, convert
Posterdown
Use RMarkdown to generate PDF Conference Posters via HTML
Stars: ✭ 602 (-16.39%)
Mutual labels:  pdf
Phpword
A pure PHP library for reading and writing word processing documents
Stars: ✭ 6,017 (+735.69%)
Mutual labels:  pdf
Pdfarranger
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface
Stars: ✭ 583 (-19.03%)
Mutual labels:  pdf
Vscode Markdown Pdf
Markdown converter for Visual Studio Code
Stars: ✭ 571 (-20.69%)
Mutual labels:  pdf
Recipes
Django application for managing recipes
Stars: ✭ 695 (-3.47%)
Mutual labels:  pdf
Backslide
💦 CLI tool for making HTML presentations with Remark.js using Markdown
Stars: ✭ 679 (-5.69%)
Mutual labels:  pdf
Pdfgenerator
A simple generator of PDF written in Swift.
Stars: ✭ 629 (-12.64%)
Mutual labels:  pdf
Images To Pdf
An app to convert images to PDF file!
Stars: ✭ 602 (-16.39%)
Mutual labels:  pdf
Ultimate Beamer Theme List
A collection of custom Beamer themes
Stars: ✭ 652 (-9.44%)
Mutual labels:  pdf
Libvips
A fast image processing library with low memory needs.
Stars: ✭ 6,094 (+746.39%)
Mutual labels:  pdf
Dinktopdf
C# .NET Core wrapper for wkhtmltopdf library that uses Webkit engine to convert HTML pages to PDF.
Stars: ✭ 682 (-5.28%)
Mutual labels:  pdf
Imagemagick
🧙‍♂️ ImageMagick 7
Stars: ✭ 6,400 (+788.89%)
Mutual labels:  convert
Html Pdf Chrome
HTML to PDF converter via Chrome/Chromium
Stars: ✭ 629 (-12.64%)
Mutual labels:  pdf
Org Noter
Emacs document annotator, using Org-mode
Stars: ✭ 671 (-6.81%)
Mutual labels:  pdf
Pagedown
Paginate the HTML Output of R Markdown with CSS for Print
Stars: ✭ 619 (-14.03%)
Mutual labels:  pdf

pdf2image

TravisCI PyPI version codecov Downloads mattermost Documentation Status

A python (3.5+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

How to install

pip install pdf2image

Windows

Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.

Mac

Mac users will have to install poppler for Mac.

Linux

Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils

Platform-independant (Using conda)

  1. Install poppler: conda install -c conda-forge poppler
  2. Install pdf2image: pip install pdf2image

How does it work?

from pdf2image import convert_from_path, convert_from_bytes

from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

Then simply do:

images = convert_from_path('/home/belval/example.pdf')

OR

images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())

OR better yet

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
    # Do something here

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600)

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600)

Need help?

Use the mattermost chat to ask questions on the helpdesk and get direct support.

What's new?

  • Add timeout parameter which raises PDFPopplerTimeoutError after the given number of seconds.
  • Add use_pdftocairo parameter which forces pdf2image to use pdftocairo. Should improve performance.
  • Fixed a bug where using pdf2image with multiple threads (but not multiple processes) would cause and exception
  • jpegopt parameter allows for tuning of the output JPEG when using fmt="jpeg" (-jpegopt in pdftoppm CLI) (Thank you @abieler)
  • pdfinfo_from_path and pdfinfo_from_bytes which expose the output of the pdfinfo CLI
  • paths_only parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF
  • size parameter allows you to define the shape of the resulting images (-scale-to in pdftoppm CLI)
    • size=400 will fit the image to a 400x400 box, preserving aspect ratio
    • size=(400, None) will make the image 400 pixels wide, preserving aspect ratio
    • size=(500, 500) will resize the image to 500x500 pixels, not preserving aspect ratio
  • grayscale parameter allows you to convert images to grayscale (-gray in pdftoppm CLI)
  • single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
  • Allow the user to specify poppler's installation path with poppler_path

Performance tips

  • Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
  • Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
  • If i/o is your bottleneck, using the JPEG format can lead to significant gains.
  • PNG format is pretty slow, this is because of the compression.
  • If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Limitations / known issues

  • A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].