All Projects → pdfminer → Pdfminer.six

pdfminer / Pdfminer.six

Licence: mit
Community maintained fork of pdfminer - we fathom PDF

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Labels

Projects that are alternatives of or similar to Pdfminer.six

Traprange
(Java)A Method to Extract Tabular Content from PDF Files
Stars: ✭ 236 (-92.74%)
Mutual labels:  parser, pdf
Origami
Origami is a pure Ruby library to parse, modify and generate PDF documents.
Stars: ✭ 234 (-92.8%)
Mutual labels:  parser, pdf
Php Svg Lib
SVG file parsing / rendering library
Stars: ✭ 1,146 (-64.74%)
Mutual labels:  parser, pdf
Posthtml
PostHTML is a tool to transform HTML/XML with JS plugins
Stars: ✭ 2,737 (-15.78%)
Mutual labels:  parser
Subtitle.js
Stream-based library for parsing and manipulating subtitle files
Stars: ✭ 234 (-92.8%)
Mutual labels:  parser
Parse5
HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.
Stars: ✭ 2,778 (-14.52%)
Mutual labels:  parser
Pxi
🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.
Stars: ✭ 248 (-92.37%)
Mutual labels:  parser
Leerraum.js
A PDF typesetting library with exact positioning and hyphenated line breaking
Stars: ✭ 233 (-92.83%)
Mutual labels:  pdf
Pdf Unstamper
Remove textual watermark of any font, any encoding and any language with pdf-unstamper now!
Stars: ✭ 245 (-92.46%)
Mutual labels:  pdf
Better Parse
A nice parser combinator library for Kotlin
Stars: ✭ 238 (-92.68%)
Mutual labels:  parser
Mercury Parser Api
🚀 A drop-in replacement for the Mercury Parser API.
Stars: ✭ 239 (-92.65%)
Mutual labels:  parser
Geektime2pdf
极客时间专栏文章 转为 PDF 包含评论 音频
Stars: ✭ 245 (-92.46%)
Mutual labels:  pdf
Stapler
A small utility making use of the pypdf library to provide a (somewhat) lighter alternative to pdftk
Stars: ✭ 238 (-92.68%)
Mutual labels:  pdf
Open Publisher
Using Jekyll to create outputs that can be used as Pandoc inputs. In short - input markdown, output mobi, epub, pdf, and print-ready pdf. With a focus on fiction.
Stars: ✭ 242 (-92.55%)
Mutual labels:  pdf
Enmime
MIME mail encoding and decoding package for Go
Stars: ✭ 246 (-92.43%)
Mutual labels:  parser
Svg2pdf.js
A javascript-only SVG to PDF conversion utility that runs in the browser. Brought to you by yWorks - the diagramming experts
Stars: ✭ 231 (-92.89%)
Mutual labels:  pdf
Androiddocumentviewer
Android 文档查看: word、excel、ppt、pdf,使用mupdf及tbs
Stars: ✭ 235 (-92.77%)
Mutual labels:  pdf
Parsr
Transforms PDF, Documents and Images into Enriched Structured Data
Stars: ✭ 2,736 (-15.82%)
Mutual labels:  pdf
Droid Application Fuzz Framework
Android application fuzzing framework with fuzzers and crash monitor.
Stars: ✭ 248 (-92.37%)
Mutual labels:  pdf
Zipcelx
Turns JSON data into `.xlsx` files in the browser
Stars: ✭ 246 (-92.43%)
Mutual labels:  parser

pdfminer.six

Build Status PyPI version gitter

We fathom PDF

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.

Check out the full documentation on Read the Docs.

Features

  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost).
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Support for extracting images (JPG, JBIG2, Bitmaps).
  • Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
  • Support for RC4 and AES encryption.
  • Support for AcroForm interactive form extraction.
  • Table of contents extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

How to use

  • Install Python 3.6 or newer.

  • Install

    pip install pdfminer.six

  • Use command-line interface to extract text from pdf:

    python pdf2txt.py samples/simple1.pdf

Contributing

Be sure to read the contribution guidelines.

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].