All Projects → JonathanLink → Pdflayouttextstripper

JonathanLink / Pdflayouttextstripper

Licence: apache-2.0
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Pdflayouttextstripper

Leerraum.js
A PDF typesetting library with exact positioning and hyphenated line breaking
Stars: ✭ 233 (-82.98%)
Mutual labels:  pdf, layout
Open Semantic Etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Stars: ✭ 165 (-87.95%)
Mutual labels:  extract, pdf
React Native Text Size
Measure text accurately before laying it out and get font information from your App.
Stars: ✭ 238 (-82.62%)
Mutual labels:  text, layout
WindowTextExtractor
WindowTextExtractor allows you to get a text from any window of an operating system including asterisk passwords
Stars: ✭ 128 (-90.65%)
Mutual labels:  text, extract
Longshadow
Add a long shadow on any Android View
Stars: ✭ 562 (-58.95%)
Mutual labels:  text, layout
Mybox
Easy tools of document, image, file, network, location, color, and media.
Stars: ✭ 45 (-96.71%)
Mutual labels:  pdf, text
Pdfsam
PDFsam, a desktop application to extract pages, split, merge, mix and rotate PDF files
Stars: ✭ 1,829 (+33.6%)
Mutual labels:  extract, pdf
Camelot
Camelot: PDF Table Extraction for Humans
Stars: ✭ 3,150 (+130.09%)
Mutual labels:  extract, pdf
Pdf To Text
Extract text from a pdf
Stars: ✭ 462 (-66.25%)
Mutual labels:  pdf, text
Excalibur
A web interface to extract tabular data from PDFs
Stars: ✭ 916 (-33.09%)
Mutual labels:  extract, pdf
Automator
Various Automator and AppleScript workflow and scripts for simplifying life
Stars: ✭ 68 (-95.03%)
Mutual labels:  pdf, text
Mustangproject
Open Source Java e-Invoicing library, validator and tool (Factur-X/ZUGFeRD, UNCEFACT/CII XRechnung)
Stars: ✭ 98 (-92.84%)
Mutual labels:  pdf
Flexlayout
FlexLayout adds a nice Swift interface to the highly optimized facebook/yoga flexbox implementation. Concise, intuitive & chainable syntax.
Stars: ✭ 1,342 (-1.97%)
Mutual labels:  layout
React Photo Layout Editor
Photo layout editor for react
Stars: ✭ 96 (-92.99%)
Mutual labels:  layout
Canvas2pdf
Export your HTML canvas to PDF
Stars: ✭ 96 (-92.99%)
Mutual labels:  pdf
Muuri
Infinite responsive, sortable, filterable and draggable layouts
Stars: ✭ 9,797 (+615.63%)
Mutual labels:  layout
Pdf Generator
Cordova plugin to generate pdf in the client-side
Stars: ✭ 98 (-92.84%)
Mutual labels:  pdf
Paper Code
对一些好的技术文章结合自己的实践经验进行翻译、举例说明等或自己的经验分享。主要包括架构设计、模式设计、模型设计、重构等。
Stars: ✭ 94 (-93.13%)
Mutual labels:  layout
React Native Flexbox Grid
Responsive Grid for React Native
Stars: ✭ 95 (-93.06%)
Mutual labels:  layout
Officeproducer
Produce doc/docx/pdf format from doc/docx template
Stars: ✭ 95 (-93.06%)
Mutual labels:  pdf

PDFLayoutTextStripper

Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Use cases

Data extraction from a table in a PDF file example

Data extraction from a form in a PDF file example

How to install

Maven

<dependency>
  <groupId>io.github.jonathanlink</groupId>
  <artifactId>PDFLayoutTextStripper</artifactId>
  <version>2.2.3</version>
</dependency>

Manual

  1. Install apache pdfbox manually (to get the v2.0.6 click here ) and its two dependencies commons-logging.jar and fontbox

warning: only pdfbox versions from version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java

How to use on Linux/Mac

cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test

How to use on Windows

The same as for Linux (see above) but replace : with ;

Sample code

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class test {
	public static void main(String[] args) {
		String string = null;
        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r"));
            pdfParser.parse();
            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
            string = pdfTextStripper.getText(pdDocument);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        };
        System.out.println(string);
	}
}

Contributors

Thanks to

  • Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
  • Ho Ting Cheng for reporting an issue (v2.1)
  • James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].