All Projects → rostrovsky → Pdf Table

rostrovsky / Pdf Table

Licence: mit
Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

Programming Languages

java
68154 projects - #9 most used programming language
java8
65 projects

Projects that are alternatives of or similar to Pdf Table

Kingtable
Library for administrative tables that are able to build themselves, on the basis of the input data.
Stars: ✭ 60 (+20%)
Mutual labels:  table, tables
Laravel Table
Generate tables from Eloquent models.
Stars: ✭ 101 (+102%)
Mutual labels:  table, tables
Reactables
GigaTables is a ReactJS plug-in to help web-developers process table-data in applications and CMS, CRM, ERP or similar systems.
Stars: ✭ 112 (+124%)
Mutual labels:  table, tables
Smart Surveillance System Using Raspberry Pi
This is my Third Year Project for face recognition using OpenCV
Stars: ✭ 41 (-18%)
Mutual labels:  opencv
Predict Facial Attractiveness
Using OpenCV and Dlib to predict facial attractiveness.
Stars: ✭ 41 (-18%)
Mutual labels:  opencv
Imagepy
Image process framework based on plugin like imagej, it is esay to glue with scipy.ndimage, scikit-image, opencv, simpleitk, mayavi...and any libraries based on numpy
Stars: ✭ 1,026 (+1952%)
Mutual labels:  opencv
Facer
Simple (🤞) face averaging (🙂) in Python (🐍)
Stars: ✭ 49 (-2%)
Mutual labels:  opencv
Convolutionalemotion
A deep convolutional neural network system for live emotion detection
Stars: ✭ 40 (-20%)
Mutual labels:  opencv
Table Tennis Computer Vision
Apply computer vision to table tennis for match / training analysis
Stars: ✭ 48 (-4%)
Mutual labels:  opencv
Realtimefaceapi
This is a demo project showing how to use Face API in Cognitive Services with OpenCV
Stars: ✭ 44 (-12%)
Mutual labels:  opencv
Table Builder
🐿️Dynamic tables with pagination and sorting for data visualisation.
Stars: ✭ 44 (-12%)
Mutual labels:  table
Sikulix1
SikuliX version 2.0.0+ (2019+)
Stars: ✭ 1,007 (+1914%)
Mutual labels:  opencv
Android Hpe
Android native application to perform head pose estimation using images coming from the front camera.
Stars: ✭ 46 (-8%)
Mutual labels:  opencv
Fliplog
fluent logging with verbose insight, colors, tables, emoji, filtering, spinners, progress bars, timestamps, capturing, stack traces, tracking, presets, & more...
Stars: ✭ 41 (-18%)
Mutual labels:  tables
Seeds Revised
Implementation of the superpixel algorithm called SEEDS [1].
Stars: ✭ 48 (-4%)
Mutual labels:  opencv
Hacking Scripts
Hacking Scripts contains amazing and awesome scripts written in Python, JavaScript, Java, Nodejs, and more. The main aim of the repository will be to provide utility scripts that might make everyday life easy.
Stars: ✭ 41 (-18%)
Mutual labels:  opencv
Opencvdeviceenumerator
This repository contains a class that allows the enumeration of video and audio devices in order to get the device IDs that are required to create a VideoCapture object inside OpenCV (in Windows).
Stars: ✭ 48 (-4%)
Mutual labels:  opencv
Xamarin.ios Opencv
OpenCV for Xamarin.iOS
Stars: ✭ 43 (-14%)
Mutual labels:  opencv
Plant Detection
Detects and marks plants in a soil area image using Python OpenCV
Stars: ✭ 43 (-14%)
Mutual labels:  opencv
Fingerprint Feature Extraction
Extract minutiae features from fingerprint images
Stars: ✭ 45 (-10%)
Mutual labels:  opencv

= PDF-table :toc:

== What is PDF-table? PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. + Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.

== Prerequisites

=== JDK

JAVA 8 is required.

=== External dependencies

pdf-table requires compiled OpenCV 3.4.2 to work properly:

. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2 . Unpack it and add to your system PATH: * Windows: <opencv dir>\build\java\x64 * Linux: TODO

== Installation [source, xml]

com.github.rostrovsky pdf-table 1.0.0 ----

== Usage

=== Parsing PDFs When PDF document page is being parsed, following operations are performed:

. Page is converted to grayscale image [OpenCV]. . Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV]. . Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV]. . Contour mask is XORed with BIT image [OpenCV]. . Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV]. . Final contours are drawn [OpenCV]. . Bounding rectangles are detected from final contours [OpenCV]. . PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to <>

==== single-threaded example [source, java]

class SingleThreadParser { public static void main(String[] args) throws IOException { PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader(); List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages()); } }

==== multi-threaded example [source, java]

class MultiThreadParser { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    // parse pages simultaneously
    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<ParsedTablePage>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<ParsedTablePage> callable = () -> {
            ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
            return page;
        };
        futures.add(executor.submit(callable));
    }

    // collect parsed pages
    List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
    try {
        for (Future<ParsedTablePage> f : futures) {
            ParsedTablePage page = f.get();
            unsortedParsedPages.add(page.getPageNum() - 1, page);
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }

    // sort pages by pageNum
    List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
            .sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
}

}

=== Saving PDF pages as PNG images PDF-Table provides methods for saving PDF pages as PNG images. + Rendering DPI can be modified in PdfTableSettings (see: <>).

==== single-threaded example [source, java]

class SingleThreadPNGDump { public static void main(String[] args) throws IOException { PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); Path outputPath = Paths.get("C:", "some_directory"); PdfTableReader reader = new PdfTableReader(); reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath); } }

==== multi-threaded example [source, java]

class MultiThreadPNGDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<Boolean>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<Boolean> callable = () -> {
            reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
            return true;
        };
        futures.add(executor.submit(callable));
    }

    try {
        for (Future<Boolean> f : futures) {
            f.get();
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

}

=== Saving debug PNG images When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page at various stages of processing. + Using these images, user can adjust PdfTableSettings accordingly to achieve desired results (see: <>).

==== single-threaded example [source, java]

class SingleThreadDebugImgsDump { public static void main(String[] args) throws IOException { PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); Path outputPath = Paths.get("C:", "some_directory"); PdfTableReader reader = new PdfTableReader(); reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath); } }

==== multi-threaded example [source, java]

class MultiThreadDebugImgsDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<Boolean>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<Boolean> callable = () -> {
            reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
            return true;
        };
        futures.add(executor.submit(callable));
    }

    try {
        for (Future<Boolean> f : futures) {
            f.get();
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

}

=== Parsing settings

PDF rendering and OpenCV filtering settings are stored in PdfTableSettings object.

Custom settings instance can be passed to PdfTableReader constructor when non-default values are needed:

[source, java]

(...)

// build settings object PdfTableSettings settings = PdfTableSettings.getBuilder() .setCannyFiltering(true) .setCannyApertureSize(5) .setCannyThreshold1(40) .setCannyThreshold2(190.5) .setPdfRenderingDpi(160) .build();

// pass settings to reader PdfTableReader reader = new PdfTableReader(settings);

=== Output format Each parsed PDF page is being returned as ParsedTablePage object: [source, java]

(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 ! ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List // getting first row ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain characters, // so it is recommended to trim them before processing double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].