WZBSocialScienceCenter / pdf2xml-viewer

Licence: Apache-2.0 License

A simple viewer and inspection tool for text boxes in PDF documents

Programming Languages

HTML

75241 projects

Projects that are alternatives of or similar to pdf2xml-viewer

En Data mining

Data Mining Historical Newspaper Metadata (METS/ALTO formats)

Stars: ✭ 14 (-82.93%)

Mutual labels: ocr, xml

Gimagereader

A Gtk/Qt front-end to tesseract-ocr.

Stars: ✭ 786 (+858.54%)

Mutual labels: ocr, pdf-document

Server-Help

💻 This VSTO Add-In allows the user to ping a list of servers and creates a file for Microsoft Remote Desktop Manager an Excel table. This is used for quickly determining which servers are offline in a list. It is written in 3 different versions as a VSTO Add-In in C# and VB.NET as well as a VBA Add-In.

Stars: ✭ 21 (-74.39%)

Mutual labels: xml

staff identity card ocr project

Staff Identity Card OCR Project

Stars: ✭ 15 (-81.71%)

Mutual labels: ocr

OCR-Reader

An Android app to extract text from camera preview directly.

Stars: ✭ 43 (-47.56%)

Mutual labels: ocr

Iron-OCR-Image-to-Text-in-CSharp

Image to Text Tutorial in C# - See https://ironsoftware.com/csharp/ocr/tutorials/how-to-read-text-from-an-image-in-csharp-net/

Stars: ✭ 65 (-20.73%)

Mutual labels: ocr

VehicleInfoOCR

Use your camera to read number plates and obtain vehicle details. Simple, ad-free and faster alternative to existing playstore apps

Stars: ✭ 35 (-57.32%)

Mutual labels: ocr

simple-square-packing

werehamster.github.io/simple-square-packing/

Stars: ✭ 12 (-85.37%)

Mutual labels: d3

tesseract-server

A small lightweight HTTP server that converts photos, images and scanned documents to text using optical character recognition by utilizing the power of Google Tesseract.

Stars: ✭ 15 (-81.71%)

Mutual labels: ocr

screenshot-actions

Dunst actions for screenshots (OCR, upload to 0x0.st, delete, rename, move to/from clipboard)

Stars: ✭ 49 (-40.24%)

Mutual labels: ocr

escpos-xml

JavaScript library that implements the thermal printer ESC / POS protocol and provides an XML interface for preparing templates for printing.

Stars: ✭ 37 (-54.88%)

Mutual labels: xml

xspec

XSpec is a unit test and behaviour-driven development (BDD) framework for XSLT, XQuery, and Schematron.

Stars: ✭ 91 (+10.98%)

Mutual labels: xml

easyocr

easy to ocr

Stars: ✭ 49 (-40.24%)

Mutual labels: ocr

ph-commons

Java 1.8+ Library with tons of utility classes required in all projects

Stars: ✭ 23 (-71.95%)

Mutual labels: xml

learn-xquery

A list of great articles, blog posts, and books for learning XQuery

Stars: ✭ 33 (-59.76%)

Mutual labels: xml

ngx-ionic-image-viewer

An Ionic 4 Angular component to view & zoom on images and photos without any additional dependencies.

Stars: ✭ 129 (+57.32%)

Mutual labels: viewer

github-contribution-graph

Add beautiful GitHub contribution/commit graph to your profile README!

Stars: ✭ 37 (-54.88%)

Mutual labels: d3

Android-Text-Scanner

Read text and numbers with android camera OCR

Stars: ✭ 27 (-67.07%)

Mutual labels: ocr

DicomViewer

Dicom images viewer, built special for medical online testing platform

Stars: ✭ 13 (-84.15%)

Mutual labels: viewer

PRLib

Pre-Recognition Library - library with algorithms for improving OCR quality.

Stars: ✭ 22 (-73.17%)

Mutual labels: ocr

View All Similar Projects ➔

pdf2xml-viewer - A simple viewer and inspection tool for text boxes in PDF documents

July 2016 / Feb. 2017, Markus Konrad [email protected] / Berlin Social Science Center

Introduction

This is a small tool with which it is possible to view and examine individual text boxes in PDF documents. This is very helpful for analyzing the distribution of texts across a page, especially in the case of OCR-processed PDFs (so called "sandwich PDFs") from which you might want to extract structured information (see pdftabextract for this). With this viewer, you can examine such PDFs and have a look at the properties of individual text boxes, like position, width, height or font specification. In combination with pdftabextract, you can view the grids that were generated for the detected columns and rows. This blog post shows an example usage.

The viewer requires you to convert your PDFs to the pdf2xml format. Afterwards you can start up a local webserver, display this XML file in the viewer (as seen below) and examine the individual text boxes with your browsers developer console.

The created file in pdf2xml format can later also be used to extract structured information, which I explain in my series of blog posts about data mining PDFs.

How to use it

1. Convert a PDF to pdf2xml format

At first, you need to convert your PDFs using the poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts. From this package we need the command pdftohtml and can create an XML file in pdf2xml format in the following way using the Terminal:

pdftohtml -c -hidden -xml input.pdf output.xml

The arguments input.pdf and output.xml are your input PDF file and the created XML file in pdf2xml format respectively. It is important that you specifiy the -hidden parameter when you're dealing with OCR-processed ("sandwich") PDFs. You can furthermore add the parameters -f n and -l n to set only a range of pages to be converted.

2. Start a minimal local webserver to display the text boxes in the PDF with the viewer

Now that you have your file(s) in pdf2xml format, change to the directory where pdf2xml-viewer resides (where it's index.html file is). You should also copy the generated XML files to this location. Now let's start up a minimal local webserver. This can be done very easily with Python, which is installed on Linux and Mac OSX by default. You can do so in the Terminal with Python 2.x:

python -m SimpleHTTPServer 8080

Or with Python 3:

python3 -m http.server 8080 --bind 127.0.0.1

Now you open your browser and go to the adress http://127.0.0.1:8080. The viewer shows up and you can now enter the file name of your file to load (it must be relative to the directory in which pdf2xml-viewer resides). If you just want to see an example, type in example/ocr-output.pdf.xml and load this file. Now you browse through the pages of your PDF document and you'll see the text boxes with red frames. You can further examine these boxes by using your browser's inspection tools (right click on element and select "Inspect" in Chrome or Firefox) as seen below:

3. Use the advanced features of the viewer

You can load a page grid JSON file that was generated with pdftabextract (function common.save_page_grids):

4. Extract data from your PDFs

If you want to extract structured data from the PDFs, you should have a look at the pdftabextract package.

Technical details

This viewer uses d3.js to display the pdf2xml file. I chose this approach because it is the fastest and simples in order to inspect individual elements of a (OCR-processed) PDF document, without using expensive special software. Furthermore it allows to add additional features such as displaying overlays of calculated lines or grids.

License

Apache License 2.0. See LICENSE file.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

WZBSocialScienceCenter / pdf2xml-viewer

Programming Languages

Labels

Projects that are alternatives of or similar to pdf2xml-viewer

pdf2xml-viewer - A simple viewer and inspection tool for text boxes in PDF documents

Introduction

How to use it

1. Convert a PDF to pdf2xml format

2. Start a minimal local webserver to display the text boxes in the PDF with the viewer

3. Use the advanced features of the viewer

4. Extract data from your PDFs

Technical details

License