Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → kermitt2 → Pdfalto

kermitt2 / Pdfalto

Licence: gpl-2.0

PDF to XML ALTO file converter

Programming Languages

50402 projects - #5 most used programming language

Labels

pdf xml

Projects that are alternatives of or similar to Pdfalto

Deck

Slide Decks

Stars: ✭ 261 (+139.45%)

Mutual labels: xml, pdf

Fulltext

Search across and get full text for OA & closed journals

Stars: ✭ 221 (+102.75%)

Mutual labels: xml, pdf

Tableexport

tableExport（table导出文件，支持json、csv、txt、xml、word、excel、image、pdf）

Stars: ✭ 261 (+139.45%)

Mutual labels: xml, pdf

Rplos

R client for the PLoS Journals API

Stars: ✭ 289 (+165.14%)

Mutual labels: xml, pdf

Koodo Reader

A modern ebook manager and reader with sync and backup capacities for Windows, macOS, Linux and Web

Stars: ✭ 2,938 (+2595.41%)

Mutual labels: pdf, xml

I7j Pdfhtml

pdfHTML is an iText 7 add-on for Java that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.

Stars: ✭ 104 (-4.59%)

Mutual labels: xml, pdf

Docconv

Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

Stars: ✭ 735 (+574.31%)

Mutual labels: xml, pdf

Node Prince

Node API for executing PrinceXML via prince(1) CLI

Stars: ✭ 42 (-61.47%)

Mutual labels: xml, pdf

Iso 3166 Countries With Regional Codes

ISO 3166-1 country lists merged with their UN Geoscheme regional codes in ready-to-use JSON, XML, CSV data sets

Stars: ✭ 1,372 (+1158.72%)

Mutual labels: xml

Render

Go package for easily rendering JSON, XML, binary data, and HTML templates responses.

Stars: ✭ 1,562 (+1333.03%)

Mutual labels: xml

Quad

document processor in Racket

Stars: ✭ 100 (-8.26%)

Mutual labels: pdf

Material Bottomnavigation

Bottom Navigation widget component inspired by the Google Material Design Guidelines at https://www.google.com/design/spec/components/bottom-navigation.html

Stars: ✭ 1,375 (+1161.47%)

Mutual labels: xml

Pdf

Simple http microservice that converts Word documents to PDF

Stars: ✭ 107 (-1.83%)

Mutual labels: pdf

Android Gradle Localization Plugin

Gradle plugin for generating localized string resources

Stars: ✭ 100 (-8.26%)

Mutual labels: xml

Command-line YAML, XML, TOML processor - jq wrapper for YAML/XML/TOML documents

Stars: ✭ 1,688 (+1448.62%)

Mutual labels: xml

Esoui

ESOUI is the Lua source code of the ZenimaxOnline's MMORPG "The Elder Scrolls Online"

Stars: ✭ 100 (-8.26%)

Mutual labels: xml

Pdflayouttextstripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Stars: ✭ 1,369 (+1155.96%)

Mutual labels: pdf

Play Pdf

A PDF module for the Play framework

Stars: ✭ 108 (-0.92%)

Mutual labels: pdf

Pypdftk

Python module to drive the awesome pdftk binary.

Stars: ✭ 107 (-1.83%)

Mutual labels: pdf

Plot

A DSL for writing type-safe HTML, XML and RSS in Swift.

Stars: ✭ 1,722 (+1479.82%)

Mutual labels: xml

View All Similar Projects ➔

pdfalto

pdfalto is a command line executable for parsing PDF files and producing structured XML representations of the PDF content in ALTO format.

pdfalto is initially a fork of pdf2xml, developed at XRCE, with modifications for robustness, addition of features and output enhanced format in ALTO (including in particular space information, useful for instance for further machine learning processing). It is based on the Xpdf library.

The latest stable version is 0.3. Working version (master) is 0.4.

Requirements

compilers : clang 3.6 or gcc 4.9
makefile generator : cmake 3.12.0
fetching dependencies : wget

Usage

General usage is as follow:

Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -noImageInline                : do not include images inline in the stream
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -noLineNumbers                : do not output line numbers added in manuscript-style textual documents
  -readingOrder                 : blocks follow the reading order
  -noText                       : do not extract textual objects (might be useful, but non-valid ALTO)
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -filesLimit <int>             : limit of asset files be extracted
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information

In addition to the ALTO file describing the PDF content, the following files are generated:

_metadata.xml file containing a pdf file metadata (generate metadata information in a separate XML file as ALTO schema does not support that).
_annot.xml file containing a description of the annotations in the PDF (e.g. GOTO, external http links, ...) obtained with -annotation option
_outline.xml file containing a possible PDF-embedded table of content (aka outline) obtained with -outline option
.xml_data/ subdirectory containing the vectorial (.vec) and bitmap images (.png) embedded in the PDF, this is generated by default when the option -noImage is not present

Dependencies

All dependencies are provided as static libraries corresponding to each operating system.

Dependencies can be recompiled by running this script

See compiling dependencies procedures for further details.

Known issues (issue 41) might occur whille building, in this case you'll need to compile the dependencies before building pdflato.

Build

NOTE for windows : it's recommended to use Cygwin and install standard libraries (either for cland or gcc)

git clone https://github.com/kermitt2/pdfalto.git && cd pdfalto

Xpdf 4.00 is shipped as git submodule, to download it:

git submodule update --init --recursive

Build pdfalto:

cmake ./

make

The executable pdfalto is generated in the root directory. Additionally, this will create a static library for xpdf-4.00 at the following path xpdf-4.00/build/xpdf/lib/libxpdf.a and all the libraries and their respective subdirectory.

Future work

Text like containing block element characters (https://unicode.org/charts/PDF/U2B00.pdf) are used as placeholders for unknown character unicodes, instead of what would be expected when visually inspecting the text. The reason for these unsolved character unicode values is that the actual characters are glyphs that are embedded in the PDF document which use free unicode range for embedded fonts, not the right unicode. The only way to extract the valid text for those special characters is to use OCR at glyph level . This is our targeted main future enhancement, relying on a custom Deep Learning approach.
map special characters in secondary fonts to their expected unicode
try to optimize speed and memory
see the issue tracker for further tasks

Changes

New in version 0.3 (apart various bug fixes):

line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (noLineNumbers option)
removal of -blocks option, the block information are always returned for ensuring ALTO validation (<TextBlock> element)
bug fixing on reading order

New in version 0.2 (apart various bug fixes):

support Unicode composition of characters
generalize reading order to all blocks (it was limited to the blocks of the first page)
use subscript/superscript text font style attribute
use SVG as a format for vectorial images
propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach)
generate metadata information in a separate XML file (as ALTO schema does not support that)
use the latest version of xpdf, version 4.00
add cmake
ALTO output is replacing custom Xerox XML format
Note: this released version was used for Grobid release 0.5.6

New in version 0.1 (apart various bug fixes):

encode URI (using xmlURIEscape from libxml2) for the @href attribute content to avoid blocking XML wellformedness issues. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand.
output coordinates attributes for the BLOCK elements when the -block option is selected,
add a parameter -readingOrder which re-order the blocks following the reading order when the -block option is selected. By default in pdf2xml, the elements followed the PDF content stream (the so-called raw order). In xpdf, several text flow orders are available including the raw order and the reading order. Note that, with this modification and this new option, only the blocks are re-ordered.

From our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. headnote introduced at the end of the page content). This additional mode can be thus quite useful for information/structure extraction applications exploiting pdf2xml output.
use the latest version of xpdf, version 3.04.

Contributors

Contact: Patrice Lopez ([email protected])

pdfalto is developed by Patrice Lopez ([email protected]) and Achraf Azhar ([email protected]).

pdf2xml is orignally written by Hervé Déjean, Sophie Andrieu, Jean-Yves Vion-Dury and Emmanuel Giguet (XRCE) under GPL2 license.

Xpdf is developed by Glyph & Cog, LLC (1996-2017) and distributed under GPL2 or GPL3 license.

The windows version has been built originally by @pboumenot and ported on windows 7 for 64 bit, then for windows (native and cygwin) by @lfoppiano and @flydutch.

License

As the original pdf2xml and main dependency Xpdf, pdfalto is distributed under GPL2 license.

Useful links

Some tools for converting ALTO into other formats:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 109

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (68) 🔗