All Projects → UB-Mannheim → ocr-fileformat

UB-Mannheim / ocr-fileformat

Licence: MIT license
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

Programming Languages

javascript
184084 projects - #8 most used programming language
shell
77523 projects
HTML
75241 projects
XSLT
1337 projects
Makefile
30231 projects
PHP
23972 projects - #3 most used programming language

Projects that are alternatives of or similar to ocr-fileformat

dinglehopper
An OCR evaluation tool
Stars: ✭ 38 (-73.24%)
Mutual labels:  ocr, page-xml, alto, ocr-d
mirador-textoverlay
Text Overlay plugin for Mirador 3
Stars: ✭ 35 (-75.35%)
Mutual labels:  ocr, hocr, alto
ocrd cis
OCR-D python tools
Stars: ✭ 28 (-80.28%)
Mutual labels:  ocr, ocr-d
ocrd anybaseocr
DFKI Layout Detection for OCR-D
Stars: ✭ 44 (-69.01%)
Mutual labels:  ocr, ocr-d
hOCR-to-ALTO
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
Stars: ✭ 40 (-71.83%)
Mutual labels:  hocr, alto
kuzushiji-recognition
Kuzushiji Recognition Kaggle 2019. Build a DL model to transcribe ancient Kuzushiji into contemporary Japanese characters. Opening the door to a thousand years of Japanese culture.
Stars: ✭ 16 (-88.73%)
Mutual labels:  ocr
butterfly
Application transformation tool
Stars: ✭ 35 (-75.35%)
Mutual labels:  transformation
ImageToText
OCR with Google's AI technology (Cloud Vision API)
Stars: ✭ 30 (-78.87%)
Mutual labels:  ocr
ingest-file
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
Stars: ✭ 40 (-71.83%)
Mutual labels:  ocr
Transformer-ocr
Handwritten text recognition using transformers.
Stars: ✭ 92 (-35.21%)
Mutual labels:  ocr
EverTranslator
Translate text anytime and everywhere, even you are gaming!
Stars: ✭ 59 (-58.45%)
Mutual labels:  ocr
vrpdr
Deep Learning Applied To Vehicle Registration Plate Detection and Recognition in PyTorch.
Stars: ✭ 36 (-74.65%)
Mutual labels:  ocr
fakemenot
Application to check authenticity of Twitter screenshots. Written in Python 🐍
Stars: ✭ 29 (-79.58%)
Mutual labels:  ocr
idcard-ocr
端到端的针对身份证的文字识别
Stars: ✭ 22 (-84.51%)
Mutual labels:  ocr
wrangle
A data transformation package for deep learning with Autonomio, Keras and TensorFlow.
Stars: ✭ 15 (-89.44%)
Mutual labels:  transformation
Shadow
计算机基础知识,数据结构,设计模式,Tomcat中间件的实现
Stars: ✭ 19 (-86.62%)
Mutual labels:  ocr
extract-information-from-identity-card
From identity card image, this repo detect 4 corners, align by OpenCV, then detect word in image and recognize word by Transformer OCR.
Stars: ✭ 81 (-42.96%)
Mutual labels:  ocr
kitodo-presentation
Kitodo.Presentation is a feature-rich framework for building a METS- or IIIF-based digital library. It is part of the Kitodo Digital Library Suite.
Stars: ✭ 33 (-76.76%)
Mutual labels:  alto
Table-Extractor-From-Image
This repository contains the code that extracts a table from an image and exports it to an Excel.
Stars: ✭ 46 (-67.61%)
Mutual labels:  ocr
tesseract-unity
Standalone OCR plugin for Unity using Tesseract
Stars: ✭ 35 (-75.35%)
Mutual labels:  ocr

ocr-fileformat

Codacy Badge Build Status GitHub release ocr-fileformat Docker build

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Screenshot GUI

Installation

Docker

You can run the command line scripts and web interface as a Docker container, you only need Docker installed.

To start the web interface on http://localhost:8080:

docker run --rm -it -p 8080:8080 ubma/ocr-fileformat

To run the command line scripts, mount the directory containing your input files into the container's /data directory:

docker run --rm -it -v "$PWD":/data ubma/ocr-fileformat ocr-transform alto2.0 hocr somefile.alto

System-wide

To install system-wide to /usr/local:

sudo make install

To install without sudo to your home directory:

make install PREFIX=$HOME/.local

If $HOME/.local/bin is not in your PATH, add this to your shell startup file (e.g. ~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin $PATH"

The web application has a PHP backed. You can deploy it on any PHP-capable server by copying the web folder somewhere below the document root of your server, e.g. /var/www/html for Apache on Debian/Ubuntu:

sudo -u www-data cp -r web /var/www/html/ocr-fileformat

In this example the GUI would be available under http://localhost/ocr-fileformat/.

Usage

The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)

CLI

  • ocr-transform: Transformation of OCR output between OCR formats
  • ocr-validate: Validation of OCR output against OCR format schemas

GUI

The web interface is for testing validation and transformations. You can upload a file or select an input file by URL.

API

Transformation

Transformation CLI

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]

For example, you can transform an ALTO XML to a hOCR file with:

ocr-transform alto hocr sample.xml sample.hocr

Or convert from ALTO XML (version 2.1) to hOCR with:

ocr-transform alto2.1 hocr sample.alto sample.hocr

You can also pass arguments directly to the Saxon CLI by passing them after a double dash (--). For example, to set the foo parameter to bar:

ocr-transform alto hocr sample.xml sample.hocr -- foo=bar

Try ocr-transform -h to get an overview:

Usage: ocr-transform [-dhLv]   [ []] [-- ]

    Options:
        --help    -h     Show this help
        --version -v     Show version
        --debug   -d     Increase debug level by 1, can be repeated
        --list    -L     List transformations

    Transformations:
        abbyy hocr
        abbyy page
        alto2.0 alto3.0
        alto2.0 alto3.1
        alto2.0 hocr
        alto2.1 alto3.0
        alto2.1 alto3.1
        alto2.1 hocr
        alto page
        alto text
        gcv hocr
        gcv page
        hocr alto2.0
        hocr alto2.1
        hocr page
        hocr text
        page alto
        page hocr
        page page2019
        page text
        tei hocr

    Saxon options:
        Usage: see http://www.saxonica.com/documentation/index.html#!using-xsl/commandline
        Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -jit -l -lib -license -m -nogo -now -o -opt -or -outval -p -quit -r -relocate -repeat -s -sa -scmin -strip -t -T -target -threads -TJ -Tlevel -Tout -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -y
        Use -XYZ:? for details of option XYZ
        Params:
          param=value           Set stylesheet string parameter
          +param=filename       Set stylesheet document parameter
          ?param=expression     Set stylesheet parameter using XPath
          !param=value          Set serialization parameter

Transformation GUI

Select the Transform menu option. Choose a URL, an input and an output format. Click Transform.

Transformation API

The stylesheets are installed in $PREFIX/share/ocr-fileformat/xslt and can be used directly in your scripts and software. You will need to use an XSLT 2.0 capable stylesheet transformer.

Supported Transformations

From ╲ To hOCR ALTO PAGEXML
hOCR =
ALTO =
PAGEXML =
FineReader -
Google Cloud Vision -
TEI - -

Validation

Usage: ocr-validate [-dhL]   []

    Options:
        --help    -h     Show this help
        --version -v     Show version
        --debug   -d     Increase debug level by 1, can be repeated
        --list    -L     List available schemas

    Schemas:
        hocr
        alto-1-0 alto-1-1 alto-1-2 alto-1-3 alto-1-4 alto-2-0 alto-2-1 alto-2-2-draft alto-3-0 alto-3-1 alto-3-2-draft alto-4-0 alto-4-1
        abbyy-6-schema-v1 abbyy-8-schema-v2 abbyy-9-schema-v1 abbyy-10-schema-v1
        page-2009-03-16 page-2010-01-12 page-2010-03-19 page-2013-07-15 page-2016-07-15 page-2017-07-15 page-2018-07-15 page-2019-07-15

Validation CLI

For example, to validate an XML file against the ALTO 3.1 schema:

ocr-validate alto-3-1 myFile.alto

Validation GUI

Select the Validate menu option. Choose a URL and an schema. Click Validate.

Validation API

The XSD files are installed under $PREFIX/share/ocr-fileformat/xsd

Supported Validation Formats

hOCR ALTO PAGEXML FineReader Google Cloud Vision
Validation -

License

This is free software. You may use it under the terms of the MIT License.

During the installation process several projects are included (in ./vendor). These projects have different licenses:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].