All Projects → qurator-spk → dinglehopper

qurator-spk / dinglehopper

Licence: Apache-2.0 license
An OCR evaluation tool

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
Jinja
831 projects

Projects that are alternatives of or similar to dinglehopper

ocr-fileformat
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
Stars: ✭ 142 (+273.68%)
Mutual labels:  ocr, page-xml, alto, ocr-d
mirador-textoverlay
Text Overlay plugin for Mirador 3
Stars: ✭ 35 (-7.89%)
Mutual labels:  ocr, alto-xml, alto
BnLMetsExporter
Command Line Interface (CLI) to export METS/ALTO documents to other formats.
Stars: ✭ 11 (-71.05%)
Mutual labels:  alto-xml, alto
kitodo-presentation
Kitodo.Presentation is a feature-rich framework for building a METS- or IIIF-based digital library. It is part of the Kitodo Digital Library Suite.
Stars: ✭ 33 (-13.16%)
Mutual labels:  alto-xml, alto
ocrd anybaseocr
DFKI Layout Detection for OCR-D
Stars: ✭ 44 (+15.79%)
Mutual labels:  ocr, ocr-d
ocreval
Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support
Stars: ✭ 48 (+26.32%)
Mutual labels:  ocr, ocr-evaluation
ocrd cis
OCR-D python tools
Stars: ✭ 28 (-26.32%)
Mutual labels:  ocr, ocr-d
blinkid-in-browser
BlinkID In-browser SDK for WebAssembly-enabled browsers.
Stars: ✭ 40 (+5.26%)
Mutual labels:  ocr
tibetan-ocr
Python OCR for Handwritten Tibetan Mauscripts
Stars: ✭ 19 (-50%)
Mutual labels:  ocr
ddddocr
带带弟弟 通用验证码识别OCR pypi版
Stars: ✭ 4,093 (+10671.05%)
Mutual labels:  ocr
CLPR.pytorch
End to End Chinese License Plate Recognition
Stars: ✭ 75 (+97.37%)
Mutual labels:  ocr
Handwritten-Text-Recognition
IAM dataset
Stars: ✭ 25 (-34.21%)
Mutual labels:  ocr
ZUCC ZhenFangHelper
正方教务管理系统学生版的自动登录、选课、信息获取
Stars: ✭ 36 (-5.26%)
Mutual labels:  ocr
lego-mindstorms-51515-jetson-nano
Combines the LEGO Mindstorms 51515 with the NVIDIA Jetson Nano
Stars: ✭ 31 (-18.42%)
Mutual labels:  ocr
Tess4Android
A new fork base on tess-two and Tesseract 4.0.0
Stars: ✭ 31 (-18.42%)
Mutual labels:  ocr
erpnext ocr
🐍 ⚗️ Optical Character Recognition using tesseract within Frappe.
Stars: ✭ 58 (+52.63%)
Mutual labels:  ocr
hOCR-to-ALTO
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
Stars: ✭ 40 (+5.26%)
Mutual labels:  alto
How-to-use-tesseract-ocr-4.0-with-csharp
How to use Tesseract OCR 4.0 with C#
Stars: ✭ 60 (+57.89%)
Mutual labels:  ocr
NLP-image-to-text
code to extract text from images
Stars: ✭ 28 (-26.32%)
Mutual labels:  ocr
i-librarian-free
I, Librarian - open-source version of a PDF managing SaaS.
Stars: ✭ 110 (+189.47%)
Mutual labels:  ocr

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report.

Build Status

Goals

  • Useful
    • As a UI tool
    • For an automated evaluation
    • As a library
  • Unicode support

Installation

It's best to use pip, e.g.:

sudo pip install .

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results.
  In that case, use --no-metrics to disable the then meaningless metrics and
  also change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". The reports include the character
  error rate (CER) and the word error rate (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper displaying metrics and character differences

dinglehopper-line-dirs

You also may want to compare a directory of GT text files (i.e. gt/line0001.gt.txt) with a directory of OCR text files (i.e. ocr/line0001.some-ocr.txt) with a separate CLI interface:

dinglehopper-line-dirs gt/ ocr/

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter Meaning
-P metrics false Disable metrics and the green-red color scheme (default: enabled)
-P textequiv_level line (PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false

Developer information

Please refer to README-DEV.md.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].