All Projects → maxim2266 → go-ocr

maxim2266 / go-ocr

Licence: BSD-3-Clause License
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

Programming Languages

go
31211 projects - #10 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to go-ocr

papermerge-core
Papermerge RESTful backend structured as reusable Django app
Stars: ✭ 103 (+232.26%)
Mutual labels:  ocr, scanned-documents
proxy-scrape
scrapin' proxies with ocr
Stars: ✭ 20 (-35.48%)
Mutual labels:  ocr
TesseractStudio.Net
A free Windows graphical interface to the Tesseract 4.0 OCR engine.
Stars: ✭ 38 (+22.58%)
Mutual labels:  ocr
polling-station-app
Voting station app to redeem the suffrage on the blockchain using a machine readable travel document.
Stars: ✭ 39 (+25.81%)
Mutual labels:  ocr
labelReader
Programmatically find and read labels using Machine Learning
Stars: ✭ 44 (+41.94%)
Mutual labels:  ocr
cordova-plugin-tesseract
Cordova Plugin for OCR process using Tesseract
Stars: ✭ 70 (+125.81%)
Mutual labels:  ocr
tesseract-ocr
Node.js wrapper for Tesseract OCR CLI.
Stars: ✭ 29 (-6.45%)
Mutual labels:  ocr
doctr-tfjs-demo
Javascript demo of docTR, powered by TensorFlowJS
Stars: ✭ 21 (-32.26%)
Mutual labels:  ocr
craft-text-detector
Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector
Stars: ✭ 151 (+387.1%)
Mutual labels:  ocr
pdf-scripts
📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs
Stars: ✭ 33 (+6.45%)
Mutual labels:  ocr
R2CNN
caffe re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
Stars: ✭ 80 (+158.06%)
Mutual labels:  ocr
google-cloud-vision-php
A simple php wrapper for the google cloud vision API
Stars: ✭ 16 (-48.39%)
Mutual labels:  ocr
lookup
🔍 Pure Go implementation of fast image search and simple OCR, focused on reading info from screenshots
Stars: ✭ 35 (+12.9%)
Mutual labels:  ocr
DAVAR-Lab-OCR
OCR toolbox from Davar-Lab
Stars: ✭ 402 (+1196.77%)
Mutual labels:  ocr
KTP-OCR
An Open Source OCR tool for Indonesian ID card (KTP).
Stars: ✭ 48 (+54.84%)
Mutual labels:  ocr
IdCardRecognition
Android id card recognition based on OCR. 安卓基于OCR的身份证识别。
Stars: ✭ 35 (+12.9%)
Mutual labels:  ocr
gazou
Japanese OCR for Linux & Windows
Stars: ✭ 32 (+3.23%)
Mutual labels:  ocr
ScreencapToTextBot
Reddit bot that takes the screencap of a conversation and converts it in reddit formatted text
Stars: ✭ 12 (-61.29%)
Mutual labels:  ocr
python-ocr-example
The code for the blogpost A Python Approach to Character Recognition
Stars: ✭ 54 (+74.19%)
Mutual labels:  ocr
ibm-cloud-functions-serverless-ocr-openchecks
Serverless bank check deposit processing with object storage and optical character recognition using Apache OpenWhisk powered by IBM Cloud Functions. See the Tech Talk replay for a demo.
Stars: ✭ 40 (+29.03%)
Mutual labels:  ocr

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control over the text recognition process.

go-ocr

A tool for extracting plain text from scanned documents (pdf or djvu), with user-defined postprocessing.

Motivation

Once I had a task of OCR'ing a number of scanned documents in pdf format. I quickly built a pipeline of the tools to extract images from the input files and to convert them to plain text, but then I realised that modern OCR software is still less than ideal in terms of recognising text, so a good deal of postprocessing was needed in order to remove at least some of those OCR artefacts and irregularities. I ended up with a long pipeline of sed/grep filters which also had to be adjusted per each document and per each document language. What I wanted was a tool that could combine the OCR tools invocation with filters application, also giving an easy way of modifying and combining the filter definitions.

The tool

Given an input file in either pdf or djvu format, the tool performs the following steps:

  1. Images get extracted from the input file using pdfimages or ddjvu tool;
  2. The extracted images get converted to plain text using tesseract tool, in parallel;
  3. The specified filters get applied to the text.

Invocation

go-ocr [OPTION]... FILE

Command line options:

-f,--first N        first page number (optional, default: 1)
-l,--last  N        last page number (optional, default: last page of the document)
-F,--filter FILE    filter specification file name (optional, may be given multiple times)
-L,--language LANG  document language (optional, default: 'eng')
-o,--output FILE    output file name (optional, default: stdout)
-h,--help           display this help and exit
-v,--version        output version information and exit
Example

The following command processes a document some.pdf in Russian, from page 12 to page 26 (inclusive), without any postprocessing, storing the result in the file document.txt:

./go-ocr --first 12 --last 26 --language rus --output document.txt some.pdf

Filter definitions

Filter definition file is a plain text file containing rewriting rules and C-style comments. Each rewriting rule has the following format:

scope type "match" "substitution"

where

  • scope is either line or text;
  • type is either word or regex;
  • match and substitution are Go strings.

Each rule must be on one line.

Each rule of the scope line is applied to each line of the text. There is no processing done to the line by the tool itself other than trimming the trailing whitespace, which means that a line does not have a trailing newline symbol when the rule is applied. After that all the lines get combined into text with newline symbols inserted between them.

Each rule of the scope text is applied to the whole text after all the line rules. All newline symbols are visible to the rule which allows for combining multiple lines into one.

The reason for having two different scopes for the rules is that applying a rule to a line is computationally cheaper that applying to the whole text. Also, this makes the line regular expressions a bit simpler as, for example, \s regex cannot match a newline.

Rules of type word do a simple substitution replacing any match string with its corresponding substitution string.

Rules of type regex search the input for any match of the match regular expression and replace it with the substitution string. The syntax of the regular expression is that of the Go regexp engine. The substuitution string may contain references to the content of capturing groups from the corresponding match regular expression. From the Go documentation, each reference

is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores. A purely numeric name like $1 refers to the submatch with the corresponding index; other names refer to capturing parentheses named with the (?P<name>...) syntax. A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.

In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.

To insert a literal $ in the output, use $$ in the template.

All filter definition files are always processed in the order in which they are specified on the command line. Within each file, the rules are grouped by the scope, and applied in the order of specification. This allows for each rule to rely on the outcome of all the rules before it.

Rewriting rules examples

Rule to replace ellipsis with a single utf-8 symbol:

line word	"..."  "…"

Rule to replace all whitespace sequences with a single space character:

line regex	`\s+`	" "

Rule to remove all newline characters from the middle of a sentence:

text regex	`([a-z\(\),])\n+([a-z\(\)])` "${1} ${2}"

More examples can be found in the files filter-eng and filter-rus.

In practice, it is often useful to maintain one filter definition file with rules to remove common OCR artefacts, and another file with rules specific to a particular document. In general, it is probably impossible to avoid all manual editing altogether by using this tool, but from my experience, a few hours spent on setting up the appropriate filters for a 700 pages document can dramatically reduce the amount of manual work needed afterwards.

Other tools

Internally the program relies on pdfimages and ddjvu tools for extracting images from the input file, and on tesseract program for the actual OCR'ing. The tool pdfimages is usually a part of poppler-utils package, the tool ddjvu comes from djvulibre-bin package, and tesseract is included in tesseract-ocr package. By default, tesseract comes with the English language support only, other languages should be installed separately, for example, run sudo apt install tesseract-ocr-rus to install the Russian language support. To find out what languages are currently installed type tesseract --list-langs.

Compilation

Invoke make (or make debug) from the directory of the project to compile the code with debug information included, or make release to compile without debug symbols. This creates executable file go-ocr.

Technical details

The tool first runs pdfimages or ddjvu program to extract images to a temporary directory, and then invokes tesseract on each image in parallel to produce lines of plain text. Those lines are then passed through the line filters, if any, then assembled into one text string and passed through text filters, if any. regexp filters are implemented using Regexp.ReplaceAll() function, and word filters are invocations of bytes.Replace() function.

Known issues

Older versions of pdfimages tool do not have -tiff option, resulting in an error.

Platform

Linux (tested on Linux Mint 18 64bit, based on Ubuntu 16.04), will probably work on MacOS as well.

Tools:

$ go version
go version go1.6.2 linux/amd64
$ tesseract --version
tesseract 3.04.01
...
$ pdfimages --version
pdfimages version 0.41.0
...
$ ddjvu --help
DDJVU --- DjVuLibre-3.5.27
...
Lisence: BSD
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].