All Projects → deajan → pmOCR

deajan / pmOCR

Licence: BSD-3-Clause license
A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR conversion on file activity

Programming Languages

shell
77523 projects

Projects that are alternatives of or similar to pmOCR

ksmbd
ksmbd kernel server(SMB/CIFS server)
Stars: ✭ 181 (+241.51%)
Mutual labels:  smb, cifs
Swiftytesseract
A Swift wrapper around Tesseract for use in iOS, macOS, and Linux applications
Stars: ✭ 170 (+220.75%)
Mutual labels:  ocr, tesseract
Ocrtable
Recognize tables and text from scanned images that contain tables. 从包含表格的扫描图片中识别表格和文字
Stars: ✭ 155 (+192.45%)
Mutual labels:  ocr, tesseract
Tesseract Ocr for windows
Visual Studio Projects for Tessearct and dependencies
Stars: ✭ 122 (+130.19%)
Mutual labels:  ocr, tesseract
Tessdata fast
Fast integer versions of trained LSTM models
Stars: ✭ 221 (+316.98%)
Mutual labels:  ocr, tesseract
Tesseract4android
Fork of tess-two rewritten from scratch to support latest version of Tesseract OCR.
Stars: ✭ 148 (+179.25%)
Mutual labels:  ocr, tesseract
Ocr Table
Extract tables from scanned image PDFs using Optical Character Recognition.
Stars: ✭ 165 (+211.32%)
Mutual labels:  ocr, tesseract
Tesseract
This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.
Stars: ✭ 43,199 (+81407.55%)
Mutual labels:  ocr, tesseract
Tesseract
Bindings to Tesseract OCR engine for R
Stars: ✭ 192 (+262.26%)
Mutual labels:  ocr, tesseract
Android Ocr
Experimental optical character recognition app
Stars: ✭ 2,177 (+4007.55%)
Mutual labels:  ocr, tesseract
Aadhaar Card Ocr
Extract text information from Aadhaar Card using tesseract-ocr 😎
Stars: ✭ 112 (+111.32%)
Mutual labels:  ocr, tesseract
Tesstrain
Train Tesseract LSTM with make
Stars: ✭ 251 (+373.58%)
Mutual labels:  ocr, tesseract
Tabulo
Table Detection and Extraction Using Deep Learning ( It is built in Python, using Luminoth, TensorFlow<2.0 and Sonnet.)
Stars: ✭ 110 (+107.55%)
Mutual labels:  ocr, tesseract
Tesseract Macos
Objective C wrapper for the open source OCR Engine Tesseract (macOS)
Stars: ✭ 154 (+190.57%)
Mutual labels:  ocr, tesseract
Links Detector
📖 👆🏻 Links Detector makes printed links clickable via your smartphone camera. No need to type a link in, just scan and click on it.
Stars: ✭ 106 (+100%)
Mutual labels:  ocr, tesseract
Lambda Text Extractor
AWS Lambda functions to extract text from various binary formats.
Stars: ✭ 159 (+200%)
Mutual labels:  ocr, tesseract
Tesserocr
A Python wrapper for the tesseract-ocr API
Stars: ✭ 1,567 (+2856.6%)
Mutual labels:  ocr, tesseract
Gosseract
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library
Stars: ✭ 1,622 (+2960.38%)
Mutual labels:  ocr, tesseract
Tesseract Ocr For Php
A wrapper to work with Tesseract OCR inside PHP.
Stars: ✭ 2,247 (+4139.62%)
Mutual labels:  ocr, tesseract
Image2text
📋 Python wrapper to grab text from images and save as text files using Tesseract Engine
Stars: ✭ 243 (+358.49%)
Mutual labels:  ocr, tesseract

pmOCR (poor man's OCR tool)

Build Status License GitHub Release

A multicore batch & service wrapper script for Tesseract v3/v4/v5 (https://github.com/tesseract-ocr/) or ABBYY CLI OCR 11 FOR LINUX based on Finereader Engine 11 optical character recognition (www.ocr4linux.com).

Conversions support tiff/jpg/png/pdf/bmp to PDF, TXT and CSV (also DOCX and XSLX for Abbyy OCR). It can actually support any other format that your OCR engine can handle.

This wrapper can work both in batch and service mode.

In batch mode, it's used as commandline tool for processing multiple files at once, being able to output one or more formats.

In service mode, it will monitor directories and launch OCR conversions as soon as new files get into the directories. Since v1.8.0, it can also monitor NFS / SMB mountpoints with new integrated inotifywait emulation poller.

pmOCR has the following options:

  • Include current date into the output filename
  • Ignore already OCRed PDF files based on font detection and / or file suffix
  • Delete or move input file after successful conversion

Install it

$ git clone https://github.com/deajan/pmOCR
$ cd pmOCR
$ ./install.sh

You will need pdffonts util (from poppler-utils package). Optionally, you can install inotifywait (from inotify-tools package).

If you are using tesseract OCR, please install tesseract-osd and tesseract-[your language] (sometimes called tesseract-ocr-osd). You will also need ImageMagick in order to be able to transform bitmap PDF documents to indexed PDFs.

Batch mode

Use pmocr to batch process all files in a given directory and its subdirectories.

Use --help for command line usage.

Example:

$ pmocr.sh --batch --target=pdf --skip-txt-pdf --delete-input /some/path
$ pmocr.sh --batch --target=pdf --target=csv --suffix=processed /some/path

If pmOCR wasn't installed, you may run it directly with a configuration file like:

$ ./pmocr.sh --config=./default.conf --batch -p /some/path

OCR Configuration

pmOCR uses a default config stored in /etc/pmocr/default.conf You may change it's contents or clone it and have pmOCR use an alternative configuration with:

$ pmocr.sh --config=/etc/pmocr/myConfig.conf --batch --target=csv /some/path

Service mode

Service mode monitors directories and their subdirectories and launched an OCR conversion whenever a new file is written. Keep in mind that only file creations are monitored. File moves aren't.

pmocr is written to monitor up to 5 directories, each producing a different target format (PDF, DOCX, XLSX, TXT & CSV). Comment out a folder to disable it's monitoring.

There's also an option to avoid passing PDFs to the OCR engine that already contain text.

After installation, please configure /etc/pmocr/default.conf in order to monitor the directories you need, and adjust your specific options.

Launch service (initV style) service pmocr-srv start

Launch service (systemd style) systemctl start [email protected]

Check service state (initV style) service pmocr-srv status

Check service state (systemd style) systemctl status [email protected]

Multiple service instances

In order to monitor multiple directories with different OCR settings, you need to duplicate /etc/pmocr/default.conf configuration file. When launching pmOCR service with initV, each config file will create an instance. With systemD, you have to launch a service for each config file. Example for configs /etc/pmocr/default.conf and /etc/pmocr/other.conf

$ systemctl start [email protected]
$ systemctl start [email protected]

Support for OCR engines

Has been tested so far with:

  • ABBYY FineReader OCR Engine 11 CLI for Linux releases R2 (v 11.1.6.562411), R3 (v 11.1.9.622165) and R6 (v 11.1.14.707470)
  • Tesseract-ocr 3.0.4
  • Tesseract-ocr 4.0.0 and 4.0.12
  • Tesseract-ocr 5.0.0 and 5.0.1

Tesseract mode also uses ghostscript to convert PDF files to an intermediary TIFF format in order to process them.

It should virtually work with any engine as long as you adjust the parameters.

Parameters include any arguments to pass to the OCR program depending on the target format.

Support for OCR Preprocessors

ABBYY has in integrated preprocessor in order to enhance recognition qualitiy whereas Tesseract relies on external tools. pmOCR can use a preprocessor like ImageMagick to deskew / clear noise / render white background and remove black borders. ImageMagick preprocessor is configured, and enabled by default to be used with Tesseract.

Tesseract caveats

When no OSD / language data is installed, tesseract will still process documents, but the quality may suffer. While pmocr will warn you about this, the conversion still happens. Please make sure to install all necessary addons for tesseract.

Troubleshooting

Please check /var/log/pmocr.log or ./pmocr.log file for errors.

Filenames containing special characters should work, nevertheless, if your file doesn't get converted, try to rename it and copy it again to the monitored directory or batch process it again.

By default, failing to prevent files will add a prefix '_OCR_ERR' + date to the filename. In order to reprocess those files, the prefix has to be removed with the following command

$ find /monitor/path -iname "*_OCR_ERR.*" -print0 | xargs -0 -I {} sh -c 'export file="{}"; mv "$file" "${file//_OCR_ERR/}"'

If using tesseract to create searchable PDF files, please make sure to have version 3.03 or better installed.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].