All Projects β†’ tesseract-ocr β†’ Tesstrain

tesseract-ocr / Tesstrain

Licence: apache-2.0
Train Tesseract LSTM with make

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tesstrain

Aadhaar Card Ocr
Extract text information from Aadhaar Card using tesseract-ocr 😎
Stars: ✭ 112 (-55.38%)
Mutual labels:  ocr, tesseract
Ocrtable
Recognize tables and text from scanned images that contain tables. δ»ŽεŒ…ε«θ‘¨ζ Όηš„ζ‰«ζε›Ύη‰‡δΈ­θ―†εˆ«θ‘¨ζ Όε’Œζ–‡ε­—
Stars: ✭ 155 (-38.25%)
Mutual labels:  ocr, tesseract
Tesseract Ocr for windows
Visual Studio Projects for Tessearct and dependencies
Stars: ✭ 122 (-51.39%)
Mutual labels:  ocr, tesseract
Links Detector
πŸ“– πŸ‘†πŸ» Links Detector makes printed links clickable via your smartphone camera. No need to type a link in, just scan and click on it.
Stars: ✭ 106 (-57.77%)
Mutual labels:  ocr, tesseract
Image2text
πŸ“‹ Python wrapper to grab text from images and save as text files using Tesseract Engine
Stars: ✭ 243 (-3.19%)
Mutual labels:  ocr, tesseract
Tabulo
Table Detection and Extraction Using Deep Learning ( It is built in Python, using Luminoth, TensorFlow<2.0 and Sonnet.)
Stars: ✭ 110 (-56.18%)
Mutual labels:  ocr, tesseract
Tesseract Macos
Objective C wrapper for the open source OCR Engine Tesseract (macOS)
Stars: ✭ 154 (-38.65%)
Mutual labels:  ocr, tesseract
Node Tesseract Ocr
A Node.js wrapper for the Tesseract OCR API
Stars: ✭ 92 (-63.35%)
Mutual labels:  ocr, tesseract
Swiftytesseract
A Swift wrapper around Tesseract for use in iOS, macOS, and Linux applications
Stars: ✭ 170 (-32.27%)
Mutual labels:  ocr, tesseract
Ocr Table
Extract tables from scanned image PDFs using Optical Character Recognition.
Stars: ✭ 165 (-34.26%)
Mutual labels:  ocr, tesseract
Tesseract
This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.
Stars: ✭ 43,199 (+17110.76%)
Mutual labels:  ocr, tesseract
Android Ocr
Experimental optical character recognition app
Stars: ✭ 2,177 (+767.33%)
Mutual labels:  ocr, tesseract
Gosseract
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library
Stars: ✭ 1,622 (+546.22%)
Mutual labels:  ocr, tesseract
Tessdata fast
Fast integer versions of trained LSTM models
Stars: ✭ 221 (-11.95%)
Mutual labels:  ocr, tesseract
Tesserocr
A Python wrapper for the tesseract-ocr API
Stars: ✭ 1,567 (+524.3%)
Mutual labels:  ocr, tesseract
Tesseract4android
Fork of tess-two rewritten from scratch to support latest version of Tesseract OCR.
Stars: ✭ 148 (-41.04%)
Mutual labels:  ocr, tesseract
Textshot
Python tool for grabbing text via screenshot
Stars: ✭ 1,163 (+363.35%)
Mutual labels:  ocr, tesseract
Penteract Ocr
⭐️ The native node.js bindings to the Tesseract OCR project.
Stars: ✭ 86 (-65.74%)
Mutual labels:  ocr, tesseract
Lambda Text Extractor
AWS Lambda functions to extract text from various binary formats.
Stars: ✭ 159 (-36.65%)
Mutual labels:  ocr, tesseract
Tesseract Ocr For Php
A wrapper to work with Tesseract OCR inside PHP.
Stars: ✭ 2,247 (+795.22%)
Mutual labels:  ocr, tesseract

tesstrain

Training workflow for Tesseract 4 as a Makefile for dependency tracking and building the required software from source.

Install

leptonica, tesseract

You will need a recent version (>= 4.0.0beta1) of tesseract built with the training tools and matching leptonica bindings. Build instructions and more can be found in the Tesseract project wiki.

Alternatively, you can build leptonica and tesseract within this project and install it to a subdirectory ./usr in the repo:

  make leptonica tesseract

Tesseract will be built from the git repository, which requires CMake, autotools (including autotools-archive) and some additional libraries for the training tools. See the installation notes in the tesseract repository.

Python

You need a recent version of Python 3.x. For image processing the Python library Pillow is used. If you don't have a global installation, please use the provided requirements file pip install -r requirements.txt.

Choose model name

Choose a name for your model. By convention, Tesseract stack models including language-specific resources use (lowercase) three-letter codes defined in ISO 639 with additional information separated by underscore. E.g., chi_tra_vert for traditional Chinese with vertical typesetting. Language-independent (i.e. script-specific) models use the capitalized name of the script type as identifier. E.g., Hangul_vert for Hangul script with vertical typesetting. In the following, the model name is referenced by MODEL_NAME.

Provide ground truth

Place ground truth consisting of line images and transcriptions in the folder data/MODEL_NAME-ground-truth. This list of files will be split into training and evaluation data, the ratio is defined by the RATIO_TRAIN variable.

Images must be TIFF and have the extension .tif or PNG and have the extension .png, .bin.png or .nrm.png.

Transcriptions must be single-line plain text and have the same name as the line image but with the image extension replaced by .gt.txt.

The repository contains a ZIP archive with sample ground truth, see ocrd-testset.zip. Extract it to ./data/foo-ground-truth and run make training.

NOTE: If you want to generate line images for transcription from a full page, see tips in issue 7 and in particular @Shreeshrii's shell script.

Train

 make training MODEL_NAME=name-of-the-resulting-model

which is basically a shortcut for

   make unicharset lists proto-model training

Run make help to see all the possible targets and variables:


  Targets

    unicharset       Create unicharset
    lists            Create lists of lstmf filenames for training and eval
    training         Start training
    traineddata      Create best and fast .traineddata files from each .checkpoint file
    proto-model      Build the proto model
    leptonica        Build leptonica
    tesseract        Build tesseract
    tesseract-langs  Download tesseract-langs
    clean            Clean all generated files

  Variables

    MODEL_NAME         Name of the model to be built. Default: foo
    START_MODEL        Name of the model to continue from. Default: ''
    PROTO_MODEL        Name of the proto model. Default: 'data/foo/foo.traineddata'
    CORES              No of cores to use for compiling leptonica/tesseract. Default: 4
    LEPTONICA_VERSION  Leptonica version. Default: 1.78.0
    TESSERACT_VERSION  Tesseract commit. Default: 4.1.1
    TESSDATA_REPO      Tesseract model repo to use. Default: _best
    TESSDATA           Path to the .traineddata directory to start finetuning from. Default: ./usr/share/tessdata
    GROUND_TRUTH_DIR   Ground truth directory. Default: data/MODEL_NAME-ground-truth
    OUTPUT_DIR         Output directory for generated files. Default: data/MODEL_NAME
    MAX_ITERATIONS     Max iterations. Default: 10000
    EPOCHS             Set max iterations based on the number of lines for training. Default: none
    LEARNING_RATE      Learning rate. Default: 0.0001 with START_MODEL, otherwise 0.002
    NET_SPEC           Network specification. Default: [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c\#\#\#]
    FINETUNE_TYPE      Finetune Training Type - Impact, Plus, Layer or blank. Default: ''
    LANG_TYPE          Language Type - Indic, RTL or blank. Default: ''
    PSM                Page segmentation mode. Default: 6
    RANDOM_SEED        Random seed for shuffling of the training data. Default: 0
    RATIO_TRAIN        Ratio of train / eval training data. Default: 0.90
    TARGET_ERROR_RATE  Stop training if the character error rate (CER in percent) gets below this value. Default: 0.01

Make model files (traineddata)

When the training is finished, it will write a traineddata file which can be used for text recognition with Tesseract. Note that this file does not include a dictionary. The tesseract executable therefore prints an warning.

It is also possible to create additional traineddata files from intermediate training results (the so called checkpoints). This can even be done while the training is still running. Example:

# Add MODEL_NAME and OUTPUT_DIR like for the training.
make traineddata

This will create two directories tessdata_best and tessdata_fast in OUTPUT_DIR with a best (double based) and fast (int based) model for each checkpoint.

It is also possible to create models for selected checkpoints only. Examples:

# Make traineddata for the checkpoint files of the last three weeks.
make traineddata CHECKPOINT_FILES="$(find data/foo -name '*.checkpoint' -mtime -21)"

# Make traineddata for the last two checkpoint files.
make traineddata CHECKPOINT_FILES="$(ls -t data/foo/checkpoints/*.checkpoint | head -2)"

# Make traineddata for all checkpoint files with CER better than 1 %.
make traineddata CHECKPOINT_FILES="$(ls data/foo/checkpoints/*[^1-9]0.*.checkpoint)"

Add MODEL_NAME and OUTPUT_DIR and replace data/foo by the output directory if needed.

Plotting CER (experimental)

Training and Evaluation CER can be plotted using matplotlib. A couple of scripts are provided as a starting point in plot subdirectory for plotting of different training scenarios. The training log is expected to be saved in plot/TESSTRAIN.LOG.

As an example, use the training data provided in ocrd-testset.zip to do training and generate the plots. Plotting can be done while training is running also to depict the training status till then.

unzip ocrd-testset.zip -d data/ocrd-ground-truth
nohup make training MODEL_NAME=ocrd START_MODEL=frk TESSDATA=~/tessdata_best MAX_ITERATIONS=10000 > plot/TESSTRAIN.LOG &
cd ./plot
./plot_cer.sh 

License

Software is provided under the terms of the Apache 2.0 license.

Sample training data provided by Deutsches Textarchiv is in the public domain.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].