Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → tesseract-ocr → Tessdata_fast

tesseract-ocr / Tessdata_fast

Licence: apache-2.0

Fast integer versions of trained LSTM models

Labels

ocr tesseract

Projects that are alternatives of or similar to Tessdata fast

Tesseract

Bindings to Tesseract OCR engine for R

Stars: ✭ 192 (-13.12%)

Mutual labels: ocr, tesseract

Aadhaar Card Ocr

Extract text information from Aadhaar Card using tesseract-ocr 😎

Stars: ✭ 112 (-49.32%)

Mutual labels: ocr, tesseract

Links Detector

📖 👆🏻 Links Detector makes printed links clickable via your smartphone camera. No need to type a link in, just scan and click on it.

Stars: ✭ 106 (-52.04%)

Mutual labels: ocr, tesseract

Tesserocr

A Python wrapper for the tesseract-ocr API

Stars: ✭ 1,567 (+609.05%)

Mutual labels: ocr, tesseract

Ocrtable

Recognize tables and text from scanned images that contain tables. 从包含表格的扫描图片中识别表格和文字

Stars: ✭ 155 (-29.86%)

Mutual labels: ocr, tesseract

Gosseract

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

Stars: ✭ 1,622 (+633.94%)

Mutual labels: ocr, tesseract

Tesseract Ocr For Php

A wrapper to work with Tesseract OCR inside PHP.

Stars: ✭ 2,247 (+916.74%)

Mutual labels: ocr, tesseract

Textshot

Python tool for grabbing text via screenshot

Stars: ✭ 1,163 (+426.24%)

Mutual labels: ocr, tesseract

Tesseract Macos

Objective C wrapper for the open source OCR Engine Tesseract (macOS)

Stars: ✭ 154 (-30.32%)

Mutual labels: ocr, tesseract

Tesseract4android

Fork of tess-two rewritten from scratch to support latest version of Tesseract OCR.

Stars: ✭ 148 (-33.03%)

Mutual labels: ocr, tesseract

Android Ocr

Experimental optical character recognition app

Stars: ✭ 2,177 (+885.07%)

Mutual labels: ocr, tesseract

Ocr Table

Extract tables from scanned image PDFs using Optical Character Recognition.

Stars: ✭ 165 (-25.34%)

Mutual labels: ocr, tesseract

Node Tesseract Ocr

A Node.js wrapper for the Tesseract OCR API

Stars: ✭ 92 (-58.37%)

Mutual labels: ocr, tesseract

Tesseract

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

Stars: ✭ 43,199 (+19447.06%)

Mutual labels: ocr, tesseract

Penteract Ocr

⭐️ The native node.js bindings to the Tesseract OCR project.

Stars: ✭ 86 (-61.09%)

Mutual labels: ocr, tesseract

Tabulo

Table Detection and Extraction Using Deep Learning ( It is built in Python, using Luminoth, TensorFlow<2.0 and Sonnet.)

Stars: ✭ 110 (-50.23%)

Mutual labels: ocr, tesseract

Idmatch

Match faces on id cards with OCR capabilities.

Stars: ✭ 52 (-76.47%)

Mutual labels: ocr, tesseract

Ocr Electron Vue

📇 A Simple OCR Application built on Electron, Vue.js & Tesseract.js

Stars: ✭ 67 (-69.68%)

Mutual labels: ocr, tesseract

Tesseract Ocr for windows

Visual Studio Projects for Tessearct and dependencies

Stars: ✭ 122 (-44.8%)

Mutual labels: ocr, tesseract

Lambda Text Extractor

AWS Lambda functions to extract text from various binary formats.

Stars: ✭ 159 (-28.05%)

Mutual labels: ocr, tesseract

View All Similar Projects ➔

tessdata_fast – Fast integer versions of trained models

This repository contains fast integer versions of trained models for the Tesseract Open Source OCR Engine.

These models only work with the LSTM OCR engine of Tesseract 4.

These are a speed/accuracy compromise as to what offered the best "value for money" in speed vs accuracy.
For some languages, this is still best, but for most not.
The "best value for money" network configuration was then integerized for further speed.
Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions eg. Ubuntu 18.04.
Fine tuning/incremental training will NOT be possible from these fast models, as they are 8-bit integer.
When using the models in this repository, only the new LSTM-based OCR engine is supported. The legacy tesseract engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.

Two types of models

The repository contains two types of models,

those for a single language and
those for a single script supporting one or more languages.

Most of the script models include English training data as well as the script, but not Cyrillic, as that would have a major ambiguity problem.

On Debian and Ubuntu, the language based traineddata packages are named tesseract-ocr-LANG where LANG is the three letter language code eg. tesseract-ocr-eng (English language), tesseract-ocr-hin (Hindi language), etc.

On Debian and Ubuntu, the script based traineddata packages are named tesseract-ocr-script-SCRIPT where SCRIPT is the four letter script code eg. tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari Script), etc.

Data files for a particular script

Initial capitals in the filename indicate the one model for all languages in that script. These are now available under script subdirectory.

Latin is all latin-based languages, except vie.
Vietnamese is for latin-based Vietnamese language.
Fraktur is basically a combination of all the latin-based languages that have an 'old' variant.
Devanagari is for hin+san+mar+nep+eng.

LSTM training details for different languages and scripts

For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. eg. Latin ~4500 fonts, Devanagari ~50 fonts, Kannada 15.

With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, the training data has been mixed with English, so that some of the font diversity might generalize to the other script. The overall effect was slightly positive, hence the script models include English language also.

Example - jpn and Japanese

'jpn' contains whatever appears on the www that is labelled as the language, trained only with fonts that can render Japanese.

Japanese contains all the languages that use that script (in this case just the one) PLUS English.The resulting model is trained with a mix of both training sets, with the expectation that some of the generalization to 4500 English training fonts will also apply to the other script that has a lot less.

'jpn_vert' is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).

'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.

See the Tesseract wiki for additional information.

All data in the repository are licensed under the Apache-2.0 License, see file LICENSE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 221

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗