All Projects → karisigurd4 → DocumentLab

karisigurd4 / DocumentLab

Licence: GPL-3.0 license
OCR using tesseract, ImageMagick, EmguCV, an advanced query language and a fluent query interface for C#

Programming Languages

C#
18002 projects
javascript
184084 projects - #8 most used programming language
HTML
75241 projects
CSS
56736 projects
powershell
5483 projects
ANTLR
299 projects

Projects that are alternatives of or similar to DocumentLab

Android Ocr
Experimental optical character recognition app
Stars: ✭ 2,177 (+3301.56%)
Mutual labels:  ocr, optical-character-recognition
Document-Scanner-and-OCR
A simple document scanner with OCR implemented using Python and OpenCV
Stars: ✭ 31 (-51.56%)
Mutual labels:  ocr, optical-character-recognition
Receipt Scanner
Receipt scanner extracts information from your PDF or image receipts - built in NodeJS
Stars: ✭ 190 (+196.88%)
Mutual labels:  ocr, optical-character-recognition
Tesseract4android
Fork of tess-two rewritten from scratch to support latest version of Tesseract OCR.
Stars: ✭ 148 (+131.25%)
Mutual labels:  ocr, optical-character-recognition
ImageToText
OCR with Google's AI technology (Cloud Vision API)
Stars: ✭ 30 (-53.12%)
Mutual labels:  ocr, optical-character-recognition
Ocr Table
Extract tables from scanned image PDFs using Optical Character Recognition.
Stars: ✭ 165 (+157.81%)
Mutual labels:  ocr, optical-character-recognition
Image2text
📋 Python wrapper to grab text from images and save as text files using Tesseract Engine
Stars: ✭ 243 (+279.69%)
Mutual labels:  ocr, optical-character-recognition
Penteract Ocr
⭐️ The native node.js bindings to the Tesseract OCR project.
Stars: ✭ 86 (+34.38%)
Mutual labels:  ocr, optical-character-recognition
blinkid-in-browser
BlinkID In-browser SDK for WebAssembly-enabled browsers.
Stars: ✭ 40 (-37.5%)
Mutual labels:  ocr, optical-character-recognition
jochre
Java Optical CHaracter Recognition
Stars: ✭ 18 (-71.87%)
Mutual labels:  ocr, optical-character-recognition
Easyocr
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Stars: ✭ 13,379 (+20804.69%)
Mutual labels:  ocr, optical-character-recognition
vrpdr
Deep Learning Applied To Vehicle Registration Plate Detection and Recognition in PyTorch.
Stars: ✭ 36 (-43.75%)
Mutual labels:  ocr, optical-character-recognition
Ssocr
Seven Segment Optical Character Recognition
Stars: ✭ 133 (+107.81%)
Mutual labels:  ocr, optical-character-recognition
Swiftytesseract
A Swift wrapper around Tesseract for use in iOS, macOS, and Linux applications
Stars: ✭ 170 (+165.63%)
Mutual labels:  ocr, optical-character-recognition
Tesserocr
A Python wrapper for the tesseract-ocr API
Stars: ✭ 1,567 (+2348.44%)
Mutual labels:  ocr, optical-character-recognition
Signature extractor
A super lightweight image processing algorithm for detection and extraction of overlapped handwritten signatures on scanned documents using OpenCV and scikit-image.
Stars: ✭ 205 (+220.31%)
Mutual labels:  ocr, optical-character-recognition
Eyevis
Android based Vocal Vision for Visually Impaired. Object Detection, Voice Assistance, Optical Character Reader, Read Aloud, Face Recognition, Landmark Recognition, Image Labelling etc.
Stars: ✭ 48 (-25%)
Mutual labels:  ocr, optical-character-recognition
Swiftytesseractrte
SwiftyTesseract Real-Time Engine
Stars: ✭ 49 (-23.44%)
Mutual labels:  ocr, optical-character-recognition
doctr
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
Stars: ✭ 1,409 (+2101.56%)
Mutual labels:  ocr, optical-character-recognition
mirador-textoverlay
Text Overlay plugin for Mirador 3
Stars: ✭ 35 (-45.31%)
Mutual labels:  ocr, optical-character-recognition

Optical character recognition, image processing and intelligent data extraction

NuGet version (DocumentLab-x64) License) Platform

Comming soon: Realtime script editor and visualization tool


This is a solution for data extraction from images of documents. You send in a bitmap, a set of queries and get extracted data in structured json.

  • DocumentLab takes care of image analysis and processing, optical character recognition, text classification
  • It provides a query language created specifically for document data extraction
    • Defining documents by queries, patterns and tables allows DocumentLab to understand documents intelligently
    • Intelligently meaning that the solution has the capacity to extract data from documents with layouts it has never seen before
    • A C# interface to the query language is also supported
  • Extensive configuration for all parts of the process are available to facilitate your requirements


Getting started

  • Important: DocumentLab has been optimized to expect a certain size range when it comes to analysing text from images. Which means that if the picture you pass in has text that is too big or too small (pixel wise) then you will not get optimal results. You can use the fakeinvoice.png from the example above for a scale reference if you need to resize your input data.

  • Important: A drawback of using the prebuilt binary application is that on each execution DocumentLab needs to load configuration files and initialize a tesseract engine for each thread. This initialization overhead might take 3x the amount of time DocumentLab would otherwise need to scan a single page. It's in the plans to provide a better prebuilt binary application for integration.

You can download the prebuilt binary and get started immediately. Parameter usage: (Picture file path) (Query file path) ((Optional) Output file path)

The (Query file path) parameter needs to be a text file containing a set of queries that DocumentLab can perform on the picture provided alongside as an argument. This still requires investigating the DocumentLab query language. The script used in the example above was the same as was used for the topmost picture with the fake invoice.

... Or if you want to integrate with code,

Quickly

// To use the scripting interface
var documentLab = new DocumentInterpreter();
var interpretedJsonResult = documentLab.InterpretToJson(script, (Bitmap)Image.FromFile(imagePath));
// To use the FluentQuery API
using (var document = new Document((Bitmap)Image.FromFile(imagePath))) 
{
 ...

You can write scripts in the query language or use the C# API. The C# fluent interface is easier to get started quickly. The raw text scripting interface allows more versatility and configurability in a production context.

Definitions

  • Pattern: A description of how information is presented in a document as well as which data to capture
    • e.g: Text(Total amount) Right [Amount]
  • Table: A description of which table column labels to match in a document and which text types are represented in each column
    • e.g: Table 'ItemNo': [Number(Item number)] 'Description': [Text(Description)] 'Price': [Amount(...
  • Query: A named set of patterns prioritized first to last
    • e.g: IncoiceNumber: *pattern 1*; *pattern 2*; ... *pattern n*;
  • Script: A collection of queries to execute in one go. Output properties will have the query name

From here...

Quick examples

Below are a few select examples with comments on how DocumentLab can be used.

C# Fluent Query Example

using (var dl = new Document((Bitmap)Image.FromFile("pathToSomeImage.png")))
{
  // Here we ask DocumentLab to specifically find a date value for the specified possible labels
  string dueDate = dl.Query().FindValueForLabel(TextType.Date, "Due date", "Payment date");

  // Here we ask DocumentLab to specifically find a date value for the specified label in a specific direction 
  string customerNumber = dl.Query().GetValueForLabel(Direction.Right, "Customer number");

  // We can build patterns using predicates, directions and capture operations that return the value matched in the document
  // Patterns allow us to recognize and capture data by contextual information, i.e., how we'd read for example receiver information from an invoice
  string receiverName = dl
    .Query()
    .Match("PostCode") // Text classification using contextual data files can be referenced by string
    .Up()
    .Match("Town")
    .Up()
    .Match("City")
    .Up()
    .Capture(TextType.Text); // All text type operations can also use the statically defined text type enum
} 

Script example

  • Your document has a label "Customer number:" and a value to the right of it
    • Query: CustomerNumber: Text(Customer number) Right [Text];
    • Match text labels with implicit starts with and Levensthein distance 2 comparison
  • Your document has a label "Invoice date" and a date below it
    • Query: InvoiceDate: Text(Invoice date) Down [Date];
    • You can capture a variety of text types. Even if the document contains additional text at the capture you'll only get back a standardized ISO date.
  • Want to capture invoice receiver info in one query?
    • Query: Receiver: 'Name': [Text] Down 'Address': [StreetAddress] Down 'City': [Town] Down 'PostalCode': [PostalCode];
    • Json output will name properties according to the query predicate naming parameters
  • You want to capture all amounts in a document?
    • Query: AllAmounts: Any [Amount];
    • When we use any, results are returned in a json array

Documentation

The case with any OCR process is that the quality of the output depends entirely on the quality of source image you pass into it. If the original image quality is very low then you can expect very low quality OCR results.

Depending on your image source, you may want to upsample low dpi images to a range between 220 - 300. This often also becomes a question of quality vs. time in terms of execution time and should be adjusted to your requirements.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].