All Projects → doxakis → form-segmentation

doxakis / form-segmentation

Licence: MIT license
Let's explore how we can extract text from forms

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to form-segmentation

Handwritten-Names-Recognition
The goal of this project is to solve the task of name transcription from handwriting images implementing a NN approach.
Stars: ✭ 54 (+28.57%)
Mutual labels:  ocr, handwriting-recognition, handwritten-text-recognition
Transformer-ocr
Handwritten text recognition using transformers.
Stars: ✭ 92 (+119.05%)
Mutual labels:  ocr, handwritten-text-recognition
recrossable
crossword game with simplistic handwriting recognition and automatic generation of crosswords
Stars: ✭ 36 (-14.29%)
Mutual labels:  handwriting-recognition, handwritten-text-recognition
CRNN-OCR-lite
Lightweight CRNN for OCR (including handwritten text) with depthwise separable convolutions and spatial transformer module [keras+tf]
Stars: ✭ 130 (+209.52%)
Mutual labels:  ocr, handwritten-text-recognition
memento
Organize your meme image cluster in a better format using OCR from the meme to sort them using tesseract along with editing memes by segmenting them using OpenCV within a directory
Stars: ✭ 70 (+66.67%)
Mutual labels:  ocr
deep-license-plate-recognition
Automatic License Plate Recognition (ALPR) or Automatic Number Plate Recognition (ANPR) software that works with any camera.
Stars: ✭ 309 (+635.71%)
Mutual labels:  ocr
MouseTooltipTranslator
chrome extension - When mouse hover on text, it shows translated tooltip using google translate
Stars: ✭ 93 (+121.43%)
Mutual labels:  ocr
ocrd cis
OCR-D python tools
Stars: ✭ 28 (-33.33%)
Mutual labels:  ocr
ocr2text
Convert a PDF via OCR to a TXT file in UTF-8 encoding
Stars: ✭ 90 (+114.29%)
Mutual labels:  ocr
paperbase
Open source document organizer with automatic OCR and full text search
Stars: ✭ 21 (-50%)
Mutual labels:  ocr
htr-united
Ground Truth Resources for the HTR of patrimonial documents
Stars: ✭ 23 (-45.24%)
Mutual labels:  handwritten-text-recognition
pytorch.ctpn
pytorch, ctpn ,text detection ,ocr,文本检测
Stars: ✭ 123 (+192.86%)
Mutual labels:  ocr
jochre
Java Optical CHaracter Recognition
Stars: ✭ 18 (-57.14%)
Mutual labels:  ocr
ocr-machine-learning
OCR Machine Learning in python
Stars: ✭ 42 (+0%)
Mutual labels:  ocr
webgrep
Grep Web pages with extra features like JS deobfuscation and OCR
Stars: ✭ 86 (+104.76%)
Mutual labels:  ocr
ReadToMe
No description or website provided.
Stars: ✭ 51 (+21.43%)
Mutual labels:  ocr
baidu-chain-dog
百度莱茨狗爬虫。
Stars: ✭ 52 (+23.81%)
Mutual labels:  ocr
Multi-Type-TD-TSR
Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition:
Stars: ✭ 174 (+314.29%)
Mutual labels:  ocr
doctr
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
Stars: ✭ 1,409 (+3254.76%)
Mutual labels:  ocr
Korean-OCR-Model-Design-based-on-Keras-CNN
Korean OCR Model Design(한글 OCR 모델 설계)
Stars: ✭ 34 (-19.05%)
Mutual labels:  ocr

Form Segmentation

Let's explore how we can extract text from any forms / scanned pages.

Objectives

The goal is to find an algorithm that can extract the maximum information from a given page (jpg format). So, we can feed it to another system. (Business logic, neural network, classifier, etc.) The overall process may not be perfect. But it would be great if it can find enough information to identify the type of document and the involve identities.

  • Parse any form / scanned page and extract any text data (printed text and handwriting text). So, no prior knowledge of the layout / structure of the document.

  • Automatic extraction process (no human interaction. So, it can scale out)

  • Somehow fast (or the ability to speed up the task with more machines or CPU)

Challenges

There are many challenges to overcome. But the main problem is to identify which part of the form contains text.

Some other challenges:

  • Black Border Removal
  • ICR (Intelligent Character Recognition): recognize and convert hand-drawn characters into text
  • Scanned page (Detect edges and apply a perspective transform to obtain the top-down view of the document)
  • Remove noise (blur, OTSU, adaptivethreshold with opencv)
  • Shape detection and extraction
  • OCR (Not a real issue since we can use : Tesseract 4 great for printed text)
  • Handwriting recognition
  • Minimize errors
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].