All Projects → zmr → namsel

zmr / namsel

Licence: MIT License
An OCR application focused on machine-print Tibetan text

Programming Languages

Rich Text Format
576 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to namsel

tibetan-ocr
Python OCR for Handwritten Tibetan Mauscripts
Stars: ✭ 19 (-13.64%)
Mutual labels:  ocr, tibetan
ScreenAccess
Anti Recoil system with weapon type built-in recognition based on OCR, currently support next games: Apex Legends
Stars: ✭ 41 (+86.36%)
Mutual labels:  ocr
PSENet-Tensorflow
TensorFlow implementation of PSENet text detector (Shape Robust Text Detection with Progressive Scale Expansion Networkt)
Stars: ✭ 51 (+131.82%)
Mutual labels:  ocr
VehicleInfoOCR
Use your camera to read number plates and obtain vehicle details. Simple, ad-free and faster alternative to existing playstore apps
Stars: ✭ 35 (+59.09%)
Mutual labels:  ocr
scanbot-sdk-example-ionic
Scanbot scanner SDK example app for Ionic with Cordova.
Stars: ✭ 24 (+9.09%)
Mutual labels:  ocr
tesseract-server
A small lightweight HTTP server that converts photos, images and scanned documents to text using optical character recognition by utilizing the power of Google Tesseract.
Stars: ✭ 15 (-31.82%)
Mutual labels:  ocr
TextBoxGAN
Generate text boxes from input words with a GAN.
Stars: ✭ 50 (+127.27%)
Mutual labels:  ocr
solr-ocrpayload-plugin
Efficient indexing and retrieval of OCR bounding boxes in Solr
Stars: ✭ 22 (+0%)
Mutual labels:  ocr
ocr
Simple app to extract text from pictures using Tesseract
Stars: ✭ 98 (+345.45%)
Mutual labels:  ocr
OCR-Reader
An Android app to extract text from camera preview directly.
Stars: ✭ 43 (+95.45%)
Mutual labels:  ocr
screenshot-actions
Dunst actions for screenshots (OCR, upload to 0x0.st, delete, rename, move to/from clipboard)
Stars: ✭ 49 (+122.73%)
Mutual labels:  ocr
Iron-OCR-Image-to-Text-in-CSharp
Image to Text Tutorial in C# - See https://ironsoftware.com/csharp/ocr/tutorials/how-to-read-text-from-an-image-in-csharp-net/
Stars: ✭ 65 (+195.45%)
Mutual labels:  ocr
PRLib
Pre-Recognition Library - library with algorithms for improving OCR quality.
Stars: ✭ 22 (+0%)
Mutual labels:  ocr
Seven-Segment-OCR
Computer vision project to automatically recognize digits characters in a seven-segments display
Stars: ✭ 58 (+163.64%)
Mutual labels:  ocr
ocromore
Process, enhance and evaluate multiple OCR output.
Stars: ✭ 16 (-27.27%)
Mutual labels:  ocr
tutorials
Git Repo for Articles on Ergo Sum blog and the youtube channel https://www.youtube.com/channel/UCiie9CN--dazA7iT2sry5FA
Stars: ✭ 42 (+90.91%)
Mutual labels:  ocr
Android-Text-Scanner
Read text and numbers with android camera OCR
Stars: ✭ 27 (+22.73%)
Mutual labels:  ocr
staff identity card ocr project
Staff Identity Card OCR Project
Stars: ✭ 15 (-31.82%)
Mutual labels:  ocr
MillionHeros
Android直播答题助手,支持全部答题APP,百万英雄/百万赢家/冲顶大会/芝士超人
Stars: ✭ 23 (+4.55%)
Mutual labels:  ocr
car-OCR
基于机器学习和OCR的车牌识别系统 @fujunhao
Stars: ✭ 39 (+77.27%)
Mutual labels:  ocr

NOTE: THIS PROJECT IS NO LONGER BEING DEVELOPED. If you are looking for a replacement, Google's Compute Vision API provides fairly accurate OCR on u-chen texts.

Namsel OCR

An OCR application focused on machine-print Tibetan text

Tested only on Ubuntu 14.04 and higher.

An overview of the Namsel project can be found in our article in the journal Himalayan Linguistics.

Check out our library partner for already OCR'd digital text: http://tbrc.org.

Install:

$ bash ubuntu_install.sh

This will install required packages, build the cython modules, unpack datasets, and initiate training for the classifiers. Note that training (classify.py) takes up to an hour or more to complete.

Quickstart

To start, run preprocessing on folder of images:

$ ./namsel.py preprocess ~/myfolder

This will save new, preprocessed image in ~/myfolder/out. Preprocessing will take a few minutes depending on how many CPUs you have available and how many images are in the folder.

Next, run OCR. If pages are "book" (rather than "pecha") - style pages, do the following:

$ ./namsel.py recognize-volume --page_type=book --format=text ~/myfolder/out

To OCR a single page, use the recognize-page command and specify a single page file:

$ ./namsel.py recognize-page --page_type=book --format=text ~/myfolder/out/image-01.tif

OCR will run and save the results in a file called ocr_output.txt.

Preprocessing

Prior to using Namsel, documents in the form of PDFs or images need to be preprocessed. Preprocessing typically involves cleaning up images and putting them in a format Namsel expects.

Scanning documents

If you are scanning the documents to be OCR'd yourself, here are some tips for improving chances of getting high quality OCR results:

  • Scan in black and white
  • Scan at a relatively high resolution (400-600 dpi)
  • Utilize the scanner's software to align and crop the pages. If your scanner software supports it, deskewing the page and removing empty borders and images can save time later on in the OCR process. (Scantailor, which is mentioned below, won't remove images, but will deskew (rotate) the page and remove empty borders).
  • Save images in TIFF format with sensible name (e.g. a sequences of numbers 001.tif, 002.tif, etc)

Preparing images from PDFs

If your original document is in PDF format, you will need to convert the individual PDF pages to black and white or grayscale jpg, tif, or png images. Black and white tif images are Namsel's preferred format.

There are a variety of tools for converting a PDF to images. The "gs" command from the Ghostscript project is one.

You can invoke gs itself like so:

$ gs -r600x600 -sDEVICE=tiffg4 -sOutputFile=ocr_%04d.tif -dBATCH -dNOPAUSE mytibetanfile.pdf

This will convert all the pages in the pdf to tiff using the "Group 4" compression, which is the most compact form of TIFF compression for black and white images. If your pdf is in grayscale, replace "tiffg4" with "tiffgray." Images will be unpacked at the location where the bash script or gs is run unless you specify otherwise.

Preparing images using scantailor

Scantailor is an open source project for cleaning up scanned documents and preparing them for OCR. Essentially, it performs 5 core operations:

  • Page splitting
  • Deskewing
  • Content isolation
  • Noise removal
  • Thresholding (making characters thinner or more bold)

Optionally, it has tools for page dewarping and manual erasing of image content. While not the only tool for image preprocessing, it typically delivers very good results and is easy to use. (A popular alternative is a project called unpaper).

Scantailor has both a graphical and command line interface. The graphical interface is straightforward to use so we won't describe it here. For faster processing on multicore computers, it is ideal to use the command line version of scantailor in order to process pages in parallel. Namsel comes with a utility for batch, multicore processing with Scantailor called "scantailor_multicore.py."

Example:

python scantailor_multicore.py <my-image-folder> [threshold (optional)]

is a path to a directory containing tif or jpg images. Threshold controls how bold or thin to make strokes on the page. A threshold higher than 0 makes text more bold or thick. A threshold less than 0 makes it thin. Poorly inked prints sometime benefit from a threshold of 10-30. Low to medium resolution images converted to grayscale from color scans may benefit from thinning or a threshold of -10 to -40.

For example

$ python scantailor_multicore.py my-image-folder -20

...generates a folder called "out" with images that have been cropped, cleaned, thinned, and deskewed by Scantailor.

Alternatively, you can choose to run scantailor from the the Namsel command line. See the section Preprocessing options below.

Converting TIFF to Group 4 compression

By default, Scantailor saves images in tiff format using the "lzw" compression format. This format is fine for grayscale and color images, but is unnecessary for black and white images. For black and white tif, convert images to Group 4 (G4) compression if possible. Using the tiffcp utility (part of the libtiff library), convert an entire folder of tiff images like so:

$ mkdir g4
$ for t in *tif
$ do
$ tiffcp -c g4 $t g4/$t
$ done

While it is not necessary to use Group 4 (G4) compression, for projects processing thousands of images, G4 format can greatly decrease the amount of disk storage required for images.

OCR

###Preprocessing options In addition to the above parameters, you can also set parameters for the Scantailor application (used with the "preprocessing" command):

--layout

The layout of pages that Scantailor can expect. Choices are "single" and "double," referring to scanned images that have up to one or two pages on them.

--threshold

The amount of thinning or thickening Scantailor will do. Good values are between -40 and 40. Negative values thin the image, positive values thicken it.

Example command:

$ ./namsel preprocess --layout=double --st_threshold=-15 /path/to/my-folder-of-tiffs

This command will run Scantailor on a folder of tiff-formatted images, command it to split double pages and apply thinning to the characters on the pages.

Namsel command line options

To run Namsel, simply specify the action you'd like Namsel to take and point the the image or images you would like processed. For example, to OCR a single tif image, run:

$ ./namsel recognize-page mytibetantextimage.tif

For an entire volume:

./namsel recognize-volume folder-of-tiff-images

Other options are "preprocess" and "isolate-lines." Preprocessing is discussed below. "Isolate-lines" runs the Namsel pipeline, but only until the line separation stage and outputs the segmented lines as tif images in a directory called "separated-lines" that is created within the parent directory. (TO BE IMPLEMENTED)

OCR quality can vary widely depending on the runtime configuration being used. Below is a list of tunable configuration parameters that Namsel uses.

--page_type

Choices "book" or "pecha." If not specified, Namsel will attempt to determine the page type based on the length and height of the page.

--recognizer

This is the type of recognizer that is used. The options are "hmm" and "probout." Use "hmm" most of the time. If the text you are OCR'ing contains many unusual character combinations and/or many sections requiring complex segmentation, the "probout" recognizer may yield better results.

--line_break_method

Options are "line_cut" and "line_cluster." Namsel will attempt to choose for you if this is left unspecified. Generally, "line_cut" works well for book-style pages and "line_cluster" works well for pecha and book-style pages.

--break_width

The value that controls how horizontally-connected stacks will be segmented. A high value (e.g. 4.0) will perform almost no segmentation. A low value (e.g. .5) may severely over-segment characters. Typical good values are 2, 2.5, 3, or 3.5. (Note that some text evade accurate segmentation, in which case there's no "goldlilocks" break-width that will manage to accurately segment wide connected stacks while also avoiding over-segmentation).

--segmenter

The type of segmentation strategy to use. "stochastic" is default and is almost always the best option.

--low_ink

Default: False. Attempt to compensate for poorly inked texts, particularly cases where the glyphs aren't connected together as part of a single stroke.

--line_cluster_pos

Use with the "line_cluster" line break method. Choices are "top" or "center." Clustering to the center of a line is good for cases where vowels may erroneously get clustered to the above line (on account of being closer to it distance-wise).

--postprocess

Run a post-processing step. This is usually an attempt to insert missing tsek characters into the final results of an OCR run. This is highly experimental and can severely mangle otherwise accurate OCR.

--detect_o

Detect na-ro vowels prior to segmentation and remove them temporarily. This is useful in cases where the na-ro vowels are long and adversely inflating char-width measurements (measurements that are used to determine how and when to segment horizontally touching stacks).

--clear_hr

Identify and remove horizontal rule or line on the top of a page. Set to True when you want to get ride of title or chapter lines that appear on each page. Note: use with caution if page_type is "pecha"

--line_cut_inflation

Rarely used. The number of iterations when dilating text in line cut. Increase this value when need to blob things together. Default is 4 iterations.

Optional: generate data yourself from fonts

This is strictly optional. The provided datasets already include these datapoints. These commands are also run automatically if you run ubuntu_install.sh.

Install fonts:

$ cd namsel-ocr
$ sudo apt-get install python-cairo
$ mkdir -p ~/.fonts
$ cp data_generation/fonts/*ttf ~/.fonts/
$ fc-cache -f -v

Generate the font-derived datasets:

$ cd data_generation
$ python font_draw.py

About the name

Namsel is an English rendering of the Tibetan རྣམ་གསལ (transliterated: rnam gsal). Literally translated, རྣམ can be thought of as "thorough" or "in detail" and གསལ can be thought of as "clear", "illuminated." Put together, Namsel can be taken to mean "making clear the details" or "thoroughly illuminating," which are both appropriate descriptions of what the Namsel project is trying to do. Practically speaking, the name was also chosen because it is reasonably easy for non-Tibetan speakers to read and pronounce correctly.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].