Program List OCR

Table of Contents

1. What is this?
2. Disclaimer
3. How to use
4. Developer information
- 4.1. License

1. What is this?

Program List OCR is a peice of OCR (Optical Character Recognition) software which is specific to computer program listings published in 1980s.

It converts scanned program listing images into plain text. You can convert this text into an emulator’s input file, e.g. casette tape image

Program List OCR is a compilation of following open souce softwares.

Tesseract (OCR engine)
gImageReader (GUI frontend)

And it also contains special OCR language model files.

BASIC (Generic Basic Langauge)
N6X-BASIC (BASIC for NEC PC-6001 (Japanese))
Hexadecimal machine language

BASIC (bas) model is for generic BASIC language listings. It recognizes ASCII printable characters.

N6X-BASIC (n6x) model is dedicated to NEC PC-6001. It recognizes ASCII and PC-6001’s Japanese and graphical characters.

Hexadecimal machine language(hex) model recognizes only hexadecimal numbers and some extra characters(0-9,A-F,Sum). Therefore it achives better accuracy.

2. Disclaimer

OCR accuracy depends on quality of printing, scanning, used printer model, and fonts.

3. How to use

3.1. Install

Double click ProgramListOCRSetup….exe
Follow the instructions of the installer.

3.2. Start

Launch "Program List OCR" → "gImageReader" from the Start Menu.

3.3. Operating instructions

3.3.1. Scan images and preprocessing

Scan program listings with your document scanner. (Taking picture with camera is not recommended)
Preferred image format is:

600dpi
grayscale
TIFF or high quality JPEG

You should deskew and normalize your images.
Scantailor is recommended for preprocessing.

For better accuracy you can thicken printed characters with GIMP.
Open image with GIMP and do "Filters" → "Generic" → "Erode".

Figure 1. Before "Erode"

Figure 2. After "Erode"

After that it is recommended to convert to a 1-bit (black and white) image, eg. GIMP’s "Colors" → "Threshold"

3.3.2. Open images

Click the folder button on the left pane to open images.

Select image(s) to recognize in file select dialog.

3.3.3. Select region and recognize

Warning

Do the following steps page by page.
Note: if you change pages before recognition, the selected regions will be cleared.

Drag mouse and select region to recognize.
You can add region by ctrl + mouse drag.

When you have finished selecting your region, click "Recognize Selection" to execute recognition.
"Recognize Selection" is pull-down button and you can select language here.
If you want to recognize BASIC program listing choose "bas". If you want to recognize hexadecimal program listing, e.g. MLX format choose "hex".
Make sure to set the Language data locations to System-wide paths within the settings.

It takes a very long time to recognize.

3.3.4. Reformat text

When recognition is finished, recognized text appears in the right pane.

Copy and paste the text to your favorite text editor.

At this point line-wrapping is not recognized.
You have to concatenate wrapped lines manually.

3.3.5. Finish

Reformatted text can be used for your emulator’s input, e.g.casette tape image file.
Enjoy!

4. Developer information

4.1. License

Licence of bundled softwares are as follows.

Tesseract

Apache License 2.0
https://github.com/tesseract-ocr/tesseract

gImageReader

GNU General Public License v3.0
https://github.com/manisandro/gImageReader

Scripts in this repository are modified version of Tesseract and licensed under Apache License 2.0, same as Tesseract.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

eighttails / ProgramListOCR

Programming Languages