All Projects → writecrow → ocr2text

writecrow / ocr2text

Licence: MIT license
Convert a PDF via OCR to a TXT file in UTF-8 encoding

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ocr2text

Tesseract
Bindings to Tesseract OCR engine for R
Stars: ✭ 192 (+113.33%)
Mutual labels:  ocr, tesseract
svg2vector
Online batch converter of SVG images to Android vector drawable XML resource files
Stars: ✭ 39 (-56.67%)
Mutual labels:  converter, batch
Tessdata fast
Fast integer versions of trained LSTM models
Stars: ✭ 221 (+145.56%)
Mutual labels:  ocr, tesseract
saram
Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract with auto rotation for wrong orientation. PYPI:
Stars: ✭ 51 (-43.33%)
Mutual labels:  ocr, tesseract
ReadToMe
No description or website provided.
Stars: ✭ 51 (-43.33%)
Mutual labels:  ocr, tesseract
Tesseract Ocr For Php
A wrapper to work with Tesseract OCR inside PHP.
Stars: ✭ 2,247 (+2396.67%)
Mutual labels:  ocr, tesseract
Tesstrain
Train Tesseract LSTM with make
Stars: ✭ 251 (+178.89%)
Mutual labels:  ocr, tesseract
Ocrtable
Recognize tables and text from scanned images that contain tables. 从包含表格的扫描图片中识别表格和文字
Stars: ✭ 155 (+72.22%)
Mutual labels:  ocr, tesseract
pmOCR
A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR conversion on file activity
Stars: ✭ 53 (-41.11%)
Mutual labels:  ocr, tesseract
ScribeBot
A highly scriptable automation system full of cool features. Automate everything with a little bit of Lua.
Stars: ✭ 72 (-20%)
Mutual labels:  ocr, tesseract
Swiftytesseract
A Swift wrapper around Tesseract for use in iOS, macOS, and Linux applications
Stars: ✭ 170 (+88.89%)
Mutual labels:  ocr, tesseract
memento
Organize your meme image cluster in a better format using OCR from the meme to sort them using tesseract along with editing memes by segmenting them using OpenCV within a directory
Stars: ✭ 70 (-22.22%)
Mutual labels:  ocr, tesseract
Ocr Table
Extract tables from scanned image PDFs using Optical Character Recognition.
Stars: ✭ 165 (+83.33%)
Mutual labels:  ocr, tesseract
Android Ocr
Experimental optical character recognition app
Stars: ✭ 2,177 (+2318.89%)
Mutual labels:  ocr, tesseract
Lambda Text Extractor
AWS Lambda functions to extract text from various binary formats.
Stars: ✭ 159 (+76.67%)
Mutual labels:  ocr, tesseract
Image2text
📋 Python wrapper to grab text from images and save as text files using Tesseract Engine
Stars: ✭ 243 (+170%)
Mutual labels:  ocr, tesseract
Tesseract4android
Fork of tess-two rewritten from scratch to support latest version of Tesseract OCR.
Stars: ✭ 148 (+64.44%)
Mutual labels:  ocr, tesseract
Tesseract Macos
Objective C wrapper for the open source OCR Engine Tesseract (macOS)
Stars: ✭ 154 (+71.11%)
Mutual labels:  ocr, tesseract
Mybox
Easy tools of document, image, file, network, location, color, and media.
Stars: ✭ 45 (-50%)
Mutual labels:  converter, ocr
MouseTooltipTranslator
chrome extension - When mouse hover on text, it shows translated tooltip using google translate
Stars: ✭ 93 (+3.33%)
Mutual labels:  ocr, tesseract

PDF to TXT (with OCR)

Given one or more PDFs that may include text-as-image content, use OCR (Optical Character Recognition) to convert the content to TXT files (in UTF-8 encoding).

Rationale

A survey of existing PDF-to-TXT solutions found no extant solutions that meet all of the following criteria:

  • is an offline tool (to keep secure human-subject information)
  • provides conversion from PDF to TXT (most existing OCR integrations assume an image as input)
  • supports batch processing of multiple files

Assumptions

  • This is (currently) a command-line tool, written in Python. Basic familiarity with executing commands in a terminal, as well as directory structure, is assumed.
  • It is assumed that you have Python version 3.x installed, as well as Pip.
  • This script relies on an industry-standard OCR library managed by Google, called Tesseract. Since it is written in C++, for Python to be able to use it, it needs to be installed separately (instructions below). Similarly, a PDF-to-image library, Poppler, will need to be installed on Windows and Mac systems.

Setup

Windows

  1. Make a new folder on your Desktop called ocr (e.g., C:\Users\mark\Desktop\ocr)
  2. Download and install the Tesseract 4 OCR library from Tesseract at UB Mannheim
  3. The installation should indicate which directory Tesseract-OCR was installed. Most likely, this will either be C:\Program Files (x86)\Tesseract-OCR or C:\Program Files\Tesseract-OCR. Move this folder into your equivalent of C:\Users\mark\Desktop\ocr, so that it is now located at Desktop\ocr\Tesseract-OCR.
  4. Download poppler for Windows.
  5. You may need to install 7Zip to unzip the executable, as well.
  6. Place the unzipped files in Desktop\ocr\poppler-0.68.0_x86).
  7. From your start menu, navigate to Control Panel > System and Security > System > Advanced System Settings
  8. Then click Environment Variables.
  9. In the System Variables window, highlight Path, and click Edit.
  10. Click New to add an additional path.
  11. Paste the full path to the location of Tesseract (e.g., C:\Users\mark\Desktop\\ocr\Tesseract-OCR) and press OK.
  12. Again, click New to add an additional path.
  13. Paste your equivalent of C:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin and press OK.
  14. Press OK on any remaining control panel windows.
  15. Download OCR2Text to Desktop\ocr).
  16. Unzip the project.
  17. Open a cmd.exe terminal, and navigate to the folder via the command line (e.g., cd Desktop\ocr\ocr2text-master)
  18. Run pip install --user --requirement requirements.txt
  19. Optionally, you can check that you set up the PATH variable correctly in steps 6-10 by typing echo %PATH%. The output must include your equivalent of C:\Users\mark\Desktop\ocr\Tesseract-OCR and C:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin for the script to work.

macOS

  1. Make a new folder on your Desktop called ocr (i.e., /Users/mark/Desktop/ocr)
  2. Install Tesseract-OCR using either MacPorts (sudo port install tesseract) or Homebrew (brew install tesseract
  3. Install poppler for Mac.
  4. Download this Github project to /Users/mark/Desktop/ocr).
  5. Unzip the project.
  6. Open a terminal and navigate to the folder via the command line (e.g., cd /Users/mark/Desktop/ocr/ocr2text)
  7. Run pip install --user --requirement requirements.txt

Linux

  1. sudo apt-get install tesseract-ocr
  2. Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils
  3. Download this Github project.
  4. Unzip the project.
  5. Open a terminal and navigate to the folder
  6. Run pip install --user --requirement requirements.txt

Usage

If you have successfully completed the setup steps and are using Python version 3, usage should now be a breeze:

On the command line, navigate to the directory where you downloaded the script and run:

python ocr2text.py

You will see the following:

********************************
*** PDF to TXT file, via OCR ***
********************************

Indicate file or folder of source PDF(s) []:
(Press [Enter] for current working directory)

Enter the full path to the file or directory to convert.

Destination folder for TXT []:
(Press [Enter] for current working directory)

Enter the full path to the directory where the result file(s) should be outputted.

The script will now covert the PDF via OCR into a plaintext file:

Testing the installation

For testing purposes, a test_files directory is included. You can press [Enter] for the source and destination directories & verify that the image.pdf file is converted. It will also be located in the test_files directory:

Converted C:\Users\mark\ocr2text\image.pdf
Percent: [##########] 100%
1 file converted
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].