All Projects → vrasneur → ocr

vrasneur / ocr

Licence: Unlicense License
ocr.sh: a bash script to OCR PDF files easily

Programming Languages

shell
77523 projects

ocr.sh: a bash script to OCR PDF files easily

Author

Vincent Rasneur [email protected]

Required programs

  • pdftk
  • ghostscript
  • imagemagick
  • tesseract
  • aspell (optional)

Remarks

By default, the script uses the French dictionaries of tesseract and aspell. Use the -t argument to change the tesseract dictionary. Use the -a argument to change the aspell dictionary.

By default, the script does not spell-check the output text. To do this, you must add -s (or use the -a argument).

Usage

To OCR a PDF file

ocr.sh document.pdf

To OCR a PDF file and spell-check each page

ocr.sh -s document.pdf

To OCR an english PDF and spell-check it

ocr.sh -t eng -a en document.pdf

Output files

For a PDF file named doc1.pdf, the script:

  • creates a directory named doc1
  • for each PDF page, a file named pg_<number>.txt is created inside this directory

Or, if the -c argument is used, the script:

  • creates a directory named doc1
  • creates a unique file named doc1/doc1.txt
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].