All Projects → gkovacs → Pdfocr

gkovacs / Pdfocr

Licence: mit
Adds text to PDF files using the cuneiform OCR software

Programming Languages

ruby
36898 projects - #4 most used programming language

Labels

Projects that are alternatives of or similar to Pdfocr

Mybox
Easy tools of document, image, file, network, location, color, and media.
Stars: ✭ 45 (-84.32%)
Mutual labels:  pdf, ocr
Ambar
🔍 Ambar: Document Search Engine
Stars: ✭ 1,829 (+537.28%)
Mutual labels:  pdf, ocr
Scanbot Sdk Example Android
Document scanning SDK example apps for the Scanbot SDK for Android.
Stars: ✭ 67 (-76.66%)
Mutual labels:  pdf, ocr
Papermerge
Open Source Document Management System for Digital Archives (Scanned Documents)
Stars: ✭ 1,177 (+310.1%)
Mutual labels:  pdf, ocr
Paperwork
Personal document manager (Linux/Windows) -- Moved to Gnome's Gitlab
Stars: ✭ 2,392 (+733.45%)
Mutual labels:  pdf, ocr
Ocrmypdf
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Stars: ✭ 5,549 (+1833.45%)
Mutual labels:  pdf, ocr
Remarks
Extract highlights, scribbles, and annotations from PDFs marked with the reMarkable tablet. Export to Markdown, PDF, PNG, and SVG
Stars: ✭ 94 (-67.25%)
Mutual labels:  pdf, ocr
Docspell
Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
Stars: ✭ 303 (+5.57%)
Mutual labels:  pdf, ocr
Open Semantic Etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Stars: ✭ 165 (-42.51%)
Mutual labels:  pdf, ocr
Pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
Stars: ✭ 1,969 (+586.06%)
Mutual labels:  pdf, ocr
Lambda Text Extractor
AWS Lambda functions to extract text from various binary formats.
Stars: ✭ 159 (-44.6%)
Mutual labels:  pdf, ocr
Open Paperless
Scan, index, and archive all of your paper documents (acquired by Mayan EDMS)
Stars: ✭ 2,538 (+784.32%)
Mutual labels:  pdf, ocr
Mayan Edms
Free Open Source Document Management System (mirror, no pull request or issues)
Stars: ✭ 226 (-21.25%)
Mutual labels:  pdf, ocr
Parsr
Transforms PDF, Documents and Images into Enriched Structured Data
Stars: ✭ 2,736 (+853.31%)
Mutual labels:  pdf, ocr
Ocr Corrector
利用语言模型,纠正OCR识别错误
Stars: ✭ 259 (-9.76%)
Mutual labels:  ocr
Thinreports Generator
Report Generator for Ruby
Stars: ✭ 268 (-6.62%)
Mutual labels:  pdf
Pdftilecut
pdftilecut lets you sub-divide a PDF page(s) into smaller pages so you can print them on small form printers.
Stars: ✭ 258 (-10.1%)
Mutual labels:  pdf
Cloud Reports
Scans your AWS cloud resources and generates reports. Check out free hosted version:
Stars: ✭ 255 (-11.15%)
Mutual labels:  pdf
Attention ocr.pytorch
This repository implements the the encoder and decoder model with attention model for OCR
Stars: ✭ 278 (-3.14%)
Mutual labels:  ocr
Pdf
Rust library to read, manipulate and write PDF files.
Stars: ✭ 265 (-7.67%)
Mutual labels:  pdf

pdfocr

pdfocr adds an OCR text layer to scanned PDF files, allowing them to be searched. It currently depends on Ruby 1.8.7 or above, and uses ocropus, cuneiform, or tesseract for performing OCR.

Using

To use, run:

pdfocr -i input.pdf -o output.pdf

For more details, see the manpage.

Dependencies

pdfocr requires tesseract and hocr2pdf. These can be provided by installing the packages tesseract-ocr, tesseract-ocr-eng (or other languages you need), and exactimage from your distribution.

Credits

pdfocr was written by Geza Kovacs

pdfocr is hosted at http://github.com/gkovacs/pdfocr

Christian Pietsch added tesseract support.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].