Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

Stars: ✭ 165 (-42.51%)

Mutual labels: pdf, ocr

Pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Stars: ✭ 1,969 (+586.06%)

Mutual labels: pdf, ocr

Lambda Text Extractor

AWS Lambda functions to extract text from various binary formats.

Stars: ✭ 159 (-44.6%)

Mutual labels: pdf, ocr

Open Paperless

Scan, index, and archive all of your paper documents (acquired by Mayan EDMS)

Stars: ✭ 2,538 (+784.32%)

Mutual labels: pdf, ocr

Mayan Edms

Free Open Source Document Management System (mirror, no pull request or issues)

Stars: ✭ 226 (-21.25%)

Mutual labels: pdf, ocr

Parsr

Transforms PDF, Documents and Images into Enriched Structured Data

Stars: ✭ 2,736 (+853.31%)

Mutual labels: pdf, ocr

Ocr Corrector

利用语言模型，纠正OCR识别错误

Stars: ✭ 259 (-9.76%)

Mutual labels: ocr

Thinreports Generator

Report Generator for Ruby

Stars: ✭ 268 (-6.62%)

Mutual labels: pdf

Pdftilecut

pdftilecut lets you sub-divide a PDF page(s) into smaller pages so you can print them on small form printers.

Stars: ✭ 258 (-10.1%)

Mutual labels: pdf

Cloud Reports

Scans your AWS cloud resources and generates reports. Check out free hosted version:

Stars: ✭ 255 (-11.15%)

Mutual labels: pdf

Attention ocr.pytorch

This repository implements the the encoder and decoder model with attention model for OCR

Stars: ✭ 278 (-3.14%)

Mutual labels: ocr

Pdf

Rust library to read, manipulate and write PDF files.

Stars: ✭ 265 (-7.67%)

Mutual labels: pdf

View All Similar Projects ➔

pdfocr

pdfocr adds an OCR text layer to scanned PDF files, allowing them to be searched. It currently depends on Ruby 1.8.7 or above, and uses ocropus, cuneiform, or tesseract for performing OCR.

Using

To use, run:

pdfocr -i input.pdf -o output.pdf

For more details, see the manpage.

Dependencies

pdfocr requires tesseract and hocr2pdf. These can be provided by installing the packages tesseract-ocr, tesseract-ocr-eng (or other languages you need), and exactimage from your distribution.

Credits

pdfocr was written by Geza Kovacs

pdfocr is hosted at http://github.com/gkovacs/pdfocr

Christian Pietsch added tesseract support.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 287

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (24) 🔗