All Projects → skylander86 → Lambda Text Extractor

skylander86 / Lambda Text Extractor

Licence: apache-2.0
AWS Lambda functions to extract text from various binary formats.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lambda Text Extractor

Ocrmypdf
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Stars: ✭ 5,549 (+3389.94%)
Mutual labels:  pdf, ocr, tesseract
ocr
Simple app to extract text from pictures using Tesseract
Stars: ✭ 98 (-38.36%)
Mutual labels:  ocr, tesseract, text-extraction
Tesseract Macos
Objective C wrapper for the open source OCR Engine Tesseract (macOS)
Stars: ✭ 154 (-3.14%)
Mutual labels:  ocr, tesseract
Tesserocr
A Python wrapper for the tesseract-ocr API
Stars: ✭ 1,567 (+885.53%)
Mutual labels:  ocr, tesseract
Links Detector
📖 👆🏻 Links Detector makes printed links clickable via your smartphone camera. No need to type a link in, just scan and click on it.
Stars: ✭ 106 (-33.33%)
Mutual labels:  ocr, tesseract
Lambcycle
🐑🛵 A declarative lambda middleware with life cycle hooks 🐑🛵
Stars: ✭ 88 (-44.65%)
Mutual labels:  aws-lambda, lambda-functions
Remarks
Extract highlights, scribbles, and annotations from PDFs marked with the reMarkable tablet. Export to Markdown, PDF, PNG, and SVG
Stars: ✭ 94 (-40.88%)
Mutual labels:  pdf, ocr
Tesseract
This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.
Stars: ✭ 43,199 (+27069.18%)
Mutual labels:  ocr, tesseract
Koreader Base
Base framework offering a Lua scriptable environment for creating document readers
Stars: ✭ 81 (-49.06%)
Mutual labels:  pdf, tesseract
Lambda Toolkit
*DO NOT USE* - This project was done during my initial python and lambda's studies. I would recommend you the `serverless framework`.
Stars: ✭ 114 (-28.3%)
Mutual labels:  aws-lambda, lambda-functions
Aadhaar Card Ocr
Extract text information from Aadhaar Card using tesseract-ocr 😎
Stars: ✭ 112 (-29.56%)
Mutual labels:  ocr, tesseract
Tesseract Ocr for windows
Visual Studio Projects for Tessearct and dependencies
Stars: ✭ 122 (-23.27%)
Mutual labels:  ocr, tesseract
Ocrtable
Recognize tables and text from scanned images that contain tables. 从包含表格的扫描图片中识别表格和文字
Stars: ✭ 155 (-2.52%)
Mutual labels:  ocr, tesseract
Aws Serverless Airline Booking
Airline Booking is a sample web application that provides Flight Search, Flight Payment, Flight Booking and Loyalty points including end-to-end testing, GraphQL and CI/CD. This web application was the theme of Build on Serverless Season 2 on AWS Twitch running from April 24th until end of August in 2019.
Stars: ✭ 1,290 (+711.32%)
Mutual labels:  aws-lambda, lambda-functions
Node Tesseract Ocr
A Node.js wrapper for the Tesseract OCR API
Stars: ✭ 92 (-42.14%)
Mutual labels:  ocr, tesseract
Penteract Ocr
⭐️ The native node.js bindings to the Tesseract OCR project.
Stars: ✭ 86 (-45.91%)
Mutual labels:  ocr, tesseract
Gosseract
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library
Stars: ✭ 1,622 (+920.13%)
Mutual labels:  ocr, tesseract
Ambar
🔍 Ambar: Document Search Engine
Stars: ✭ 1,829 (+1050.31%)
Mutual labels:  pdf, ocr
Papermerge
Open Source Document Management System for Digital Archives (Scanned Documents)
Stars: ✭ 1,177 (+640.25%)
Mutual labels:  pdf, ocr
Php Apache Tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Stars: ✭ 76 (-52.2%)
Mutual labels:  ocr, text-extraction

Extracting Text from Binary Document Formats using AWS Lambda

lambda-text-extractor is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.

Features

Some of its key features are:

  • out of the box support for many common binary document formats (see section on Supported Formats),
  • scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,
  • creation of text searchable PDFs after OCR,
  • serverless architecture makes deployment quick and easy,
  • detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and
  • sensible unicode handling

Supported Formats

lambda-text-extractor supports many common and legacy document formats:

  • Portable Document Format (.pdf),
  • Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (.doc) using Antiword with fallback to Catdoc,
  • Microsoft Word 2007 OpenXML files (.docx) using python-docx,
  • Microsoft PowerPoint 2007 OpenXML files (.pptx) using python-pptx,
  • Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (.xls, .xlsx) using xlrd,
  • OpenDocument 1.2 (.odm, .odp, .ods, .odt, .oth, .otm, .otp, .ots, .ott) using odfpy,
  • Rich Text Format (.rtf) using UnRTF v0.21.9,
  • XML files and HTML web pages (.html, .htm, .xml) using lxml,
  • CSV files (.csv) using Python csv module,
  • Images (.tiff, .jpg, .jpeg, .png) using Tesseract, and
  • Plain text files (.txt)

Setup

Due to the size of code and dependencies (and AWS Lambda's 50MB package limits), the extraction system is split into two Lambda functions: simple and ocr. ocr supports extracting text from images and "image" PDFs, while simple handles text extraction from the remaining formats. The side benefit of splitting into two functions is that we can configure the memory requirements of the two functions independently.

We use apex for our development toolchain to deploy the AWS Lambda functions; the code for the two Lambda functions are found in the functions directory. To deploy to AWS (Note that the -D argument refers to dry run mode.)

apex -D deploy

You need to ensure your IAM role has lambda:InvokeAsync permissions, and s3:PutObject permissions on the output bucket. Generally, we would advice using a specific bucket with auto-delete lifecycle rules for the temporary storage. You can set the IAM role and other configuration options in project.json.

The speed of parsing depends on CPU and this is controlled by the amount of memory allocated to your Lambda functions. For our needs, we find that 512MB for simple and 1024MB for ocr is a good balance between performance and cost.

Usage

Non OCR Text Extraction

The simple function expects an event with

  • document_uri: A URI containing the document to extract text from, i.e., s3://bucket/key.pdf.
  • temp_uri_prefix (optional): A URI prefix where temporary files can be stored. Defaults to <document_uri>-temp if not set.
  • text_uri (optional): A URI where the extracted text will be stored, i.e., s3://bucket/key.txt. Defaults to <document_uri>.txt if not set.
  • disable_ocr (optional): Whether to disable OCR feature. Defaults to False.

Example

aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://bucket/", "text_uri": "s3://bucket/tracemonkey.txt"}' -

aws s3 cp s3://bucket/tracemonkey.txt -

It automatically fallbacks to ocr function when:

  • file is a PDF (i.e., ends with .pdf),
  • text content is shorter than 32 characters, and
  • disable_ocr is False.

OCR Text Extraction

The ocr expects the same event as simple with the following additional fields:

  • searchable_pdf_uri: A URI where searchable version of the PDF file is stored. Defaults to <document_uri>.searchable.pdf
  • create_searchable_pdf: Whether to create searchable PDFs. Defaults to True.
  • page: Page number of perform PDF OCR extraction. Defaults to all pages.

Searchable PDF creation may take significantly longer than just text extraction. As there are multiple steps in OCR PDF extraction, there are several additional variables (set through environment variables) to configure its behavior.

  • MERGE_SEARCHABLE_PDF_DURATION: The maximum number of seconds to take for searchable PDF merging. Defaults to 90 seconds.
  • RETURN_RESULTS_DURATION: The number of seconds to reserve at the end for compiling results and returning them. Defaults to 3 seconds.
  • TEXTRACT_OUTPUT_WAIT_BUFFER_TIME: The number of seconds to reserve for the overhead in async wait of each page's OCR Lambda functions to return. Defaults to 5 seconds.

For more details about how PDF OCR extraction work here, see section on PDF OCR Extraction.

Example

aws lambda invoke --function-name textractor_ocr --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://bucket/", "text_uri": "s3://bucket/tracemonkey.txt", "searchable_pdf_uri": "s3://bucket/tracemonkey.searchable.pdf"}' -

aws s3 cp s3://bucket/tracemonkey-5.txt -

PDF OCR Extraction

Due to the slow nature of OCR on images and AWS Lambda's 300 seconds execution limit, we used a hack (i.e., another lambda invocation) to OCR the pages of a PDF in parallel, while using S3 as our temporary store.

When we determine that a PDF needs to be processed using OCR (i.e., simple text extraction yields < 512 bytes), we automatically invoke ocr and wait for the results asynchronously for each page of the PDF (we use asyncio and aiobotocore to achieve this). The page field in event determines which page we want to OCR for that function call.

Basically, the steps for OCR extraction are as follows:

  1. Determine the number of pages in the PDF using pdfinfo. We find that this subprocess call is faster (and more robust) than using a Python PDF library like PyPDF2.
  2. Invoke ocr on each page of the document by passing in the page field. We store the intermediate output (i.e., extracted text and searchable PDFs for each page) in the temp_uri_prefix folder. We wait for the Lambda function calls in step 2 to complete using await.
  3. We download the intermediate outputs to the Lambda function's local filesystem.
  4. We combine the intermediate text and searchable PDF, ignoring missing pages and files. The missing information will be stored in the metadata of the final text_uri and searchable_pdf_uri as missing_text_pages and missing_searchable_pdf_pages respectively.

For step 2 and 3, it is done concurrently and asynchronously and we set a timeout based on

REMAINING_TIME - MERGE_SEARCHABLE_PDF_DURATION - RETURN_RESULTS_DURATION - TEXTRACT_OUTPUT_WAIT_BUFFER_TIME

where REMAINING_TIME is the amount of time remaining after step 1.

Based on our experience, merging searchable PDFs take quite a while (and depends on the number of pages you have). On average, it can take about 60 seconds for merging 100 pages of searchable PDFs. If this is an issue for you, you might want to modify the code to fix the path of the intermediate outputs and combine it yourself outside the Lambda infrastructure. Currently, we use random UUIDs for the filenames of each intermediate output page. The relevant part of the code is in the _invoke_textract_ocr_tasks method.

For OCR extractions on individual pages, we use Ghostscript to extract the page into an image with basic image processing and then use Tesseract to do text extraction. If create_searchable_pdf is enabled, Tesseract is used to directly create a searchable PDF. After which, we use pdftotext for regular text extraction from the searchable PDF (instead of running Tesseract twice).

If anybody knows of a better pattern for processing PDFs, do feel free to submit a pull request!

Building Binaries

For more information on how we prepped the Lambda execution environment to run all these external software and libraries, see Building Binaries.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].