All Projects β†’ aeksco β†’ aws-pdf-textract-pipeline

aeksco / aws-pdf-textract-pipeline

Licence: MIT license
πŸ” Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

Programming Languages

typescript
32286 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to aws-pdf-textract-pipeline

go-localstack
Go Wrapper for using localstack
Stars: ✭ 56 (-60.28%)
Mutual labels:  cloudformation, dynamodb, s3, sns
Aws Sdk Perl
A community AWS SDK for Perl Programmers
Stars: ✭ 153 (+8.51%)
Mutual labels:  cloudformation, dynamodb, s3, sns
Awesome Aws
A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources. Featuring the Fiery Meter of AWSome.
Stars: ✭ 9,895 (+6917.73%)
Mutual labels:  cloudformation, dynamodb, s3
cdk-constructs
Shared constructs for AWS CDK
Stars: ✭ 34 (-75.89%)
Mutual labels:  cloudformation, cdk, aws-cdk
Hands-On-Serverless-Applications-with-Go
Hands-On Serverless Applications with Go, published by Packt.
Stars: ✭ 92 (-34.75%)
Mutual labels:  cloudformation, dynamodb, s3
Aws Sdk Js V3
Modularized AWS SDK for JavaScript.
Stars: ✭ 737 (+422.7%)
Mutual labels:  dynamodb, s3, sns
Sherlock Holmes Partying In The Jungle
Parses AWS events payloads into a plain JavaScript object
Stars: ✭ 12 (-91.49%)
Mutual labels:  dynamodb, s3, sns
Serverless
This is intended to be a repo containing all of the official AWS Serverless architecture patterns built with CDK for developers to use. All patterns come in Typescript and Python with the exported CloudFormation also included.
Stars: ✭ 1,048 (+643.26%)
Mutual labels:  cloudformation, dynamodb, sns
amazon-sns-java-extended-client-lib
This AWS SNS client library allows to publish messages to SNS that exceed the 256 KB message size limit.
Stars: ✭ 23 (-83.69%)
Mutual labels:  s3, sns
aws-cdk-project-template-for-devops
This repository provides best practices and template framework for developing AWS Cloud Development Kit(CDK)-based applications effectively, quickly and collaboratively.
Stars: ✭ 18 (-87.23%)
Mutual labels:  cloudformation, cdk
Aws Cognito Apigw Angular Auth
A simple/sample AngularV4-based web app that demonstrates different API authentication options using Amazon Cognito and API Gateway with an AWS Lambda and Amazon DynamoDB backend that stores user details in a complete end to end Serverless fashion.
Stars: ✭ 278 (+97.16%)
Mutual labels:  cloudformation, dynamodb
serverless-data-pipeline-sam
Serverless Data Pipeline powered by Kinesis Firehose, API Gateway, Lambda, S3, and Athena
Stars: ✭ 78 (-44.68%)
Mutual labels:  cloudformation, data-pipeline
serverless-dynamodb-ttl
⚑️ Serverless Plugin to set DynamoDB TTL
Stars: ✭ 16 (-88.65%)
Mutual labels:  cloudformation, dynamodb
cdkgoat
CdkGoat is Bridgecrew's "Vulnerable by Design" AWS CDK repository. CdkGoat is a learning and training project that demonstrates how common configuration errors can find their way into production cloud environments.
Stars: ✭ 27 (-80.85%)
Mutual labels:  cloudformation, aws-cdk
Aws Data Replication Hub
Seamless User Interface for replicating data into AWS.
Stars: ✭ 40 (-71.63%)
Mutual labels:  cloudformation, s3
Aws Toolkit Vscode
AWS Toolkit for Visual Studio Code, an extension for working with AWS services including AWS Lambda.
Stars: ✭ 823 (+483.69%)
Mutual labels:  cloudformation, s3
Aws Toolkit Eclipse
AWS Toolkit for Eclipse – an open-source plugin for developing, deploying, and managing AWS applications.
Stars: ✭ 252 (+78.72%)
Mutual labels:  cloudformation, dynamodb
Aws Iot Certificate Vending Machine
The CVM allows a device to apply for its own certificate and installation.
Stars: ✭ 64 (-54.61%)
Mutual labels:  cloudformation, dynamodb
Sumologic Aws Lambda
A collection of lambda functions to collect data from Cloudwatch, Kinesis, VPC Flow logs, S3, security-hub and AWS Inspector
Stars: ✭ 126 (-10.64%)
Mutual labels:  cloudformation, s3
data-transfer-hub
Seamless User Interface for replicating data into AWS.
Stars: ✭ 102 (-27.66%)
Mutual labels:  cloudformation, s3

aws-pdf-textract-pipeline Mentioned in Awesome CDK

πŸ” Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS Textract. Built with AWS CDK + TypeScript.

This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.

Example Extension Popup

Getting Started

Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.

yarn install
yarn build
cdk bootstrap
cdk deploy

Overview

The following is an overview of each process performed by this CDK stack.

  1. Scrape PDF download URLs from a website

    Scraping data from the COGCC website.

  2. Store PDF download URL in DynamoDB

    Example Extension Popup

  3. Download the PDF to S3

    A lambda fires off when a new PDF download URL has been created in DynamoDB.

  4. Process the PDF with AWS Textract

    Another lambda fires off when a PDF has been downloaded to the S3 bucket.

  5. Process the AWS Textract results

    When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.

  6. Save the processed Textract result to DynamoDB.

    After the full result is pruned down the the desired datastructure, we save the data in DynamoDB. Example Extension Popup

Scripts

  • yarn install - installs dependencies
  • yarn build - builds the production-ready CDK Stack
  • yarn test - runs Jest
  • cdk bootstrap - bootstraps AWS Cloudformation for your CDK deploy
  • cdk deploy - deploys the CDK stack to AWS

Notes

  • Warning - the AnalyzeDocument process from AWS Textract costs $50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention.

  • If a PDF download URL has already been added to the pdfUrlsTable DynamoDB table, the pipeline will not re-execute for the PDF.

  • Includes tests with Jest.

  • Recommended to use Visual Studio Code with the Format on Save setting turned on.

Built with

Additional Resources

License

Opens source under the MIT License.

Built with ❀️ by aeksco

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].