Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → awslabs → Amazon S3 Find And Forget

awslabs / Amazon S3 Find And Forget

Licence: apache-2.0

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Programming Languages

python

139335 projects - #7 most used programming language

Labels

aws privacy data big-data s3 gdpr parquet

Projects that are alternatives of or similar to Amazon S3 Find And Forget

Streamx

kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)

Stars: ✭ 96 (-16.52%)

Mutual labels: aws, s3, big-data

Data Processing Agreements

Collection of Data Processing Agreement (DPA) and GDPR compliance resources

Stars: ✭ 110 (-4.35%)

Mutual labels: data, gdpr, privacy

Locopy

locopy: Loading/Unloading to Redshift and Snowflake using Python.

Stars: ✭ 73 (-36.52%)

Mutual labels: aws, s3, data

Presidio

Context aware, pluggable and customizable data protection and anonymization SDK for text and images

Stars: ✭ 1,320 (+1047.83%)

Mutual labels: gdpr, privacy

S3 Beam

🚀 direct-to-S3 uploading using ClojureScript

Stars: ✭ 91 (-20.87%)

Mutual labels: aws, s3

Rpcheckup

rpCheckup is an AWS resource policy security checkup tool that identifies public, external account access, intra-org account access, and private resources.

Stars: ✭ 91 (-20.87%)

Mutual labels: aws, s3

Dropdot

☁️ Direct Upload to Amazon S3 With CORS demo. Built with Node/Express

Stars: ✭ 87 (-24.35%)

Mutual labels: aws, s3

Data Protection Mapping Project

Open Source Data Protection/Privacy Regulatory Mapping Project

Stars: ✭ 96 (-16.52%)

Mutual labels: gdpr, privacy

Aws Workflows On Github

Workflows for automation of AWS services setup from Github CI/CD

Stars: ✭ 95 (-17.39%)

Mutual labels: aws, s3

Just Dashboard

📊 📋 Dashboards using YAML or JSON files

Stars: ✭ 1,511 (+1213.91%)

Mutual labels: big-data, data

Foundatio

Pluggable foundation blocks for building distributed apps.

Stars: ✭ 1,365 (+1086.96%)

Mutual labels: aws, s3

Awesome Aws

A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources. Featuring the Fiery Meter of AWSome.

Stars: ✭ 9,895 (+8504.35%)

Mutual labels: aws, s3

Udacity Data Engineering

Udacity Data Engineering Nano Degree (DEND)

Stars: ✭ 89 (-22.61%)

Mutual labels: aws, s3

S3scanner

Scan for open AWS S3 buckets and dump the contents

Stars: ✭ 1,319 (+1046.96%)

Mutual labels: aws, s3

Parquet Mr

Apache Parquet

Stars: ✭ 1,278 (+1011.3%)

Mutual labels: big-data, parquet

Diamondb

[WIP] DiamonDB: Rebuild of time series database on AWS.

Stars: ✭ 98 (-14.78%)

Mutual labels: aws, s3

S3transfer

Amazon S3 Transfer Manager for Python

Stars: ✭ 108 (-6.09%)

Mutual labels: aws, s3

Historical

A serverless, event-driven AWS configuration collection service with configuration versioning.

Stars: ✭ 85 (-26.09%)

Mutual labels: aws, s3

Dataengineeringproject

Example end to end data engineering project.

Stars: ✭ 82 (-28.7%)

Mutual labels: s3, big-data

Awstaghelper

AWS bulk tagging tool

Stars: ✭ 98 (-14.78%)

Mutual labels: aws, s3

View All Similar Projects ➔

Amazon S3 Find and Forget

Warning: Consult the Production Readiness guidelines prior to using the solution with production data

Amazon S3 Find and Forget is a solution to the need to selectively erase records from data lakes stored on Amazon Simple Storage Service (Amazon S3). This solution can assist data lake operators to handle data erasure requests, for example, pursuant to the European General Data Protection Regulation (GDPR).

The solution can be used with Parquet and JSON format data stored in Amazon S3 buckets. Your data lake is connected to the solution via AWS Glue tables and by specifying which columns in the tables need to be used to identify the data to be erased.

Once configured, you can queue record identifiers that you want the corresponding data erased for. You can then run a deletion job to remove the data corresponding to the records specified from the objects in the data lake. A report log is provided of all the S3 objects modified.

Installation

The solution is available as an AWS CloudFormation template and should take about 20 to 40 minutes to deploy. See the deployment guide for one-click deployment instructions, and the cost overview guide to learn about costs.

Usage

The solution provides a web user interface, and a REST API to allow you to integrate it in your own applications.

See the user guide to learn how to use the solution and the API specification to integrate the solution with your own applications.

Architecture

The goal of the solution is to provide a secure, reliable, performant and cost effective tool for finding and removing individual records within objects stored in S3 buckets. In order to achieve this goal the solution has adopted the following design principles:

Secure by design:
- Every component is implemented with least privilege access
- Encryption is performed at all layers at rest and in transit
- Authentication is provided out of the box
- Expiration of logs is configurable
- Record identifiers (known as Match IDs) are automatically obfuscated or irreversibly deleted as soon as possible when persisting state
Built to scale: The system is designed and tested to work with petabyte-scale Data Lakes containing thousands of partitions and hundreds of thousands of objects
Cost optimised:
- Perform work in batches: Since the time complexity of removing a single vs multiple records in a single object is practically equal and it is common for data owners to have the requirement of removing data within a given timeframe, the solution is designed to allow the solution operator to "queue" multiple matches to be removed in a single job.
- Fail fast: A deletion job takes place in two distinct phases: Find and Forget. The Find phase queries the objects in your S3 data lakes to find any objects which contain records where a specified column contains at least one of the Match IDs in the deletion queue. If any queries fail, the job will abandon as soon as possible and the Forget phase will not take place. The Forget Phase takes the list of objects returned from the Find phase, and deletes only the relevant rows in those objects.
- Optimised for Parquet: The split phase approach optimises scanning for columnar dense formats such as Parquet. The Find phase only retrieves and processes the data for relevant columns when determining which S3 objects need to be processed in the Forget phase. This approach can have significant cost savings when operating on large data lakes with sparse matches.
- Serverless: Where possible, the solution only uses Serverless components to avoid costs for idle resources. All the components for Web UI, API and Deletion Jobs are Serverless (for more information consult the Cost Overview guide).
Robust monitoring and logging: When performing deletion jobs, information is provided in real-time to provide visibility. After the job completes, detailed reports are available documenting all the actions performed to individual S3 Objects, and detailed error traces in case of failures to facilitate troubleshooting processes and identify remediation actions. For more information consult the Troubleshooting guide.

High-level overview diagram

See the Architecture guide to learn more about the architecture.

Documentation

Contributing

Contributions are more than welcome. Please read the code of conduct and the contributing guidelines.

License Summary

This project is licensed under the Apache-2.0 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 115

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗