All Projects → awslabs → Amazon S3 Find And Forget

awslabs / Amazon S3 Find And Forget

Licence: apache-2.0
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Amazon S3 Find And Forget

Streamx
kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Stars: ✭ 96 (-16.52%)
Mutual labels:  aws, s3, big-data
Data Processing Agreements
Collection of Data Processing Agreement (DPA) and GDPR compliance resources
Stars: ✭ 110 (-4.35%)
Mutual labels:  data, gdpr, privacy
Locopy
locopy: Loading/Unloading to Redshift and Snowflake using Python.
Stars: ✭ 73 (-36.52%)
Mutual labels:  aws, s3, data
Presidio
Context aware, pluggable and customizable data protection and anonymization SDK for text and images
Stars: ✭ 1,320 (+1047.83%)
Mutual labels:  gdpr, privacy
S3 Beam
🚀 direct-to-S3 uploading using ClojureScript
Stars: ✭ 91 (-20.87%)
Mutual labels:  aws, s3
Rpcheckup
rpCheckup is an AWS resource policy security checkup tool that identifies public, external account access, intra-org account access, and private resources.
Stars: ✭ 91 (-20.87%)
Mutual labels:  aws, s3
Dropdot
☁️ Direct Upload to Amazon S3 With CORS demo. Built with Node/Express
Stars: ✭ 87 (-24.35%)
Mutual labels:  aws, s3
Data Protection Mapping Project
Open Source Data Protection/Privacy Regulatory Mapping Project
Stars: ✭ 96 (-16.52%)
Mutual labels:  gdpr, privacy
Aws Workflows On Github
Workflows for automation of AWS services setup from Github CI/CD
Stars: ✭ 95 (-17.39%)
Mutual labels:  aws, s3
Just Dashboard
📊 📋 Dashboards using YAML or JSON files
Stars: ✭ 1,511 (+1213.91%)
Mutual labels:  big-data, data
Foundatio
Pluggable foundation blocks for building distributed apps.
Stars: ✭ 1,365 (+1086.96%)
Mutual labels:  aws, s3
Awesome Aws
A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources. Featuring the Fiery Meter of AWSome.
Stars: ✭ 9,895 (+8504.35%)
Mutual labels:  aws, s3
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-22.61%)
Mutual labels:  aws, s3
S3scanner
Scan for open AWS S3 buckets and dump the contents
Stars: ✭ 1,319 (+1046.96%)
Mutual labels:  aws, s3
Parquet Mr
Apache Parquet
Stars: ✭ 1,278 (+1011.3%)
Mutual labels:  big-data, parquet
Diamondb
[WIP] DiamonDB: Rebuild of time series database on AWS.
Stars: ✭ 98 (-14.78%)
Mutual labels:  aws, s3
S3transfer
Amazon S3 Transfer Manager for Python
Stars: ✭ 108 (-6.09%)
Mutual labels:  aws, s3
Historical
A serverless, event-driven AWS configuration collection service with configuration versioning.
Stars: ✭ 85 (-26.09%)
Mutual labels:  aws, s3
Dataengineeringproject
Example end to end data engineering project.
Stars: ✭ 82 (-28.7%)
Mutual labels:  s3, big-data
Awstaghelper
AWS bulk tagging tool
Stars: ✭ 98 (-14.78%)
Mutual labels:  aws, s3

Amazon S3 Find and Forget

Warning: Consult the Production Readiness guidelines prior to using the solution with production data

Amazon S3 Find and Forget is a solution to the need to selectively erase records from data lakes stored on Amazon Simple Storage Service (Amazon S3). This solution can assist data lake operators to handle data erasure requests, for example, pursuant to the European General Data Protection Regulation (GDPR).

The solution can be used with Parquet and JSON format data stored in Amazon S3 buckets. Your data lake is connected to the solution via AWS Glue tables and by specifying which columns in the tables need to be used to identify the data to be erased.

Once configured, you can queue record identifiers that you want the corresponding data erased for. You can then run a deletion job to remove the data corresponding to the records specified from the objects in the data lake. A report log is provided of all the S3 objects modified.

Installation

The solution is available as an AWS CloudFormation template and should take about 20 to 40 minutes to deploy. See the deployment guide for one-click deployment instructions, and the cost overview guide to learn about costs.

Usage

The solution provides a web user interface, and a REST API to allow you to integrate it in your own applications.

See the user guide to learn how to use the solution and the API specification to integrate the solution with your own applications.

Architecture

The goal of the solution is to provide a secure, reliable, performant and cost effective tool for finding and removing individual records within objects stored in S3 buckets. In order to achieve this goal the solution has adopted the following design principles:

  1. Secure by design:
    • Every component is implemented with least privilege access
    • Encryption is performed at all layers at rest and in transit
    • Authentication is provided out of the box
    • Expiration of logs is configurable
    • Record identifiers (known as Match IDs) are automatically obfuscated or irreversibly deleted as soon as possible when persisting state
  2. Built to scale: The system is designed and tested to work with petabyte-scale Data Lakes containing thousands of partitions and hundreds of thousands of objects
  3. Cost optimised:
    • Perform work in batches: Since the time complexity of removing a single vs multiple records in a single object is practically equal and it is common for data owners to have the requirement of removing data within a given timeframe, the solution is designed to allow the solution operator to "queue" multiple matches to be removed in a single job.
    • Fail fast: A deletion job takes place in two distinct phases: Find and Forget. The Find phase queries the objects in your S3 data lakes to find any objects which contain records where a specified column contains at least one of the Match IDs in the deletion queue. If any queries fail, the job will abandon as soon as possible and the Forget phase will not take place. The Forget Phase takes the list of objects returned from the Find phase, and deletes only the relevant rows in those objects.
    • Optimised for Parquet: The split phase approach optimises scanning for columnar dense formats such as Parquet. The Find phase only retrieves and processes the data for relevant columns when determining which S3 objects need to be processed in the Forget phase. This approach can have significant cost savings when operating on large data lakes with sparse matches.
    • Serverless: Where possible, the solution only uses Serverless components to avoid costs for idle resources. All the components for Web UI, API and Deletion Jobs are Serverless (for more information consult the Cost Overview guide).
  4. Robust monitoring and logging: When performing deletion jobs, information is provided in real-time to provide visibility. After the job completes, detailed reports are available documenting all the actions performed to individual S3 Objects, and detailed error traces in case of failures to facilitate troubleshooting processes and identify remediation actions. For more information consult the Troubleshooting guide.

High-level overview diagram

Architecture Diagram

See the Architecture guide to learn more about the architecture.

Documentation

Contributing

Contributions are more than welcome. Please read the code of conduct and the contributing guidelines.

License Summary

This project is licensed under the Apache-2.0 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].