All Projects → amazon-archives → s3bundler

amazon-archives / s3bundler

Licence: other
ARCHIVED - see https://aws.amazon.com/about-aws/whats-new/2019/04/Amazon-S3-Introduces-S3-Batch-Operations-for-Object-Management/ Amazon S3 Bundler downloads billions of small S3 objects, bundles them into archives, and uploads them back into S3.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to s3bundler

Awesome Aws
A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources. Featuring the Fiery Meter of AWSome.
Stars: ✭ 9,895 (+37957.69%)
Mutual labels:  s3, ecs
herman
Herman is a tool to simplify deployment of AWS Services using ECS and Lambda, and the provisioning of various AWS services.
Stars: ✭ 33 (+26.92%)
Mutual labels:  s3, ecs
ecs-mesh-workshop
This handy workshop help the customers to quickly launch ECS with service mesh support on top of mixed type of instance in all commercial regions (include China), and also provides hands-on tutorials with best practices. It can be customized easily as per need.
Stars: ✭ 17 (-34.62%)
Mutual labels:  ecs, spot
Aws Workflows On Github
Workflows for automation of AWS services setup from Github CI/CD
Stars: ✭ 95 (+265.38%)
Mutual labels:  s3, ecs
amazon-sns-java-extended-client-lib
This AWS SNS client library allows to publish messages to SNS that exceed the 256 KB message size limit.
Stars: ✭ 23 (-11.54%)
Mutual labels:  s3
Rocket
Automated software delivery as fast and easy as possible 🚀
Stars: ✭ 217 (+734.62%)
Mutual labels:  s3
Duplicacy Autobackup
💾 Painless automated backups to multiple storage providers with Docker and duplicacy.
Stars: ✭ 214 (+723.08%)
Mutual labels:  s3
Cakephp File Storage
Abstract file storage and upload plugin for CakePHP. Write to local disk, FTP, S3, Dropbox and more through a single interface. It's not just yet another uploader but a complete storage solution.
Stars: ✭ 202 (+676.92%)
Mutual labels:  s3
mediasort
Upload manager using Laravel's built-in Filesystem/Cloud Storage
Stars: ✭ 20 (-23.08%)
Mutual labels:  s3
pg-bifrost
PostgreSQL Logical Replication tool into Kinesis, S3 and RabbitMQ
Stars: ✭ 31 (+19.23%)
Mutual labels:  s3
Litestream
Streaming replication for SQLite.
Stars: ✭ 3,795 (+14496.15%)
Mutual labels:  s3
Sftpgo
Fully featured and highly configurable SFTP server with optional HTTP, FTP/S and WebDAV support - S3, Google Cloud Storage, Azure Blob
Stars: ✭ 3,534 (+13492.31%)
Mutual labels:  s3
amazon-ecs
With Laravel, search and lookup Amazon products easily.
Stars: ✭ 52 (+100%)
Mutual labels:  ecs
Image Upload Example
Demonstration of how to upload images from the ImagePicker, using a node backend to upload to S3
Stars: ✭ 214 (+723.08%)
Mutual labels:  s3
aws-pdf-textract-pipeline
🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
Stars: ✭ 141 (+442.31%)
Mutual labels:  s3
Bucketstore
A simple library for interacting with Amazon S3.
Stars: ✭ 209 (+703.85%)
Mutual labels:  s3
Node S3 Uploader
Flexible and efficient resize, rename, and upload images to Amazon S3 disk storage. Uses the official AWS Node SDK for transfer, and ImageMagick for image processing. Support for multiple image versions targets.
Stars: ✭ 237 (+811.54%)
Mutual labels:  s3
SpaceWar-ECS
A space war game made with ECS and JobSystem in Unity.
Stars: ✭ 26 (+0%)
Mutual labels:  ecs
Storagetapper
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
Stars: ✭ 232 (+792.31%)
Mutual labels:  s3
Nextjs Aws S3
Example Next.js app to upload photos to an S3 bucket.
Stars: ✭ 229 (+780.77%)
Mutual labels:  s3

Amazon S3 Bundler

Amazon S3 Bundler downloads billions of small S3 objects, bundles them into archives, and uploads them back into S3.

s3grouper takes a manifest.json from S3 inventory as input, splits it into manifests by number of objects and total size, and writes them to S3.

s3bundler takes an s3grouper manifest as input either from SQS or CLI argument. It copies the objects into a tar archive or directly to S3 if they are too big. It writes an index with metadata, tags, etc. alongside the archive. If objects were uploaded multipart, then it calculates the md5sum, otherwise it uses ETag. For a variety of S3 errors, objects are written to a DLQ index to be reviewed later.

If SQS is used, a separate thread is used to update the visibility timeout. It will process 2 messages from SQS before killing itself.

s3bundler and s3grouper can be run in ECS. Tasks will be submitted with the manifest.json URI as input for s3grouper. It writes manifests to an S3 bucket with an event that sends newly written objects ending in .index to SQS.

s3bundler can run as an ECS service. It can be manually scaled to the number of messages in SQS. 6 containers/VCPU is a reasonable load for small objects. As the average object size increases, network throughput may become a bottleneck. The container instances can be run by spot fleet across a variety of instances with instance storage and reasonable networking. ECS should use the instance storage for the containers. If s3bundler can't handle a manifest, SQS will send it to a DLQ for later review.

Setting up S3 Storage Inventory

You’ll need to edit the destination inventory bucket and list all of your buckets to be archived in a file.

while read bucket; do aws s3api put-bucket-inventory-configuration --bucket $bucket --id export --inventory-configuration '{
        "IncludedObjectVersions": "Current",
        "OptionalFields": [
            "Size",
            "LastModifiedDate",
            "StorageClass"
        ],
        "Schedule": {
            "Frequency": "Weekly"
        },
        "Id": "export",
        "Destination": {
            "S3BucketDestination": {
                "Format": "CSV",
                "Bucket": "arn:aws:s3:::your-inventory-bucket"
            }
        },
        "IsEnabled": true
    }'; done < /tmp/bucketlist

Getting Started

Pre-requisites:

  • awscli (running commands)
  • boto3 (driver submitting jobs)
  • docker (building docker images)
  1. Set up two ECR repositories:
aws ecr create-repository --repository-name myproj/s3grouper
aws ecr create-repository --repository-name myproj/s3bundler
  1. Build both docker images using the dockerfile, tag, and push to ECR
pushd s3bundler
docker build -f s3bundler.dockerfile -t 's3bundler:latest' .
popd
pushd s3grouper
docker build -f s3grouper.dockerfile -t 's3grouper:latest' .
popd
docker tag s3bundler:latest <ecr_repository:s3bundler>
docker tag s3grouper:latest <ecr_repository:s3grouper>
$(aws ecr get-login)
docker push <ecr_repository:s3grouper>
docker push <ecr_repository:s3bundler>
  1. Once verified that the images are in ECR, proceed to deploy Cloudformation stack

s3bundler-spotfleet.cfn.py makes different Cloudformation templates depending on the region and instance family specified.

pip install troposphere
python s3bundler.cfn.py > s3bundler.cfn.json
python s3bundler-spotfleet.cfn.py -f i3 --region us-east-1 > s3bundler-spotfleet.cfn.json
  1. First create stack with s3bundler.cfn.json Note the value of ECSCluster in the Outputs, you'll need this as input for the next step.

  2. Create stack with s3bundler-spotfleet.cfn.json

You should now have an ECS cluster running on SpotFleet in your environment. You're now ready to submit jobs using submitmanifest.py!

Monitoring

The following Cloudwatch metrics are useful for watching the progress.

S3Bundler/Errors - A few of these here and there should not be an issue. If there are a lot, check the logs for throttling or permissions issues. If S3 is throttling you, scale in your ECS cluster. If you see permissions issues, make sure the bucket policy allows the newly created IAM role to access objects. In either case, you will want to manually add the manifest back to the queue.

SQS/ApproximateNumberOfMessagesVisible - Shows the manifests waiting to start processing. SQS/ApproximateNumberOfMessagesNotVisible - Shows the number of manifests currently being processed. SQS/NumberOfMessagesDeleted - Shows the manifest completion over time.

EC2Spot/CPUUtilization - It may be useful to show CPU utilization. If it is too high, it may be necessary to tune the s3bundler task to use less. EC2Spot/PendingCapacity - This will increase if there is contention in the spot market. The overall job may take longer. EC2Spot/FulfilledCapacity

Submitting Jobs

echo s3://inventorybucket/sourcebucket/inventory/2017-03-30T18-06Z/manifest.json | python submitmanifest.py --region us-east-1 -c s3bundler-ECSCluster-YYYYYYYYYYYY -t arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task-definition/s3grouper:7 -- -b s3bundler-archivebucket-yyyyyyyyyyyy -p manifests

Common Errors

Access Denied

If S3Bundler can't read objects in your source buckets, you may need to add the TaskRole created in the s3bundler Cloudformation stack to your whitelists in the bucket policies in the source buckets.

Throttling

If writes are frequently throttled, you will either need to reduce the concurrency or request that the S3 team prepare the bucket for the throughput needed.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].