All Projects → dacort → athena-sqlite

dacort / athena-sqlite

Licence: Apache-2.0 license
A SQLite driver for S3 and Amazon Athena 😳

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to athena-sqlite

athenadriver
A fully-featured AWS Athena database driver (+ athenareader https://github.com/uber/athenadriver/tree/master/athenareader)
Stars: ✭ 116 (+41.46%)
Mutual labels:  athena, s3
scalikejdbc-athena
Library for using Amazon Athena JDBC Driver with ScalikeJDBC
Stars: ✭ 19 (-76.83%)
Mutual labels:  athena, amazon-athena
qwery
A SQL-like language for performing ETL transformations.
Stars: ✭ 28 (-65.85%)
Mutual labels:  athena, s3
benji
📁 This library is a Scala reactive DSL for object storage (e.g. S3/Amazon, S3/CEPH, Google Cloud Storage).
Stars: ✭ 18 (-78.05%)
Mutual labels:  s3, vfs
Aws Data Wrangler
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+2808.54%)
Mutual labels:  athena, amazon-athena
pipedream
Easy multipart uploads for Amazon S3, DigitalOcean Spaces and S3-compatible services
Stars: ✭ 17 (-79.27%)
Mutual labels:  s3
waihona
Rust crate for performing cloud storage CRUD actions across major cloud providers e.g aws
Stars: ✭ 46 (-43.9%)
Mutual labels:  s3
analyzing-reddit-sentiment-with-aws
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
Stars: ✭ 40 (-51.22%)
Mutual labels:  amazon-athena
react-native-appsync-s3
React Native app for image uploads to S3 and storing their records in Amazon DynamoDB using AWS Amplify and AppSync SDK
Stars: ✭ 18 (-78.05%)
Mutual labels:  s3
Amazon
Simple access to Amazon's web services.
Stars: ✭ 20 (-75.61%)
Mutual labels:  s3
prefect-docker-compose
A simple guide to understand Prefect and make it work with your own docker-compose configuration.
Stars: ✭ 122 (+48.78%)
Mutual labels:  s3
s3recon
Amazon S3 bucket finder and crawler.
Stars: ✭ 111 (+35.37%)
Mutual labels:  s3
flask-fs
Simple and easy file storages for Flask
Stars: ✭ 30 (-63.41%)
Mutual labels:  s3
ob bulkstash
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
Stars: ✭ 113 (+37.8%)
Mutual labels:  s3
django-s3file
A lightweight file upload input for Django and Amazon S3
Stars: ✭ 66 (-19.51%)
Mutual labels:  s3
go-drive
A simple cloud drive mapping web app supports local, FTP/SFTP, S3, OneDrive, WebDAV, Google Drive.
Stars: ✭ 184 (+124.39%)
Mutual labels:  s3
sentinel-util
A CLI for downloading, processing, and making a mosaic from Sentinel-1, -2 and -3 data
Stars: ✭ 22 (-73.17%)
Mutual labels:  sar
s3cli
Command line tool for S3
Stars: ✭ 21 (-74.39%)
Mutual labels:  s3
muxfys
High performance multiplexed user fuse mounting
Stars: ✭ 20 (-75.61%)
Mutual labels:  s3
xarray-sentinel
Xarray backend to Copernicus Sentinel-1 satellite data products
Stars: ✭ 189 (+130.49%)
Mutual labels:  sar

Athena SQLite Driver

Using Athena's new Query Federation functionality, read SQLite databases from S3.

Install it from the Serverless Application Repository: AthenaSQLiteConnector.

Why?

I occasionally like to put together fun side projects over Thanksgiving and Christmas holidays.

I'd always joked it would a crazy idea to be able to read SQLite using Athena, so...here we are!

How?

The PyArrow library unfortunately weighs in over 250MB, so we have to use a custom compilation step to build a Lambda Layer.

What?

Drop SQLite databases in a single prefix in S3, and Athena will list each file as a database and automatically detect tables and schemas.

Currently, all data types are strings. I'll fix this eventually. All good things in time.

Status

This project is under active development and very much in it's infancy.

Many things are hard-coded or broken into various pieces as I experiment and figure out how everything works.

Building

The documentation for this is a work in progress. It's currently in between me creating the resources manually and building the assets for the AWS SAR, and most of the docs will be automated away.

Requirements

  • Docker
  • Python 3.7

Lambda layer

First you need to build Lambda layer. There are two Dockerfiles and build scripts in the lambda-layer/ directory.

We'll execute each of the build scripts and copy the results to the target directory. This is referenced by the SAR template, athena-sqlite.yaml.

cd lambda-layer
./build.sh
./build-pyarrow.sh
cp -R layer/ ../target/

Upload sample data

For the purpose of this test, we just have a sample sqlite database you can upload.

aws s3 cp sample-data/sample_data.sqlite s3://<TARGET_BUCKET>/<TARGET_PREFIX>/

Feel free to upload your own SQLite databases as well!

Lambda function

There are three components to the Lambda code:

  • vfs.py - A SQLite Virtual File System implementation for S3
  • s3qlite.py - The actual Lambda function that handles Athena metadata/data requests
  • sqlite_db.py - Helper functions for access SQLite databases on S3

Create a function with the code in lambda-function/s3qlite.py that uses the previously created layer. The handler will be s3qlite.lambda_handler Also include the vfs.py and sqlite_db.py files in your Lambda function

Configure two environment variables for your lambda function:

  • TARGET_BUCKET - The name of your S3 bucket where SQLite files live
  • TARGET_PREFIX - The prefix (e.g. data/sqlite) that you uploaded the sample sqlite database to

Note that the IAM role you associate the function with will also need s3:GetObject and s3:ListBucket access to wherever your lovely SQLite databases are stored.

Configure Athena

Follow the Athena documentation for Connecting to a data source. The primary thing to note here is that you need to create a workgroup named AmazonAthenaPreviewFunctionality and use that for your testing. Some functionality will work in the primary workgroup, but you'll get weird errors when you try to query data.

I named my function s3qlite :)

Run queries!

Here's a couple basic queries that should work:

SELECT * FROM "s3qlite"."sample_data"."records" limit 10;

SELECT COUNT(*) FROM "s3qlite"."sample_data"."records";

If you deploy the SAR app, the data catalog isn't registered automatically, but you can still run queries by using the special lambda: schema:

SELECT * FROM "lambda:s3qlite".sample_data.records LIMIT 10;

Where s3qlite is the value you provided for the AthenaCatalogName parameter.

TODO

  • Move these into issues :)
  • Move vfs.py into it's own module
    • Maybe add write support to it someday 😱
  • Publish to SAR
  • Add tests...always tests
  • struct types, probably
  • Don't read the entire file every time :)
  • Escape column names with invalid characters
  • Implement recursive listing

Serverless App Repo

These are mostly notes I made while figuring out how to get SAR working.

Need to grant SAR access to the bucket

aws s3api put-bucket-policy --bucket <BUCKET> --region us-east-1 --policy '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service":  "serverlessrepo.amazonaws.com"
      },
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::<BUCKET>/*"
    }
  ]
}'

For publishing to the SAR, we just execute two commands

sam package --template-file athena-sqlite.yaml --s3-bucket <BUCKET> --output-template-file target/out.yaml
sam publish --template target/out.yaml --region us-east-1

If you want to deploy using CloudFormation, use this command:

sam deploy --template-file ./target/out.yaml --stack-name athena-sqlite --capabilities CAPABILITY_IAM --parameter-overrides 'DataBucket=<BUCKET> DataPrefix=tmp/sqlite' --region us-east-1
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].