All Projects → ExpediaGroup → apiary-data-lake

ExpediaGroup / apiary-data-lake

Licence: Apache-2.0 license
Terraform scripts for deploying Apiary Data Lake

Programming Languages

HCL
1544 projects
python
139335 projects - #7 most used programming language
Smarty
1635 projects
shell
77523 projects

Projects that are alternatives of or similar to apiary-data-lake

BrAPI
Repository for version control of the BrAPI specifications
Stars: ✭ 50 (+233.33%)
Mutual labels:  apiary
dlink
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Stars: ✭ 1,535 (+10133.33%)
Mutual labels:  datalake
Hudi
Upserts, Deletes And Incremental Processing on Big Data.
Stars: ✭ 2,586 (+17140%)
Mutual labels:  datalake
Leofs
The LeoFS Storage System
Stars: ✭ 1,439 (+9493.33%)
Mutual labels:  datalake
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+30440%)
Mutual labels:  datalake
SparkProgrammingInScala
Apache Spark Course Material
Stars: ✭ 57 (+280%)
Mutual labels:  datalake
parquet-usql
A custom extractor designed to read parquet for Azure Data Lake Analytics
Stars: ✭ 13 (-13.33%)
Mutual labels:  datalake
Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Stars: ✭ 52 (+246.67%)
Mutual labels:  datalake
apiary
Apiary provides modules which can be combined to create a federated cloud data lake
Stars: ✭ 30 (+100%)
Mutual labels:  datalake
pan-cortex-data-lake-python
Python idiomatic SDK for Cortex™ Data Lake.
Stars: ✭ 36 (+140%)
Mutual labels:  datalake
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+160%)
Mutual labels:  datalake
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+4266.67%)
Mutual labels:  datalake

Overview

This repo contains a Terraform module to deploy the Apiary data lake component. The module deploys various stateful components in a typical Hadoop-compatible data lake in AWS.

For more information please refer to the main Apiary project page.

Architecture

Datalake  architecture

Key Features

  • Highly Available(HA) metastore service - packaged as Docker container and running on an ECS Fargate Cluster.
  • PrivateLinks - Network load balancers and VPC endpoints to enable federated access to read-only and read/write metastores.
  • Managed schemas - integrated way of managing Hive schemas, S3 buckets and bucket policies.
  • SNS Listener - A Hive metastore event listener to publish all metadata updates to a SNS topic, see ApiarySNSListener for more details.
  • Gluesync - A metastore event listener to replay Hive metadata events in a Glue catalog.
  • Metastore authorization - A metastore pre-event listener to handle authorization using Ranger.
  • Grafana dashboard - If deployed in EKS, a Grafana dashboard will be created that shows S3 bucket sizes for each Apiary bucket.

Variables

Please refer to VARIABLES.md.

Usage

NB: This module currently requires you to use it from a machine with bash, aws, mysql, and jq CLI tools installed.

Example module invocation:

module "apiary" {
  source                   = "git::https://github.com/ExpediaGroup/apiary-data-lake.git"
  aws_region               = "us-west-2"
  instance_name            = "test"
  apiary_tags              = "${var.tags}"
  private_subnets          = ["subnet1", "subnet2", "subnet3"]
  vpc_id                   = "vpc-123456"
  hms_docker_image         = "${aws_account}.dkr.ecr.${aws_region}.amazonaws.com/apiary-metastore"
  hms_docker_version       = "1.0.0"
  hms_ro_cpu               = "2048"
  hms_rw_cpu               = "2048"
  hms_ro_heapsize          = "8192"
  hms_rw_heapsize          = "8192"
  apiary_log_bucket        = "s3-logs-bucket"
  db_instance_class        = "db.t2.medium"
  db_backup_retention      = "7"
  apiary_managed_schemas   = [
    {
        schema_name = "db1",
        s3_lifecycle_policy_transition_period = "30"
    },
    {
        schema_name = "db_2",
        s3_storage_class = "INTELLIGENT_TIERING"
    },
    {
        schema_name = "secure_db",
        encryption   = "aws:kms" //supported values for encryption are AES256,aws:kms
        admin_roles = "role1_arn,role2_arn" //kms key management will be restricted to these roles.
        client_roles = "role3_arn,role4_arn" //s3 bucket read/write and kms key usage will be restricted to these roles.
        customer_accounts = "account_id1,account_id2" //this will override module level apiary_customer_accounts
    }
  ]
  apiary_customer_accounts = ["aws_account_no_1", "aws_account_no_2"]
  # single policy with multiple conditions will use AND operator
  # https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_multi-value-conditions.html
  # ; will create seperate policies for each condition, essentially to enable OR operator
  apiary_customer_condition = <<EOF
    "ForAnyValue:StringEquals": {"s3:ExistingObjectTag/security": [ "public"] };
    "StringLike": {"s3:ExistingObjectTag/type": "image*" }
  EOF
  ingress_cidr             = ["10.0.0.0/8"]
  apiary_assume_roles      = [
    {
        name = "client_name"
        principals = [ "arn:aws:iam::account_number:role/cross-account-role" ]
        schema_names = [ "dm","lz","test_1" ]
        max_role_session_duration_seconds = "7200",
        allow_cross_region_access = true 
    }
  ]
}

Notes

The Apiary metastore Docker image is not yet published to a public repository, you can build from this repo and then publish it to your own ECR.

In k8s deployment mode IAM roles can be attached to metastore pods either using IRSA or KIAM, module will use IRSA when oidc_provider variable is configured, will use Kiam whne kiam_arn variable is configured.

Contact

Mailing List

If you would like to ask any questions about or discuss Apiary please join our mailing list at

https://groups.google.com/forum/#!forum/apiary-user

Legal

This project is available under the Apache 2.0 License.

Copyright 2018-2019 Expedia, Inc.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].