All Projects → datarootsio → terraform-module-azure-datalake

datarootsio / terraform-module-azure-datalake

Licence: MIT license
Terraform module for an Azure Data Lake

Programming Languages

HCL
1544 projects
go
31211 projects - #10 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to terraform-module-azure-datalake

Azure-Certification-DP-200
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
Stars: ✭ 54 (+92.86%)
Mutual labels:  data-lake
analyzing-reddit-sentiment-with-aws
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
Stars: ✭ 40 (+42.86%)
Mutual labels:  data-lake
smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Stars: ✭ 79 (+182.14%)
Mutual labels:  data-lake
zeeqs
Query API for aggregated Zeebe data
Stars: ✭ 37 (+32.14%)
Mutual labels:  data-lake
SparkProgrammingInScala
Apache Spark Course Material
Stars: ✭ 57 (+103.57%)
Mutual labels:  data-lake
Data-Engineering-Projects
Personal Data Engineering Projects
Stars: ✭ 167 (+496.43%)
Mutual labels:  data-lake
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-10.71%)
Mutual labels:  data-lake
hiveberg
Demonstration of a Hive Input Format for Iceberg
Stars: ✭ 22 (-21.43%)
Mutual labels:  data-lake
herd-mdl
Herd-MDL, a turnkey managed data lake in the cloud. See https://finraos.github.io/herd-mdl/ for more information.
Stars: ✭ 11 (-60.71%)
Mutual labels:  data-lake

Terraform module Azure Data Lake

This is a module for Terraform that deploys a complete and opinionated data lake network on Microsoft Azure.

maintained by dataroots Terraform 0.13 Terraform Registry tests Go Report Card

Components

  • Azure Data Factory for data ingestion from various sources
  • Azure Data Lake Storage gen2 containers to store data for the data lake layers
  • Azure Databricks to clean and transform the data
  • Azure Synapse Analytics to store presentation data
  • Azure CosmosDB to store metadata
  • Credentials and access management configured ready to go

This design is based on one of Microsoft's architecture patterns for an advanced analytics solution.

Microsoft Advanced Analytics pattern

It includes some additional changes that dataroots is recommending.

  • Multiple storage containers to store every version of the data
  • Cosmos DB is used to store the metadata of the data as a Data Catalog
  • Azure Analysis Services is not used for now as some services might be replaced when Azure Synapse Analytics Workspace becomes GA

Usage

module "azuredatalake" {
  source  = "datarootsio/azure-datalake/module"
  version = "~> 0.1" 

  data_lake_name = "example name"
  region         = "eastus2"

  storage_replication          = "ZRS"
  service_principal_end_date   = "2030-01-01T00:00:00Z"
  cosmosdb_consistency_level   = "Session"
}

Requirements

Supported environments

This module works on macOS and Linux. Windows is not supported as the module uses some Bash scripts to get around Terraform limitations.

Azure provider configuration

The following providers have to be configured:

You can either log in through the Azure CLI, or set environment variables as documented in the links above.

Azure CLI

The module uses some workarounds for features that are not yet available in the Azure providers. Therefore, you need to be logged in to the Azure CLI as well. You can use both a user account, as well as service principal authentication.

jq

The module uses jq to extract Databricks parameters during the deployment. Therefore, you need to have jq installed.

Contributing

Contributions to this repository are very welcome! Found a bug or do you have a suggestion? Please open an issue. Do you know how to fix it? Pull requests are welcome as well! To get you started faster, a Makefile is provided.

Make sure to install Terraform, Azure CLI, Go (for automated testing) and Make (optional, if you want to use the Makefile) on your computer. Install tflint to be able to run the linting.

  • Setup tools & dependencies: make tools
  • Format your code: make fmt
  • Linting: make lint
  • Run tests: make test (or go test -timeout 2h ./... without Make)

To run the automated tests, the environment variable ARM_SUBSCRIPTION_ID has to be set to your Azure subscription ID.

License

MIT license. Please see LICENSE for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].