All Projects → dbt-labs → spark-utils

dbt-labs / spark-utils

Licence: Apache-2.0 license
Utility functions for dbt projects running on Spark

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Labels

Projects that are alternatives of or similar to spark-utils

airflow-dbt
Apache Airflow integration for dbt
Stars: ✭ 233 (+1126.32%)
Mutual labels:  dbt
dbt-databricks
A dbt adapter for Databricks.
Stars: ✭ 115 (+505.26%)
Mutual labels:  dbt
dbt-airflow-docker-compose
Execution of DBT models using Apache Airflow through Docker Compose
Stars: ✭ 76 (+300%)
Mutual labels:  dbt
snowflake-starter
A _simple_ starter template for Snowflake Cloud Data Platform
Stars: ✭ 31 (+63.16%)
Mutual labels:  dbt
airflow-dbt-python
A collection of Airflow operators, hooks, and utilities to elevate dbt to a first-class citizen of Airflow.
Stars: ✭ 111 (+484.21%)
Mutual labels:  dbt
dbt-spotify-analytics
Containerized end-to-end analytics of Spotify data using Python, dbt, Postgres, and Metabase
Stars: ✭ 92 (+384.21%)
Mutual labels:  dbt
kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (+2394.74%)
Mutual labels:  dbt
dbt-formatter
Formatting for dbt jinja-flavored sql
Stars: ✭ 37 (+94.74%)
Mutual labels:  dbt
awesome-dbt
A curated list of awesome dbt resources
Stars: ✭ 520 (+2636.84%)
Mutual labels:  dbt
dbt ml
Package for dbt that allows users to train, audit and use BigQuery ML models.
Stars: ✭ 41 (+115.79%)
Mutual labels:  dbt
pre-commit-dbt
🎣 List of `pre-commit` hooks to ensure the quality of your `dbt` projects.
Stars: ✭ 149 (+684.21%)
Mutual labels:  dbt
dbt-on-airflow
No description or website provided.
Stars: ✭ 30 (+57.89%)
Mutual labels:  dbt
metriql
The metrics layer for your data. Join us at https://metriql.com/slack
Stars: ✭ 227 (+1094.74%)
Mutual labels:  dbt
re-data
re_data - fix data issues before your users & CEO would discover them 😊
Stars: ✭ 955 (+4926.32%)
Mutual labels:  dbt
dbt2looker
Generate lookml for views from dbt models
Stars: ✭ 119 (+526.32%)
Mutual labels:  dbt
ria-jit
Lightweight and performant dynamic binary translation for RISC–V code on x86–64
Stars: ✭ 38 (+100%)
Mutual labels:  dbt
PyRasgo
Helper code to interact with Rasgo via our SDK, PyRasgo
Stars: ✭ 39 (+105.26%)
Mutual labels:  dbt
fal
do more with dbt. fal helps you run Python alongside dbt, so you can send Slack alerts, detect anomalies and build machine learning models.
Stars: ✭ 567 (+2884.21%)
Mutual labels:  dbt
dbt-sugar
dbt-sugar is a CLI tool that allows users of dbt to have fun and ease performing actions around dbt models
Stars: ✭ 139 (+631.58%)
Mutual labels:  dbt
dbt ad reporting
Fivetran's ad reporting dbt package. Combine your Facebook, Google, Pinterest, Linkedin, Twitter, Snapchat and Microsoft advertising spend using this package.
Stars: ✭ 68 (+257.89%)
Mutual labels:  dbt

This dbt package contains macros that:

  • can be (re)used across dbt projects running on Spark
  • define Spark-specific implementations of dispatched macros from other packages

Installation Instructions

Check dbt Hub for the latest installation instructions, or read the docs for more information on installing packages.


Compatibility

This package provides "shims" for:

  • dbt_utils, except for:
    • dbt_utils.get_relations_by_pattern
    • dbt_utils.groupby
    • dbt_utils.recency
    • dbt_utils.any_value
    • dbt_utils.listagg
    • dbt_utils.pivot with apostrophe(s) in the values
  • snowplow (tested on Databricks only)

In order to use these "shims," you should set a dispatch config in your root project (on dbt v0.20.0 and newer). For example, with this project setting, dbt will first search for macro implementations inside the spark_utils package when resolving macros from the dbt_utils namespace:

dispatch:
  - macro_namespace: dbt_utils
    search_order: ['spark_utils', 'dbt_utils']

Note to maintainers of other packages

The spark-utils package may be able to provide compatibility for your package, especially if your package leverages dbt-utils macros for cross-database compatibility. This package does not need to be specified as a dependency of your package in packages.yml. Instead, you should encourage anyone using your package on Apache Spark / Databricks to:

  • Install spark_utils alongside your package
  • Add a dispatch config in their root project, like the one above

Useful macros: maintenance

Caveat: These are not tested in CI, or guaranteed to work on all platforms.

Each of these macros accepts a regex pattern, finds tables with names matching the pattern, and will loop over those tables to perform a maintenance operation:

  • spark_optimize_delta_tables: Runs optimize for all matched Delta tables
  • spark_vacuum_delta_tables: Runs vacuum for all matched Delta tables
  • spark_analyze_tables: Compute statistics for all matched tables

Contributing

We welcome contributions to this repo! To contribute a new feature or a fix, please open a Pull Request with 1) your changes and 2) updated documentation for the README.md file.

Testing

The macros are tested with pytest and pytest-dbt-core. For example, the create_tables macro is tested by:

  1. Create a test table (test setup):
    spark_session.sql(f"CREATE TABLE {table_name} (id int) USING parquet")
  2. Call the macro generator:
    tables = macro_generator()
  3. Assert test condition:
    assert simple_table in tables
  4. Delete the test table (test cleanup):
    spark_session.sql(f"DROP TABLE IF EXISTS {table_name}")

A macro is fetched using the macro_generator fixture and providing the macro name trough indirect parameterization:

@pytest.mark.parametrize(
    "macro_generator", ["macro.spark_utils.get_tables"], indirect=True
)
def test_create_table(macro_generator: MacroGenerator) -> None:

Getting started with dbt + Spark

Code of Conduct

Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].