All Projects → kristeligt-dagblad → dbt_ml

kristeligt-dagblad / dbt_ml

Licence: Apache-2.0 license
Package for dbt that allows users to train, audit and use BigQuery ML models.

Projects that are alternatives of or similar to dbt ml

dbt-superset-lineage
Make dbt docs and Apache Superset talk to one another
Stars: ✭ 60 (+46.34%)
Mutual labels:  dbt
snowflake-starter
A _simple_ starter template for Snowflake Cloud Data Platform
Stars: ✭ 31 (-24.39%)
Mutual labels:  dbt
dbt-databricks
A dbt adapter for Databricks.
Stars: ✭ 115 (+180.49%)
Mutual labels:  dbt
dbt-clickhouse
The Clickhouse plugin for dbt (data build tool)
Stars: ✭ 77 (+87.8%)
Mutual labels:  dbt
airflow-dbt
Apache Airflow integration for dbt
Stars: ✭ 233 (+468.29%)
Mutual labels:  dbt
tellery
Tellery lets you build metrics using SQL and bring them to your team. As easy as using a document. As powerful as a data modeling tool.
Stars: ✭ 219 (+434.15%)
Mutual labels:  dbt
dbt artifacts
A dbt package for modelling dbt metadata. https://brooklyn-data.github.io/dbt_artifacts
Stars: ✭ 119 (+190.24%)
Mutual labels:  dbt
metriql
The metrics layer for your data. Join us at https://metriql.com/slack
Stars: ✭ 227 (+453.66%)
Mutual labels:  dbt
re-data
re_data - fix data issues before your users & CEO would discover them 😊
Stars: ✭ 955 (+2229.27%)
Mutual labels:  dbt
awesome-dbt
A curated list of awesome dbt resources
Stars: ✭ 520 (+1168.29%)
Mutual labels:  dbt
lightdash
An open source alternative to Looker built using dbt. Made for analysts ❤️
Stars: ✭ 1,082 (+2539.02%)
Mutual labels:  dbt
ria-jit
Lightweight and performant dynamic binary translation for RISC–V code on x86–64
Stars: ✭ 38 (-7.32%)
Mutual labels:  dbt
dbt-on-airflow
No description or website provided.
Stars: ✭ 30 (-26.83%)
Mutual labels:  dbt
dbt-ml-preprocessing
A SQL port of python's scikit-learn preprocessing module, provided as cross-database dbt macros.
Stars: ✭ 128 (+212.2%)
Mutual labels:  dbt
PyRasgo
Helper code to interact with Rasgo via our SDK, PyRasgo
Stars: ✭ 39 (-4.88%)
Mutual labels:  dbt
dbt-invoke
A CLI for creating, updating, and deleting dbt property files
Stars: ✭ 42 (+2.44%)
Mutual labels:  dbt
pre-commit-dbt
🎣 List of `pre-commit` hooks to ensure the quality of your `dbt` projects.
Stars: ✭ 149 (+263.41%)
Mutual labels:  dbt
dbt ad reporting
Fivetran's ad reporting dbt package. Combine your Facebook, Google, Pinterest, Linkedin, Twitter, Snapchat and Microsoft advertising spend using this package.
Stars: ✭ 68 (+65.85%)
Mutual labels:  dbt
dbt-spotify-analytics
Containerized end-to-end analytics of Spotify data using Python, dbt, Postgres, and Metabase
Stars: ✭ 92 (+124.39%)
Mutual labels:  dbt
airflow-dbt-python
A collection of Airflow operators, hooks, and utilities to elevate dbt to a first-class citizen of Airflow.
Stars: ✭ 111 (+170.73%)
Mutual labels:  dbt

BigQuery ML models in dbt

Package for dbt that allows users to train, audit and use BigQuery ML models. The package implements a model materialization that trains a BigQuery ML model from a select statement and a set of parameters. In addition to the model materialization a set of helper macros that assist with model audit and prediction are included.

Installation

To install the package add the package path to the packages.yml file in your dbt project

In order to use the model audit post-hook the following variables have to be set in your dbt_project.yml file.

Variable Description
dbt_ml:audit_schema Schema of the audit table.
dbt_ml:audit_table Name of the audit table.

You will also need to specify the post-hook in your dbt_project.yml file[1] as {{ dbt_ml.model_audit() }}. Optionally, you can use the dbt_ml.create_model_audit_table() macro to create the audit table automatically if it does not exist - for example in an on-run-start hook.

Example config for dbt_project.yml below:

vars:
  "dbt_ml:audit_schema": "audit"
  "dbt_ml:audit_table": "ml_models"
on-run-start:
  - '{% do adapter.create_schema(api.Relation.create(target.project, "audit")) %}'
  - "{{ dbt_ml.create_model_audit_table() }}"
models:
  <project>:
    ml:
      enabled: true
      schema: ml
      materialized: model
      post-hook: "{{ dbt_ml.model_audit() }}"

Usage

In order to use the model materialization, simply create a .sql file with a select statement and set the materialization to model. Additionaly, specify any BigQuery ML options in the ml_config key of the config dictionary.

# model.sql

{{
    config(
        materialized='model',
        ml_config={
            'model_type': 'logistic_reg',
            'early_stop': true,
            'ls_init_learn_rate': 0.1,
            ...
        }
    )
}}

select * from your_input

Note that the materialization should not be prefixed with dbt_ml, since dbt does not support namespaced materializations.

After training your model you can reference it in downstream dbt models using the included predict macro.

# downstream_model.sql

{{
    config(
        materialized='table'
    )
}}

with eval_data as (
    ...
)

select * from {{ dbt_ml.predict(ref('model'), 'eval_data') }}

If you're using a BQML matrix_factorization model, you can use the recommend macro in the same way.

# downstream_model.sql

with predict_features AS (
    ...
)

select * from {{ dbt_ml.recommend(ref('model'), 'predict_features') }}

The ML.DETECT_ANOMALIES function provides anomaly detection for BigQuery ML.

# detect_anomalies_model.sql

{{
    config(
        materialized='table'
    )
}}

with eval_data as (
    ...
)

select * from {{ dbt_ml.detect_anomalies(ref('model'), 'eval_data', threshold) }}

Tuning hyperparameters

BigQuery ML supports tuning model hyperparameters[2], as does dbt_ml. In order to specify which hyperparameters to tune, and which parameterspace to use, one can use the dbt_ml.hparam_candidates and dbt_ml.hparam_range macros that map to the corresponding BigQuery ML methods.

The following example takes advantage of hyperparameter tuning:

{{
    config(
        materialized='model',
        ml_config={
            'model_type': 'dnn_classifier',
            'auto_class_weights': true,
            'learn_rate': dbt_ml.hparam_range(0.01, 0.1),
            'early_stop': false,
            'max_iterations': 50,
            'num_trials': 4,
            'optimizer': dbt_ml.hparam_candidates(['adam', 'sgd'])
        }
    )
}}

It is worth noting that one must set the num_trials parameter to a positive integer, otherwise BigQuery will return an error.

Overriding the package

If a user wishes to override/shim this package, instead of defining a var named dbt_ml_dispatch_list, they should now define a config in dbt_project.yml, for instance:

dispatch:
  - macro_namespace: dbt_ml
    search_order: ['my_project', 'dbt_ml']  # enable override

Reservations

Some BigQuery ML models, e.g. Matrix Factorization, cannot be run using the on-demand pricing model. In order to train such models, please set up a flex or regular reservation[3] prior to running the model.

Footnotes

[1] The post-hook has to be specified in the dbt_project.yml instead of the actual model file because the relation is not available during parsing hence variables like {{ this }} are not properly templated.

[2] https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning

[3] https://cloud.google.com/bigquery/docs/reservations-tasks

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].