All Projects → Matts966 → alphasql

Matts966 / alphasql

Licence: Apache-2.0 License
AlphaSQL provides Integrated Type and Schema Check and Parallelization for SQL file set mainly for BigQuery

Programming Languages

C++
36643 projects - #6 most used programming language
Starlark
911 projects
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to alphasql

Graphvizer
Preview Graphviz in real time with Sublime Text 3
Stars: ✭ 74 (+111.43%)
Mutual labels:  graphviz, dot-language
git-explode
Explode linear sequence of git commits into topic branches
Stars: ✭ 43 (+22.86%)
Mutual labels:  dependency-analysis, dependency-graph
sdk-java
Temporal Java SDK
Stars: ✭ 117 (+234.29%)
Mutual labels:  workflow-automation, workflow-tool
Protodot
transforming your .proto files into .dot files (and .svg, .png if you happen to have graphviz installed)
Stars: ✭ 107 (+205.71%)
Mutual labels:  graphviz, dependency-graph
react
Basic Primitives Diagrams for React. Data visualization components library that implements organizational chart and multi-parent dependency diagrams.
Stars: ✭ 15 (-57.14%)
Mutual labels:  dependency-analysis, dependency-graph
DIRECT
DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-42.86%)
Mutual labels:  datawarehouse, datawarehouseautomation
javascript
Basic Primitives Diagrams for JavaScript - data visualization components library that implements organizational chart and multi-parent dependency diagrams, contains implementations of JavaScript Controls and PDF rendering plugins.
Stars: ✭ 46 (+31.43%)
Mutual labels:  dependency-analysis, dependency-graph
Geoweaver
a web system to allow users to automatically record history and manage complicated scientific workflows in web browsers involving the online spatial data facilities, high-performance computation platforms, and open-source libraries.
Stars: ✭ 32 (-8.57%)
Mutual labels:  pipeline-framework, workflow-tool
BimlFlex-Community
Community-focused content to supplement working with BimlFlex.
Stars: ✭ 30 (-14.29%)
Mutual labels:  datawarehouse, datawarehouseautomation
lightflow
A lightweight, distributed workflow system
Stars: ✭ 67 (+91.43%)
Mutual labels:  pipeline-framework, workflow-automation
Scabbard
🗡 A tool to visualize Dagger 2 dependency graphs
Stars: ✭ 615 (+1657.14%)
Mutual labels:  graphviz, dependency-graph
asset-graph-webpack-plugin
Webpack plugin to easily get assets dependency graph based on entry point
Stars: ✭ 13 (-62.86%)
Mutual labels:  dependency-analysis, dependency-graph
Dependency Cruiser
Validate and visualize dependencies. Your rules. JavaScript, TypeScript, CoffeeScript. ES6, CommonJS, AMD.
Stars: ✭ 2,326 (+6545.71%)
Mutual labels:  dependency-analysis, dependency-graph
bigquery-data-lineage
Reference implementation for real-time Data Lineage tracking for BigQuery using Audit Logs, ZetaSQL and Dataflow.
Stars: ✭ 112 (+220%)
Mutual labels:  bigquery, zetasql
Data-Warehouse-Automation-Metadata-Schema
Generic interface exchange format for Data Warehouse Automation and ETL generation.
Stars: ✭ 26 (-25.71%)
Mutual labels:  datawarehouse, datawarehouseautomation
astro
Astro allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
Stars: ✭ 79 (+125.71%)
Mutual labels:  bigquery, elt
dbd
dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.
Stars: ✭ 30 (-14.29%)
Mutual labels:  bigquery, elt
laravel-big
Google BigQuery for Laravel
Stars: ✭ 14 (-60%)
Mutual labels:  bigquery
growthbook
Open Source Feature Flagging and A/B Testing Platform
Stars: ✭ 2,342 (+6591.43%)
Mutual labels:  bigquery
dependency-check-py
🔐 Shim to easily install OWASP dependency-check-cli into Python projects
Stars: ✭ 44 (+25.71%)
Mutual labels:  dependency-analysis

AlphaSQL

release test

AlphaSQL provides Automatic Parallelization for sets of SQL files and integrated Type/Scheme Checker to eliminate syntax, type and schema errors from your datawarehouse.

Features

You can quickly introduce AlphaSQL by CI Example.

Docker Image

You can run commands below with docker

docker run --rm -v `pwd`:/home matts966/alphasql:latest [command]

like

docker run --rm -v `pwd`:/home matts966/alphasql:latest alphacheck ./samples/sample/dag.dot

Commands are installed in the PATH of the image.

Fast Binaries

# To install for MacOSX
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_darwin_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_darwin_x86_64.tar.gz -C /usr/local/bin
# To install for Linux
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_linux_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_linux_x86_64.tar.gz -C /usr/local/bin --strip=1

Extract DAG from SQL set

alphadag finds dependencies between table references and create table statements, function calls and create function statements.

# To extract DAG from your SQL set
$ alphadag --output_path ./samples/sample/dag.dot ./samples/sample/

# Or you can check the output in stdout by
$ alphadag [paths]

# with graphviz
$ dot -Tpng samples/sample/dag.dot -o samples/sample/dag.png

Note that sometimes the output has cycle, and refactoring SQL files or manual editing of the dot file is needed (see this issue).

If there are cycles, warning is emitted, type checker reports error, and bq_jobrunner raise error before execution. You can see the example in ./samples/sample-cycle .

If you want to serially execute some statements, you can write SQL script that contains multiple statements. See samples/sample/create_interim1.sql as an example.

Sample DAG output

The image below is extracted from SQL set in ./samples/sample . You can write tests for created tables and run them parallely only by separating SQL file.

dag.dot

Side effect first

You can resolve side effects such as INSERT and UPDATE statements before simple references by the --side_effect_first option.

dag.dot

With tables and functions

You can extract dependencies containing tables and functions by --with_tables and --with_functions options.

With tables

dag.dot

With functions

dag.dot

Parallel Execution

For BigQuery, the output DAG can be run parallely using

bq-airflow-dag-generator

airflow DAG

bq-airflow-dag-generator as Python package can generate Airflow DAG by simple Python script.

dagpath = "/path/to/dag.dot"
dag = generate_airflow_dag_by_dot_path(dagpath)

See usage on README for more details.

bq-jobrunner

from bq_jobrunner.bq_jobrunner import BQJobrunner

FILE_PATH = "./path/to/dag.dot"
PROJECT_ID = "your-project-id"
REGION = "asia-northeast1" # your region

runner = BQJobrunner(
    PROJECT_ID,
    location=REGION,
)
runner.compose_query_by_dot_path(FILE_PATH)
runner.execute()

Note that you should run job_runner in the same path as in extracting DAG.

Pipeline level Type Check for SQL set

Note that you should run type_checker in the same path as in extracting DAG.

# to check type and schema of SQL set
$ alphacheck ./samples/sample.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interim2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interium1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/insert_into_interim1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_mart.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart3.sql"
SUCCESS: analysis finished!
Successfully finished type check!

If you change column x's type in ./samples/sample/create_datawarehouse3.sql to STRING, type checker reports error.

$ alphacheck ./samples/sample/dag.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
ERROR: INVALID_ARGUMENT: Column 1 in UNION ALL has incompatible types: INT64, STRING [at ./samples/sample/create_interim2.sql:7:1]
catalog:
        datawarehouse3
        datawarehouse2

Schema specification by JSON

You can specify external schemata (not created by queries in SQL set) by passing JSON schema path.

# with external schema
$ alphacheck --json_schema_path ./samples/sample-schema.json ./samples/sample/dag.dot

You can extract required external tables by

$ alphadag --external_required_tables_output_path ./required_tables.txt {./path/to/sqls}
# and get schemata using bq command
$ cat ./required_tables.txt | while read line
do
    bq show \
        --schema \
        --format=prettyjson \
        $line
done

JSON schema file should have only a top level map element keyed by string element, which specifies table name, and each value specifies schema for its key. name and type elements in the schema elements are recognized like the official API.

{
        "tablename1": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null}
        ],
        "tablename2": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null},
                {"mode": "NULLABLE", "name": "column2", "type": "INT64", "description": null}
        ]
}

CI Example

The pipeline level type check above is also useful in CI context. The sample in ./samples/sample-ci contains an example for extracting DAG, retrieving schema and checking schema and type of SQL set quering bigquery public dataset. You can introduce the CI to your environment only by copying cloudbuild_ci_sample.yaml and python_entrypoint.py to your project.

You can try the example CI with gcloud command by

(cd ./samples/sample-ci && gcloud builds submit --config=cloudbuild_ci_sample.yaml .)

This example

  • Supports _TABLE_SUFFIX feature!
  • Does not execute actual BigQuery and very cheap!

License

Apache License 2.0

Sponsors

The development of this project is sponsored by Japan Data Science Consortium and Cybozu Lab.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].