All Projects → datayoga-io → lineage

datayoga-io / lineage

Licence: Apache-2.0 license
Generate beautiful documentation for your data pipelines in markdown format

Programming Languages

typescript
32286 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to lineage

basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+56.25%)
Mutual labels:  pipeline, etl, pyspark
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (+6.25%)
Mutual labels:  pipeline, etl, pyspark
Stetl
Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (+300%)
Mutual labels:  pipeline, etl
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+393.75%)
Mutual labels:  pipeline, etl
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+30643.75%)
Mutual labels:  pipeline, etl
Serving
A flexible, high-performance carrier for machine learning models(『飞桨』服务化部署框架)
Stars: ✭ 403 (+2418.75%)
Mutual labels:  pipeline, dag
Go Streams
A lightweight stream processing library for Go
Stars: ✭ 615 (+3743.75%)
Mutual labels:  pipeline, etl
Metl
mito ETL tool
Stars: ✭ 153 (+856.25%)
Mutual labels:  pipeline, etl
etl
M-Lab ingestion pipeline
Stars: ✭ 15 (-6.25%)
Mutual labels:  pipeline, etl
naas
⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (+1268.75%)
Mutual labels:  pipeline, etl
Morphl Community Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (+1481.25%)
Mutual labels:  pipeline, pyspark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+143.75%)
Mutual labels:  etl, pyspark
Datavec
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (+1600%)
Mutual labels:  pipeline, etl
Phila Airflow
Stars: ✭ 16 (+0%)
Mutual labels:  pipeline, etl
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+3725%)
Mutual labels:  etl, dag
Mara Pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Stars: ✭ 1,841 (+11406.25%)
Mutual labels:  pipeline, etl
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (+687.5%)
Mutual labels:  etl, pyspark
Example Airflow Dags
Example DAGs using hooks and operators from Airflow Plugins
Stars: ✭ 243 (+1418.75%)
Mutual labels:  etl, dag
Bulk Writer
Provides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (+1212.5%)
Mutual labels:  pipeline, etl
python mozetl
ETL jobs for Firefox Telemetry
Stars: ✭ 25 (+56.25%)
Mutual labels:  etl, pyspark

DataYoga Lineage

This package allows to generate documentation of data pipelines and data lineage charts. It is source agnostic and uses a predefined json/yaml format to represent the dependencies and business logic. The resulting markdown files can be used standalone or as part of a documentation site using tools like MkDocs or VuePress.

image

Installation

npm install -g @datayoga-io/lineage

Quick Start

To quickly get started with Lineage, scaffold a new project. This will create the folder structure along with sample files.

dy-lineage scaffold ./my-project

To generate the documentation for the new project:

dy-lineage build ./my-project --dest ./docs

Data Entities

Lineage models the data ecosystem using the following entities:

Datastore - A datastore represents a source or target of data that can hold data at rest or data in motion. Datastores include entities such as a table in a database, a file, or a stream in Kafka. A Datastore can act either as a source or a target of a pipeline.

File - A file is a type of Datastore that represents information stored in files. Files contain metadata about their structure and schema.

Dimension - A dimension table / file is typically used for lookup and constant information that is managed as part of the application code. This often includes lookup values such as country codes.

Runner - A runner is a processing engine capable of running data operations. Every Runner supports one or more programming languages. Some Runners, like a database engine, only support SQL, while others like Spark may support Python, Scala, and Java.

Consumer - A consumer consumes data and presents it to a user. Consumers include reports, dashboards, and interactive applications.

Pipeline - A pipeline represents a series of Jobs that operate on a single Runner.

Job - A job is composed of a series of Steps that fetch information from one or more Datastores, transform them, and store the result in a target Datastore, or perform actions such as sending out alerts or performing HTTP calls.

Job Step - Every step in a job performs a single action. A step can be of a certain type representing the action it performs. A step can be an SQL statement, a Python statement, or a callout to a library. Steps can be chained to create a Directed Acyclic Graph (DAG).

Getting started

Usage

dy-lineage <input folder> -dest <output folder>

Lineage will scan the file(s) specified in input and attempt to load information about the catalog. In addition, a folder containing yaml files describing each pipeline's business logic can be provided.

See the Example folder for sample input files and generated output files

Structure of input folder

.
├── .dyrc
├── entities
  ├── files
  ├── pipelines
  ├── datastores
├── relations
  • .dyrc: Used to store global configuration.
  • datastores: Catalog file(s) with information about datastore entities and their metadata
  • files: Catalog file(s) with information about file datastore entities and their metadata
  • pipelines: Optional files containing information about the pipelines and business logic flow
  • relations: Information about the relations between the data entities

Structure of entities files

The entities folder contains the entity definition and metadata for each of the entities. Each folder can contain one or more files. For convenience, multiple entities can be located in the same yaml file. All *.yaml files are scanned.

Whenever an ID is used, the ID naming convention is: <entity type>:<optional module name>.<entity name>. Module name can be nested: e.g. order_mgmt.weekdays.inbound.load_orders

Pipelines

Example:

id: pipeline:order_mgmt.load_orders
title: Load order from SAP
description: |
  Loads orders from raw order system. Performs the following calculation on order value:
    ```order_value = order_qty * order_line_item_value```
classification: PII
metadata:
  - triggers: file_arrival

id: unique identifier of this pipeline

title: title to use for document

description: description of the pipeline. Can use markdown format

classification: data classification

metadata: any key,value pairs under the metadata section will be presented as tags in the resulting document

Files

Here is an example of a single yaml holding more than one source definition.

Example:

- id: file:exported_orders
  filename: orders_{[0-9]}6.csv
  format: delimited
  delimiter: ","
  header: true
  schema:
    fields:
      - name: order_id
        type: int
        doc: order number
      - name: order_date
        type: date
        dateformat: MM/dd/YYYY
        description: order date
- id: file:customers
  filename: customer_full.json
  doc: example of a json file with nested schema
  format: json
  schema:
    type: record
    fields:
      - name: customer_id
        type: int
        doc: customer identifier
      - name: address
        type: record
        fields:
          - name: street
            type: string
          - name: city
            type: string
          - name: state
            type: string
          - name: zip
            type: string

Datastore

Example:

- id: datastore:orders
  source: table
  doc: orders table
  schema:
    fields:
      - name: order_id
        type: int
        doc: order number
        primary_key: True
      - name: customer_id
        type: double
        doc: customer number
        primary_key: False
      - name: name
        type: string
        doc: customer name

Structure of relations file

The relations file holds the relationships between the entities.

Example:

- source: datastore:raw_orders
  target: pipeline:order_mgmt.load_orders
- source: pipeline:order_mgmt.load_orders
  target: datastore:orders
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].