All Projects → Cigna → ibis

Cigna / ibis

Licence: Apache-2.0 License
IBIS is a workflow creation-engine that abstracts the Hadoop internals of ingesting RDBMS data.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Mako
254 projects

Projects that are alternatives of or similar to ibis

cloud
云计算之hadoop、hive、hue、oozie、sqoop、hbase、zookeeper环境搭建及配置文件
Stars: ✭ 48 (+0%)
Mutual labels:  hadoop, oozie, sqoop
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+22797.92%)
Mutual labels:  hadoop, sqoop
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+2389.58%)
Mutual labels:  workflow, hadoop
Dagster
An orchestration platform for the development, production, and observation of data assets.
Stars: ✭ 4,099 (+8439.58%)
Mutual labels:  workflow, workflow-automation
hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (+16.67%)
Mutual labels:  hadoop, hadoop-ecosystem
N8n
Free and open fair-code licensed node based Workflow Automation Tool. Easily automate tasks across different services.
Stars: ✭ 19,252 (+40008.33%)
Mutual labels:  workflow, workflow-automation
Ibis
A pandas-like deferred expression system, with first-class SQL support
Stars: ✭ 1,630 (+3295.83%)
Mutual labels:  hadoop, ibis
BigInsights-on-Apache-Hadoop
Example projects for 'BigInsights for Apache Hadoop' on IBM Bluemix
Stars: ✭ 21 (-56.25%)
Mutual labels:  hadoop, oozie
web-click-flow
网站点击流离线日志分析
Stars: ✭ 14 (-70.83%)
Mutual labels:  hadoop, sqoop
monopacker
A tool for managing builds of monorepo frontend projects with eg. npm- or yarn workspaces, lerna or similar tools into a standalone application - no other tools needed.
Stars: ✭ 17 (-64.58%)
Mutual labels:  workflow, workflow-automation
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-72.92%)
Mutual labels:  hadoop
BigData-News
基于Spark2.2新闻网大数据实时系统项目
Stars: ✭ 36 (-25%)
Mutual labels:  hadoop
denoflow
Configuration as Code, use YAML to write automated workflows that run on Deno, with any Deno modules, Typescript/Javascript codes
Stars: ✭ 143 (+197.92%)
Mutual labels:  workflow
jenkins-scriptlets
Useful groovy scripts that can be used while using Jenkins-CI for workflow automation
Stars: ✭ 16 (-66.67%)
Mutual labels:  workflow-automation
AlfredWorkflows
My workflow creations for Alfred on macOS.
Stars: ✭ 55 (+14.58%)
Mutual labels:  workflow
user guide
The CWL v1.0 user guide
Stars: ✭ 20 (-58.33%)
Mutual labels:  workflow
bistro
A library to build and execute typed scientific workflows
Stars: ✭ 43 (-10.42%)
Mutual labels:  workflow
alfred-fanfou
Alfred workflow for Fanfou
Stars: ✭ 15 (-68.75%)
Mutual labels:  workflow
bitnami-docker-airflow-scheduler
Bitnami Docker Image for Apache Airflow Scheduler
Stars: ✭ 19 (-60.42%)
Mutual labels:  workflow
action-sync-node-meta
GitHub Action that syncs package.json with the repository metadata.
Stars: ✭ 25 (-47.92%)
Mutual labels:  workflow

IBIS

Get your data in the Lake!

IBIS is a workflow creation-engine that abstracts the Hadoop internals of ingesting RDBMS data.
Ibis Logo

All you need to do is create a request file containing information about a table that you want to query, and IBIS will:

  1. manage the information required to create the workflow;
  2. generate Oozie workflows, following current ingestion processes and standards to make the ingestion as performant as possible;
  3. creates file that simplify the scheduling of the job through automation; and
  4. allow easy execution of the workflow to populate the Data Lake of the RDBMS source

The IBIS framework wraps the Hadoop stack technology Oozie, an XML-based workflow scheduler system used to manage Apache Hadoop jobs (Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions).

We support creating "config-based" workflows from any of the standard Oozie actions - such as Hive, Shell and Impala. The main use of IBIS is to use SQOOP for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases (including Oracle, Teradata, SQL Server, and DB2).

Refer here on How to setup IBIS?

Purpose

Creating XMLs is a painful process, especially when including many other steps in the process, such as converting Avro to Parquet, validating data sets, writing to a checks & balances table for auditing, and making views accessible to users. The XMLs also contain code that creates a backup of the previous load's data, in case somethign goes wrong.

IBIS takes care of all of this for you, with the user's input being only one single configuration file, also called the "Request file".

What does IBIS provide?

What Function How is this useful?
Split by Provides an automated split by for Teradata, SQL Server, and DB2 Without the automated split by, on a per table basis, you need to find a column that enables parallel execution of an ingestion
Auto generating Oozie workflows Automatically creates XML workflows No manual XML creation
Generate non ingestion workflows through "building blocks" Given hive and shell scripts, IBIS generates oozie workflow Automate running any type of script in the Data Lake
Group tables based on schedule Group workflows into subworkflows based on schedule Tables with similar schedule can be kicked off using Automation by triggering one workflow.
Use Parquet Store data in Parquet Efficient storage + fast queries!
Follows Lambda Architecture Storing data in the base layer, as immutable data
Allows for data export to any RDBMS Allows to export data from hive to oracle, db2, sqlserver, mysql, Teradata For generating reports based on exported data to RDBMS
Creates automated incremental workflows Allows you to incrementally ingest data Based on a column, generate a where clause data ingestion that will ingest into partitions automatically
Data validation Validate the data was ingested correctly Row counts, DDL match, data sampling and stores info in a QA log table
Map data types to valid Hadoop types Match, as close as possible, external data types to Hadoop data types Oracle has specific data types like NUMBER which don't exist in Hadoop. Map to a valid Hadoop type as best as possible (eg to DECIMAL) by grabbing from metadata tables. Other tools map everything directly as string - doesn't work for SAS!
Isolated Env - custom ENV Every different team is having separate team space for all source tables. Workflows can be generated and scheduled without affecting any team.

Functionalities available in IBIS

Command --help would list the IBIS Functionalities ./ibis-shell --help

Architecture

Alt text

Behind the scenes

Under the covers, IBIS manages the information required to pull in data sources into HDFS, including usernames, passwords, JBDC connection info, and also keeps track of Automation ids, used for scheduling jobs in Production.

IBIS also has shell that allows you to run your workflow, when it's been created. The goal is to make everything as automated as possible, so many items (workflows, request files) are all resolved around git.

Using the IBIS shell

Command line interface

Example file

[Request]
split_by:MBR_KEY
mappers:10
jdbcurl:jdbc:teradata://fake.teradata/DATABASE=fake_database
db_username:fake_username
password_file:jceks://hdfs/user/dev/fake.passwords.jceks#fake.password.alias
fetch_size:50000
source_database_name:fake_database
source_table_name:fake_tablename
views:fake_view_open
weight:medium
refresh_frequency:weekly
check_column:TRANS_TIME
db_env:int
Parameter Description
Split-by The column used to allow the job to be run in parallel - the best type of column is a primary key or index
Mappers The number of threads that can be used to connect concurrently against the source system. Always check with the DBA on the max number that can be used!
JDBC URL The JDBC URL connection string for the database. See the IBIS Confluence page on specifics for different source systems
DB Username The username that connects to the database
Password File The encrypted password file alias and the location of the JCEKS in HDFS - contact the Big Data team if you're unsure what this is or use the Data Lake 360 Dashboard
Fetch Size The number of batches to pull in at a given time; the default is 50,000. In case of OOM errors, reduce this in 10,000 increments
Source Database Name The name of the database you are connecting to
Source Table Name The name of the table you want to pull in
Views Where you have access to query and manipulate the dataset - if you need it in more than one place, seperate it by a pipe '|' NOTE: this column is append only. To remove all of them, update with null and then add your views.
Weight Provide the size of the table (use your best judgement) on if it's '''light''','''medium''' or '''heavy'''. This is based on the number of columns and the number of rows.This effects how the workflow is generated and the number of tables that can be run in parallel. In the bakend values are translated (100 — light, 010 — medium, 001 — heavy)
Refresh Frequency Only needed for Production, the frequency that is needed. This is used for reporting in checks & balances and other dashboards. All scheduling is physically done through automation. Possible values are none, daily, weekly, monthly, and quarterly
Check Column If you need this table ingested incrementally, provide a check column that is used to split up the load. Note that this only works for some sources right now - check with the team first
DB ENV If you need to run multiple QA cycles then specify the QA cycle name(DEV/CUSTOM-ENV/PROD) the workflows will be created with corresponding filename

Submit table(s) request to generate ingestion workflow

ibis-shell --submit-request <path to requestfile.txt> --env prod

Sample Content of requestfile.txt

        [Request]
        source_database_name:fake_database           <---- db/owner name of table (mandatory)
        source_table_name:FAKE_SVC_TABLENAME                  <---- name of table (mandatory)
        db_env:sys                                     <---- env specific (optional)
        jdbcurl:jdbc:ora..                             <---- JDBC Url (optional)
        db_username:fake_username                              <---- Database username (optional)
        password_file:/user/dev/fake.password.file    <---- Database password file (optional)
        mappers:5                                      <---- Number of sqoop mappers (optional)
        refresh_frequency:weekly                       <---- if table needs to be on a schedule (optional)
                                                                   [none, daily, weekly, biweekly, fortnightly, monthly, quarterly, yearly]
        weight:heavy                                   <---- load of table (optional)
                                                                   [small, medium, heavy]
        views:fake_view_im|fake_view_open                              <---- Views (optional)
        check_column:TRANS_TIME                        <---- Column for incremental (optional)
        automation_group:magic                                <---- used for grouping tables in Automation (optional)
        fetch_size:50000                               <---- sqoop rows fetch size (optional)
        hold:1                                         <---- workflow wont be generated (optional)
        split_by:fake_nd_tablename_NUM                               <---- Used for sqoop import (recommended)
        source_schema_name:dbo                         <---- Sql-server specific schema name (optional)
        sql_query:S_NO > 50 AND ID = "ABC"             <---- Sqoop import based on custom where clause (optional)

Test Sqoop auth credentials

ibis-shell --auth-test --env prod --source-db fake_database --source-table fake_dim_tablename
                --user-name fake_username --password-file /user/dev/fake.password.file
                --jdbc-url jdbc:oracle:thin:@//fake.oracle:1521/fake_servicename

Export request

ibis-shell --export-request-prod <path to requestfile.txt> --env prod

Sample Content of requestfile.txt

        [Request]
        source_database_name:test_oracle_export                    <---- db/owner name of table (mandatory)
        source_table_name:test_oracle                              <---- name of table (mandatory)
        source_dir:/user/data/mdm/test_oracle/test_oracle_export   <---- source dir (optional)
        jdbcurl:jdbc:ora...                                        <---- JDBC Url (mandatory)
        target_schema:fake_schema                                  <---- Target schema (mandatory)
        target_table:test                                          <---- Target table (mandatory)
        db_username:fake_username                                  <---- Database username (mandatory)
        password_file:/user/dev/fake.password.file                 <---- Database password file (mandatory)
        staging_database:fake_staging_database					   <---- Staging database for TD (optional)

Run oozie job for provided table_workflow.xml. Note: just the file name without extension

ibis-shell --run-job table_workflow --env prod

Add table(s) to IT table

ibis-shell --update-it-table <path to tables.txt> --env prod

Sample Content of tables.txt

    [Request]
    split_by:CLIENT_STRUC_KEY
    mappers:6
    jdbcurl:jdbc:teradata://fake.teradata/DATABASE=fake_database  # DATABASE= required for teradata
    db_username:fake_username
    password_file:/user/dev/fake.password.file
    refresh_frequency:weekly                        <---- if table needs to be on a schedule (optional)
                                                                [none, daily, weekly, biweekly, fortnightly, monthly, quarterly]
    weight:heavy                                    <---- load of table (optional)
                                                                [small, medium, heavy]
    fetch_size:                                     <---- No value given. Just ignores
    hold:0
    automation_appl_id:null                                <---- set null to empty the column value
    views:fake_view_im|fake_view_open               <---- Pipe(|) seperated values
    automation_group:magic_table
    check_column:fake_nd_tablename_NUM              <---- Sqoop incremental column
    source_database_name:fake_database                    (mandatory)
    source_table_name:fake_client_tablename               (mandatory)

Creating a view

ibis-shell --view --view-name <view_name> --db  <name> --table <name>  # Minimum
        # OPTIONAL PARAMETERS: --select {cols} , --where {statement}
        eg. ibis-shell --view --view-name myview --db  mydb --table mytable --select col1 col2 col3 --where 'Country=USA'

Create a workflow based on schedule/frequency

ibis-shell --gen-automation-workflows <frequency>  # frequency choices['weekly', 'monthly', 'quarterly', or 'biweekly']

Create workflows with subworkflows based on one or more filters
ibis-shell --gen-automation-workflow schedule=None database=None jdbc_source=None
        # schedule choices - none, daily, biweekly, weekly, fortnightly, monthly, quarterly
        jdbc_source choices - oracle, db2, teradata, sqlserver

Save all records or that of specific source in it-table to a file
ibis-shell --save-it-table --source-type sqlserver

Running unit tests

To ensure that IBIS can be automatically deployed, has consistency and reliability, unit tests are a must. The development team follows TDD practices. To run unit tests, run the following:

python ibis_test_suite.py

Contribute

Contributors

  • Matthew Wood
  • Bhaskar Teja Yerneni
  • Mohammad Quraishi
  • IBIS logo artwork by Sam Menza

License

The project is licensed under the Apache 2 license

 ▄█  ▀█████████▄   ▄█     ▄████████
███    ███    ███ ███    ███    ███
███▌   ███    ███ ███▌   ███    █▀
███▌  ▄███▄▄▄██▀  ███▌   ███
███▌ ▀▀███▀▀▀██▄  ███▌ ▀███████████
███    ███    ██▄ ███           ███
███    ███    ███ ███     ▄█    ███
█▀   ▄█████████▀  █▀    ▄████████▀

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].