All Projects → dsaidgovsg → Airflow Pipeline

dsaidgovsg / Airflow Pipeline

Licence: apache-2.0
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Airflow Pipeline

Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+833.59%)
Mutual labels:  airflow, spark, hadoop
Apache Spark Hands On
Educational notes,Hands on problems w/ solutions for hadoop ecosystem
Stars: ✭ 74 (-42.19%)
Mutual labels:  spark, hadoop
Waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Stars: ✭ 60 (-53.12%)
Mutual labels:  spark, hadoop
Hadoop cookbook
Cookbook to install Hadoop 2.0+ using Chef
Stars: ✭ 82 (-35.94%)
Mutual labels:  spark, hadoop
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-1.56%)
Mutual labels:  spark, hadoop
Docker Hadoop
A Docker container with a full Hadoop cluster setup with Spark and Zeppelin
Stars: ✭ 54 (-57.81%)
Mutual labels:  spark, hadoop
Docker Spark
🚢 Docker image for Apache Spark
Stars: ✭ 78 (-39.06%)
Mutual labels:  spark, hadoop
Interview Questions Collection
按知识领域整理面试题,包括C++、Java、Hadoop、机器学习等
Stars: ✭ 21 (-83.59%)
Mutual labels:  spark, hadoop
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-28.12%)
Mutual labels:  spark, hadoop
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+8486.72%)
Mutual labels:  spark, hadoop
Bigdata Notebook
Stars: ✭ 100 (-21.87%)
Mutual labels:  spark, hadoop
Weblogsanalysissystem
A big data platform for analyzing web access logs
Stars: ✭ 37 (-71.09%)
Mutual labels:  spark, hadoop
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-71.09%)
Mutual labels:  spark, hadoop
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-55.47%)
Mutual labels:  spark, hadoop
Data Algorithms Book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+641.41%)
Mutual labels:  spark, hadoop
Xlearning Xdml
extremely distributed machine learning
Stars: ✭ 113 (-11.72%)
Mutual labels:  spark, hadoop
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+561.72%)
Mutual labels:  spark, hadoop
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+569.53%)
Mutual labels:  spark, hadoop
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-30.47%)
Mutual labels:  airflow, spark
Waterdrop
Production Ready Data Integration Product, documentation:
Stars: ✭ 1,856 (+1350%)
Mutual labels:  spark, hadoop

Airflow Pipeline Docker Image Set-up

CI Status

This repo is a GitHub Actions build matrix set-up to generate Docker images of Airflow, and other major applications as below:

  • Airflow
  • Spark
  • Hadoop integration with Spark
  • Python
  • SQL Alchemy

Note that this repo is actually a fork of https://github.com/dsaidgovsg/airflow-pipeline, but has been heavily revamped in order to do build matrix to generate Docker images with varying application versions.

Additionally, poetry is used to perform all Python related installations at a predefined global project directory, so that it is easy to add on new packages without conflicting dependency package versions, which raw pip cannot achieve. See https://github.com/dsaidgovsg/spark-k8s-addons#how-to-properly-manage-pip-packages for more information.

Also, for convenience, the current version runs both the webserver and scheduler together in the same instance by the default entrypoint, with the webserver being at the background and scheduler at the foreground. All the convenient environment variables only works on the basis that the entrypoint is used without any extra command.

If there is a preference to run the various Airflow CLI services separately, you can simply pass the full command into the Docker command, but it will no longer trigger any of the convenient environment variables / functionalities.

The above convenience functionalities include:

  1. Discovering if database (sqlite and postgres) is ready
  2. Automatically running airflow initdb
  3. Easy creation of Airflow Web UI admin user by simple env vars.

Also note that the command that will be run will also be run as airflow user/group, unless the host overrides the user/group to run the Docker container.

Commands to demo

You will need docker-compose and docker command installed.

Default Combined Airflow Webserver and Scheduler

docker-compose up --build

Navigate to http://localhost:8080/, and log in using the following RBAC credentials to try out the DAGs:

  • Username: admin
  • Password: Password123

Note that the webserver logs are suppressed by default.

CTRL-C to gracefully terminate the services.

Separate Airflow Webserver and Scheduler

docker-compose -f docker-compose.split.yml up --build

Navigate to http://localhost:8080/ to try out the DAGs.

Both webserver and scheduler logs are shown separately.

CTRL-C to gracefully terminate the services.

Versioning

Starting from Docker tags that give self-version v1, any Docker image usage related breaking change will generate a new self-version so that this will minimize any impact on the user-facing side trying to use the most updated image.

These are considered breaking changes:

  • Change of Linux distro, e.g. Alpine <-> Debian. This will automatically lead to a difference in the package management tool used such as apk vs apt. Note that however this does not include upgrading of Linux distro that may affect the package management, e.g. alpine:3.9 vs alpine:3.10.
  • Removal of advertized installed CLI tools that is not listed within the Docker tag. E.g. Spark and Hadoop are part of the Docker tag, so they are not part of the advertized CLI tools.
  • Removal of advertized environment variables
  • Change of any environment variable value

In the case where a CLI tool is known to perform a major version upgrade, this set-up will try to also release a new self-version number. But note that this is at a best effort scale only because most of the tools are inherited upstream, or simply unable / undesirable to specify the version to install.

Changelogs

All self-versioned change logs are listed in CHANGELOG.md.

The advertized CLI tools and env vars are also listed in the detailed change logs.

How to Manually Build Docker Image

Example build command:

AIRFLOW_VERSION=1.10
SPARK_VERSION=3.0.0
HADOOP_VERSION=3.2.0
SCALA_VERSION=2.12
PYTHON_VERSION=3.6
SQLALCHEMY_VERSION=1.3
docker build -t airflow-pipeline \
  --build-arg "AIRFLOW_VERSION=${AIRFLOW_VERSION}" \
  --build-arg "SPARK_VERSION=${SPARK_VERSION}" \
  --build-arg "HADOOP_VERSION=${HADOOP_VERSION}" \
  --build-arg "SCALA_VERSION=${SCALA_VERSION}" \
  --build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
  --build-arg "SQLALCHEMY_VERSION=${SQLALCHEMY_VERSION}" \
  .

You may refer to the vars.yml to have a sensing of all the possible build arguments to combine.

Entrypoint

Additional Useful Perks

There is already an AWS S3 log configuration file in this set-up.

If you wish to save the Airflow logs into S3 bucket instead, provide the following environment variables when launcher the Docker container:

AIRFLOW__CORE__TASK_LOG_READER: s3.task
AIRFLOW__CORE__LOGGING_CONFIG_CLASS: s3_log_config.LOGGING_CONFIG
S3_LOG_FOLDER: s3://yourbucket/path/to/your/dir

Caveat

Because this image is based on Spark with Kubernetes compatible image, which always generates Debian based Docker images, the images generated from this repository are likely to stay Debian based as well. But note that there is no guarantee that this is always true, but such changes are always marked with Docker image release tag.

Also, currently the default entrypoint without command logic assumes that a Postgres server will always be used (the default sqlite can work as an alternative). As such, when using in this mode, an external Postgres server has to be made available for Airflow services to access.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].