All Projects → mozilla → Telemetry Airflow

mozilla / Telemetry Airflow

Licence: mpl-2.0
Airflow configuration for Telemetry

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Telemetry Airflow

Docker Airflow
Repo for building docker based airflow image. Containers support multiple features like writing logs to local or S3 folder and Initializing GCP while container booting. https://abhioncbr.github.io/docker-airflow/
Stars: ✭ 29 (-76.8%)
Mutual labels:  airflow
Discreetly
ETLy is an add-on dashboard service on top of Apache Airflow.
Stars: ✭ 60 (-52%)
Mutual labels:  airflow
Bitnami Docker Airflow
Bitnami Docker Image for Apache Airflow
Stars: ✭ 89 (-28.8%)
Mutual labels:  airflow
Airflow On Kubernetes
Bare minimal Airflow on Kubernetes (Local, EKS, AKS)
Stars: ✭ 38 (-69.6%)
Mutual labels:  airflow
Xene
A distributed workflow runner focusing on performance and simplicity.
Stars: ✭ 56 (-55.2%)
Mutual labels:  airflow
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+856%)
Mutual labels:  airflow
Elyra
Elyra extends JupyterLab Notebooks with an AI centric approach.
Stars: ✭ 839 (+571.2%)
Mutual labels:  airflow
Whirl
Fast iterative local development and testing of Apache Airflow workflows
Stars: ✭ 111 (-11.2%)
Mutual labels:  airflow
Airflow Cookbook
Airflow workflow management platform chef cookbook.
Stars: ✭ 58 (-53.6%)
Mutual labels:  airflow
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-28.8%)
Mutual labels:  airflow
Data Pipelines With Apache Airflow
Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation, validation and loading of data from S3 -> Redshift -> S3
Stars: ✭ 50 (-60%)
Mutual labels:  airflow
Airflow Toolkit
Any Airflow project day 1, you can spin up a local desktop Kubernetes Airflow environment AND one in Google Cloud Composer with tested data pipelines(DAGs) 🖥 >> [ 🚀, 🚢 ]
Stars: ✭ 51 (-59.2%)
Mutual labels:  airflow
Airflow Training
Airflow training for the crunch conf
Stars: ✭ 83 (-33.6%)
Mutual labels:  airflow
Objinsync
Continuously synchronize directories from remote object store to local filesystem
Stars: ✭ 29 (-76.8%)
Mutual labels:  airflow
Aws Ecs Airflow
Run Airflow in AWS ECS(Elastic Container Service) using Fargate tasks
Stars: ✭ 107 (-14.4%)
Mutual labels:  airflow
Airflow Maintenance Dags
A series of DAGs/Workflows to help maintain the operation of Airflow
Stars: ✭ 914 (+631.2%)
Mutual labels:  airflow
Terraform Aws Airflow
Terraform module to deploy an Apache Airflow cluster on AWS, backed by RDS PostgreSQL for metadata, S3 for logs and SQS as message broker with CeleryExecutor
Stars: ✭ 69 (-44.8%)
Mutual labels:  airflow
Afctl
afctl helps to manage and deploy Apache Airflow projects faster and smoother.
Stars: ✭ 116 (-7.2%)
Mutual labels:  airflow
Airflow in docker compose
Apache Airflow in Docker Compose (for both versions 1.10.* and 2.*)
Stars: ✭ 109 (-12.8%)
Mutual labels:  airflow
Dataengineeringproject
Example end to end data engineering project.
Stars: ✭ 82 (-34.4%)
Mutual labels:  airflow

Telemetry-Airflow

CircleCi

Airflow is a platform to programmatically author, schedule and monitor workflows.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Prerequisites

This app is built and deployed with docker and docker-compose.

Updating Python dependencies

Add new Python dependencies into requirements.in. Run the following commands with the same Python version specified by the Dockerfile.

# As of time of writing, python3.7
pip install pip-tools
pip-compile

Build Container

An Airflow container can be built with

make build

Migrate Database

Airflow database migration is no longer a separate step for dev but is run by the web container if necessary on first run. That means, however, that you should run the web container (and the database container, of course) and wait for the database migrations to complete before running individual test commands per below. The easiest way to do this is to run make up and let it run until the migrations complete.

Testing

A single task, e.g. spark, of an Airflow dag, e.g. example, can be run with an execution date, e.g. 2018-01-01, in the dev environment with:

make run COMMAND="test example spark 20180101"
docker logs -f telemetryairflow_scheduler_1

Adding dummy credentials

Tasks often require credentials to access external credentials. For example, one may choose to store API keys in an Airflow connection or variable. These variables are sure to exist in production but are often not mirrored locally for logistical reasons. Providing a dummy variable is the preferred way to keep the local development environment up to date.

In bin/run, please update the init_connections and init_variables with appropriate strings to prevent broken workflows. To test this, run bin/test-parse to check for errors. You may manually test this by restarting the orchestrated containers and checking for error messages within the main administration UI at localhost:8000.

Local Deployment

Assuming you're using macOS and Docker for macOS, start the docker service, click the docker icon in the menu bar, click on preferences and change the available memory to 4GB.

To deploy the Airflow container on the docker engine, with its required dependencies, run:

make up

You can now connect to your local Airflow web console at http://localhost:8000/.

All DAGs are paused by default for local instances and our staging instance of Airflow. In order to submit a DAG via the UI, you'll need to toggle the DAG from "Off" to "On". You'll likely want to toggle the DAG back to "Off" as soon as your desired task starts running.

Workaround for permission issues

Users on Linux distributions will encounter permission issues with docker-compose. This is because the local application folder is mounted as a volume into the running container. The Airflow user and group in the container is set to 10001.

To work around this, replace all instances of 10001 in Dockerfile.dev with the host user id.

sed -i "s/10001/$(id -u)/g" Dockerfile.dev

Testing GKE Jobs (including BigQuery-etl changes)

For now, follow the steps outlined here to create a service account: https://bugzilla.mozilla.org/show_bug.cgi?id=1553559#c1.

Enable that service account in Airflow with the following:

make build && make up
./bin/add_gcp_creds $GOOGLE_APPLICATION_CREDENTIALS

From there, connect to Airflow and enable your job.

Testing Dataproc Jobs

Dataproc jobs run on a self-contained Dataproc cluster, created by Airflow.

To test these, jobs, you'll need a sandbox account and corresponding service account. For information on creating that, see "Testing GKE Jobs". Your service account will need Dataproc and GCS permissions (and BigQuery, if you're connecting to it). Note: Dataproc requires "Dataproc/Dataproc Worker" as well as Compute Admin permissions. You'll need to ensure that the Dataproc API is enabled in your sandbox project.

Ensure that your dataproc job has a configurable project to write to. Set the project in the DAG entry to be configured based on development environment; see the ltv.py job for an example of that.

From there, run the following:

make build && make up
./bin/add_gcp_creds $GOOGLE_APPLICATION_CREDENTIALS google_cloud_airflow_dataproc

You can then connect to Airflow locally. Enable your DAG and see that it runs correctly.

Production Setup

Note: the canonical reference for production environment variables lives in a private repository.

When deploying to production make sure to set up the following environment variables:

  • AWS_ACCESS_KEY_ID -- The AWS access key ID to spin up the Spark clusters
  • AWS_SECRET_ACCESS_KEY -- The AWS secret access key
  • AIRFLOW_DATABASE_URL -- The connection URI for the Airflow database, e.g. mysql://username:[email protected]:port/database
  • AIRFLOW_BROKER_URL -- The connection URI for the Airflow worker queue, e.g. redis://hostname:6379/0
  • AIRFLOW_BROKER_URL -- The connection URI for the Airflow result backend, e.g. redis://hostname:6379/1
  • AIRFLOW_GOOGLE_CLIENT_ID -- The Google Auth client id used for authentication.
  • AIRFLOW_GOOGLE_CLIENT_SECRET -- The Google Auth client secret used for authentication.
  • AIRFLOW_GOOGLE_APPS_DOMAIN -- The domain(s) to restrict Google Auth login to e.g. mozilla.com
  • AIRFLOW_SMTP_HOST -- The SMTP server to use to send emails e.g. email-smtp.us-west-2.amazonaws.com
  • AIRFLOW_SMTP_USER -- The SMTP user name
  • AIRFLOW_SMTP_PASSWORD -- The SMTP password
  • AIRFLOW_SMTP_FROM -- The email address to send emails from e.g. [email protected]
  • URL -- The base URL of the website e.g. https://workflow.telemetry.mozilla.org
  • DEPLOY_ENVIRONMENT -- The environment currently running, e.g. stage or prod
  • DEPLOY_TAG -- The tag or branch to retrieve the JAR from, e.g. master or tags. You can specify the tag or travis build exactly as well, e.g. master/42.1 or tags/v2.2.1. Not specifying the exact tag or build will use the latest from that branch, or the latest tag.

Also, please set

  • AIRFLOW_SECRET_KEY -- A secret key for Airflow's Flask based webserver
  • AIRFLOW__CORE__FERNET_KEY -- A secret key to saving connection passwords in the DB

Both values should be set by using the cryptography module's fernet tool that we've wrapped in a docker-compose call:

make secret

Run this for each key config variable, and don't use the same for both!

Debugging

Some useful docker tricks for development and debugging:

# Stop all docker containers:
docker stop $(docker ps -aq)

# Remove any leftover docker volumes:
docker volume rm $(docker volume ls -qf dangling=true)

# Purge docker volumes (helps with mysql container failing to start)
# Careful as this will purge all local volumes not used by at least one container.
docker volume prune

Failing CircleCI 'test-environment' check:

# These commands are from the bin/test-parse script (get_errors_in_listing)
# If --detach is unavailable,  make sure you are running the latest version of docker-compose
docker-compose up --detach

docker-compose logs --follow --tail 0 | sed -n '/\[testing_stage_0\]/q'

# Don't pipe to grep to see the full output including your errors
docker-compose exec web airflow list_dags

Triggering a task to re-run within the Airflow UI

  • Check if the task / run you want to re-run is visible in the DAG's Tree View UI
  • If the dag run is not showing in the Dag Tree View UI (maybe deleted)
    • Browse -> Dag Runs
    • Create (you can look at another dag run of the same dag for example values too)
      • Dag Id: the name of the dag, for example, main_summary
      • Execution Date: The date the dag should have run, for example, 2018-05-14 00:00:00
      • Start Date: Some date between the execution date and "now", for example, 2018-05-20 00:00:05
      • End Date: Leave it blank
      • State: success
      • Run Id: scheduled__2018-05-14T00:00:00
      • External Trigger: unchecked
    • Click Save
    • Click on the Graph view for the dag in question. From the main DAGs view, click the name of the DAG
    • Select the "Run Id" you just entered from the drop-down list
    • Click "Go"
    • Click each element of the DAG and "Mark Success"
    • The tasks should now show in the Tree View UI
  • If the dag run is showing in the DAG's Tree View UI
    • Click on the small square for the task you want to re-run
    • Uncheck the "Downstream" toggle
    • Click the "Clear" button
    • Confirm that you want to clear it
    • The task should be scheduled to run again straight away.

Triggering backfill tasks using the CLI

  • SSH into the ECS container instance
  • List docker containers using docker ps
  • Log in to one of the docker containers using docker exec -it <container_id> bash. The web server instance is a good choice.
  • Run the desired backfill command, something like $ airflow backfill main_summary -s 2018-05-20 -e 2018-05-26

CircleCI

  • Commits to forked repo PRs will trigger CircleCI builds that build the docker container and test python dag compilation. This should pass prior to merging.
  • Every commit to master or tag will trigger a CircleCI build that will build and push the container to dockerhub
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].