All Projects → opentrials → opentrials-airflow

opentrials / opentrials-airflow

Licence: MPL-2.0 license
Configuration and definitions of Airflow for OpenTrials

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to opentrials-airflow

AirflowETL
Blog post on ETL pipelines with Airflow
Stars: ✭ 20 (+11.11%)
Mutual labels:  airflow, data-pipeline
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (+38.89%)
Mutual labels:  airflow, data-pipeline
airflow-site
Apache Airflow Website
Stars: ✭ 95 (+427.78%)
Mutual labels:  airflow
airflow-client-python
Apache Airflow - OpenApi Client for Python
Stars: ✭ 172 (+855.56%)
Mutual labels:  airflow
apache-airflow-cloudera-parcel
Parcel for Apache Airflow
Stars: ✭ 16 (-11.11%)
Mutual labels:  airflow
saisoku
Saisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.
Stars: ✭ 40 (+122.22%)
Mutual labels:  data-pipeline
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+116.67%)
Mutual labels:  data-pipeline
kedro-airflow
Kedro-Airflow makes it easy to deploy Kedro projects to Airflow.
Stars: ✭ 121 (+572.22%)
Mutual labels:  airflow
FastETL
Plugins do Airflow para implementação de pipelines de dados
Stars: ✭ 31 (+72.22%)
Mutual labels:  airflow
k3ai
A lightweight tool to get an AI Infrastructure Stack up in minutes not days. K3ai will take care of setup K8s for You, deploy the AI tool of your choice and even run your code on it.
Stars: ✭ 105 (+483.33%)
Mutual labels:  airflow
qunomon
Testbed of AI Systems Quality Management
Stars: ✭ 15 (-16.67%)
Mutual labels:  airflow
machine-learning-data-pipeline
Pipeline module for parallel real-time data processing for machine learning models development and production purposes.
Stars: ✭ 22 (+22.22%)
Mutual labels:  data-pipeline
T-Watch
Real Time Twitter Sentiment Analysis Product
Stars: ✭ 20 (+11.11%)
Mutual labels:  airflow
polygon-etl
ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+194.44%)
Mutual labels:  airflow
incremental training
Repo that relates to the Medium blog 'Keeping your ML model in shape with Kafka, Airflow' and MLFlow'
Stars: ✭ 110 (+511.11%)
Mutual labels:  airflow
dbt-airflow-docker-compose
Execution of DBT models using Apache Airflow through Docker Compose
Stars: ✭ 76 (+322.22%)
Mutual labels:  airflow
Insight-GDELT-Feed
A way for home buyers to know about factors affecting a state
Stars: ✭ 43 (+138.89%)
Mutual labels:  airflow
ob bulkstash
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
Stars: ✭ 113 (+527.78%)
Mutual labels:  data-pipeline
Data-pipeline-project
Data pipeline project
Stars: ✭ 18 (+0%)
Mutual labels:  data-pipeline
torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
Stars: ✭ 165 (+816.67%)
Mutual labels:  airflow

opentrials-airflow

Build Status

Airflow is a platform to programmatically author, schedule and monitor workflows.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Build Container

An Airflow container can be built with

docker build -t opentrials/opentrials-airflow .

and pushed to Docker hub with

docker push opentrials/opentrials-airflow

Testing

A single task, e.g. spark, of an Airflow dag, e.g. example, can be run with an execution date, e.g. 2016-01-01, in the dev environment with:

ansible-playbook ansible/deploy_local.yml -e '@ansible/envs/dev.yml' -e "command='test example spark 20160101'"

The container will run the desired task to completion (or failure). Note that if the container is stopped during the execution of a task, the task will be aborted. In the example's case, the Spark job will be terminated.

The logs of the task can be inspected in real-time with:

docker logs -f files_scheduler_1

Local Deployment

Assuming you are on OS X, first create a docker machine with a sufficient amount of memory with e.g.:

docker-machine create -d virtualbox --virtualbox-memory 4096 default

To deploy the Airflow container on the docker engine, with its required dependencies, run:

ansible-playbook ansible/deploy_local.yml -e '@ansible/envs/dev.yml'
echo "Airflow web console should now be running locally at http://$(docker-machine ip default):8080"

Note that this will start running all the DAGs with a start date in the past! To avoid that do not pass the AWS credentials.

If you get a message saying "Couldn't connect to Docker daemon - you might need to run docker-machine start default.", try the following:

docker-machine start default
eval "$(docker-machine env default)"

You can now connect to your local Airflow web console with a URL like http://192.168.99.100:8080 (see above for how to identify the exact IP address).

Production Deployment

To deploy to our Docker Cloud, run:

make deploy

This requires the Vault's password file to be located on ./.vault_pass, so Ansible is able to decrypt the production variables.

Configuration

Our Airflow instance will need a few configuration variables and connections to be able to run the supplied DAGs. Please follow the CONFIGURATION.md file in order to know what those variable mean and how to use them.

Debugging

Some useful docker tricks for development and debugging:

# Stop all docker containers:
docker stop $(docker ps -aq)

# Remove any leftover docker volumes:
docker volume rm $(docker volume ls -qf dangling=true)

Triggering a task to re-run within the Airflow UI

  • Check if the task / run you want to re-run is visible in the DAG's Tree View UI
  • If the dag run is not showing in the Dag Tree View UI (maybe deleted)
    • Browse -> Dag Runs
    • Create (you can look at another dag run of the same dag for example values too)
      • Dag Id: the name of the dag, for example main_summary or crash_aggregates
      • Execution Date: The date the dag should have run, for example 2016-07-14 00:00:00
      • Start Date: Some date between the execution date and "now", for example 2016-07-20 00:00:05
      • End Date: Leave it blank
      • State: success
      • Run Id: scheduled__2016-07-14T00:00:00
      • External Trigger: unchecked
    • Click Save
    • Click on the Graph view for the dag in question. From the main DAGs view, click the name of the DAG
    • Select the "Run Id" you just entered from the drop-down list
    • Click "Go"
    • Click each element of the DAG and "Mark Success"
    • The tasks should now show in the Tree View UI
  • If the dag run is showing in the DAG's Tree View UI
    • Click on the small square for the task you want to re-run
    • Uncheck the "Downstream" toggle
    • Click the "Clear" button
    • Confirm that you want to clear it
    • The task should be scheduled to run again straight away.

Triggering backfill tasks using the CLI

  • SSH in to the server
  • List docker containers using docker ps
  • Log in to one of the docker containers using docker exec -it <container_id> bash. The webserver instance is a good choice.
  • Run the desired backfill command, something like $ airflow backfill main_summary -s 2016-06-20 -e 2016-06-26

Credits

This repository is heavily based on https://github.com/mozilla/telemetry-airflow

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].