All Projects → CompEpigen → workflUX

CompEpigen / workflUX

Licence: Apache-2.0 license
An open-source, cloud-ready web application for simplified deployment of big data workflows.

Programming Languages

javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language
HTML
75241 projects
CSS
56736 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to workflUX

Bigdata practice
大数据分析可视化实践
Stars: ✭ 166 (+538.46%)
Mutual labels:  bigdata
Node Hbase
Asynchronous HBase client for NodeJs using REST
Stars: ✭ 226 (+769.23%)
Mutual labels:  bigdata
Every Single Day I Tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+857.69%)
Mutual labels:  bigdata
Kotlin Spark Api
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Stars: ✭ 183 (+603.85%)
Mutual labels:  bigdata
Flink Boot
懒松鼠Flink-Boot 脚手架让Flink全面拥抱Spring生态体系,使得开发者可以以Java WEB开发模式开发出分布式运行的流处理程序,懒松鼠让跨界变得更加简单。懒松鼠旨在让开发者以更底上手成本(不需要理解分布式计算的理论知识和Flink框架的细节)便可以快速编写业务代码实现。为了进一步提升开发者使用懒松鼠脚手架开发大型项目的敏捷的度,该脚手架默认集成Spring框架进行Bean管理,同时将微服务以及WEB开发领域中经常用到的框架集成进来,进一步提升开发速度。比如集成Mybatis ORM框架,Hibernate Validator校验框架,Spring Retry重试框架等,具体见下面的脚手架特性。
Stars: ✭ 209 (+703.85%)
Mutual labels:  bigdata
Hadoop Attack Library
A collection of pentest tools and resources targeting Hadoop environments
Stars: ✭ 228 (+776.92%)
Mutual labels:  bigdata
Nmflibrary
MATLAB library for non-negative matrix factorization (NMF): Version 1.8.1
Stars: ✭ 153 (+488.46%)
Mutual labels:  bigdata
attr-gather
Hit a million different APIs and combine the results in one simple hash (without pulling your hair out). A simple workflow system to gather aggregate attributes for something.
Stars: ✭ 30 (+15.38%)
Mutual labels:  workflows
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+726.92%)
Mutual labels:  bigdata
Aws Etl Orchestrator
A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda.
Stars: ✭ 245 (+842.31%)
Mutual labels:  bigdata
Awesome Learning
实践源码库:https://github.com/jast90/bigdata 。 微信搜索Jast关注公众号,获取最新技术分享😯。
Stars: ✭ 197 (+657.69%)
Mutual labels:  bigdata
Shifu
An end-to-end machine learning and data mining framework on Hadoop
Stars: ✭ 207 (+696.15%)
Mutual labels:  bigdata
Simple It English
Simple-IT-English: smart wordbook from community for community
Stars: ✭ 233 (+796.15%)
Mutual labels:  bigdata
Flinkx
Based on Apache Flink. support data synchronization/integration and streaming SQL computation.
Stars: ✭ 2,651 (+10096.15%)
Mutual labels:  bigdata
UofT-Timetable-Generator
A web application that generates timetables for university students at the University of Toronto
Stars: ✭ 34 (+30.77%)
Mutual labels:  web-application
Java Notes
☕️ Java 基础 👫 面向对象思想✏️ 算法 📝 操作系统 ☁️ 网络 💾 数据库 🙊 Spring 💡 系统架构🐘大数据
Stars: ✭ 160 (+515.38%)
Mutual labels:  bigdata
Tdengine
An open-source big data platform designed and optimized for the Internet of Things (IoT).
Stars: ✭ 17,434 (+66953.85%)
Mutual labels:  bigdata
geoform-template-js
GeoForm is a configurable template for form based data editing of a Feature Service.
Stars: ✭ 66 (+153.85%)
Mutual labels:  web-application
bigdatatutorial
bigdatatutorial
Stars: ✭ 34 (+30.77%)
Mutual labels:  bigdata
Dpark
Python clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+10161.54%)
Mutual labels:  bigdata

workflUX - The Workflow User eXperience

(Formerly known as CWLab.)

An open-source, cloud-ready web application for simplified deployment of big data workflows.

CI/CD: Build Status

Packaging: PyPI status PyPI version shields.io PyPI pyversions Docker Cloud Automated build

Citation & Contribution: DOI All Contributors

Installation and Quick Start:

Attention: workflUX is in beta state and breaking changes might be introduced in the future. However, if you like to test it or even run in production, we will support you.

Installation can be done using pip:
python3 -m pip install workflux

Please see the section "Configuration" for a discussion of available options.

Start the webserver with your custom configuration (or leave out the --config flag to use the default one):
workflux up --config config.yaml

If you like to make use of containers for dependency management, you need to install Docker or a Docker-compatible containerization solution like singularity or udocker. To run on Windows or MacOs, please install the dedicated docker versions: Docker for Windows, Docker for Mac

The usage of the web interface should be self-explanatory with build-in instruction. The following section gives an overview of the basic usage scenario.

Supported Systems:

workflUX is written in platform-agnostic python and can therefore be executed on:

  • Linux
  • MacOs
  • Windows*

Any CWL runner that has a command-line interface can be integrated into workflUX in order to execute CWL workflows or tool-wrappers, such as:

Therefore, workflUX can be used on any infrastructure supported by these CWL runners, including:

  • single workstations
  • HPC clusters (PBS, LSF, slurm, ...)
  • clouds (AWS, GCP, Azure, OpenStack)

*Please Note:
Execution on Windows is only supported by cwltool which talks to docker for windows. Therefore, CWL-wrapped tools and workflows which where originally designed for Linux/MacOs can be executed on Windows with a graphical interface provided by workflUX.

Usage:

Please see our tutorial, that walks you through an simple yet meaningful example of how workflUX can be used to compare the spike protein sequences of Covid-19 in two patient cohorts.

Here are some apetizers: welcome screenshot

create job screenshot

Configuration:

workflUX is a highly versatile package and makes almost no assumptions on your hard- and software environment used for the execution of CWL. To adapt it to your system and use case, a set of configuration options is available:

  • General configs, including:
    • webserver (hosting IP address and port, remotely or locally available, login protected or not)
    • paths of working directories
  • Execution profiles:
    This flexible API allows you to adapt workflUX to your local software environment and to integrate a CWL runner of your choice (such as Cwltool, Toil, or Cromwell).

All configuration options can be specified in a single YAML file which is provided to workflUX upon start:
workflux up --config my_config.yaml

To get an example config file, run the following command:
workflux print_config > config.yaml (or see the example below)

General Configs:

  • WEB_SERVER_HOST:
    Specify the host or IP address on which the webserver shall run. Use localhost for local usage on your machine only. Use 0.0.0.0 to allow remote accessibility by other machines in the same network.
    Default: localhost

  • WEB_SERVER_PORT:
    Specify the port used by the webserver.
    Default: 5000

  • TEMP_DIR:
    Directory for temporary files.
    Default: a subfolder "workflux/temp" in the home directory

  • WORKFLOW_DIR:
    Directory for saving CWL documents.
    Default: a subfolder "workflux/temp" in the home directory

  • EXEC_DIR:
    Directory for saving execution data including output files.
    Default: a subfolder "workflux/temp" in the home directory

  • DEFAULT_INPUT_DIR:
    Default directory where users can search for input files. You may specify additional input directories using the "ADD_INPUT_DIRS" parameter. Default: a subfolder "workflux/temp" in the home directory

  • DB_DIR:
    Directory for databases.
    Default: a subfolder "workflux/temp" in the home directory

  • ADD_INPUT_DIRS:
    In addition to "DEFAULT_INPUT_DIR", these directories can be searched by the user for input files.
    Please specify them in the format "name: path" like shown in this example:

    ADD_INPUT_DIRS:
        GENOMES_DIR: '/ngs_share/genomes'
        PUBLIC_GEO_DATA: '/datasets/public/geo'
    

    Default: no additional input dirs.

  • ADD_INPUT_AND_UPLOAD_DIRS:
    Users can search these directories for input files (in addition to "DEFAULT_INPUT_DIR") and they may also upload their one files.
    Please specify them in the format "name: path" like shown in this example:

    ADD_INPUT_AND_UPLOAD_DIRS:
        UPLOAD_SCRATCH: '/scratch/upload'
        PERMANEN_UPLOAD_STORE: '/datasets/upload'
    

    Default: no additional input dirs.

  • DEBUG:
    If set to True, the debugging mode is turned on. Do not use on production systems.
    Default: False

Exec Profiles:

This is where you configure how to execute cwl jobs on your system. A profile consists of four steps: prepare, exec, eval, and finalize (only exec required, the rest is optional). For each step, you can specify commands that are executed in bash or cmd terminal.

You can define multiple execution profile as shown in the config example below. This allows frontend users to choose between different execution options (e.g. using different CWL runners, different dependency management systems, or even choose a between multiple available batch execution infrastructures like lsf, pbs, ...). For each execution profile, following configuration parameters are available (but only type and exec is required):

  • type:
    Specify which shell/interpreter to use. For Linux or MacOS use bash. For Windows, use powershell.
    Required.

  • max_retries: Specify how many times the execution (all steps) is retried before marking a run as failed.

  • timeout:
    For each step in the execution profile, you can set a timeout limit.
    Default:

    prepare: 120
    exec: 86400
    eval: 120
    finalize: 120
  • prepare*:
    Commands that are executed before the actual CWL execution. For instance to load required python/conda environments.
    Optional.

  • exec*:
    Commands to start the CWL execution. Usually, this is only the command line to execute the CWL runner. The stdout and stderr of the CWL runner should be redirected to the predefined log file.
    Required.

  • eval*:
    The exit status at the end of the exec step is automatically checked. Here you can specify commands to additionally evaluate the content of the execution log to determine if the execution succeeded. To communicate failure to workflUX, set the SUCCESS variable to False.
    Optional.

  • finalize*: Commands that are executed after exec and eval. For instance, this can be used to clean up temporary files.

* Additional notes regarding execution profile steps:

  • In each step following predefined variables are available:
    • JOB_ID
    • RUN_ID (please note: is only unique within a job)
    • WORKFLOW (the path to the used CWL document)
    • RUN_INPUT (the path to the YAML file containing input parameters)
    • OUTPUT_DIR (the path of the run-specific output directory)
    • LOG_FILE (the path of the log file that should receive the stdout and stderr of CWL runner)
    • SUCCESS (if set to False the run will be marked as failed and terminated)
    • PYTHON_PATH (the path to the python interpreter used to run workflUX)
  • The steps will be executed in the order: prepare, exec, eval, finalize.
  • You may define your own variables in one step and access them in the subsequent steps.
  • At the end of each step. The exit code is checked. If it is non-zero, the run will be marked as failed. Please note, if a step consists of multiple commands and an intermediate command fails, this will not be recognized by workflUX as long as the final command of the step will succeed. To manually communicate failure to workflUX, please set the SUCCESS variable to False.
  • The steps are executed using pexpect (https://pexpect.readthedocs.io/en/stable/overview.html), this allows you also connect to a remote infrastructure via ssh (recommended to use an ssh key). Please be aware that the path of files or directories specified in the input parameter YAML will not be adapted to the new host. We are working on solutions to achieve an automated path correction and/or upload functionality if the execution host is not the workflUX server host.
  • On Windows, please be aware that each code block (contained in {...}) has to be in one line.

Example configuration files:

Below, you can find example configurations for local execution of CWL workflows or tools with cwltool.

Linux / MacOs:

WEB_SERVER_HOST: localhost 
WEB_SERVER_PORT: 5000

DEBUG: False  

TEMP_DIR: '/home/workflux_user/workflux/temp'
WORKFLOW_DIR: '/home/workflux_user/workflux/workflows'
EXEC_DIR: '/datasets/processing_out/'
DEFAULT_INPUT_DIR: '/home/workflux_user/workflux/input'
DB_DIR: '/home/workflux_user/workflux/db'

ADD_INPUT_DIRS:
    GENOMES_DIR: '/ngs_share/genomes'
    PUBLIC_GEO_DATA: '/datasets/public/geo'

ADD_INPUT_AND_UPLOAD_DIRS:
    UPLOAD_SCRATCH: '/scratch/upload'
    PERMANEN_UPLOAD_STORE: '/datasets/upload'

EXEC_PROFILES:
    cwltool_local:
        type: bash
        max_retries: 2
        timeout:
            prepare: 120
            exec: 86400
            eval: 120
            finalize: 120
        exec: |
            cwltool --outdir "${OUTPUT_DIR}" "${WORKFLOW}" "${RUN_INPUT}" \
                >> "${LOG_FILE}" 2>&1
        eval: | 
            LAST_LINE=$(tail -n 1 ${LOG_FILE})
            if [[ "${LAST_LINE}" == *"Final process status is success"* ]]
            then
                SUCCESS=True
            else
                SUCCESS=False
                ERR_MESSAGE="cwltool failed - ${LAST_LINE}"
            fi

Windows:

WEB_SERVER_HOST: localhost
WEB_SERVER_PORT: 5000

DEBUG: False  

TEMP_DIR: 'C:\Users\workflux_user\workflux\temp'
WORKFLOW_DIR: 'C:\Users\workflux_user\workflux\workflows'
EXEC_DIR: 'D:\processing_out\'
DEFAULT_INPUT_DIR: 'C:\Users\workflux_user\workflux\input'
DB_DIR: 'C:\Users\workflux_user\workflux\db'

ADD_INPUT_DIRS:
    GENOMES_DIR: 'E:\genomes'
    PUBLIC_GEO_DATA: 'D:\public\geo'
    
ADD_INPUT_AND_UPLOAD_DIRS:
    UPLOAD_SCRATCH: 'E:\upload'
    PERMANEN_UPLOAD_STORE: '\D:\upload'

EXEC_PROFILES:
    cwltool_windows:
        type: powershell
        max_retries: 2
        timeout:
            prepare: 120
            exec: 86400
            eval: 120
            finalize: 120
        exec: |
            . "${PYTHON_PATH}" -m cwltool --debug --default-container ubuntu:16.04 --outdir "${OUTPUT_DIR}" "${CWL}" "${RUN_INPUT}" > "${LOG_FILE}" 2>&1

        eval: |
            $LAST_LINES = (Get-Content -Tail 2 "${LOG_FILE}")

            if ($LAST_LINES.Contains("Final process status is success")){$SUCCESS="True"}
            else {$SUCCESS="False"; $ERR_MESSAGE = "cwltool failed - ${LAST_LINE}"}

Licence:

This package is free to use and modify under the Apache 2.0 Licence.

Contributors

Thanks goes to these wonderful people (emoji key):


Kersten Breuer

💻 🎨

Pavlo Lutsik

💻 🤔 💵

Sven Twardziok

💻

Marius

💻 🚇

Lukas Jelonek

💻

Michael Franklin

💻

Alex Kanitz

💻

Yoann PAGEAUD

💻

Yassen Assenov

🤔

YuYu Lin

💻 🔌

This project follows the all-contributors specification. Contributions of any kind welcome!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].