All Projects → scipipe → Scipipe

scipipe / Scipipe

Licence: mit
Robust, flexible and resource-efficient pipelines using Go and the commandline

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Scipipe

Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (-1.69%)
Mutual labels:  workflow-engine, bioinformatics, pipeline, workflow
Nextflow
A DSL for data-driven computational pipelines
Stars: ✭ 1,337 (+61.86%)
Mutual labels:  workflow-engine, dataflow, bioinformatics, pipeline
Ugene
UGENE is free open-source cross-platform bioinformatics software
Stars: ✭ 112 (-86.44%)
Mutual labels:  bioinformatics, pipeline, workflow
Flowr
Robust and efficient workflows using a simple language agnostic approach
Stars: ✭ 73 (-91.16%)
Mutual labels:  bioinformatics, pipeline, workflow
Rnaseq Workflow
A repository for setting up a RNAseq workflow
Stars: ✭ 170 (-79.42%)
Mutual labels:  bioinformatics, pipeline, workflow
Cookiecutter
DEPRECIATED! Please use nf-core/tools instead
Stars: ✭ 18 (-97.82%)
Mutual labels:  bioinformatics, pipeline, workflow
Sarek
Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
Stars: ✭ 124 (-84.99%)
Mutual labels:  bioinformatics, pipeline, workflow
Machine
Machine is a workflow/pipeline library for processing data
Stars: ✭ 78 (-90.56%)
Mutual labels:  workflow-engine, pipeline, workflow
Batchflow
BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.
Stars: ✭ 156 (-81.11%)
Mutual labels:  workflow-engine, pipeline, workflow
Cuneiform
Cuneiform distributed programming language
Stars: ✭ 175 (-78.81%)
Mutual labels:  workflow-engine, bioinformatics, workflow
bistro
A library to build and execute typed scientific workflows
Stars: ✭ 43 (-94.79%)
Mutual labels:  workflow, bioinformatics, pipeline
Arvados
An open source platform for managing and analyzing biomedical big data
Stars: ✭ 274 (-66.83%)
Mutual labels:  workflow-engine, bioinformatics, workflow
Jug
Parallel programming with Python
Stars: ✭ 337 (-59.2%)
Mutual labels:  workflow-engine, workflow
Tactic
Open source remote collaboration platform used for configuring and deploying enterprise Workflow solutions.
Stars: ✭ 301 (-63.56%)
Mutual labels:  workflow-engine, workflow
Pvm
Build workflows, activities, BPMN like processes, or state machines with PVM.
Stars: ✭ 348 (-57.87%)
Mutual labels:  workflow-engine, workflow
Utask
µTask is an automation engine that models and executes business processes declared in yaml. ✏️📋
Stars: ✭ 374 (-54.72%)
Mutual labels:  workflow-engine, workflow
Rnaseq
RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
Stars: ✭ 305 (-63.08%)
Mutual labels:  pipeline, workflow
Piper
piper - a distributed workflow engine
Stars: ✭ 374 (-54.72%)
Mutual labels:  workflow-engine, pipeline
Rush
A cross-platform command-line tool for executing jobs in parallel
Stars: ✭ 421 (-49.03%)
Mutual labels:  bioinformatics, pipeline
Pipeline
Pipeline is a package to build multi-staged concurrent workflows with a centralized logging output.
Stars: ✭ 433 (-47.58%)
Mutual labels:  pipeline, workflow

SciPipe

Robust, flexible and resource-efficient pipelines using Go and the commandline

Build Status Test Coverage Codebeat Grade Go Report Card GoDoc Gitter DOI

Project links: Documentation & Main Website | Issue Tracker | Chat

Why SciPipe?

  • Intuitive: SciPipe works by flowing data through a network of channels and processes
  • Flexible: Wrapped command-line programs can be combined with processes in Go
  • Convenient: Full control over how your files are named
  • Efficient: Workflows are compiled to binary code that run fast
  • Parallel: Pipeline paralellism between processes as well as task parallelism for multiple inputs, making efficient use of multiple CPU cores
  • Supports streaming: Stream data between programs to avoid wasting disk space
  • Easy to debug: Use available Go debugging tools or just println()
  • Portable: Distribute workflows as Go code or as self-contained executable files

Project updates

Introduction

SciPipe is a library for writing Scientific Workflows, sometimes also called "pipelines", in the Go programming language.

When you need to run many commandline programs that depend on each other in complex ways, SciPipe helps by making the process of running these programs flexible, robust and reproducible. SciPipe also lets you restart an interrupted run without over-writing already produced output and produces an audit report of what was run, among many other things.

SciPipe is built on the proven principles of Flow-Based Programming (FBP) to achieve maximum flexibility, productivity and agility when designing workflows. Compared to plain dataflow, FBP provides the benefits that processes are fully self-contained, so that a library of re-usable components can be created, and plugged into new workflows ad-hoc.

Similar to other FBP systems, SciPipe workflows can be likened to a network of assembly lines in a factory, where items (files) are flowing through a network of conveyor belts, stopping at different independently running stations (processes) for processing, as depicted in the picture above.

SciPipe was initially created for problems in bioinformatics and cheminformatics, but works equally well for any problem involving pipelines of commandline applications.

Project status: SciPipe pretty stable now, and only very minor API changes might still occur. We have successfully used SciPipe in a handful of both real and experimental projects, and it has had occasional use outside the research group as well.

Known limitations

Hello World example

Let's look at an example workflow to get a feel for what writing workflows in SciPipe looks like:

package main

import (
    // Import SciPipe, aliased to sp
    sp "github.com/scipipe/scipipe"
)

func main() {
    // Init workflow and max concurrent tasks
    wf := sp.NewWorkflow("hello_world", 4)

    // Initialize processes, and file extensions
    hello := wf.NewProc("hello", "echo 'Hello ' > {o:out|.txt}")
    world := wf.NewProc("world", "echo $(cat {i:in}) World > {o:out|.txt}")

    // Define data flow
    world.In("in").From(hello.Out("out"))

    // Run workflow
    wf.Run()
}

Running the example

Let's put the code in a file named hello_world.go and run it:

$ go run hello_world.go
AUDIT   2018/07/17 21:42:26 | workflow:hello_world             | Starting workflow (Writing log to log/scipipe-20180717-214226-hello_world.log)
AUDIT   2018/07/17 21:42:26 | hello                            | Executing: echo 'Hello ' > hello.out.txt
AUDIT   2018/07/17 21:42:26 | hello                            | Finished: echo 'Hello ' > hello.out.txt
AUDIT   2018/07/17 21:42:26 | world                            | Executing: echo $(cat ../hello.out.txt) World > hello.out.txt.world.out.txt
AUDIT   2018/07/17 21:42:26 | world                            | Finished: echo $(cat ../hello.out.txt) World > hello.out.txt.world.out.txt
AUDIT   2018/07/17 21:42:26 | workflow:hello_world             | Finished workflow (Log written to log/scipipe-20180717-214226-hello_world.log)

Let's check what file SciPipe has generated:

$ ls -1 hello*
hello.out.txt
hello.out.txt.audit.json
hello.out.txt.world.out.txt
hello.out.txt.world.out.txt.audit.json

As you can see, it has created a file hello.out.txt, and hello.out.world.out.txt, and an accompanying .audit.json for each of these files.

Now, let's check the output of the final resulting file:

$ cat hello.out.txt.world.out.txt
Hello World

Now we can rejoice that it contains the text "Hello World", exactly as a proper Hello World example should :)

Now, these were a little long and cumbersome filenames, weren't they? SciPipe gives you very good control over how to name your files, if you don't want to rely on the automatic file naming. For example, we could set the first filename to a static one, and then use the first name as a basis for the file name for the second process, like so:

package main

import (
    // Import the SciPipe package, aliased to 'sp'
    sp "github.com/scipipe/scipipe"
)

func main() {
    // Init workflow with a name, and max concurrent tasks
    wf := sp.NewWorkflow("hello_world", 4)

    // Initialize processes and set output file paths
    hello := wf.NewProc("hello", "echo 'Hello ' > {o:out}")
    hello.SetOut("out", "hello.txt")

    world := wf.NewProc("world", "echo $(cat {i:in}) World >> {o:out}")
    // The modifier 's/.txt//' will replace '.txt' in the input path with ''
    world.SetOut("out", "{i:in|s/.txt//}_world.txt")

    // Connect network
    world.In("in").From(hello.Out("out"))

    // Run workflow
    wf.Run()
}

Now, if we run this, the file names get a little cleaner:

$ ls -1 hello*
hello.txt
hello.txt.audit.json
hello.txt.world.go
hello.txt.world.txt
hello.txt.world.txt.audit.json

The audit logs

Finally, we could have a look at one of those audit file created:

$ cat hello.txt.world.txt.audit.json
{
    "ID": "99i5vxhtd41pmaewc8pr",
    "ProcessName": "world",
    "Command": "echo $(cat hello.txt) World \u003e\u003e hello.txt.world.txt.tmp/hello.txt.world.txt",
    "Params": {},
    "Tags": {},
    "StartTime": "2018-06-15T19:10:37.955602979+02:00",
    "FinishTime": "2018-06-15T19:10:37.959410102+02:00",
    "ExecTimeNS": 3000000,
    "Upstream": {
        "hello.txt": {
            "ID": "w4oeiii9h5j7sckq7aqq",
            "ProcessName": "hello",
            "Command": "echo 'Hello ' \u003e hello.txt.tmp/hello.txt",
            "Params": {},
            "Tags": {},
            "StartTime": "2018-06-15T19:10:37.950032676+02:00",
            "FinishTime": "2018-06-15T19:10:37.95468214+02:00",
            "ExecTimeNS": 4000000,
            "Upstream": {}
        }
    }

Each such audit-file contains a hierarchic JSON-representation of the full workflow path that was executed in order to produce this file. On the first level is the command that directly produced the corresponding file, and then, indexed by their filenames, under "Upstream", there is a similar chunk describing how all of its input files were generated. This process will be repeated in a recursive way for large workflows, so that, for each file generated by the workflow, there is always a full, hierarchic, history of all the commands run - with their associated metadata - to produce that file.

You can find many more examples in the examples folder in the GitHub repo.

For more information about how to write workflows using SciPipe, and much more, see SciPipe website (scipipe.org)!

More material on SciPipe

Citing SciPipe

If you use SciPipe in academic or scholarly work, please cite the following paper as source:

Lampa S, Dahlö M, Alvarsson J, Spjuth O. SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines Gigascience. 8, 5 (2019). DOI: 10.1093/gigascience/giz044

Acknowledgements

Related tools

Find below a few tools that are more or less similar to SciPipe that are worth worth checking out before deciding on what tool fits you best (in approximate order of similarity to SciPipe):

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].