All Projects → mschubert → Clustermq

mschubert / Clustermq

Licence: apache-2.0
R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Clustermq

Drake
An R-focused pipeline toolkit for reproducibility and high-performance computing
Stars: ✭ 1,301 (+1127.36%)
Mutual labels:  r-package, high-performance-computing
Targets
Function-oriented Make-like declarative workflows for R
Stars: ✭ 293 (+176.42%)
Mutual labels:  r-package, high-performance-computing
Ssh
Native SSH client in R based on libssh
Stars: ✭ 111 (+4.72%)
Mutual labels:  r-package, ssh
Teleport
Certificate authority and access plane for SSH, Kubernetes, web apps, databases and desktops
Stars: ✭ 10,602 (+9901.89%)
Mutual labels:  ssh, cluster
Doazureparallel
A R package that allows users to submit parallel workloads in Azure
Stars: ✭ 102 (-3.77%)
Mutual labels:  cluster
Groovy Ssh
SSH automation tool based on Groovy DSL
Stars: ✭ 100 (-5.66%)
Mutual labels:  ssh
Alfred Tty
Alfred Workflow to quickly switch between or close iTerm windows, tabs and panes based on title and tty
Stars: ✭ 99 (-6.6%)
Mutual labels:  ssh
Ftpbucket
FTPbucket is a PHP script that enables you to sync your BitBucket or GitHub repository with any web-server
Stars: ✭ 99 (-6.6%)
Mutual labels:  ssh
Scrapyd Cluster On Heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉
Stars: ✭ 106 (+0%)
Mutual labels:  cluster
Lastpass Ssh
SSH key management with LastPass
Stars: ✭ 105 (-0.94%)
Mutual labels:  ssh
Rorcid
A programmatic interface the Orcid.org API
Stars: ✭ 101 (-4.72%)
Mutual labels:  r-package
Forecastml
An R package with Python support for multi-step-ahead forecasting with machine learning and deep learning algorithms
Stars: ✭ 101 (-4.72%)
Mutual labels:  r-package
Ssh keyscanner
ssh public host key scanner using shodan
Stars: ✭ 102 (-3.77%)
Mutual labels:  ssh
Mercury
Simple Android app that sends pre-configured commands to remote servers via SSH.
Stars: ✭ 100 (-5.66%)
Mutual labels:  ssh
Btrfs Sxbackup
Incremental btrfs snapshot backups with push/pull support via SSH
Stars: ✭ 105 (-0.94%)
Mutual labels:  ssh
Gratia
ggplot-based graphics and useful functions for GAMs fitted using the mgcv package
Stars: ✭ 102 (-3.77%)
Mutual labels:  r-package
Gitzone
git-based zone management tool for static and dynamic domains
Stars: ✭ 100 (-5.66%)
Mutual labels:  ssh
Redis Tools
my tools working with redis
Stars: ✭ 104 (-1.89%)
Mutual labels:  cluster
Jupiter
Jupiter是一款性能非常不错的, 轻量级的分布式服务框架
Stars: ✭ 1,372 (+1194.34%)
Mutual labels:  cluster
Kubernetes Pfsense Controller
Integrate Kubernetes and pfSense
Stars: ✭ 100 (-5.66%)
Mutual labels:  cluster

ClusterMQ: send R function calls as cluster jobs

CRAN version Build Status CRAN downloads DOI

This package will allow you to send function calls as jobs on a computing cluster with a minimal interface provided by the Q function:

# load the library and create a simple function
library(clustermq)
fx = function(x) x * 2

# queue the function call on your scheduler
Q(fx, x=1:3, n_jobs=1)
# list(2,4,6)

Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. All calculations are load-balanced, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.

Browse the vignettes here:

Installation

First, we need the ZeroMQ system library. This is probably already installed on your system. If not, your package manager will provide it:

# You can skip this step on Windows and macOS, the package binary has it
# On a computing cluster, we recommend to use Conda or Linuxbrew
brew install zeromq # Linuxbrew, Homebrew on macOS
conda install zeromq # Conda, Miniconda
sudo apt-get install libzmq3-dev # Ubuntu
sudo yum install zeromq-devel # Fedora
pacman -S zeromq # Arch Linux

Then install the clustermq package in R from CRAN:

install.packages('clustermq')

Alternatively you can use the remotes package to install directly from Github:

# install.packages('remotes')
remotes::install_github('mschubert/clustermq')
# remotes::install_github('mschubert/clustermq', ref="develop") # dev version

Schedulers

An HPC cluster's scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq interfaces with in order to do computations.

We currently support the following schedulers (either locally or via SSH):

  • Multiprocess - test your calls and parallelize on cores using options(clustermq.scheduler="multiprocess")
  • LSF - should work without setup
  • SGE - should work without setup
  • SLURM - should work without setup
  • PBS/Torque - needs options(clustermq.scheduler="PBS"/"Torque")
  • via SSH - needs options(clustermq.scheduler="ssh", clustermq.ssh.host=<yourhost>)

Default submission templates are provided and can be customized, e.g. to activate compute environments or containers.

Usage

The most common arguments for Q are:

  • fun - The function to call. This needs to be self-sufficient (because it will not have access to the master environment)
  • ... - All iterated arguments passed to the function. If there is more than one, all of them need to be named
  • const - A named list of non-iterated arguments passed to fun
  • export - A named list of objects to export to the worker environment

The documentation for other arguments can be accessed by typing ?Q. Examples of using const and export would be:

# adding a constant argument
fx = function(x, y) x * 2 + y
Q(fx, x=1:3, const=list(y=10), n_jobs=1)
# exporting an object to workers
fx = function(x) x * 2 + y
Q(fx, x=1:3, export=list(y=10), n_jobs=1)

clustermq can also be used as a parallel backend for foreach. As this is also used by BiocParallel, we can run those packages on the cluster as well:

library(foreach)
register_dopar_cmq(n_jobs=2, memory=1024) # see `?workers` for arguments
foreach(i=1:3) %dopar% sqrt(i) # this will be executed as jobs
library(BiocParallel)
register(DoparParam()) # after register_dopar_cmq(...)
bplapply(1:3, sqrt)

More examples are available in the user guide.

Comparison to other packages

There are some packages that provide high-level parallelization of R function calls on a computing cluster. We compared clustermq to BatchJobs and batchtools for processing many short-running jobs, and found it to have approximately 1000x less overhead cost.

Overhead comparison

In short, use clustermq if you want:

  • a one-line solution to run cluster jobs with minimal setup
  • access cluster functions from your local Rstudio via SSH
  • fast processing of many function calls without network storage I/O

Use batchtools if you:

  • want to use a mature and well-tested package
  • don't mind that arguments to every call are written to/read from disc
  • don't mind there's no load-balancing at run-time

Use Snakemake or drake if:

  • you want to design and run a workflow on HPC

Don't use batch (last updated 2013) or BatchJobs (issues with SQLite on network-mounted storage).

Contributing

We use Github's Issue Tracker to coordinate development of clustermq. Contributions are welcome and they come in many different forms, shapes, and sizes. These include, but are not limited to:

  • Questions: You are welcome to ask questions if something is not clear in the User guide.
  • Bug reports: Let us know if something does not work as expected. Be sure to include a self-contained Minimal Reproducible Example and set log_worker=TRUE.
  • Code contributions: Have a look at the good first issue tag. Please discuss anything more complicated before putting a lot of work in, I'm happy to help you get started.

Citation

This project is part of my academic work, for which I will be evaluated on citations. If you like me to be able to continue working on research support tools like clustermq, please cite the article when using it for publications:

M Schubert. clustermq enables efficient parallelisation of genomic analyses. Bioinformatics (2019). doi:10.1093/bioinformatics/btz284

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].