All Projects β†’ It4innovations β†’ hyperqueue

It4innovations / hyperqueue

Licence: MIT license
Scheduler for sub-node tasks for HPC systems with batch scheduling

Programming Languages

rust
11053 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to hyperqueue

Easylambda
distributed dataflows with functional list operations for data processing with C++14
Stars: ✭ 475 (+889.58%)
Mutual labels:  hpc, distributed-computing
future.batchtools
πŸš€ R package future.batchtools: A Future API for Parallel and Distributed Processing using batchtools
Stars: ✭ 77 (+60.42%)
Mutual labels:  hpc, distributed-computing
Charm4py
Parallel Programming with Python and Charm++
Stars: ✭ 259 (+439.58%)
Mutual labels:  hpc, distributed-computing
Future
πŸš€ R package: future: Unified Parallel and Distributed Processing in R for Everyone
Stars: ✭ 735 (+1431.25%)
Mutual labels:  hpc, distributed-computing
Future.apply
πŸš€ R package: future.apply - Apply Function to Elements in Parallel using Futures
Stars: ✭ 159 (+231.25%)
Mutual labels:  hpc, distributed-computing
wrench
WRENCH: Cyberinfrastructure Simulation Workbench
Stars: ✭ 25 (-47.92%)
Mutual labels:  hpc, distributed-computing
dislib
The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
Stars: ✭ 39 (-18.75%)
Mutual labels:  hpc, distributed-computing
ParallelUtilities.jl
Fast and easy parallel mapreduce on HPC clusters
Stars: ✭ 28 (-41.67%)
Mutual labels:  hpc, distributed-computing
Federated-Learning-and-Split-Learning-with-raspberry-pi
SRDS 2020: End-to-End Evaluation of Federated Learning and Split Learning for Internet of Things
Stars: ✭ 54 (+12.5%)
Mutual labels:  distributed-computing
pat-helland-and-me
Materials related to my talk "Pat Helland and Me"
Stars: ✭ 14 (-70.83%)
Mutual labels:  distributed-computing
good-karma-kit
πŸ˜‡ A Docker Compose bundle to run on servers with spare CPU, RAM, disk, and bandwidth to help the world. Includes Tor, ArchiveWarrior, BOINC, and more...
Stars: ✭ 238 (+395.83%)
Mutual labels:  distributed-computing
easybuild-easyblocks
Collection of easyblocks that implement support for building and installing software with EasyBuild.
Stars: ✭ 83 (+72.92%)
Mutual labels:  hpc
reframe
A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
Stars: ✭ 154 (+220.83%)
Mutual labels:  hpc
lazycluster
πŸŽ› Distributed machine learning made simple.
Stars: ✭ 43 (-10.42%)
Mutual labels:  distributed-computing
asyncoro
Python framework for asynchronous, concurrent, distributed, network programming with coroutines
Stars: ✭ 50 (+4.17%)
Mutual labels:  distributed-computing
float
Single precision (float) matrices for R.
Stars: ✭ 41 (-14.58%)
Mutual labels:  hpc
jessica
Jessica - Jessie (secure distributed Javascript) Compiler Architecture
Stars: ✭ 27 (-43.75%)
Mutual labels:  distributed-computing
nautilus
Nautilus Aerokernel
Stars: ✭ 30 (-37.5%)
Mutual labels:  hpc
JOLI.jl
Julia Operators LIbrary
Stars: ✭ 14 (-70.83%)
Mutual labels:  distributed-computing
tacc stats
TACC Stats is an automated resource-usage monitoring and analysis package.
Stars: ✭ 36 (-25%)
Mutual labels:  hpc

HyperQueue (HQ) lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS. It dynamically groups jobs into SLURM/PBS jobs and distributes them to fully utilize allocated notes. You thus do not have to manually aggregate your tasks into SLURM/PBS jobs.

Documentation

If you find a bug or a problem with HyperQueue, please create an issue. For more general discussion or feature requests, please use our discussion forum. If you want to chat with the HyperQueue developers, you can use our Zulip server.

Features

  • Performance

    • The inner scheduler can scale to hundreds of nodes
    • The overhead per one task is below 0.1ms.
    • HQ allows streaming outputs from tasks to avoid creating many small files on a distributed filesystem
  • Easy deployment

    • HQ is provided as a single, statically linked binary without any dependencies
    • No admin access to a cluster is needed

Getting started

Installation

  • Download the latest binary distribution from this link.

  • Unpack the downloaded archive:

    $ tar -xvzf hq-<version>-linux-x64.tar.gz

If you want to try the newest features, you can also download a nightly build.

Submitting a simple task

  • Start a server (e.g. on a login node or in a cluster partition)

    $ hq server start &
  • Submit a job (command echo 'Hello world' in this case)

    $ hq submit echo 'Hello world'
  • Ask for computing resources

    • Start worker manually

      $ hq worker start &
    • Automatic submission of workers into PBS/SLURM

      • PBS:

        $ hq alloc add pbs --time-limit 1h -- -q <queue>
      • Slurm:

        $ hq alloc add slurm --time-limit 1h -- -p <partition>
    • Manual request in PBS

      • Start worker on the first node of a PBS job

        $ qsub <your-params-of-qsub> -- hq worker start
      • Start worker on all nodes of a PBS job

        $ qsub <your-params-of-qsub> -- `which pbsdsh` hq worker start
    • Manual request in SLURM

      • Start worker on the first node of a Slurm job

        $ sbatch <your-params-of-sbatch> --wrap "hq worker start"
      • Start worker on all nodes of a Slurm job

        $ sbatch <your-params-of-sbatch> --wrap "srun hq worker start"
  • Monitor the state of jobs

    $ hq job list --all

What's next?

Check out the documentation.

FAQ

  • How HQ works?

    You start a HQ server somewhere (e.g. login node, cloud partition of a cluster). Then you can submit your jobs to the server. You may have hundreds of thousands of jobs; they may have various CPUs and other resource requirements.

    Then you can connect any number of HQ workers to the server (either manually or via SLURM/PBS). The server will then immediately start to assign jobs to them.

    Workers are fully and dynamically controlled by server; you do not need to specify what jobs are executed on a particular worker or preconfigure it in any way.

    HQ provides a command line tool for submitting and controlling jobs.

  • What is a task in HQ?

    Task is a unit of computation. Currently, it is either the execution of an arbitrary external program (specified via CLI) or the execution of a single Python function (specified via our Python API).

  • What is a job in HQ?

    Job is a collection of tasks (a task graph). You can display and manage jobs using the CLI.

  • Do I need to use SLURM or PBS to run HQ?

    No. Even though HQ is designed to smoothly work on systems using SLURM/PBS, they are not required for HQ to work.

  • Is HQ a replacement for SLURM or PBS?

    Definitely not. Multi-tenancy is out of the scope of HQ, i.e. HQ does not provide user isolation. HQ is light-weight and easy to deploy; on an HPC system each user (or a group of users that trust each other) may run her own instance of HQ.

  • Do I need an HPC cluster to run HQ?

    No. None of functionality is bound to any HPC technology. Communication between all components is performed using TCP/IP. You can also run HQ locally.

  • Is it safe to run HQ on a login node shared by other users?

    Yes. All communication is secured and encrypted. The server generates a secret file and only those users that have access to it file may submit jobs and connect workers. Users without access to the secret file will only see that the service is running.

  • How many jobs/tasks may I submit into HQ?

    Our preliminary benchmarks show that the overhead of HQ is around 0.1 ms per task. It should be thus possible to submit a job with tens or hundreds of thousands tasks into HQ.

    Note that HQ is designed for a large number of tasks, not jobs. If you want to perform a lot of computations, use task arrays, i.e. create a job with many tasks, not many jobs each with a single task.

    HQ also supports streaming of task outputs into a single file. This avoids creating many small files for each task on a distributed file system, which improves scaling.

  • Does HQ support multi-CPU jobs?

    Yes. You can define an arbitrary amount of cores for each task. HQ is also NUMA aware and you can select the allocation strategy.

  • Does HQ support job arrays?

    Yes, see task arrays.

  • Does HQ support jobs with dependencies?

    Yes, but only using the (currently experimental and undocumented) Python API. It is currently not possible to specify dependencies using the CLI.

  • How is HQ implemented?

    HQ is implemented in Rust and the Tokio async ecosystem. The scheduler is a work-stealing scheduler implemented in our project Tako, which is derived from our previous work RSDS. Integration tests are written in Python, but HQ itself does not depend on Python.

You can find more frequently asked questions here.

HyperQueue team

We are a group of researchers working at IT4Innovations, the Czech National Supercomputing Center. We welcome any outside contributions.

Acknowledgement

  • This work was supported by the LIGATE project. This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 956137. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Italy, Sweden, Austria, the Czech Republic, Switzerland.

  • This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].