Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

numforge / Laser

Licence: apache-2.0

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

Programming Languages

nim

578 projects

assembler

53 projects

Labels

deep-learning parallel simd jit high-performance-computing tensor convolution openmp blas

Projects that are alternatives of or similar to Laser

Libxsmm

Library for specialized dense and sparse matrix operations, and deep learning primitives.

Stars: ✭ 518 (+171.2%)

Mutual labels: jit, simd, tensor, convolution, blas

analisis-numerico-computo-cientifico

Análisis numérico y cómputo científico

Stars: ✭ 42 (-78.01%)

Mutual labels: openmp, blas, tensor

shortcut-comparison

Performance comparison of parallel Rust and C++

Stars: ✭ 74 (-61.26%)

Mutual labels: parallel, openmp, simd

Edge

Extreme-scale Discontinuous Galerkin Environment (EDGE)

Stars: ✭ 18 (-90.58%)

Mutual labels: openmp, simd, high-performance-computing

Guided Missile Simulation

Guided Missile, Radar and Infrared EOS Simulation Framework written in Fortran.

Stars: ✭ 33 (-82.72%)

Mutual labels: openmp, simd, high-performance-computing

tbslas

A parallel, fast solver for the scalar advection-diffusion and the incompressible Navier-Stokes equations based on semi-Lagrangian/Volume-Integral method.

Stars: ✭ 21 (-89.01%)

Mutual labels: parallel, openmp, simd

Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends

Stars: ✭ 793 (+315.18%)

Mutual labels: openmp, tensor, high-performance-computing

John

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs

Stars: ✭ 5,656 (+2861.26%)

Mutual labels: openmp, simd

Ems

Extended Memory Semantics - Persistent shared object memory and parallelism for Node.js and Python

Stars: ✭ 552 (+189.01%)

Mutual labels: parallel, openmp

Taskflow

A General-purpose Parallel and Heterogeneous Task Programming System

Stars: ✭ 6,128 (+3108.38%)

Mutual labels: high-performance-computing, parallel

monolish

monolish: MONOlithic LInear equation Solvers for Highly-parallel architecture

Stars: ✭ 166 (-13.09%)

Mutual labels: openmp, blas

Vectorious

Linear algebra in TypeScript.

Stars: ✭ 616 (+222.51%)

Mutual labels: blas, high-performance-computing

Blis

BLAS-like Library Instantiation Software Framework

Stars: ✭ 859 (+349.74%)

Mutual labels: blas, high-performance-computing

Awesome Tensor Compilers

A list of awesome compiler projects and papers for tensor computation and deep learning.

Stars: ✭ 490 (+156.54%)

Mutual labels: tensor, high-performance-computing

Armadillo Code

Armadillo: fast C++ library for linear algebra & scientific computing - http://arma.sourceforge.net

Stars: ✭ 388 (+103.14%)

Mutual labels: openmp, blas

Ilgpu

ILGPU JIT Compiler for high-performance .Net GPU programs

Stars: ✭ 374 (+95.81%)

Mutual labels: jit, parallel

Multi-dimensional arrays (tensors) and numerical definitions for Elixir

Stars: ✭ 1,133 (+493.19%)

Mutual labels: jit, tensor

Openmp Examples

openmp examples

Stars: ✭ 64 (-66.49%)

Mutual labels: parallel, openmp

Mpm

CB-Geo High-Performance Material Point Method

Stars: ✭ 115 (-39.79%)

Mutual labels: parallel, high-performance-computing

blas-benchmarks

Timing results for BLAS (Basic Linear Algebra Subprograms) libraries in R

Stars: ✭ 24 (-87.43%)

Mutual labels: high-performance-computing, blas

View All Similar Projects ➔

Laser - Primitives for high performance computing

Carefully-tuned primitives for running tensor and image-processing code on CPU, GPUs and accelerators.

The library is in heavy development. For now the CPU backend is being optimised.

Library content

Laser - Primitives for high performance computing

SIMD intrinsics for x86 and x86-64

import laser/simd

Laser includes a wrapper for x86 and x86-64 to operate on 128-bit (SSE) and 256-bit (AVX) vectors of floats and integers. SIMD are added on a as-needed basis for Laser optimisation needs.

OpenMP templates

import laser/openmp

Laser includes several OpenMP templates to easu data-parallel programming in Nim:

The simple omp parallel for loops
Splitting into chunks and having a per-thread ptr+len pair to paralley algorithm that takes a ptr+len
omp parallel, omp critical, omp master, omp barrier and omp flush for fine-grained control over parallelism
attachGC and detachGC if you need to use Nim GC-ed types in a non-master thread.

Examples:

`cpuinfo` for runtime CPU feature detection for x86, x86-64 and ARM

import laser/cpuinfo

Laser includes a wrapper for cpuinfo by Facebook's PyTorch team. This allows to query runtime information about CPU SIMD capabilities and various L1, L2, L3, L4 CPU cache sizes to optimize your compute-bound algorithms.

Example: ex01_cpuinfo.nim

JIT Assembler

import laser/photon_jit

Laser offers its own JIT assembler with features being added on a as needed basis. It is very lightweight and easy to extend. Currently it only supports x86-64 with the following opcodes.

Examples:

Loop-fusion and strided iterators for matrix and tensors

import laser/strided_iteration/foreach
import laser/strided_iteration/foreach_staged

Usage - forEach:

forEach x in a, y in b, z in c:
  x += y * z

Laser includes optimised macros to iterate on contiguous and strided tensors. The iterators work with normal Nim syntax, are parallelized via OpenMP when it makes sense.

Any tensor type works as long as it exposes the following interface:

rank: the number of dimensions
size: the number of elements in the tensor
shape, strides: a container that supports [] indexing
unsafe_raw_data: a routine that returns a ptr UncheckedArray[T] or any type with [] indexing implemented, including mutable indexing.

A advanced iterator forEach_staged provides a lot of flexibility to deal with advanced need, for example for parallel reduction:

proc reduction_localsum_critical[T](x, y: Tensor[T]): T =
  forEachStaged xi in x, yi in y:
    openmp_config:
      use_openmp: true
      use_simd: false
      nowait: true
      omp_grain_size: OMP_MEMORY_BOUND_GRAIN_SIZE
    iteration_kind:
      {contiguous, strided} # Default, "contiguous", "strided" are also possible
    before_loop:
      var local_sum = 0.T
    in_loop:
      local_sum += xi + yi
    after_loop:
      omp_critical:
        result += local_sum

Examples:

ex04 - TODO
ex05_tensor_parallel_reduction

Benchmarks:

Raw tensor type

import laser/tensor/[datatypes, allocator, initialization] # WIP

Laser includes a low-level tensor type with only the low-level allocation and initialization needed:

Aligned allocator
Parallel zero-ing and copy (deep copy, copy from a seq)
Metadata initialisation
Tensor raw data access via pointers is using Nim compiler for safeguard. Immutable objects return a RawImmutablePtr and mutable objects return a RawMutablePtr to prevent you from accidentally modifying an immutable object when accessing raw memory.

An example of how to use that to build higher-level newTensor or randomTensor, transpose and [] is give in the iter_bench in the previous section.

Optimised floating point parallel reduction for sum, min and max

import laser/primitives/reductions

Floating-point reductions are not optimised by compilers by default because they can't assume that result = (a+b) + c is equivalent to result = a + (b + c) due to how floating-point rounding work. This forces serial evaluation of reductions unless -ffast-math flag is passed to the compiler.

The primitives work around that by keeping several accumulators in parallel to avoid waiting for a previous serial evaluation. This allows those kernels to maximise memory-bandwith of your computer.

Benchmarks:

reduction_packed_sse

Optimised logarithmic, exponential, tanh, sigmoid, softmax ...

In heavy development.

Unfortunately the default logarithm and exponential functions included in C and C++ standard <math.h> library are extremely slow.

Benchmarks shows that a 10x speed improvement is possible while keeping excellent accuracy.

Benchmarks:

Optimised transpose, batched transpose and NCHW <=> NHWC format conversion

import laser/primitives/swapaxes

While logical transpose (just swapping the shape and strides metadata of the tensor/matrix) is often enough, we sometimes might need to transpose data physically in-memory.

Laser provides Optimised routines for physical transpose, batched transpose (N matrices) and also transposition of images from and to NCHW and NHWC i.e. [Image id, Color, Height, Width] and [Image id, Height, Width, Color].

90% of ML libraries including Nvidia's CuDNN prefer to work in NCHW while often images are decoded in HWC.

Benchmarks:

transpose_bench

Optimised strided Matrix-Multiplication for integers and floats

import laser/primitives/matrix_multiplication/gemm

Matrix multiplication is the at the base of Machine Learning and numerical computing.

The Dense/Linear/Affine layer of neural network is just a matrix-multiplication and often convolutions are reframed into matrix multiplication to use the 20 years of optimisation research gone into BLAS libraries.

Laser implements its own multithreaded BLAS with the following details:

It reaches 98% of OpenBLAS speed on float64 when multithreaded and 102% when single-threaded
It reaches 97% of OpenBLAS speed on float32 when multithreaded and 99% when single-threaded
It support strided matrices, for example resulting from slicing every 2 rows or every 2 columns: myTensor[0::2, :]. This is very useful when doing cross-validation as you don't need an extra copy before matrix-multiplication.
Contrary to 99% of the BLAS out there, it supports integers: int32 and int64 using SSE2 or AVX2 instructions

Extending support to new SIMD including ARM Neon and AVX512 is very easy, including software fallback is easy as well. For example this is how to add AVX2 int32 support with fused multiply-add fallback:

template int32x8_muladd_unfused_avx2(a, b, c: m256i): m256i =
mm256_add_epi32(mm256_mullo_epi32(a, b), c)

ukernel_generator(
      x86_AVX2,
      typ = int32,
      vectype = m256i,
      nb_scalars = 8,
      simd_setZero = mm256_setzero_si256,
      simd_broadcast_value = mm256_set1_epi32,
      simd_load_aligned = mm256_load_si256,
      simd_load_unaligned = mm256_loadu_si256,
      simd_store_unaligned = mm256_storeu_si256,
      simd_mul = mm256_mullo_epi32,
      simd_add = mm256_add_epi32,
      simd_fma = int32x8_muladd_unfused_avx2
    )

In the future

Operation fusion

The BLAS will allow easily fusing unary operations (like max/relu, tanh or sigmoid) and binary operations (like adding a bias) at the end of the matrix multiplication kernels.

As those operations are memory-bound and not compute-bound, and for matrix multiplication we already have all the data in memory (in the unary case) or half the data (in the binary case), we basically save lots by not looping once again on the matrix to apply them.

Similarly, you will be able to fuse operations before the matrix multiplication kernel, during the prepacking when data is being re-ordered for high performance processing. This will be useful for backward propagation when before each matrix multiplication we must apply the derivatives of relu, tanh and sigmoid.

Pre-packing

Also pre-packing matrices and working on pre-packed matrices is being added. This is useful for matrices that are being used repeatedly, for example for batched matrix multiplication.

im2col prepacker that fuses the convolution->matrix multiplication (im2col) step with the matrix multiplication packing is also planned to get very efficient convolutions.

Batched matrix multiplication

We often have to bached matrix multiplication for examples N tensors A multiplied by a tensor B, or N tensors A multiplied by N tensors B, this is planned.

Small matrix multiplication

In many cases we don't deal with 1000x1000 matrices. For example the traditional image size is 224x224 and the overhead to re-pack matrices in an efficient format is not justified.

When reframing convolutions in terms of matrix multiplication this is even worse as the main convolution kernels are 1x1, 3x3, 5x5.

Optimised small matrix-multiplication is planned.

Optimised convolutions

In heavy development.

Benchmarks:

conv2D_bench

State-of-the art random distributions and weighted random sampling

In heavy development

Benchmarks of multinomial sampling for Natural Language Processing and Reinforcement Learning: -bench_multinomial_samplers

Usage & Installation

The library is split in relatively independant modules that can be used without the others.

For example to just use the SIMD and cpu-detection portion, just do:

import laser/simd
import laser/cpuinfo

To just use OpenMP

import laser/openmp

The library is unstable and will be published on nimble when more mature. Basically it will be published when it's ready to be the CPU backend of Arraymancer, it will automatically profit from the dozens of tests and edge cases handled in Arraymancer test suite.

License

Laser is licensed under the Apache License version 2
Facebook's cpuinfo is licensed under Simplified BSD (BSD 2 clauses)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 191

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (18) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

numforge / Laser

Programming Languages

Labels

Projects that are alternatives of or similar to Laser

Laser - Primitives for high performance computing

Library content

SIMD intrinsics for x86 and x86-64

OpenMP templates

cpuinfo for runtime CPU feature detection for x86, x86-64 and ARM

JIT Assembler

Loop-fusion and strided iterators for matrix and tensors

Raw tensor type

Optimised floating point parallel reduction for sum, min and max

Optimised logarithmic, exponential, tanh, sigmoid, softmax ...

Optimised transpose, batched transpose and NCHW <=> NHWC format conversion

Optimised strided Matrix-Multiplication for integers and floats

In the future

Operation fusion

Pre-packing

Batched matrix multiplication

Small matrix multiplication

Optimised convolutions

State-of-the art random distributions and weighted random sampling

Usage & Installation

License

`cpuinfo` for runtime CPU feature detection for x86, x86-64 and ARM