All Projects β†’ raygon-renderer β†’ Thermite

raygon-renderer / Thermite

Licence: other
Thermite SIMD: Melt your CPU

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Thermite

Simdcompressionandintersection
A C++ library to compress and intersect sorted lists of integers using SIMD instructions
Stars: ✭ 289 (+104.96%)
Mutual labels:  algorithms, simd
Algodeck
An Open-Source Collection of 200+ Algorithmic Flash Cards to Help you Preparing your Algorithm & Data Structure Interview πŸ’―
Stars: ✭ 4,441 (+3049.65%)
Mutual labels:  algorithms, math
Baekjoon
μ½”λ”©ν…ŒμŠ€νŠΈ λŒ€λΉ„ λ¬Έμ œμ§‘(Baekjoon Online Judge)
Stars: ✭ 295 (+109.22%)
Mutual labels:  algorithms, math
zig-gamedev
Building game development ecosystem for @ziglang!
Stars: ✭ 1,059 (+651.06%)
Mutual labels:  math, simd
Phobos
The standard library of the D programming language
Stars: ✭ 1,038 (+636.17%)
Mutual labels:  algorithms, math
hlml
vectorized high-level math library
Stars: ✭ 42 (-70.21%)
Mutual labels:  math, simd
Rtm
Realtime Math
Stars: ✭ 373 (+164.54%)
Mutual labels:  simd, math
Ugm
Ubpa Graphics Mathematics
Stars: ✭ 178 (+26.24%)
Mutual labels:  simd, math
Blog
About math, programming and procedural generation
Stars: ✭ 37 (-73.76%)
Mutual labels:  algorithms, math
Mlcourse.ai
Open Machine Learning Course
Stars: ✭ 7,963 (+5547.52%)
Mutual labels:  algorithms, math
SCNMathExtensions
Math extensions for SCNVector3, SCNQuaternion, SCNMatrix4
Stars: ✭ 32 (-77.3%)
Mutual labels:  math, simd
Project Euler Solutions
Runnable code for solving Project Euler problems in Java, Python, Mathematica, Haskell.
Stars: ✭ 1,374 (+874.47%)
Mutual labels:  algorithms, math
HLML
Auto-generated maths library for C and C++ based on HLSL/Cg
Stars: ✭ 23 (-83.69%)
Mutual labels:  math, simd
Leetcode Go
βœ… Solutions to LeetCode by Go, 100% test coverage, runtime beats 100% / LeetCode 钘解
Stars: ✭ 22,440 (+15814.89%)
Mutual labels:  algorithms, math
Stats
A well tested and comprehensive Golang statistics library package with no dependencies.
Stars: ✭ 2,196 (+1457.45%)
Mutual labels:  algorithms, math
Mango
mango fun framework
Stars: ✭ 343 (+143.26%)
Mutual labels:  simd, math
Cglm
πŸ“½ Highly Optimized Graphics Math (glm) for C
Stars: ✭ 887 (+529.08%)
Mutual labels:  simd, math
Libmaths
A Python library created to assist programmers with complex mathematical functions
Stars: ✭ 72 (-48.94%)
Mutual labels:  algorithms, math
Sage
Mirror of the Sage source tree -- please do not submit PRs here -- everything must be submitted via https://trac.sagemath.org/
Stars: ✭ 1,656 (+1074.47%)
Mutual labels:  algorithms, math
Conduit
High Performance Streams Based on Coroutine TS ⚑
Stars: ✭ 135 (-4.26%)
Mutual labels:  algorithms

Thermite SIMD: Melt your CPU

NOTE: This crate is not yet on crates.io, but I do own the name and will publish it there when ready

Thermite is a WIP SIMD library focused on providing portable SIMD acceleration of SoA (Structure of Arrays) algorithms, using consistent-length1 SIMD vectors for lockstep iteration and computation.

Thermite provides highly optimized feature-rich backends for SSE2, SSE4.2, AVX and AVX2, with planned support for AVX512, ARM/Aarch64 NEON, and WASM SIMD extensions.

In addition to that, Thermite includes a highly optimized vectorized math library with many special math functions and algorithms, specialized for both single and double precision.

1 All vectors in an instruction set are the same length, regardless of size.

Current Status

Refer to issue #1

Motivation and Goals

Thermite was conceived while working on Raygon renderer, when it was decided we needed a state of the art high-performance SIMD vector library focused on facilitating SoA algorithms. Using SIMD for AoS values was a nightmare, constantly shuffling vectors and performing unnecessary horizontal operations. We also weren't able to take advantage of AVX2 fully due to 3D vectors only using 3 or 4 lanes of a regular 128-bit register.

Using SIMDeez, faster, or redesigning packed_simd were all considered, but each has their flaws. SIMDeez is rather limited in functionality, and their handling of target_feature leaves much to be desired. faster fits well into the SoA paradigm, but the iterator-based API is rather unwieldy, and it is lacking many features. packed_simd isn't bad, but it's also missing many features and relies on the Nightly-only "platform-intrinsic"s, which can produce suboptimal code in some cases.

Therefore, the only solution was to write my own, and thus Thermite was born.

The primary goal of Thermite is to provide optimal codegen for every backend instruction set, and provide a consistent set of features on top of all of them, in such a way as to encourage using chunked SoA or AoSoA algorithms regardless of what data types you need. Furthermore, with the #[dispatch] macro, multiple instruction sets can be easily targetted within a single binary.

Features

  • SSE2, SSE4.2, AVX, AVX2 backends, with planned support for scalar, AVX512, WASM SIMD and ARM NEON backends.
  • Extensive built-in vectorized math library.
  • Compile-time policies to emphasize precision, performance or code size (useful for WASM)
  • Compile-time monomorphisation with runtime selection
    • Aided by a #[dispatch] procedural macro to ensure optimal codegen.
  • Zero runtime overhead.
  • Operator overloading on vector types.
  • Abstracts over vector length, giving the same length to all vectors of an instruction set.
  • Provides fast polyfills where necessary to provide the same API across all instruction sets.
  • Highly optimized value cast routines between vector types where possible.
  • Dedicated mask wrapper type with low-cost bitwise vector conversions built-in.

Optimized Project Setup

For optimal performance, ensure you Cargo.toml profiles looks something like this:

[profile.dev]
opt-level = 2       # Required to inline SIMD intrinsics internally

[profile.release]
opt-level = 3       # Should be at least 2; level 1 will not use SIMD intrinsics
lto = 'thin'        # 'fat' LTO may also improve things, but will increase compile time
codegen-units = 1   # Required for optimal inlining and optimizations

# optional release options depending on your project and preference
incremental = false # Release builds will take longer to compile, but inter-crate optimizations may work better
panic = 'abort'     # Very few functions in Thermite panic, but aborting will avoid the unwind mechanism overhead

Misc. Usage Notes

  • Vectors with 64-bit elements are approximately 2-4x slower than 32-bit vectors.
  • Integer vectors are 2x slower on SSE2/AVX1, but nominal on SSE4.1 and AVX2. This compounds the first point.
  • Casting floats to signed integers is faster than to unsigned integers.
  • Equal-size Signed and Unsigned integer vectors can be cast between each other at zero cost.
  • Operations mixing float and integer types can incur a 1-cycle penalty on most modern CPUs.
  • Integer division currently can only be done with a scalar fallback, so it's not recommended.
  • Dividing integer vectors by constant uniform divisors should use SimdIntVector::div_const
  • When reusing masks for all/any/none queries, consider using the bitmask directly to avoid recomputing.
  • Avoid casting between differently-sized types in hot loops.
  • Avoid extracting and replacing elements.
  • LLVM will inline many math functions and const-eval as much as possible, but only if it was called in the same instruction-set context.

Cargo --features

alloc (enabled by default)

The alloc feature enables aligned allocation of buffers suitable to reading/writing to with SIMD.

nightly

The nightly feature enables nightly-only optimizations such as accelerated half-precision encoding/decoding.

math (enabled by default)

Enables the vectorized math modules

rng

Enables the vectorized random number modules

emulate_fma

Real fused multiply-add instructions are only enabled for AVX2 platforms. However, as FMA is used not only for performance but for its extended precision, falling back to a split multiply and addition will incur two rounding errors, and may be unacceptable for some applications. Therefore, the emulate_fma Cargo feature will enable a slower but more accurate implementation on older platforms.

For single-precision floats, this is easiest done by simply casting it to double-precision, doing seperate multiply and additions, then casting back. For double-precision, it will use an infinite-precision implementation based on libm.

On SSE2 platforms, double-precision may fallback to scalar ops, as the effort needed to make it branchless will be more expensive than not. As of writing this, it has not been implemented, so benchmarks will reveal what is needed later.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].