Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → raygon-renderer → Thermite

raygon-renderer / Thermite

Licence: other

Thermite SIMD: Melt your CPU

Programming Languages

rust

11053 projects

Labels

algorithms math simd

Projects that are alternatives of or similar to Thermite

Simdcompressionandintersection

A C++ library to compress and intersect sorted lists of integers using SIMD instructions

Stars: ✭ 289 (+104.96%)

Mutual labels: algorithms, simd

Algodeck

An Open-Source Collection of 200+ Algorithmic Flash Cards to Help you Preparing your Algorithm & Data Structure Interview 💯

Stars: ✭ 4,441 (+3049.65%)

Mutual labels: algorithms, math

Baekjoon

코딩테스트 대비 문제집(Baekjoon Online Judge)

Stars: ✭ 295 (+109.22%)

Mutual labels: algorithms, math

zig-gamedev

Building game development ecosystem for @ziglang!

Stars: ✭ 1,059 (+651.06%)

Mutual labels: math, simd

Phobos

The standard library of the D programming language

Stars: ✭ 1,038 (+636.17%)

Mutual labels: algorithms, math

hlml

vectorized high-level math library

Stars: ✭ 42 (-70.21%)

Mutual labels: math, simd

Rtm

Realtime Math

Stars: ✭ 373 (+164.54%)

Mutual labels: simd, math

Ugm

Ubpa Graphics Mathematics

Stars: ✭ 178 (+26.24%)

Mutual labels: simd, math

Blog

About math, programming and procedural generation

Stars: ✭ 37 (-73.76%)

Mutual labels: algorithms, math

Mlcourse.ai

Open Machine Learning Course

Stars: ✭ 7,963 (+5547.52%)

Mutual labels: algorithms, math

SCNMathExtensions

Math extensions for SCNVector3, SCNQuaternion, SCNMatrix4

Stars: ✭ 32 (-77.3%)

Mutual labels: math, simd

Project Euler Solutions

Runnable code for solving Project Euler problems in Java, Python, Mathematica, Haskell.

Stars: ✭ 1,374 (+874.47%)

Mutual labels: algorithms, math

HLML

Auto-generated maths library for C and C++ based on HLSL/Cg

Stars: ✭ 23 (-83.69%)

Mutual labels: math, simd

Leetcode Go

✅ Solutions to LeetCode by Go, 100% test coverage, runtime beats 100% / LeetCode 题解

Stars: ✭ 22,440 (+15814.89%)

Mutual labels: algorithms, math

Stats

A well tested and comprehensive Golang statistics library package with no dependencies.

Stars: ✭ 2,196 (+1457.45%)

Mutual labels: algorithms, math

Mango

mango fun framework

Stars: ✭ 343 (+143.26%)

Mutual labels: simd, math

Cglm

📽 Highly Optimized Graphics Math (glm) for C

Stars: ✭ 887 (+529.08%)

Mutual labels: simd, math

Libmaths

A Python library created to assist programmers with complex mathematical functions

Stars: ✭ 72 (-48.94%)

Mutual labels: algorithms, math

Sage

Mirror of the Sage source tree -- please do not submit PRs here -- everything must be submitted via https://trac.sagemath.org/

Stars: ✭ 1,656 (+1074.47%)

Mutual labels: algorithms, math

Conduit

High Performance Streams Based on Coroutine TS ⚡

Stars: ✭ 135 (-4.26%)

Mutual labels: algorithms

View All Similar Projects ➔

Thermite SIMD: Melt your CPU

NOTE: This crate is not yet on crates.io, but I do own the name and will publish it there when ready

Thermite is a WIP SIMD library focused on providing portable SIMD acceleration of SoA (Structure of Arrays) algorithms, using consistent-length¹ SIMD vectors for lockstep iteration and computation.

Thermite provides highly optimized feature-rich backends for SSE2, SSE4.2, AVX and AVX2, with planned support for AVX512, ARM/Aarch64 NEON, and WASM SIMD extensions.

In addition to that, Thermite includes a highly optimized vectorized math library with many special math functions and algorithms, specialized for both single and double precision.

_{¹ All vectors in an instruction set are the same length, regardless of size.}

Current Status

Refer to issue #1

Motivation and Goals

Thermite was conceived while working on Raygon renderer, when it was decided we needed a state of the art high-performance SIMD vector library focused on facilitating SoA algorithms. Using SIMD for AoS values was a nightmare, constantly shuffling vectors and performing unnecessary horizontal operations. We also weren't able to take advantage of AVX2 fully due to 3D vectors only using 3 or 4 lanes of a regular 128-bit register.

Using SIMDeez, faster, or redesigning packed_simd were all considered, but each has their flaws. SIMDeez is rather limited in functionality, and their handling of target_feature leaves much to be desired. faster fits well into the SoA paradigm, but the iterator-based API is rather unwieldy, and it is lacking many features. packed_simd isn't bad, but it's also missing many features and relies on the Nightly-only "platform-intrinsic"s, which can produce suboptimal code in some cases.

Therefore, the only solution was to write my own, and thus Thermite was born.

The primary goal of Thermite is to provide optimal codegen for every backend instruction set, and provide a consistent set of features on top of all of them, in such a way as to encourage using chunked SoA or AoSoA algorithms regardless of what data types you need. Furthermore, with the #[dispatch] macro, multiple instruction sets can be easily targetted within a single binary.

Features

SSE2, SSE4.2, AVX, AVX2 backends, with planned support for scalar, AVX512, WASM SIMD and ARM NEON backends.
Extensive built-in vectorized math library.
Compile-time policies to emphasize precision, performance or code size (useful for WASM)
Compile-time monomorphisation with runtime selection
- Aided by a #[dispatch] procedural macro to ensure optimal codegen.
Zero runtime overhead.
Operator overloading on vector types.
Abstracts over vector length, giving the same length to all vectors of an instruction set.
Provides fast polyfills where necessary to provide the same API across all instruction sets.
Highly optimized value cast routines between vector types where possible.
Dedicated mask wrapper type with low-cost bitwise vector conversions built-in.

Optimized Project Setup

For optimal performance, ensure you Cargo.toml profiles looks something like this:

[profile.dev]
opt-level = 2       # Required to inline SIMD intrinsics internally

[profile.release]
opt-level = 3       # Should be at least 2; level 1 will not use SIMD intrinsics
lto = 'thin'        # 'fat' LTO may also improve things, but will increase compile time
codegen-units = 1   # Required for optimal inlining and optimizations

# optional release options depending on your project and preference
incremental = false # Release builds will take longer to compile, but inter-crate optimizations may work better
panic = 'abort'     # Very few functions in Thermite panic, but aborting will avoid the unwind mechanism overhead

Misc. Usage Notes

Vectors with 64-bit elements are approximately 2-4x slower than 32-bit vectors.
Integer vectors are 2x slower on SSE2/AVX1, but nominal on SSE4.1 and AVX2. This compounds the first point.
Casting floats to signed integers is faster than to unsigned integers.
Equal-size Signed and Unsigned integer vectors can be cast between each other at zero cost.
Operations mixing float and integer types can incur a 1-cycle penalty on most modern CPUs.
Integer division currently can only be done with a scalar fallback, so it's not recommended.
Dividing integer vectors by constant uniform divisors should use SimdIntVector::div_const
When reusing masks for all/any/none queries, consider using the bitmask directly to avoid recomputing.
Avoid casting between differently-sized types in hot loops.
Avoid extracting and replacing elements.
LLVM will inline many math functions and const-eval as much as possible, but only if it was called in the same instruction-set context.

Cargo `--features`

`alloc` (enabled by default)

The alloc feature enables aligned allocation of buffers suitable to reading/writing to with SIMD.

`nightly`

The nightly feature enables nightly-only optimizations such as accelerated half-precision encoding/decoding.

`math` (enabled by default)

Enables the vectorized math modules

`rng`

Enables the vectorized random number modules

`emulate_fma`

Real fused multiply-add instructions are only enabled for AVX2 platforms. However, as FMA is used not only for performance but for its extended precision, falling back to a split multiply and addition will incur two rounding errors, and may be unacceptable for some applications. Therefore, the emulate_fma Cargo feature will enable a slower but more accurate implementation on older platforms.

For single-precision floats, this is easiest done by simply casting it to double-precision, doing seperate multiply and additions, then casting back. For double-precision, it will use an infinite-precision implementation based on libm.

On SSE2 platforms, double-precision may fallback to scalar ops, as the effort needed to make it branchless will be more expensive than not. As of writing this, it has not been implemented, so benchmarks will reveal what is needed later.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 141

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

raygon-renderer / Thermite

Programming Languages

Labels

Projects that are alternatives of or similar to Thermite

Thermite SIMD: Melt your CPU

Current Status

Motivation and Goals

Features

Optimized Project Setup

Misc. Usage Notes

Cargo --features

alloc (enabled by default)

nightly

math (enabled by default)

rng

emulate_fma

Cargo `--features`

`alloc` (enabled by default)

`nightly`

`math` (enabled by default)

`rng`

`emulate_fma`