All Projects → AdamNiederer → Faster

AdamNiederer / Faster

Licence: mpl-2.0
SIMD for humans

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Faster

utf8
Fast UTF-8 validation with range algorithm (NEON+SSE4+AVX2)
Stars: ✭ 60 (-95.4%)
Mutual labels:  optimization, simd
Fastnoise2
Modular node based noise generation library using SIMD, C++17 and templates
Stars: ✭ 196 (-84.97%)
Mutual labels:  simd, cross-platform
Qreverse
A small study in hardware accelerated AoS reversal
Stars: ✭ 97 (-92.56%)
Mutual labels:  simd, optimization
Spirit
Atomistic Spin Simulation Framework
Stars: ✭ 67 (-94.86%)
Mutual labels:  cross-platform, optimization
Module Shop Mini Program
一个基于 .NET Core构建的简单、跨平台、模块化的商城系统
Stars: ✭ 89 (-93.17%)
Mutual labels:  cross-platform
Csso Rails
CSS Optimizer(csso) ruby wrapper for Rails Asset pipeline
Stars: ✭ 86 (-93.4%)
Mutual labels:  optimization
Re Start
react-native template to target multiple platforms 🌐 📱 💻 with single codebase.
Stars: ✭ 1,272 (-2.45%)
Mutual labels:  cross-platform
Http Client
A high-performance, high-stability, cross-platform HTTP client.
Stars: ✭ 86 (-93.4%)
Mutual labels:  cross-platform
Jplusone
Tool for automatic detection and asserting "N+1 SELECT problem" occurences in JPA based Spring Boot Java applications and finding origin of JPA issued SQL statements in general
Stars: ✭ 91 (-93.02%)
Mutual labels:  optimization
Cubicsdr
Cross-Platform Software-Defined Radio Application
Stars: ✭ 1,294 (-0.77%)
Mutual labels:  cross-platform
Otter Browser
Otter Browser aims to recreate the best aspects of the classic Opera (12.x) UI using Qt5
Stars: ✭ 1,289 (-1.15%)
Mutual labels:  cross-platform
Beautiful React Redux
Redux 🚀, Redux 🤘, Redux 🔥 - and the magic optimization
Stars: ✭ 87 (-93.33%)
Mutual labels:  optimization
Quantum Learning
This repository contains the source code used to produce the results presented in the paper "Machine learning method for state preparation and gate synthesis on photonic quantum computers".
Stars: ✭ 89 (-93.17%)
Mutual labels:  optimization
Pussh
The cross platform, multi-functional screenshot utility
Stars: ✭ 86 (-93.4%)
Mutual labels:  cross-platform
Cutehmi
CuteHMI is an open-source HMI (Human Machine Interface) software written in C++ and QML, using Qt libraries as a framework. GitHub repository is a mirror!
Stars: ✭ 90 (-93.1%)
Mutual labels:  cross-platform
Auxpack
A dashboard for monitoring Webpack build stats.
Stars: ✭ 86 (-93.4%)
Mutual labels:  optimization
Friday
An open-source virtual assistant.
Stars: ✭ 88 (-93.25%)
Mutual labels:  cross-platform
Despacer
C library to remove white space from strings as fast as possible
Stars: ✭ 90 (-93.1%)
Mutual labels:  simd
Catesta
Catesta is a PowerShell module project generator. It uses templates to rapidly scaffold test and build integration for a variety of CI/CD platforms.
Stars: ✭ 88 (-93.25%)
Mutual labels:  cross-platform
So 5 5
SObjectizer: it's all about in-process message dispatching!
Stars: ✭ 87 (-93.33%)
Mutual labels:  cross-platform
  • faster #+BEGIN_HTML #+END_HTML

** SIMD for Humans Easy, powerful, portable, absurdly fast numerical calculations. Includes static dispatch with inlining based on your platform and vector types, zero-allocation iteration, vectorized loading/storing, and support for uneven collections.

It looks something like this: #+BEGIN_SRC rust use faster::*;

let lots_of_3s = (&[-123.456f32; 128][..]).simd_iter() .simd_map(f32s(0.0), |v| { f32s(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s(4.0) - f32s(2.0) }) .scalar_collect(); #+END_SRC

Which is analogous to this scalar code: #+BEGIN_SRC rust let lots_of_3s = (&[-123.456f32; 128][..]).iter() .map(|v| { 9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() - 4.0 - 2.0 }) .collect::<Vec>(); #+END_SRC

The vector size is entirely determined by the machine you're compiling for - it attempts to use the largest vector size supported by your machine, and works on any platform or architecture (see below for details).

Compare this to traditional explicit SIMD: #+BEGIN_SRC rust use std::mem::transmute; use stdsimd::{f32x4, f32x8};

let lots_of_3s = &mut [-123.456f32; 128][..];

if cfg!(all(not(target_feature = "avx"), target_feature = "sse")) { for ch in init.chunks_mut(4) { let v = f32x4::load(ch, 0); let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) }; let abs_mask = f32x4::splat(scalar_abs_mask); // There isn't actually an absolute value intrinsic for floats - you // have to look at the IEEE 754 spec and do some bit flipping v = unsafe { _mm_and_ps(v, abs_mask) }; v = unsafe { _mm_sqrt_ps(v) }; v = unsafe { _mm_rsqrt_ps(v) }; v = unsafe { _mm_ceil_ps(v) }; v = unsafe { _mm_sqrt_ps(v) }; v = unsafe { _mm_mul_ps(v, 9.0) }; v = unsafe { _mm_sub_ps(v, 4.0) }; v = unsafe { _mm_sub_ps(v, 2.0) }; f32x4::store(ch, 0); } } else if cfg!(all(not(target_feature = "avx512"), target_feature = "avx")) { for ch in init.chunks_mut(8) { let v = f32x8::load(ch, 0); let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) }; let abs_mask = f32x8::splat(scalar_abs_mask); v = unsafe { _mm256_and_ps(v, abs_mask) }; v = unsafe { _mm256_sqrt_ps(v) }; v = unsafe { _mm256_rsqrt_ps(v) }; v = unsafe { _mm256_ceil_ps(v) }; v = unsafe { _mm256_sqrt_ps(v) }; v = unsafe { _mm256_mul_ps(v, 9.0) }; v = unsafe { _mm256_sub_ps(v, 4.0) }; v = unsafe { _mm256_sub_ps(v, 2.0) }; f32x8::store(ch, 0); } } #+END_SRC Even with all of that boilerplate, this still only supports x86-64 machines with SSE or AVX - and you have to look up each intrinsic to ensure it's usable for your compilation target. ** Upcoming Features A rewrite of the iterator API is upcoming, as well as internal changes to better match the direction Rust is taking with explicit SIMD. ** Compatibility Faster currently supports any architecture with floating point support, although hardware acceleration is only enabled on machines with x86's vector extensions. ** Performance Here are some extremely unscientific benchmarks which, at least, prove that this isn't any worse than scalar iterators. Even on ancient CPUs, a lot of performance can be extracted out of SIMD.

#+BEGIN_SRC shell $ RUSTFLAGS="-C target-cpu=ivybridge" cargo bench # host is ivybridge; target has AVX test tests::base100_enc_scalar ... bench: 1,307 ns/iter (+/- 45) test tests::base100_enc_simd ... bench: 332 ns/iter (+/- 10) test tests::determinant2_scalar ... bench: 486 ns/iter (+/- 8) test tests::determinant2_simd ... bench: 215 ns/iter (+/- 3) test tests::determinant3_scalar ... bench: 389 ns/iter (+/- 6) test tests::determinant3_simd ... bench: 209 ns/iter (+/- 3) test tests::map_fill_simd ... bench: 835 ns/iter (+/- 12) test tests::map_scalar ... bench: 6,963 ns/iter (+/- 117) test tests::map_simd ... bench: 879 ns/iter (+/- 18) test tests::map_uneven_simd ... bench: 884 ns/iter (+/- 10) test tests::nop_scalar ... bench: 49 ns/iter (+/- 0) test tests::nop_simd ... bench: 34 ns/iter (+/- 0) test tests::reduce_scalar ... bench: 6,905 ns/iter (+/- 107) test tests::reduce_simd ... bench: 839 ns/iter (+/- 13) test tests::reduce_uneven_simd ... bench: 838 ns/iter (+/- 11) test tests::zip_nop_scalar ... bench: 824 ns/iter (+/- 18) test tests::zip_nop_simd ... bench: 231 ns/iter (+/- 5) test tests::zip_scalar ... bench: 901 ns/iter (+/- 29) test tests::zip_simd ... bench: 1,128 ns/iter (+/- 12)

RUSTFLAGS="-C target-cpu=x86-64" cargo bench # host is ivybridge; target has SSE2 test tests::base100_enc_scalar ... bench: 760 ns/iter (+/- 11) test tests::base100_enc_simd ... bench: 492 ns/iter (+/- 2) test tests::determinant2_scalar ... bench: 477 ns/iter (+/- 3) test tests::determinant2_simd ... bench: 277 ns/iter (+/- 1) test tests::determinant3_scalar ... bench: 380 ns/iter (+/- 3) test tests::determinant3_simd ... bench: 285 ns/iter (+/- 2) test tests::map_fill_simd ... bench: 1,797 ns/iter (+/- 8) test tests::map_scalar ... bench: 7,237 ns/iter (+/- 51) test tests::map_simd ... bench: 1,879 ns/iter (+/- 12) test tests::map_uneven_simd ... bench: 1,878 ns/iter (+/- 9) test tests::nop_scalar ... bench: 47 ns/iter (+/- 0) test tests::nop_simd ... bench: 34 ns/iter (+/- 0) test tests::reduce_scalar ... bench: 7,021 ns/iter (+/- 39) test tests::reduce_simd ... bench: 1,801 ns/iter (+/- 8) test tests::reduce_uneven_simd ... bench: 1,734 ns/iter (+/- 9) test tests::zip_nop_scalar ... bench: 803 ns/iter (+/- 9) test tests::zip_nop_simd ... bench: 257 ns/iter (+/- 1) test tests::zip_scalar ... bench: 988 ns/iter (+/- 6) test tests::zip_simd ... bench: 629 ns/iter (+/- 5)

$ RUSTFLAGS="-C target-cpu=pentium" cargo bench # host is ivybridge; this only runs the polyfills! test tests::bench_determinant2_scalar ... bench: 427 ns/iter (+/- 2) test tests::bench_determinant2_simd ... bench: 402 ns/iter (+/- 1) test tests::bench_determinant3_scalar ... bench: 354 ns/iter (+/- 1) test tests::bench_determinant3_simd ... bench: 593 ns/iter (+/- 1) test tests::bench_map_scalar ... bench: 7,195 ns/iter (+/- 28) test tests::bench_map_simd ... bench: 6,271 ns/iter (+/- 22) test tests::bench_map_uneven_simd ... bench: 6,288 ns/iter (+/- 22) test tests::bench_nop_scalar ... bench: 38 ns/iter (+/- 0) test tests::bench_nop_simd ... bench: 69 ns/iter (+/- 0) test tests::bench_reduce_scalar ... bench: 7,004 ns/iter (+/- 17) test tests::bench_reduce_simd ... bench: 6,063 ns/iter (+/- 17) test tests::bench_reduce_uneven_simd ... bench: 6,107 ns/iter (+/- 11) test tests::bench_zip_nop_scalar ... bench: 623 ns/iter (+/- 2) test tests::bench_zip_nop_simd ... bench: 289 ns/iter (+/- 1) test tests::bench_zip_scalar ... bench: 972 ns/iter (+/- 3) test tests::bench_zip_simd ... bench: 621 ns/iter (+/- 3) #+END_SRC

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].