sliceslice-rsA fast implementation of single-pattern substring search using SIMD acceleration.
Stars: ✭ 66 (-52.17%)
penguinVSimple and fast C++ image processing library with focus on heterogeneous systems
Stars: ✭ 110 (-20.29%)
Chromium ClangChromium browser compiled with the Clang/LLVM compiler.
Stars: ✭ 77 (-44.2%)
dbcsrDBCSR: Distributed Block Compressed Sparse Row matrix library
Stars: ✭ 65 (-52.9%)
HiSpatialClusterClustering spatial points with algorithm of Fast Search, high performace computing implements of CUDA or parallel in CPU, and runnable implements on python standalone or arcgis.
Stars: ✭ 31 (-77.54%)
OccaJIT Compilation for Multiple Architectures: C++, OpenMP, CUDA, HIP, OpenCL, Metal
Stars: ✭ 230 (+66.67%)
lsp-dsp-libDSP library for signal processing
Stars: ✭ 37 (-73.19%)
SoftLightA shader-based Software Renderer Using The LightSky Framework.
Stars: ✭ 2 (-98.55%)
peakperfAchieve peak performance on x86 CPUs and NVIDIA GPUs
Stars: ✭ 33 (-76.09%)
gpubootcampThis repository consists for gpu bootcamp material for HPC and AI
Stars: ✭ 227 (+64.49%)
block-alignerSIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.
Stars: ✭ 58 (-57.97%)
monolishmonolish: MONOlithic LInear equation Solvers for Highly-parallel architecture
Stars: ✭ 166 (+20.29%)
MatXAn efficient C++17 GPU numerical computing library with Python-like syntax
Stars: ✭ 418 (+202.9%)
Futhark💥💻💥 A data-parallel functional programming language
Stars: ✭ 1,641 (+1089.13%)
FastorA lightweight high performance tensor algebra framework for modern C++
Stars: ✭ 280 (+102.9%)
Fastbase64SIMD-accelerated base64 codecs
Stars: ✭ 309 (+123.91%)
PackettracerThe SIMD-accelereted ray tracing in C# powered by Intel hardware intrinsic of .NET Core.
Stars: ✭ 109 (-21.01%)
VisionarayA C++-based, cross platform ray tracing library
Stars: ✭ 342 (+147.83%)
awesome-simdA curated list of awesome SIMD frameworks, libraries and software
Stars: ✭ 39 (-71.74%)
HiopHPC solver for nonlinear optimization problems
Stars: ✭ 75 (-45.65%)
Arrayfire PythonPython bindings for ArrayFire: A general purpose GPU library.
Stars: ✭ 358 (+159.42%)
Asm DudeVisual Studio extension for assembly syntax highlighting and code completion in assembly files and the disassembly window
Stars: ✭ 3,898 (+2724.64%)
Turbo-TransposeTranspose: SIMD Integer+Floating Point Compression Filter
Stars: ✭ 50 (-63.77%)
Sha256 SimdAccelerate SHA256 computations in pure Go using Accelerate SHA256 computations in pure Go using AVX512, SHA Extensions for x86 and ARM64 for ARM. On AVX512 it provides an up to 8x improvement (over 3 GB/s per core). SHA Extensions give a performance boost of close to 4x over native.
Stars: ✭ 657 (+376.09%)
OnemkloneAPI Math Kernel Library (oneMKL) Interfaces
Stars: ✭ 122 (-11.59%)
ParenchymaAn extensible HPC framework for CUDA, OpenCL and native CPU.
Stars: ✭ 71 (-48.55%)
allgebraBase container for developing C++ and Fortran HPC applications
Stars: ✭ 14 (-89.86%)
simdjson-rsRust version of lemire's SimdJson
Stars: ✭ 18 (-86.96%)
SimdjsonsharpC# bindings for lemire/simdjson (and full C# port)
Stars: ✭ 506 (+266.67%)
HighwayhashNative Go version of HighwayHash with optimized assembly implementations on Intel and ARM. Able to process over 10 GB/sec on a single core on Intel CPUs - https://en.wikipedia.org/wiki/HighwayHash
Stars: ✭ 670 (+385.51%)
KttKernel Tuning Toolkit
Stars: ✭ 33 (-76.09%)
SixtyfourHow fast can we brute force a 64-bit comparison?
Stars: ✭ 41 (-70.29%)
DeepnetDeep.Net machine learning framework for F#
Stars: ✭ 99 (-28.26%)
Knn cudaFast K-Nearest Neighbor search with GPU
Stars: ✭ 119 (-13.77%)
DppDetail-Preserving Pooling in Deep Networks (CVPR 2018)
Stars: ✭ 99 (-28.26%)
Singularity CriThe Singularity implementation of the Kubernetes Container Runtime Interface
Stars: ✭ 97 (-29.71%)
FastapproxApproximate and vectorized versions of common mathematical functions
Stars: ✭ 128 (-7.25%)
Docker HomebridgeHomebridge Docker. HomeKit support for the impatient using Docker on x86_64, Raspberry Pi (armhf) and ARM64. Includes ffmpeg + libfdk-aac.
Stars: ✭ 1,847 (+1238.41%)
Extending JaxExtending JAX with custom C++ and CUDA code
Stars: ✭ 98 (-28.99%)
QreverseA small study in hardware accelerated AoS reversal
Stars: ✭ 97 (-29.71%)
NnpackAcceleration package for neural networks on multi-core CPUs
Stars: ✭ 1,538 (+1014.49%)
SupraSUPRA: Software Defined Ultrasound Processing for Real-Time Applications - An Open Source 2D and 3D Pipeline from Beamforming to B-Mode
Stars: ✭ 96 (-30.43%)
JevoisJeVois smart machine vision framework
Stars: ✭ 128 (-7.25%)
SketchC++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings
Stars: ✭ 96 (-30.43%)
NextflowA DSL for data-driven computational pipelines
Stars: ✭ 1,337 (+868.84%)
CharmThe Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Stars: ✭ 96 (-30.43%)
ThorinThe Higher-Order Intermediate Representation
Stars: ✭ 116 (-15.94%)
PynvvlA Python wrapper of NVIDIA Video Loader (NVVL) with CuPy for fast video loading with Python
Stars: ✭ 95 (-31.16%)
Region ConvNot All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade
Stars: ✭ 95 (-31.16%)
SpocStream Processing with OCaml
Stars: ✭ 115 (-16.67%)
Fbtt EmbeddingThis is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed this library can reduce the total model size by up to 100x in Facebook’s open sourced DLRM model while achieving same model quality. Our implementation is faster than the state-of-the-art implementations. Existing the state-of-the-art library also decompresses the whole embedding tables on the fly therefore they do not provide memory reduction during runtime of the training. Our library decompresses only the requested rows therefore can provide 10,000 times memory footprint reduction per embedding table. The library also includes a software cache to store a portion of the entries in the table in decompressed format for faster lookup and process.
Stars: ✭ 92 (-33.33%)
OffOFF, Open source Finite volume Fluid dynamics code
Stars: ✭ 93 (-32.61%)
Amplifier.netAmplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
Stars: ✭ 92 (-33.33%)
BltA streamlined CMake build system foundation for developing HPC software
Stars: ✭ 135 (-2.17%)
Jpeg QuantsmoothJPEG artifacts removal based on quantization coefficients.
Stars: ✭ 134 (-2.9%)