Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → DTolm → Vkfft

DTolm / Vkfft

Licence: mpl-2.0

Vulkan Fast Fourier Transform library

Labels

vulkan hpc fft convolution

Projects that are alternatives of or similar to Vkfft

susa

High Performance Computing (HPC) and Signal Processing Framework

Stars: ✭ 55 (-90.74%)

Mutual labels: convolution, fft

Mnn

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba

Stars: ✭ 6,284 (+957.91%)

Mutual labels: vulkan, convolution

Ktt

Kernel Tuning Toolkit

Stars: ✭ 33 (-94.44%)

Mutual labels: vulkan, hpc

hpc

Learning and practice of high performance computing (CUDA, Vulkan, OpenCL, OpenMP, TBB, SSE/AVX, NEON, MPI, coroutines, etc. )

Stars: ✭ 39 (-93.43%)

Mutual labels: hpc, vulkan

dsp

DSP and filtering library

Stars: ✭ 36 (-93.94%)

Mutual labels: convolution, fft

Dsp Theory

Theory of digital signal processing (DSP): signals, filtration (IIR, FIR, CIC, MAF), transforms (FFT, DFT, Hilbert, Z-transform) etc.

Stars: ✭ 437 (-26.43%)

Mutual labels: convolution, fft

dsp-theory

Theory of digital signal processing (DSP): signals, filtration (IIR, FIR, CIC, MAF), transforms (FFT, DFT, Hilbert, Z-transform) etc.

Stars: ✭ 643 (+8.25%)

Mutual labels: convolution, fft

Surge

A Swift library that uses the Accelerate framework to provide high-performance functions for matrix math, digital signal processing, and image manipulation.

Stars: ✭ 4,945 (+732.49%)

Mutual labels: convolution, fft

Realtime pyaudio fft

Realtime audio analysis in Python, using PyAudio and Numpy to extract and visualize FFT features from streaming audio.

Stars: ✭ 515 (-13.3%)

Mutual labels: fft

Bulllord Engine

lightspeed lightweight elegant game engine in pure c

Stars: ✭ 539 (-9.26%)

Mutual labels: vulkan

Gfx

[maintenance mode] A low-overhead Vulkan-like GPU API for Rust.

Stars: ✭ 5,045 (+749.33%)

Mutual labels: vulkan

Vkbasalt

a vulkan post processing layer for linux

Stars: ✭ 517 (-12.96%)

Mutual labels: vulkan

Silk.net

The high-speed OpenAL, OpenGL, Vulkan, and GLFW bindings library your mother warned you about.

Stars: ✭ 534 (-10.1%)

Mutual labels: vulkan

Vulkan Forward Plus Renderer

Forward+ renderer in Vulkan using Compute Shader. An Upenn CIS565 final project.

Stars: ✭ 513 (-13.64%)

Mutual labels: vulkan

Volk

Meta loader for Vulkan API

Stars: ✭ 551 (-7.24%)

Mutual labels: vulkan

Raytracinginvulkan

Implementation of Peter Shirley's Ray Tracing In One Weekend book using Vulkan and NVIDIA's RTX extension.

Stars: ✭ 487 (-18.01%)

Mutual labels: vulkan

Engine Native

Native engine for Cocos Creator

Stars: ✭ 488 (-17.85%)

Mutual labels: vulkan

Tusimple Duc

Understanding Convolution for Semantic Segmentation

Stars: ✭ 567 (-4.55%)

Mutual labels: convolution

Rlsl

Rust to SPIR-V compiler

Stars: ✭ 546 (-8.08%)

Mutual labels: vulkan

Ffsubsync

Automagically synchronize subtitles with video.

Stars: ✭ 5,167 (+769.87%)

Mutual labels: fft

View All Similar Projects ➔

VkFFT - Vulkan Fast Fourier Transform library

VkFFT is an efficient GPU-accelerated multidimensional Fast Fourier Transform library for Vulkan/CUDA/HIP projects. VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance. VkFFT is written in C language and supports Vulkan, CUDA and HIP as backends.

I am looking for a PhD position/job that may be interested in my set of skills. Contact me by email: [email protected] | [email protected]

Added Windows executables for benchmark: versions with CUDA benchmark (requires CUDA 9.0) and without (requires only graphics drivers). Both require FFTW dll placed in the same location as executable. Uses ~3.5GB of VRAM.

Benchmark results of VkFFT can be found here: https://openbenchmarking.org/test/pts/vkfft

Currently supported features:

1D/2D/3D systems
Forward and inverse directions of FFT
Support for big FFT dimension sizes. Current limits in single and half precision: C2C - (2^32, 2^32, 2^32). C2R/R2C - (2^12, 2^32, 2^32). (will be increased later). Current limits in double precision: C2C - (2^32, 2^32, 2^32), C2R/R2C - (2^11, 2^32, 2^32) with no register overutilization.
Radix-2/3/4/5/7/8/11/13 FFT. Sequences using radix 3, 5, 7, 11 and 13 have comparable performance to that of powers of 2
Single, double and half precision support. Double precision uses CPU generated LUT tables. Half precision still does all computations in single and only uses half precision to store data.
All transformations are performed in-place with no performance loss. Out-of-place transforms are supported by selecting different input/output buffers.
No additional transposition uploads. Note: data can be reshuffled after the four step FFT algorithm with additional buffer (for big sequences). Doesn't matter for convolutions - they return to the input ordering (saves memory).
Complex to complex (C2C), real to complex (R2C) and complex to real (C2R) transformations. R2C and C2R are optimized to run up to 2x times faster than C2C (2D and 3D case only)
1x1, 2x2, 3x3 convolutions with symmetric or nonsymmetric kernel (no register overutilization)
Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros). Can specify the range of sequences filled with zeros and the direction where zeropadding is applied (read or write stage)
WHDCN layout - data is stored in the following order (sorted by increase in strides): the width, the height, the depth, the coordinate (the number of feature maps), the batch number
Multiple feature/batch convolutions - one input, multiple kernels
Multiple input/output/temporary buffer split. Allows to use data split between different memory allocations and mitigate 4GB single allocation limit.
Works on Nvidia, AMD and Intel GPUs (tested on Nvidia RTX 3080, GTX 1660 Ti, AMD Radeon VII and Intel UHD 620)
VkFFT supports Vulkan, CUDA and HIP as backend to cover wide range of APIs
Header-only library with Vulkan interface, which allows to append VkFFT directly to user's command buffer. Shaders are compiled once during the plan creation stage

Future release plan

Planned
- Publication based on implemented optimizations
- Mobile GPU support
Ambitious
- Multiple GPU job splitting

Installation

Vulkan version: Include the vkFFT.h file and glslang compiler. Provide the library with correctly chosen VKFFT_BACKEND definition (VKFFT_BACKEND=0 for Vulkan). Sample CMakeLists.txt file configures project based on Vulkan_FFT.cpp file, which contains examples on how to use VkFFT to perform FFT, iFFT and convolution calculations, use zero padding, multiple feature/batch convolutions, C2C FFTs of big systems, R2C/C2R transforms, double precision FFTs, half precision FFTs.
For single and double precision, Vulkan 1.0 is required. For half precision, Vulkan 1.1 is required.

CUDA/HIP: Include the vkFFT.h file and make sure your system has NVRTC/HIPRTC built. Provide the library with correctly chosen VKFFT_BACKEND definition. Only single/double precision for now.
To build CUDA/HIP version of benchmark, replace VKFFT_BACKEND in CMakeLists (line 5) with the correct one and optionally enable FFTW. VKFFT_BACKEND=1 for CUDA, VKFFT_BACKEND=2 for HIP.

Command-line interface

VkFFT has a command-line interface with the following set of commands:
-h: print help
-devices: print the list of available GPU devices
-d X: select GPU device (default 0)
-o NAME: specify output file path
-vkfft X: launch VkFFT sample X (0-15, 1000-1003) (if FFTW is enabled in CMakeLists.txt)
-cufft X: launch cuFFT sample X (0-4, 1000-1003) (if enabled in CMakeLists.txt)
-rocfft X: launch rocFFT sample X (0-4, 1000-1003) (if enabled in CMakeLists.txt)
-test: (or no other keys) launch all VkFFT and cuFFT benchmarks
So, the command to launch single precision benchmark of VkFFT and cuFFT and save log to output.txt file on device 0 will look like this on Windows:
.\Vulkan_FFT.exe -d 0 -o output.txt -vkfft 0 -cufft 0
For double precision benchmark, replace -vkfft 0 -cufft 0 with -vkfft 1 -cufft 1. For half precision benchmark, replace -vkfft 0 -cufft 0 with -vkfft 2 -cufft 2.

How to use VkFFT

VkFFT.h is a library which can append FFT, iFFT or convolution calculation to the user defined command buffer. It operates on storage buffers allocated by user and doesn't require any additional memory by itself (except for LUT tables, if they are enabled). All computations are fully based on Vulkan compute shaders with no CPU usage except for FFT planning. VkFFT creates and optimizes memory layout by itself and performs FFT with the best chosen parameters. For an example application, see Vulkan_FFT.cpp file, which has comments explaining the VkFFT configuration process.
VkFFT achieves striding by grouping nearby FFTs instead of transpositions.

Benchmark results in comparison to cuFFT

To measure how Vulkan FFT implementation works in comparison to cuFFT, we will perform a number of 1D, 2D and 3D tests, ranging from the small systems to the big ones. The test will consist of performing C2C FFT and inverse C2C FFT consecutively multiple times to calculate average time required. The results are obtained on Nvidia RTX 3080, AMD Radeon VII and AMD Radeon 6800XT graphics cards with no other GPU load. Launching -test key from Vulkan_FFT.cpp performs VkFFT/cuFFT benchmark. The overall benchmark score is calculated as an averaged performance score over presented set of systems (the bigger - the better): sum(system_size/average_iteration_time) /num_benchmark_samples

The stable flat lines present for small sequence lengths indicate that time scales linearly with the system size, so the bigger the bandwidth the better the result will be. The stepwise drops occur once the amount of transfers increases from to 2x and to 3x when compute unit can't hold full sequence and splits it into combination of smaller ones. Radeon VII is faster than RTX 3080 below 2^18 (=2MB - page file size on AMD due to it having HBM2 memory with a higher bandwidth, however, this GPU apparently has TLB miss problems on large buffer sizes. On RTX 3080, VkFFT is faster than cuFFT in single precision batched 1D FFTs on the range from 2^3 to 2^27: In double precision Radeon VII is able to get advantage due to its high double precision core count. Radeon RX 6800XT can store LUT in L3 cache and has higher double precision core count as well: In half precision mode, VkFFT only uses it for data storage, all computations are performed in single.It still proves to be enough to get stable 2x performance gain on RTX 3080: Multidimensional systems are optimized as well. Benchmark shows Radeon RX 6800XT can store systems up to 128MB in L3 cache for big performance gains. Native support for zero padding allows to transfer less data and get up to 3x performance boost in multidimensional FFTs:

Precision comparison of cuFFT/VkFFT/FFTW

To measure how VkFFT (single/double/half precision) results compare to cuFFT/rocFFT (single/double/half precision) and FFTW (double precision), a set of ~60 systems covering full FFT range was filled with random complex data on the scale of [-1,1] and one C2C transform was performed on each system. Samples 11(single), 12(double), 13(half) calculate for each value of the transformed system:

Max difference between cuFFT/rocFFT/VkFFT result and FFTW result
Average difference between cuFFT/rocFFT/VkFFT result and FFTW result
Max ratio of the difference between cuFFT/rocFFT/VkFFT result and FFTW result to the FFTW result
Average ratio of the difference between cuFFT/rocFFT/VkFFT result and FFTW result to the FFTW result

FFTW is required to launch these samples (specify in CMakeLists include and library directories). If cuFFT is disabled, only FFTW/VkFFT results are calculated.
The precision_cuFFT_VkFFT_FFTW.txt file contains the single precision results for Nvidia's 1660Ti GPU and AMD Ryzen 2700 CPU. On average, the results fluctuate both for cuFFT and VkFFT with no clear winner in single precision. Max ratio stays in range of 2% for both cuFFT and VkFFT, while average ratio stays below 1e-6.
The precision_cuFFT_VkFFT_FFTW_double.txt file contains the double precision results for Nvidia's 1660Ti GPU and AMD Ryzen 2700 CPU. On average, VkFFT is more precise than cuFFT in double precision (see: max_difference and max_eps coloumns), however it is also ~20% slower (vkfft_benchmark_double.png). Note that double precision is still in testing and these results may change in the future. Max ratio stays in range of 5e-10% for both cuFFT and VkFFT, while average ratio stays below 1e-15. Overall, double precision is ~7 times slower than single on Nvidia's 1660Ti GPU.
The precision_cuFFT_VkFFT_FFTW_half.txt file contains the half precision results for Nvidia's 1660Ti GPU and AMD Ryzen 2700 CPU. On average, VkFFT is at least two times more precise than cuFFT in half precision (see: max_difference and max_eps coloumns), while being faster on average (vkfft_benchmark_half.png). Note that half precision is still in testing and is only used to store data in VkFFT. cuFFT script can probably also be improved. Average ratio stays in range of 0.2% for both cuFFT and VkFFT. Overall, half precision of VkFFT is ~50%-100% times faster than single on Nvidia's 1660Ti GPU.

Contact information

Initial version of VkFFT is developed by Tolmachev Dmitrii
Formerly Peter Grünberg Institute and Institute for Advanced Simulation, Forschungszentrum Jülich, D-52425 Jülich, Germany
E-mail 1: [email protected]
E-mail 2: [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 594

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (13) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

DTolm / Vkfft

Labels

Projects that are alternatives of or similar to Vkfft

VkFFT - Vulkan Fast Fourier Transform library

I am looking for a PhD position/job that may be interested in my set of skills. Contact me by email: [email protected] | [email protected]

Added Windows executables for benchmark: versions with CUDA benchmark (requires CUDA 9.0) and without (requires only graphics drivers). Both require FFTW dll placed in the same location as executable. Uses ~3.5GB of VRAM.

Benchmark results of VkFFT can be found here: https://openbenchmarking.org/test/pts/vkfft

Currently supported features:

Future release plan

Planned

Ambitious

Installation

Command-line interface

How to use VkFFT

Benchmark results in comparison to cuFFT

Precision comparison of cuFFT/VkFFT/FFTW

Contact information