All Projects → MMichel → Cudajacobi

MMichel / Cudajacobi

CUDA implementation of the Jacobi method

Labels

Projects that are alternatives of or similar to Cudajacobi

Deep Painterly Harmonization
Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189
Stars: ✭ 6,027 (+31621.05%)
Mutual labels:  cuda
Arraymancer
A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
Stars: ✭ 793 (+4073.68%)
Mutual labels:  cuda
Ddsh Tip2018
source code for paper "Deep Discrete Supervised Hashing"
Stars: ✭ 16 (-15.79%)
Mutual labels:  cuda
Juice
The Hacker's Machine Learning Engine
Stars: ✭ 743 (+3810.53%)
Mutual labels:  cuda
Numba
NumPy aware dynamic Python compiler using LLVM
Stars: ✭ 7,090 (+37215.79%)
Mutual labels:  cuda
Scikit Cuda
Python interface to GPU-powered libraries
Stars: ✭ 803 (+4126.32%)
Mutual labels:  cuda
Cuda Convnet2
Automatically exported from code.google.com/p/cuda-convnet2
Stars: ✭ 690 (+3531.58%)
Mutual labels:  cuda
Neuralsuperresolution
Real-time video quality improvement for applications such as video-chat using Perceptual Losses
Stars: ✭ 18 (-5.26%)
Mutual labels:  cuda
Pyopencl
OpenCL integration for Python, plus shiny features
Stars: ✭ 790 (+4057.89%)
Mutual labels:  cuda
Cudadbclustering
Clustering via Graphics Processor, using NVIDIA CUDA sdk to preform database clustering on the massively parallel graphics card processor
Stars: ✭ 6 (-68.42%)
Mutual labels:  cuda
Accelerate
Embedded language for high-performance array computations
Stars: ✭ 751 (+3852.63%)
Mutual labels:  cuda
Marian
Fast Neural Machine Translation in C++
Stars: ✭ 777 (+3989.47%)
Mutual labels:  cuda
Pytorch Loss
label-smooth, amsoftmax, focal-loss, triplet-loss, lovasz-softmax. Maybe useful
Stars: ✭ 812 (+4173.68%)
Mutual labels:  cuda
Kintinuous
Real-time large scale dense visual SLAM system
Stars: ✭ 740 (+3794.74%)
Mutual labels:  cuda
Gmatrix
R package for unleashing the power of NVIDIA GPU's
Stars: ✭ 16 (-15.79%)
Mutual labels:  cuda
Gunrock
High-Performance Graph Primitives on GPUs
Stars: ✭ 718 (+3678.95%)
Mutual labels:  cuda
Blocksparse
Efficient GPU kernels for block-sparse matrix multiplication and convolution
Stars: ✭ 797 (+4094.74%)
Mutual labels:  cuda
Libomptarget
Stars: ✭ 18 (-5.26%)
Mutual labels:  cuda
Wheels
Performance-optimized wheels for TensorFlow (SSE, AVX, FMA, XLA, MPI)
Stars: ✭ 891 (+4589.47%)
Mutual labels:  cuda
Libcudarange
An interval arithmetic and affine arithmetic library for NVIDIA CUDA
Stars: ✭ 5 (-73.68%)
Mutual labels:  cuda

CUDA implementation of the Jacobi method

This project has been created in the scope of "Introduction to GPU and accelerator programming for scientific computing", a course organized by SeSE and PDC at KTH.

Problem

The Jacobi method is used to find approximate numerical solutions for systems of linear equations of the form Ax = b. The algorithm starts with an initial estimate for x and iteratively updates it until convergence. The Jacobi method is guaranteed to converge if the matrix A is diagonally dominant.

Two problem sizes were considered in the scope of this assignment. The small problem consists of a randomly generated 512x512 coefficient matrix A and a 512x1 right hand side (rhs) vector b. The large problem is given by a randomly generated 2048x2049 coefficient matrix A and a 2048x1 rhs vector b.

Implementation

The given task was to implement the Jacobi method in several versions: a serial CPU function, an un-optimized CUDA kernel, and an optimized version of the CUDA kernel. The Jacobi method was implemented in C according to the pseudocode on Wikipedia. The un-optimized kernel runs the inner loop in each thread where all threads belong to a single block. Two adjustments were then made to optimize this kernel. First, the index was precomputed and stored in a local register. This avoids one multiplication within each iteration in each thread. Second, the threads were tiled into blocks. This enables scaling to larger problem sizes. Prefetching and loop enrollment were also tested, but unfortunately it was not possible to reproduce correct results. Therefore, the last two approaches remain as comments in the code, but are not used.

The file jacobi.cu contains the CPU implementations of the Jacobi method as well as both CUDA kernels. In order to compile the program on zorn, the command ./install.sh needs to be executed. This loads the necessary CUDA module cuda/5.5 and executes the Makefile. The program can now be executed by ./jacobi <arguments>. The following command line arguments are available:

-h  --help			Display usage information.
-f  --file filename		File containing coefficient matrix and right hand side (required).
-i  --Ni int		Number of elements in Y direction (optional, default=512).
-j  --Nj int		Number of elements in X direction (optional, default=512).
-n  --iterations int	Number of iterations (optional, default=10000).
-k  --kernel int		1: unoptimized, 2: optimized kernel (optional, default=2).
-t  --tilesize int		Size of each thread block in kernel 2 (default=4).

The command qsub job_512.sh submits an example run on the small problem with default parameters to the queue of zorn. The large problem can be run with qsub job_2048.sh.

Input matrices of a particular size can be generated by python gen_diag_dominant_matrix.py <size> <output_filename>. A matrix and its corresponding rhs are created and stored in a text file, where each line contains one element of the row-majored matrix. The rhs is added to the end of the file. The small and large example problems are given by test_512 and test_2048, respectively.

Benchmark

All implementations were tested on the small and large problem sizes. The timings were recorded for 10,000 iterations of the Jacobi method.

Host implementation: The runtime for the serial CPU implementation is 17.25s for the small and 279.46s for the large problem.

GPU implementation basic: The un-optimized kernel needed 14.46s on the small problem. Unfortunately it was not possible to solve the large problem with this kernel, because the number of threads that would be required exceeded the number of available threads on the GPU.

GPU implementation optimized: By storing the index in a register variable the runtime could be reduced to 9.46s on the small problem. Tiling was then used to be able to compute the large problem and for further runtime optimization. This resulted in an optimal runtime of 1.65s on the small problem with a tile size of 3. The large problem took 16.17s to compute with an optimal tile size of 4. Table 1 shows all runtimes for the tested tile sizes and both problems.

Table 1: Performance of the optimized kernel.
Runtime (s)
Tile size Small (512) Large (2048)
2 2.09 27.36
3 1.65 20.91
4 2.29 16.17
5 1.91 17.45
8 2.61 18.62
16 2.43 20.57
32 4.36 20.94
64 4.38 20.33
128 4.39 16.43
256 7.85 29.41
512 15.88 58.11
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].