Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → MMichel → Cudajacobi

MMichel / Cudajacobi

CUDA implementation of the Jacobi method

Labels

cuda

Projects that are alternatives of or similar to Cudajacobi

Deep Painterly Harmonization

Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

Stars: ✭ 6,027 (+31621.05%)

Mutual labels: cuda

Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends

Stars: ✭ 793 (+4073.68%)

Mutual labels: cuda

Ddsh Tip2018

source code for paper "Deep Discrete Supervised Hashing"

Stars: ✭ 16 (-15.79%)

Mutual labels: cuda

Juice

The Hacker's Machine Learning Engine

Stars: ✭ 743 (+3810.53%)

Mutual labels: cuda

Numba

NumPy aware dynamic Python compiler using LLVM

Stars: ✭ 7,090 (+37215.79%)

Mutual labels: cuda

Scikit Cuda

Python interface to GPU-powered libraries

Stars: ✭ 803 (+4126.32%)

Mutual labels: cuda

Cuda Convnet2

Automatically exported from code.google.com/p/cuda-convnet2

Stars: ✭ 690 (+3531.58%)

Mutual labels: cuda

Neuralsuperresolution

Real-time video quality improvement for applications such as video-chat using Perceptual Losses

Stars: ✭ 18 (-5.26%)

Mutual labels: cuda

Pyopencl

OpenCL integration for Python, plus shiny features

Stars: ✭ 790 (+4057.89%)

Mutual labels: cuda

Cudadbclustering

Clustering via Graphics Processor, using NVIDIA CUDA sdk to preform database clustering on the massively parallel graphics card processor

Stars: ✭ 6 (-68.42%)

Mutual labels: cuda

Accelerate

Embedded language for high-performance array computations

Stars: ✭ 751 (+3852.63%)

Mutual labels: cuda

Marian

Fast Neural Machine Translation in C++

Stars: ✭ 777 (+3989.47%)

Mutual labels: cuda

Pytorch Loss

label-smooth, amsoftmax, focal-loss, triplet-loss, lovasz-softmax. Maybe useful

Stars: ✭ 812 (+4173.68%)

Mutual labels: cuda

Kintinuous

Real-time large scale dense visual SLAM system

Stars: ✭ 740 (+3794.74%)

Mutual labels: cuda

Gmatrix

R package for unleashing the power of NVIDIA GPU's

Stars: ✭ 16 (-15.79%)

Mutual labels: cuda

Gunrock

High-Performance Graph Primitives on GPUs

Stars: ✭ 718 (+3678.95%)

Mutual labels: cuda

Blocksparse

Efficient GPU kernels for block-sparse matrix multiplication and convolution

Stars: ✭ 797 (+4094.74%)

Mutual labels: cuda

Libomptarget

Stars: ✭ 18 (-5.26%)

Mutual labels: cuda

Wheels

Performance-optimized wheels for TensorFlow (SSE, AVX, FMA, XLA, MPI)

Stars: ✭ 891 (+4589.47%)

Mutual labels: cuda

Libcudarange

An interval arithmetic and affine arithmetic library for NVIDIA CUDA

Stars: ✭ 5 (-73.68%)

Mutual labels: cuda

View All Similar Projects ➔

CUDA implementation of the Jacobi method

This project has been created in the scope of "Introduction to GPU and accelerator programming for scientific computing", a course organized by SeSE and PDC at KTH.

Problem

The Jacobi method is used to find approximate numerical solutions for systems of linear equations of the form Ax = b. The algorithm starts with an initial estimate for x and iteratively updates it until convergence. The Jacobi method is guaranteed to converge if the matrix A is diagonally dominant.

Two problem sizes were considered in the scope of this assignment. The small problem consists of a randomly generated 512x512 coefficient matrix A and a 512x1 right hand side (rhs) vector b. The large problem is given by a randomly generated 2048x2049 coefficient matrix A and a 2048x1 rhs vector b.

Implementation

The given task was to implement the Jacobi method in several versions: a serial CPU function, an un-optimized CUDA kernel, and an optimized version of the CUDA kernel. The Jacobi method was implemented in C according to the pseudocode on Wikipedia. The un-optimized kernel runs the inner loop in each thread where all threads belong to a single block. Two adjustments were then made to optimize this kernel. First, the index was precomputed and stored in a local register. This avoids one multiplication within each iteration in each thread. Second, the threads were tiled into blocks. This enables scaling to larger problem sizes. Prefetching and loop enrollment were also tested, but unfortunately it was not possible to reproduce correct results. Therefore, the last two approaches remain as comments in the code, but are not used.

The file jacobi.cu contains the CPU implementations of the Jacobi method as well as both CUDA kernels. In order to compile the program on zorn, the command ./install.sh needs to be executed. This loads the necessary CUDA module cuda/5.5 and executes the Makefile. The program can now be executed by ./jacobi <arguments>. The following command line arguments are available:

-h  --help			Display usage information.
-f  --file filename		File containing coefficient matrix and right hand side (required).
-i  --Ni int		Number of elements in Y direction (optional, default=512).
-j  --Nj int		Number of elements in X direction (optional, default=512).
-n  --iterations int	Number of iterations (optional, default=10000).
-k  --kernel int		1: unoptimized, 2: optimized kernel (optional, default=2).
-t  --tilesize int		Size of each thread block in kernel 2 (default=4).

The command qsub job_512.sh submits an example run on the small problem with default parameters to the queue of zorn. The large problem can be run with qsub job_2048.sh.

Input matrices of a particular size can be generated by python gen_diag_dominant_matrix.py <size> <output_filename>. A matrix and its corresponding rhs are created and stored in a text file, where each line contains one element of the row-majored matrix. The rhs is added to the end of the file. The small and large example problems are given by test_512 and test_2048, respectively.

Benchmark

All implementations were tested on the small and large problem sizes. The timings were recorded for 10,000 iterations of the Jacobi method.

Host implementation: The runtime for the serial CPU implementation is 17.25s for the small and 279.46s for the large problem.

GPU implementation basic: The un-optimized kernel needed 14.46s on the small problem. Unfortunately it was not possible to solve the large problem with this kernel, because the number of threads that would be required exceeded the number of available threads on the GPU.

GPU implementation optimized: By storing the index in a register variable the runtime could be reduced to 9.46s on the small problem. Tiling was then used to be able to compute the large problem and for further runtime optimization. This resulted in an optimal runtime of 1.65s on the small problem with a tile size of 3. The large problem took 16.17s to compute with an optimal tile size of 4. Table 1 shows all runtimes for the tested tile sizes and both problems.

Table 1: Performance of the optimized kernel.
	Runtime (s)
Tile size	Small (512)	Large (2048)
2	2.09	27.36
3	1.65	20.91
4	2.29	16.17
5	1.91	17.45
8	2.61	18.62
16	2.43	20.57
32	4.36	20.94
64	4.38	20.33
128	4.39	16.43
256	7.85	29.41
512	15.88	58.11

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 19

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗