All Projects → hma02 → cublasHgemm-P100

hma02 / cublasHgemm-P100

Licence: MIT license
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm

Programming Languages

Cuda
1817 projects
C++
36643 projects - #6 most used programming language
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to cublasHgemm-P100

hashcat-benchmark-comparison
Hashcat Benchmark Comparison
Stars: ✭ 22 (-37.14%)
Mutual labels:  p100, v100
cuda-swift
Parallel Computing Library for Linux and macOS & NVIDIA CUDA Wrapper
Stars: ✭ 79 (+125.71%)
Mutual labels:  cublas
Cupy
NumPy & SciPy for GPU
Stars: ✭ 5,625 (+15971.43%)
Mutual labels:  cublas
learn-gpgpu
Algorithms implemented in CUDA + resources about GPGPU
Stars: ✭ 37 (+5.71%)
Mutual labels:  cublas
Decimal
Arbitrary-precision fixed-point decimal numbers in go
Stars: ✭ 3,588 (+10151.43%)
Mutual labels:  precision
js-big-decimal
Work with large numbers on the client side with high precision.
Stars: ✭ 41 (+17.14%)
Mutual labels:  precision
PreciseDecimal
A Decimal type that plays nicely with literals and Decodable ✨
Stars: ✭ 18 (-48.57%)
Mutual labels:  precision
DoubleFloats.jl
math with more good bits
Stars: ✭ 102 (+191.43%)
Mutual labels:  precision
mixed-precision-pytorch
Training with FP16 weights in PyTorch
Stars: ✭ 72 (+105.71%)
Mutual labels:  precision
verificarlo
A tool for debugging and assessing floating point precision and reproducibility.
Stars: ✭ 51 (+45.71%)
Mutual labels:  precision
m2.Price
Magento2. Rounding Price to Prettier Value for Multi-Currency Stores.
Stars: ✭ 60 (+71.43%)
Mutual labels:  precision
slibs
Single file libraries for C/C++
Stars: ✭ 80 (+128.57%)
Mutual labels:  gemm
dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
Stars: ✭ 65 (+85.71%)
Mutual labels:  gemm
Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Stars: ✭ 78 (+122.86%)
Mutual labels:  gemm
openai-gemm.pytorch
PyTorch bindings for openai-gemm
Stars: ✭ 20 (-42.86%)
Mutual labels:  gemm
caffe-android-opencl-fp16
Optimised Caffe with OpenCL supporting for less powerful devices such as mobile phones
Stars: ✭ 17 (-51.43%)
Mutual labels:  half-precision
chop
Round matrix elements to lower precision in MATLAB
Stars: ✭ 21 (-40%)
Mutual labels:  half-precision
pytorch-model-parallel
A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch
Stars: ✭ 74 (+111.43%)
Mutual labels:  half-precision
watt
iOS App for TP-Link Devices (Kasa Smart & Tapo) | Support page for issues related to Watt iOS App
Stars: ✭ 24 (-31.43%)
Mutual labels:  p100

fp16-cublasHgemm-test

A simple benchmarking code of the half-precision (float16) performance on Tesla P100 (sm_60) or V100 (sm_70) GPU based on cublasHgemm.

Build and Run

The code does C=alpha*A*B+beta*C on GPU with different sizes of square matrices A, B and C. Shape A is (m,k). Shape B is (k,n). Shape C is (m,n).

To test float16 matrix multiplication,

$ make
$ ./hgemm

Comment line 11 in hgemm.cu to test float32 matrix multiplication.

Tesla P100 Example Testing Result

nvcc hgemm.cu -lcublas --std=c++11 -arch=sm_60  -o hgemm

running cublasHgemm test

running with min_m_k_n: 2 max_m_k_n: 32768 repeats: 10
allocating device variables
float16; size 2 average: 7.69632e-05 s 
float16; size 4 average: 1.34304e-05 s 
float16; size 8 average: 3.49152e-05 s 
float16; size 16 average: 1.6272e-05 s 
float16; size 32 average: 1.91808e-05 s 
float16; size 64 average: 2.52672e-05 s 
float16; size 128 average: 2.48512e-05 s 
float16; size 256 average: 6.52992e-05 s 
float16; size 512 average: 0.000111104 s 
float16; size 1024 average: 0.000275123 s 
float16; size 2048 average: 0.00155046 s 
float16; size 4096 average: 0.00934949 s 
float16; size 8192 average: 0.0659167 s 
float16; size 16384 average: 0.508014 s 
float16; size 32768 average: 4.01786 s 

nvcc hgemm.cu -lcublas --std=c++11 -arch=sm_60  -o hgemm

running cublasSgemm test

running with min_m_k_n: 2 max_m_k_n: 32768 repeats: 10
allocating device variables
float32; size 2 average: 5.21152e-05 s 
float32; size 4 average: 2.06112e-05 s 
float32; size 8 average: 7.1616e-06 s 
float32; size 16 average: 5.3248e-06 s 
float32; size 32 average: 4.624e-06 s 
float32; size 64 average: 1.128e-05 s 
float32; size 128 average: 2.37504e-05 s 
float32; size 256 average: 4.83776e-05 s 
float32; size 512 average: 0.000117616 s 
float32; size 1024 average: 0.000599805 s 
float32; size 2048 average: 0.0026987 s 
float32; size 4096 average: 0.0180615 s 
float32; size 8192 average: 0.128823 s 
float32; size 16384 average: 1.00408 s 
float32; size 32768 average: 8.07247 s 

Tesla V100 Example Testing Result

nvcc hgemm.cu -lcublas --std=c++11 -arch=sm_70  -o hgemm

running cublasHgemm test

running with min_m_k_n: 2 max_m_k_n: 32768 repeats: 10
allocating device variables
float16; size 2 average: 0.000115712 s
float16; size 4 average: 6.76864e-05 s
float16; size 8 average: 7.03488e-05 s
float16; size 16 average: 7.08608e-05 s
float16; size 32 average: 7.8336e-05 s
float16; size 64 average: 8.16128e-05 s
float16; size 128 average: 8.7552e-05 s
float16; size 256 average: 0.000126157 s
float16; size 512 average: 0.000196301 s
float16; size 1024 average: 0.000361267 s
float16; size 2048 average: 0.00156385 s
float16; size 4096 average: 0.00853637 s
float16; size 8192 average: 0.0443268 s
float16; size 16384 average: 0.307294 s
float16; size 32768 average: 2.30823 s

nvcc hgemm.cu -lcublas --std=c++11 -arch=sm_70  -o hgemm

running cublasSgemm test

running with min_m_k_n: 2 max_m_k_n: 32768 repeats: 10
allocating device variables
float32; size 2 average: 6.7584e-05 s 
float32; size 4 average: 6.53312e-05 s 
float32; size 8 average: 6.47168e-05 s 
float32; size 16 average: 6.44096e-05 s 
float32; size 32 average: 7.29088e-05 s 
float32; size 64 average: 7.4752e-05 s 
float32; size 128 average: 8.06912e-05 s 
float32; size 256 average: 0.000160768 s 
float32; size 512 average: 0.000111923 s 
float32; size 1024 average: 0.000254464 s 
float32; size 2048 average: 0.00134257 s 
float32; size 4096 average: 0.00944916 s 
float32; size 8192 average: 0.0721418 s 
float32; size 16384 average: 0.573173 s 
float32; size 32768 average: 4.6143 s

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].