All Projects → NVIDIA → Nccl Tests

NVIDIA / Nccl Tests

Licence: other
NCCL Tests

Labels

Projects that are alternatives of or similar to Nccl Tests

Optical Flow Filter
A real time optical flow algorithm implemented on GPU
Stars: ✭ 146 (-12.05%)
Mutual labels:  cuda
Cumf als
CUDA Matrix Factorization Library with Alternating Least Square (ALS)
Stars: ✭ 154 (-7.23%)
Mutual labels:  cuda
Cx db8
a contextual, biasable, word-or-sentence-or-paragraph extractive summarizer powered by the latest in text embeddings (Bert, Universal Sentence Encoder, Flair)
Stars: ✭ 164 (-1.2%)
Mutual labels:  cuda
Cuda Cnn
CNN accelerated by cuda. Test on mnist and finilly get 99.76%
Stars: ✭ 148 (-10.84%)
Mutual labels:  cuda
Compactcnncascade
A binary library for very fast face detection using compact CNNs.
Stars: ✭ 152 (-8.43%)
Mutual labels:  cuda
3dunderworld Sls Gpu cpu
A structured light scanner
Stars: ✭ 157 (-5.42%)
Mutual labels:  cuda
Gpurir
Python library for Room Impulse Response (RIR) simulation with GPU acceleration
Stars: ✭ 145 (-12.65%)
Mutual labels:  cuda
Opencuda
Stars: ✭ 164 (-1.2%)
Mutual labels:  cuda
Dsmnet
Domain-invariant Stereo Matching Networks
Stars: ✭ 153 (-7.83%)
Mutual labels:  cuda
Khiva
An open-source library of algorithms to analyse time series in GPU and CPU.
Stars: ✭ 161 (-3.01%)
Mutual labels:  cuda
Ginkgo
Numerical linear algebra software package
Stars: ✭ 149 (-10.24%)
Mutual labels:  cuda
Jetson
Helmut Hoffer von Ankershoffen experimenting with arm64 based NVIDIA Jetson (Nano and AGX Xavier) edge devices running Kubernetes (K8s) for machine learning (ML) including Jupyter Notebooks, TensorFlow Training and TensorFlow Serving using CUDA for smart IoT.
Stars: ✭ 151 (-9.04%)
Mutual labels:  cuda
Xmrminer
🐜 A CUDA based miner for Monero
Stars: ✭ 158 (-4.82%)
Mutual labels:  cuda
Sketchgraphs
A dataset of 15 million CAD sketches with geometric constraint graphs.
Stars: ✭ 148 (-10.84%)
Mutual labels:  cuda
Primitiv
A Neural Network Toolkit.
Stars: ✭ 164 (-1.2%)
Mutual labels:  cuda
Volumetric Path Tracer
☁️ Volumetric path tracer using cuda
Stars: ✭ 145 (-12.65%)
Mutual labels:  cuda
Rmm
RAPIDS Memory Manager
Stars: ✭ 154 (-7.23%)
Mutual labels:  cuda
Jcuda
JCuda - Java bindings for CUDA
Stars: ✭ 165 (-0.6%)
Mutual labels:  cuda
Multi Gpu Programming Models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Stars: ✭ 165 (-0.6%)
Mutual labels:  cuda
Clojurecuda
Clojure library for CUDA development
Stars: ✭ 158 (-4.82%)
Mutual labels:  cuda

NCCL Tests

These tests check both the performance and the correctness of NCCL operations.

Build

To build the tests, just type make.

If CUDA is not installed in /usr/local/cuda, you may specify CUDA_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL_HOME.

$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI_HOME to the path where MPI is installed.

$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

Usage

NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)*(number of threads)*(number of GPUs per thread).

Quick examples

Run on 8 GPUs (-g 8), scanning from 8 Bytes to 128MBytes :

$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Run with MPI on 40 processes (potentially on multiple nodes) with 4 GPUs each :

$ mpirun -np 40 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

Performance

See the Performance page for explanation about numbers, and in particular the "busbw" column.

Arguments

All tests support the same set of arguments :

  • Number of GPUs
    • -t,--nthreads <num threads> number of threads per process. Default : 1.
    • -g,--ngpus <GPUs per thread> number of gpus per thread. Default : 1.
  • Sizes to scan
    • -b,--minbytes <min size in bytes> minimum size to start with. Default : 32M.
    • -e,--maxbytes <max size in bytes> maximum size to end at. Default : 32M.
    • Increments can be either fixed or a multiplication factor. Only one of those should be used
      • -i,--stepbytes <increment size> fixed increment between sizes. Default : (max-min)/10.
      • -f,--stepfactor <increment factor> multiplication factor between sizes. Default : disabled.
  • NCCL operations arguments
    • -o,--op <sum/prod/min/max/all> Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum.
    • -d,--datatype <nccltype/all> Specify which datatype to use. Default : Float.
    • -r,--root <root/all> Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0.
  • Performance
    • -n,--iters <iteration count> number of iterations. Default : 20.
    • -w,--warmup_iters <warmup iteration count> number of warmup iterations (not timed). Default : 5.
    • -m,--agg_iters <aggregation count> number of operations to aggregate together in each iteration. Default : 1.
  • Test operation
    • -p,--parallel_init <0/1> use threads to initialize NCCL in parallel. Default : 0.
    • -c,--check <0/1> check correctness of results. This can be quite slow on large numbers of GPUs. Default : 1.
    • -z,--blocking <0/1> Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.

Copyright

NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].