All Projects → baidu-research → Baidu Allreduce

baidu-research / Baidu Allreduce

Licence: apache-2.0

Labels

Projects that are alternatives of or similar to Baidu Allreduce

Music Translation
A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.
Stars: ✭ 385 (-10.47%)
Mutual labels:  cuda
Pytorch Pwc
a reimplementation of PWC-Net in PyTorch that matches the official Caffe version
Stars: ✭ 402 (-6.51%)
Mutual labels:  cuda
H2o4gpu
H2Oai GPU Edition
Stars: ✭ 416 (-3.26%)
Mutual labels:  cuda
Amgcl
C++ library for solving large sparse linear systems with algebraic multigrid method
Stars: ✭ 390 (-9.3%)
Mutual labels:  cuda
Integral Human Pose
Integral Human Pose Regression
Stars: ✭ 395 (-8.14%)
Mutual labels:  cuda
Warp Ctc
Fast parallel CTC.
Stars: ✭ 3,954 (+819.53%)
Mutual labels:  cuda
Ilgpu
ILGPU JIT Compiler for high-performance .Net GPU programs
Stars: ✭ 374 (-13.02%)
Mutual labels:  cuda
Tensorflow Cmake
TensorFlow examples in C, C++, Go and Python without bazel but with cmake and FindTensorFlow.cmake
Stars: ✭ 418 (-2.79%)
Mutual labels:  cuda
Cubert
Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL
Stars: ✭ 395 (-8.14%)
Mutual labels:  cuda
Deformable Convolution Pytorch
PyTorch implementation of Deformable Convolution
Stars: ✭ 410 (-4.65%)
Mutual labels:  cuda
Neuralnetwork.net
A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN
Stars: ✭ 392 (-8.84%)
Mutual labels:  cuda
Cudanative.jl
Julia support for native CUDA programming
Stars: ✭ 393 (-8.6%)
Mutual labels:  cuda
Ai Lab
All-in-one AI container for rapid prototyping
Stars: ✭ 406 (-5.58%)
Mutual labels:  cuda
Cudf
cuDF - GPU DataFrame Library
Stars: ✭ 4,370 (+916.28%)
Mutual labels:  cuda
Icpcuda
Super fast implementation of ICP in CUDA for compute capable devices 3.5 or higher
Stars: ✭ 416 (-3.26%)
Mutual labels:  cuda
Hipsycl
Implementation of SYCL for CPUs, AMD GPUs, NVIDIA GPUs
Stars: ✭ 377 (-12.33%)
Mutual labels:  cuda
Gocv
Go package for computer vision using OpenCV 4 and beyond.
Stars: ✭ 4,511 (+949.07%)
Mutual labels:  cuda
Tsdf Fusion
Fuse multiple depth frames into a TSDF voxel volume.
Stars: ✭ 426 (-0.93%)
Mutual labels:  cuda
Accel
(Mirror of GitLab) GPGPU Framework for Rust
Stars: ✭ 420 (-2.33%)
Mutual labels:  cuda
Tensorrt tutorial
Stars: ✭ 407 (-5.35%)
Mutual labels:  cuda

baidu-allreduce

baidu-allreduce is a small C++ library, demonstrating the ring allreduce and ring allgather techniques. The goal is to provide a template for deep learning framework authors to use when implementing these communication algorithms within their respective frameworks.

A description of the ring allreduce with its application to deep learning is available on the Baidu SVAIL blog.

Installation

Prerequisites: Before compiling baidu-allreduce, make sure you have installed CUDA (7.5 or greater) and an MPI implementation.

baidu-allreduce has been tested with OpenMPI, but should work with any CUDA-aware MPI implementation, such as MVAPICH.

To compile baidu-allreduce, run

# Modify MPI_ROOT to point to your installation of MPI.
# You should see $MPI_ROOT/include/mpi.h and $MPI_ROOT/lib/libmpi.so.
# Modify CUDA_ROOT to point to your installation of CUDA.
make MPI_ROOT=/usr/lib/openmpi CUDA_ROOT=/path/to/cuda/lib64

You may need to modify your LD_LIBRARY_PATH environment variable to point to your MPI implementation as well as your CUDA libraries.

To run the baidu-allreduce tests after compiling it, run

# On CPU.
mpirun --np 3 allreduce-test cpu

# On GPU. Requires a CUDA-aware MPI implementation.
mpirun --np 3 allreduce-test gpu

Interface

The baidu-allreduce library provides the following C++ functions:

// Initialize the library, including MPI and if necessary the CUDA device.
// If device == NO_DEVICE, no GPU is used; otherwise, the device specifies which CUDA
// device should be used. All data passed to other functions must be on that device.
#define NO_DEVICE -1
void InitCollectives(int device);

// The ring allreduce. The lengths of the data chunks passed to this function
// must be the same across all MPI processes. The output memory will be
// allocated and written into `output`.
void RingAllreduce(float* data, size_t length, float** output);

// The ring allgather. The lengths of the data chunks passed to this function
// may differ across different devices. The output memory will be allocated and
// written into `output`.
void RingAllgather(float* data, size_t length, float** output);

The interface is simple and inflexible and is meant as a demonstration. The code is fairly straightforward and the same technique can be integrated into existing codebases in a variety of ways.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].