Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

The purpose of this tutorial is to learn how to install and prepare TensorFlow framework to train your own convolutional neural network object detection classifier for multiple objects, starting from scratch

Stars: ✭ 113 (-7.38%)

Mutual labels: cuda

Dpp

Detail-Preserving Pooling in Deep Networks (CVPR 2018)

Stars: ✭ 99 (-18.85%)

Mutual labels: cuda

Babelstream

STREAM, for lots of devices written in many programming models

Stars: ✭ 121 (-0.82%)

Mutual labels: cuda

Futhark

💥💻💥 A data-parallel functional programming language

Stars: ✭ 1,641 (+1245.08%)

Mutual labels: cuda

Mtensor

A C++ Cuda Tensor Lazy Computing Library

Stars: ✭ 115 (-5.74%)

Mutual labels: cuda

Dace

DaCe - Data Centric Parallel Programming

Stars: ✭ 106 (-13.11%)

Mutual labels: cuda

Torch Mesh Isect

Stars: ✭ 107 (-12.3%)

Mutual labels: cuda

Pytorch spn

Extension package for spatial propagation network in pytorch.

Stars: ✭ 114 (-6.56%)

Mutual labels: cuda

Cuda Winograd

Fast CUDA Kernels for ResNet Inference.

Stars: ✭ 104 (-14.75%)

Mutual labels: cuda

Tensorflow Optimized Wheels

TensorFlow wheels built for latest CUDA/CuDNN and enabled performance flags: SSE, AVX, FMA; XLA

Stars: ✭ 118 (-3.28%)

Mutual labels: cuda

Deepnet

Deep.Net machine learning framework for F#

Stars: ✭ 99 (-18.85%)

Mutual labels: cuda

Pytorch Unflow

a reimplementation of UnFlow in PyTorch that matches the official TensorFlow version

Stars: ✭ 113 (-7.38%)

Mutual labels: cuda

Onemkl

oneAPI Math Kernel Library (oneMKL) Interfaces

Stars: ✭ 122 (+0%)

Mutual labels: cuda

Knn cuda

Fast K-Nearest Neighbor search with GPU

Stars: ✭ 119 (-2.46%)

Mutual labels: cuda

Cltune

CLTune: An automatic OpenCL & CUDA kernel tuner

Stars: ✭ 114 (-6.56%)

Mutual labels: cuda

View All Similar Projects ➔

CUDA-Warp RNN-Transducer

A GPU implementation of RNN Transducer (Graves 2012, 2013). This code is ported from the reference implementation (by Awni Hannun) and fully utilizes the CUDA warp mechanism.

The main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm. In particular, there is a nested loop to populate a lattice with shape (T, U), and each value in this lattice depend on the two previous cells from each dimension (e.g. forward pass).

CUDA executes threads in groups of 32 parallel threads called warps. Full efficiency is realized when all 32 threads of a warp agree on their execution path. This is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension. In each warp, variables between threads exchanged using a fast operations. As soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running. A schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size. The similar procedure for the backward pass runs in parallel.

Performance

Benchmarked on a GeForce GTX 1080 Ti GPU, Intel i7-8700 CPU @ 3.20GHz.

	warp_rnnt	warprnnt_pytorch	transducer
T=150, U=40, V=28
N=1	0.07 ms	0.68 ms	1.28 ms
N=16	0.33 ms	1.80 ms	6.15 ms
N=32	0.35 ms	3.39 ms	12.72 ms
N=64	0.56 ms	6.11 ms	23.73 ms
N=128	0.60 ms	9.22 ms	47.93 ms
T=150, U=20, V=5000
N=1	0.46 ms	2.14 ms	21.18 ms
N=16	1.42 ms	21.24 ms	240.11 ms
N=32	2.51 ms	38.26 ms	490.66 ms
N=64	out-of-memory	75.54 ms	944.73 ms
N=128	out-of-memory	out-of-memory	1894.93 ms
T=1500, U=300, V=50
N=1	0.60 ms	10.77 ms	121.82 ms
N=16	2.25 ms	97.69 ms	732.50 ms
N=32	3.97 ms	184.73 ms	1448.54 ms
N=64	out-of-memory	out-of-memory	2767.59 ms

TODO

Fix the original benchmarking methodology as mentioned in this issue #9

Note

This implementation assumes that the input is log_softmax.
In addition to alphas/betas arrays, counts array is allocated with shape (N, U * 2), which is used as a scheduling mechanism.
core_gather.cu is a slightly memory-efficient version that expects log_probs with the shape (N, T, U, 2) only for blank and labels values.
Do not expect that this implementation will greatly reduce the training time of RNN Transducer model. Probably, the main bottleneck will be a trainable joint network with an output (N, T, U, V).
Also, there is a restricted version, called Recurrent Neural Aligner, with assumption that the length of input sequence is equal to or greater than the length of target sequence.

Install

There are two bindings for the core algorithm:

Reference

Awni Hannun transducer
Mingkun Huang warp-transducer

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 122

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗