All Projects → 1ytic → Warp Rnnt

1ytic / Warp Rnnt

Licence: mit
CUDA-Warp RNN-Transducer

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Warp Rnnt

Pygraphistry
PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer
Stars: ✭ 1,365 (+1018.85%)
Mutual labels:  cuda
Adacof Pytorch
Official source code for our paper "AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation" (CVPR 2020)
Stars: ✭ 110 (-9.84%)
Mutual labels:  cuda
Spoc
Stream Processing with OCaml
Stars: ✭ 115 (-5.74%)
Mutual labels:  cuda
Chamferdistancepytorch
Chamfer Distance in Pytorch with f-score
Stars: ✭ 105 (-13.93%)
Mutual labels:  cuda
Cuhe
CUDA Homomorphic Encryption Library
Stars: ✭ 109 (-10.66%)
Mutual labels:  cuda
Tensorflow Object Detection Tutorial
The purpose of this tutorial is to learn how to install and prepare TensorFlow framework to train your own convolutional neural network object detection classifier for multiple objects, starting from scratch
Stars: ✭ 113 (-7.38%)
Mutual labels:  cuda
Dpp
Detail-Preserving Pooling in Deep Networks (CVPR 2018)
Stars: ✭ 99 (-18.85%)
Mutual labels:  cuda
Babelstream
STREAM, for lots of devices written in many programming models
Stars: ✭ 121 (-0.82%)
Mutual labels:  cuda
Futhark
💥💻💥 A data-parallel functional programming language
Stars: ✭ 1,641 (+1245.08%)
Mutual labels:  cuda
Mtensor
A C++ Cuda Tensor Lazy Computing Library
Stars: ✭ 115 (-5.74%)
Mutual labels:  cuda
Dace
DaCe - Data Centric Parallel Programming
Stars: ✭ 106 (-13.11%)
Mutual labels:  cuda
Torch Mesh Isect
Stars: ✭ 107 (-12.3%)
Mutual labels:  cuda
Pytorch spn
Extension package for spatial propagation network in pytorch.
Stars: ✭ 114 (-6.56%)
Mutual labels:  cuda
Cuda Winograd
Fast CUDA Kernels for ResNet Inference.
Stars: ✭ 104 (-14.75%)
Mutual labels:  cuda
Tensorflow Optimized Wheels
TensorFlow wheels built for latest CUDA/CuDNN and enabled performance flags: SSE, AVX, FMA; XLA
Stars: ✭ 118 (-3.28%)
Mutual labels:  cuda
Deepnet
Deep.Net machine learning framework for F#
Stars: ✭ 99 (-18.85%)
Mutual labels:  cuda
Pytorch Unflow
a reimplementation of UnFlow in PyTorch that matches the official TensorFlow version
Stars: ✭ 113 (-7.38%)
Mutual labels:  cuda
Onemkl
oneAPI Math Kernel Library (oneMKL) Interfaces
Stars: ✭ 122 (+0%)
Mutual labels:  cuda
Knn cuda
Fast K-Nearest Neighbor search with GPU
Stars: ✭ 119 (-2.46%)
Mutual labels:  cuda
Cltune
CLTune: An automatic OpenCL & CUDA kernel tuner
Stars: ✭ 114 (-6.56%)
Mutual labels:  cuda

PyPI Downloads

CUDA-Warp RNN-Transducer

A GPU implementation of RNN Transducer (Graves 2012, 2013). This code is ported from the reference implementation (by Awni Hannun) and fully utilizes the CUDA warp mechanism.

The main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm. In particular, there is a nested loop to populate a lattice with shape (T, U), and each value in this lattice depend on the two previous cells from each dimension (e.g. forward pass).

CUDA executes threads in groups of 32 parallel threads called warps. Full efficiency is realized when all 32 threads of a warp agree on their execution path. This is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension. In each warp, variables between threads exchanged using a fast operations. As soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running. A schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size. The similar procedure for the backward pass runs in parallel.

Performance

Benchmarked on a GeForce GTX 1080 Ti GPU, Intel i7-8700 CPU @ 3.20GHz.

warp_rnnt warprnnt_pytorch transducer
T=150, U=40, V=28
N=1 0.07 ms 0.68 ms 1.28 ms
N=16 0.33 ms 1.80 ms 6.15 ms
N=32 0.35 ms 3.39 ms 12.72 ms
N=64 0.56 ms 6.11 ms 23.73 ms
N=128 0.60 ms 9.22 ms 47.93 ms
T=150, U=20, V=5000
N=1 0.46 ms 2.14 ms 21.18 ms
N=16 1.42 ms 21.24 ms 240.11 ms
N=32 2.51 ms 38.26 ms 490.66 ms
N=64 out-of-memory 75.54 ms 944.73 ms
N=128 out-of-memory out-of-memory 1894.93 ms
T=1500, U=300, V=50
N=1 0.60 ms 10.77 ms 121.82 ms
N=16 2.25 ms 97.69 ms 732.50 ms
N=32 3.97 ms 184.73 ms 1448.54 ms
N=64 out-of-memory out-of-memory 2767.59 ms

TODO

  • Fix the original benchmarking methodology as mentioned in this issue #9

Note

  • This implementation assumes that the input is log_softmax.

  • In addition to alphas/betas arrays, counts array is allocated with shape (N, U * 2), which is used as a scheduling mechanism.

  • core_gather.cu is a slightly memory-efficient version that expects log_probs with the shape (N, T, U, 2) only for blank and labels values.

  • Do not expect that this implementation will greatly reduce the training time of RNN Transducer model. Probably, the main bottleneck will be a trainable joint network with an output (N, T, U, V).

  • Also, there is a restricted version, called Recurrent Neural Aligner, with assumption that the length of input sequence is equal to or greater than the length of target sequence.

Install

There are two bindings for the core algorithm:

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].