All Projects → szagoruyko → Pyinn

szagoruyko / Pyinn

Licence: mit
CuPy fused PyTorch neural networks ops

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pyinn

Speedtorch
Library for faster pinned CPU <-> GPU transfer in Pytorch
Stars: ✭ 615 (+132.08%)
Mutual labels:  cupy
Adacof Pytorch
Official source code for our paper "AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation" (CVPR 2020)
Stars: ✭ 110 (-58.49%)
Mutual labels:  cupy
Einops
Deep learning operations reinvented (for pytorch, tensorflow, jax and others)
Stars: ✭ 4,022 (+1417.74%)
Mutual labels:  cupy
Sepconv Slomo
an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch
Stars: ✭ 918 (+246.42%)
Mutual labels:  cupy
Chainercv
ChainerCV: a Library for Deep Learning in Computer Vision
Stars: ✭ 1,463 (+452.08%)
Mutual labels:  cupy
Pyhpc Benchmarks
A suite of benchmarks to test the sequential CPU and GPU performance of most popular high-performance libraries for Python.
Stars: ✭ 119 (-55.09%)
Mutual labels:  cupy
Pytorch Pwc
a reimplementation of PWC-Net in PyTorch that matches the official Caffe version
Stars: ✭ 402 (+51.7%)
Mutual labels:  cupy
revisiting-sepconv
an implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation using PyTorch
Stars: ✭ 43 (-83.77%)
Mutual labels:  cupy
Nvidia Gpu Tensor Core Accelerator Pytorch Opencv
A complete machine vision container that includes Jupyter notebooks with built-in code hinting, Anaconda, CUDA-X, TensorRT inference accelerator for Tensor cores, CuPy (GPU drop in replacement for Numpy), PyTorch, TF2, Tensorboard, and OpenCV for accelerated workloads on NVIDIA Tensor cores and GPUs.
Stars: ✭ 110 (-58.49%)
Mutual labels:  cupy
Softmax Splatting
an implementation of softmax splatting for differentiable forward warping using PyTorch
Stars: ✭ 218 (-17.74%)
Mutual labels:  cupy
Tensorly
TensorLy: Tensor Learning in Python.
Stars: ✭ 977 (+268.68%)
Mutual labels:  cupy
Pynvvl
A Python wrapper of NVIDIA Video Loader (NVVL) with CuPy for fast video loading with Python
Stars: ✭ 95 (-64.15%)
Mutual labels:  cupy
Spanet
Spatial Attentive Single-Image Deraining with a High Quality Real Rain Dataset (CVPR'19)
Stars: ✭ 136 (-48.68%)
Mutual labels:  cupy
Chainer
A flexible framework of neural networks for deep learning
Stars: ✭ 5,656 (+2034.34%)
Mutual labels:  cupy
Phase-based-Frame-Interpolation
Frame interpolation
Stars: ✭ 28 (-89.43%)
Mutual labels:  cupy
Cupy
NumPy & SciPy for GPU
Stars: ✭ 5,625 (+2022.64%)
Mutual labels:  cupy
Pytorch Unflow
a reimplementation of UnFlow in PyTorch that matches the official TensorFlow version
Stars: ✭ 113 (-57.36%)
Mutual labels:  cupy
waifu2x-chainer
Chainer implementation of waifu2x
Stars: ✭ 137 (-48.3%)
Mutual labels:  cupy
roi pooling
ROIPooling for pytorch
Stars: ✭ 50 (-81.13%)
Mutual labels:  cupy
Mobulaop
A Simple & Flexible Cross Framework Operators Toolkit
Stars: ✭ 161 (-39.25%)
Mutual labels:  cupy

PyINN

CuPy implementations of fused PyTorch ops.

PyTorch version of imagine-nn

The purpose of this package is to contain CUDA ops written in Python with CuPy, which is not a PyTorch dependency.

An alternative to CuPy would be https://github.com/pytorch/extension-ffi, but it requires a lot of wrapping code like https://github.com/sniklaus/pytorch-extension, so doesn't really work with quick prototyping.

Another advantage of CuPy over C code is that dimensions of each op are known at JIT-ing time, and compiled kernels potentially can be faster. Also, the first version of the package was in PyCUDA, but it can't work with PyTorch multi-GPU.

~~On Maxwell Titan X pyinn.conv2d_depthwise MobileNets are 2.6x faster than F.conv2d benchmark.py

No longer the case - with new kernels PyTorch 0.3.0 is now ~20% faster than pyinn.

Installation

pip install git+https://github.com/szagoruyko/[email protected]

Example

import torch
from torch.autograd import Variable
import pyinn as P
x = Variable(torch.randn(1,4,5,5).cuda())
w = Variable(torch.randn(4,1,3,3).cuda())
y = P.conv2d_depthwise(x, w, padding=1)

or with modules interface:

from pyinn.modules import Conv2dDepthwise
module = Conv2dDepthwise(channels=4, kernel_size=3, padding=1).cuda()
y = module(x)

Documentation

conv2d_depthwise

Implements depthwise convolution as in https://arxiv.org/abs/1704.04861 MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

CUDA kernels from https://github.com/BVLC/caffe/pull/5665

CPU side is done by F.conv2d.

Equivalent to:

F.conv2d(input, weight, groups=input.size(1))

Inputs and arguments are the same with F.conv2d

dgmm

Multiplication with a diagonal matrix.

Used CUDA dgmm function, sometimes is faster than expand.

In torch functions does input.mm(x.diag()). Both left and right mutliplications are supported.

Args: input: 2D tensor x: 1D tensor

cdgmm

Complex multiplication with a diagonal matrix.

Does input.mm(x.diag()) where input and x are complex.

Args: input: 3D tensor with last dimension of size 2 x: 2D tensor with last dimension of size 2

NCReLU

Applies NCReLU (negative concatenated ReLU) nonlinearity.

Does torch.cat([x.clamp(min=0), x.clamp(max=0)], dim=1) in a single fused op.

Used in https://arxiv.org/abs/1706.00388 DiracNets: Training Very Deep Neural Networks Without Skip-Connections

Args: input: 4D tensor

im2col and col2im

Rearrange image blocks into columns.

The representation is used to perform GEMM-based convolution.

Output is 5D (or 6D in case of minibatch) tensor.

Minibatch implementation is inefficient, and could be done in a single CUDA kernel.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].