All Projects → gcambara → cape

gcambara / cape

Licence: MIT license
Continuous Augmented Positional Embeddings (CAPE) implementation for PyTorch

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to cape

Visual-Transformer-Paper-Summary
Summary of Transformer applications for computer vision tasks.
Stars: ✭ 51 (+75.86%)
Mutual labels:  transformer, vit, visual-transformer
LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Stars: ✭ 1,566 (+5300%)
Mutual labels:  transformer, vit
towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
Stars: ✭ 821 (+2731.03%)
Mutual labels:  transformer, vit
AdaSpeech
AdaSpeech: Adaptive Text to Speech for Custom Voice
Stars: ✭ 108 (+272.41%)
Mutual labels:  speech, transformer
End2end Asr Pytorch
End-to-End Automatic Speech Recognition on PyTorch
Stars: ✭ 175 (+503.45%)
Mutual labels:  speech, transformer
speech-transformer
Transformer implementation speciaized in speech recognition tasks using Pytorch.
Stars: ✭ 40 (+37.93%)
Mutual labels:  speech, transformer
Zero-Shot-TTS
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Stars: ✭ 33 (+13.79%)
Mutual labels:  speech, transformer
Nlp Experiments In Pytorch
PyTorch repository for text categorization and NER experiments in Turkish and English.
Stars: ✭ 35 (+20.69%)
Mutual labels:  text, transformer
Openasr
A pytorch based end2end speech recognition system.
Stars: ✭ 69 (+137.93%)
Mutual labels:  speech, transformer
Neural sp
End-to-end ASR/LM implementation with PyTorch
Stars: ✭ 408 (+1306.9%)
Mutual labels:  speech, transformer
FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Stars: ✭ 204 (+603.45%)
Mutual labels:  text, transformer
Aeneas
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
Stars: ✭ 1,942 (+6596.55%)
Mutual labels:  text, speech
keras-vision-transformer
The Tensorflow, Keras implementation of Swin-Transformer and Swin-UNET
Stars: ✭ 91 (+213.79%)
Mutual labels:  transformer
php-serializer
Serialize PHP variables, including objects, in any format. Support to unserialize it too.
Stars: ✭ 47 (+62.07%)
Mutual labels:  transformer
Awesome-low-level-vision-resources
A curated list of resources for Low-level Vision Tasks
Stars: ✭ 35 (+20.69%)
Mutual labels:  transformer
ventib
📈 Ventib records your voice, transcribes it in realtime, and performs speech pattern analysis to give you objective statistics about how you speak.
Stars: ✭ 43 (+48.28%)
Mutual labels:  speech
svelte-slate
slate svelte view layer
Stars: ✭ 43 (+48.28%)
Mutual labels:  text
clifs
Contrastive Language-Image Forensic Search allows free text searching through videos using OpenAI's machine learning model CLIP
Stars: ✭ 271 (+834.48%)
Mutual labels:  text
pytorch-pcen
PyTorch reimplementation of per-channel energy normalization for audio.
Stars: ✭ 80 (+175.86%)
Mutual labels:  speech
bytekit
Java 字节操作的工具库(不是字节码的工具库)
Stars: ✭ 40 (+37.93%)
Mutual labels:  transformer

CAPE 🌴 pylint pytest

PyTorch implementation of Continuous Augmented Positional Embeddings (CAPE), by Likhomanenko et al. Enhance your Transformer positional embeddings with easy-to-use augmentations!

Setup 🔧

Minimum requirements:

einops>=0.3.2
torch>=1.10.0

Install from source:

git clone https://github.com/gcambara/cape.git
cd cape
pip install --editable ./

Usage 📖

Ready to go along with PyTorch's official implementation of Transformers. Default initialization behaves identically as sinusoidal positional embeddings, summing them up to your content embeddings:

from torch import nn
from cape import CAPE1d

pos_emb = CAPE1d(d_model=512)
transformer = nn.Transformer(d_model=512)

x = torch.randn(10, 32, 512) # seq_len, batch_size, n_feats
x = pos_emb(x) # forward sums the positional embedding by default
x = transformer(x)

Alternatively, you can get positional embeddings separately

x = torch.randn(10, 32, 512)
pos_emb = pos_emb.compute_pos_emb(x)

scale = 512**0.5
x = (scale * x) + pos_emb
x = transformer(x)

Let's see a few examples of CAPE initialization for different modalities, inspired by the original paper experiments.

CAPE for text 🔤

CAPE1d is ready to be applied to text. Keep max_local_shift between 0 and 0.5 to shift local positions without disordering them.

from cape import CAPE1d
pos_emb = CAPE1d(d_model=512, max_global_shift=5.0, 
                 max_local_shift=0.5, max_global_scaling=1.03, 
                 normalize=False)

x = torch.randn(10, 32, 512) # seq_len, batch_size, n_feats
x = pos_emb(x)

Padding is supported by indicating the length of samples in the forward method, with the x_lengths argument. For example, the original length of samples is 7, although they have been padded to sequence length 10.

x = torch.randn(10, 32, 512) # seq_len, batch_size, n_feats
x_lengths = torch.ones(32)*7
x = pos_emb(x, x_lengths=x_lengths)

CAPE for audio 🎙️

CAPE1d for audio is applied similarly to text. Use positions_delta argument to set the separation in seconds between time steps, and x_lengths for indicating sample durations in case there is padding.

For instance, let's consider no padding and same hop size (30 ms) at every sample in the batch:

# Max global shift is 60 s.
# Max local shift is set to 0.5 to maintain positional order.
# Max global scaling is 1.1, according to WSJ recipe.
# Freq scale is 30 to ensure that 30 ms queries are possible with long audios
from cape import CAPE1d
pos_emb = CAPE1d(d_model=512, max_global_shift=60.0, 
                 max_local_shift=0.5, max_global_scaling=1.1, 
                 normalize=True, freq_scale=30.0)

x = torch.randn(100, 32, 512) # seq_len, batch_size, n_feats
positions_delta = 0.03 # 30 ms of stride
x = pos_emb(x, positions_delta=positions_delta)

Now, let's imagine that the original duration of all samples is 2.5 s, although they have been padded to 3.0 s. Hop size is 30 ms for every sample in the batch.

x = torch.randn(100, 32, 512) # seq_len, batch_size, n_feats

duration = 2.5
positions_delta = 0.03
x_lengths = torch.ones(32)*duration
x = pos_emb(x, x_lengths=x_lengths, positions_delta=positions_delta)

What if the hop size is different for every sample in the batch? E.g. first half of the samples have stride of 30 ms, and the second half of 50 ms.

positions_delta = 0.03
positions_delta = torch.ones(32)*positions_delta
positions_delta[16:] = 0.05
x = pos_emb(x, positions_delta=positions_delta)
positions_delta
tensor([0.0300, 0.0300, 0.0300, 0.0300, 0.0300, 0.0300, 0.0300, 0.0300, 0.0300,
        0.0300, 0.0300, 0.0300, 0.0300, 0.0300, 0.0300, 0.0300, 0.0500, 0.0500,
        0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500,
        0.0500, 0.0500, 0.0500, 0.0500, 0.0500])

Lastly, let's consider a very rare case, where hop size is different for every sample in the batch, and is not constant within some samples. E.g. stride of 30 ms for the first half of samples, and 50 ms for the second half. However, the hop size of the very first sample linearly increases for each time step.

from einops import repeat
positions_delta = 0.03
positions_delta = torch.ones(32)*positions_delta
positions_delta[16:] = 0.05
positions_delta = repeat(positions_delta, 'b -> b new_axis', new_axis=100)
positions_delta[0, :] *= torch.arange(1, 101)
x = pos_emb(x, positions_delta=positions_delta)
positions_delta
tensor([[0.0300, 0.0600, 0.0900,  ..., 2.9400, 2.9700, 3.0000],
        [0.0300, 0.0300, 0.0300,  ..., 0.0300, 0.0300, 0.0300],
        [0.0300, 0.0300, 0.0300,  ..., 0.0300, 0.0300, 0.0300],
        ...,
        [0.0500, 0.0500, 0.0500,  ..., 0.0500, 0.0500, 0.0500],
        [0.0500, 0.0500, 0.0500,  ..., 0.0500, 0.0500, 0.0500],
        [0.0500, 0.0500, 0.0500,  ..., 0.0500, 0.0500, 0.0500]])

CAPE for ViT 🖼️

CAPE2d is used for embedding positions in image patches. Scaling of positions between [-1, 1] is done within the module, whether patches are square or non-square. Thus, set max_local_shift between 0 and 0.5, and the scale of local shifts will be adjusted according to the height and width of patches. Beyond values of 0.5 the order of positions might be altered, do this at your own risk!

from cape import CAPE2d
pos_emb = CAPE2d(d_model=512, max_global_shift=0.5, 
                 max_local_shift=0.5, max_global_scaling=1.4)

# Case 1: square patches
x = torch.randn(16, 16, 32, 512) # height, width, batch_size, n_feats
x = pos_emb(x)

# Case 2: non-square patches
x = torch.randn(24, 16, 32, 512) # height, width, batch_size, n_feats
x = pos_emb(x)

Citation ✍️

I just did this PyTorch implementation following the paper's Python code and the Flashlight recipe in C++. All the credit goes to the original authors, please cite them if you use this for your research project:

@inproceedings{likhomanenko2021cape,
title={{CAPE}: Encoding Relative Positions with Continuous Augmented Positional Embeddings},
author={Tatiana Likhomanenko and Qiantong Xu and Gabriel Synnaeve and Ronan Collobert and Alex Rogozhnikov},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021},
url={https://openreview.net/forum?id=n-FqqWXnWW}
}

Acknowledgments 🙏

Many thanks to the paper's authors for code reviewing and clarifying doubts about the paper and the implementation. :)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].