Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → Separius → Awesome Fast Attention

Separius / Awesome Fast Attention

Licence: gpl-3.0

list of efficient attention modules

Programming Languages

python

139335 projects - #7 most used programming language

Labels

awesome transformer attention attention-is-all-you-need

Projects that are alternatives of or similar to Awesome Fast Attention

transformer

A PyTorch Implementation of "Attention Is All You Need"

Stars: ✭ 28 (-95.53%)

Mutual labels: transformer, attention, attention-is-all-you-need

Pytorch Original Transformer

My implementation of the original transformer model (Vaswani et al.). I've additionally included the playground.py file for visualizing otherwise seemingly hard concepts. Currently included IWSLT pretrained models.

Stars: ✭ 411 (-34.45%)

Mutual labels: attention, attention-is-all-you-need, transformer

Speech Transformer

A PyTorch implementation of Speech Transformer, an End-to-End ASR with Transformer network on Mandarin Chinese.

Stars: ✭ 565 (-9.89%)

Mutual labels: attention, attention-is-all-you-need, transformer

Visual-Transformer-Paper-Summary

Summary of Transformer applications for computer vision tasks.

Stars: ✭ 51 (-91.87%)

Mutual labels: transformer, attention

visualization

a collection of visualization function

Stars: ✭ 189 (-69.86%)

Mutual labels: transformer, attention

transformer

Neutron: A pytorch based implementation of Transformer and its variants.

Stars: ✭ 60 (-90.43%)

Mutual labels: transformer, attention-is-all-you-need

Keras Transformer

Transformer implemented in Keras

Stars: ✭ 273 (-56.46%)

Mutual labels: attention, transformer

attention-is-all-you-need-paper

Implementation of Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

Stars: ✭ 97 (-84.53%)

Mutual labels: transformer, attention-is-all-you-need

Dab

Data Augmentation by Backtranslation (DAB) ヽ( •_-)ᕗ

Stars: ✭ 294 (-53.11%)

Mutual labels: attention-is-all-you-need, transformer

Text Classification Models Pytorch

Implementation of State-of-the-art Text Classification Models in Pytorch

Stars: ✭ 379 (-39.55%)

Mutual labels: attention, transformer

Transformer

A TensorFlow Implementation of the Transformer: Attention Is All You Need

Stars: ✭ 3,646 (+481.5%)

Mutual labels: attention-is-all-you-need, transformer

Attention Is All You Need Pytorch

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Stars: ✭ 6,070 (+868.1%)

Mutual labels: attention, attention-is-all-you-need

Relation-Extraction-Transformer

NLP: Relation extraction with position-aware self-attention transformer

Stars: ✭ 63 (-89.95%)

Mutual labels: transformer, attention

Nmt Keras

Neural Machine Translation with Keras

Stars: ✭ 501 (-20.1%)

Mutual labels: attention-is-all-you-need, transformer

speech-transformer

Transformer implementation speciaized in speech recognition tasks using Pytorch.

Stars: ✭ 40 (-93.62%)

Mutual labels: transformer, attention-is-all-you-need

ai challenger 2018 sentiment analysis

Fine-grained Sentiment Analysis of User Reviews --- AI CHALLENGER 2018

Stars: ✭ 16 (-97.45%)

Mutual labels: transformer, attention

learningspoons

nlp lecture-notes and source code

Stars: ✭ 29 (-95.37%)

Mutual labels: transformer, attention

CrabNet

Predict materials properties using only the composition information!

Stars: ✭ 57 (-90.91%)

Mutual labels: transformer, attention

Transformer Tensorflow

TensorFlow implementation of 'Attention Is All You Need (2017. 6)'

Stars: ✭ 319 (-49.12%)

Mutual labels: attention, transformer

Nlp Tutorials

Simple implementations of NLP models. Tutorials are written in Chinese on my website https://mofanpy.com

Stars: ✭ 394 (-37.16%)

Mutual labels: attention, transformer

View All Similar Projects ➔

awesome-fast-attention

A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)

Efficient Attention
Articles/Surveys/Benchmarks

Efficient Attention

Paper (citations)	Implementation	Computational Complexity	AutoRegressive	Main Idea
Generating Wikipedia by Summarizing Long Sequences (282)	memory-compressed-attention	$\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D})$	✔️	EXPAND compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module (999+)	attention-module	$\mathcal{O}(({N}\cdot{D}+\frac{{D}^2}{r})+({N}\cdot{D}\cdot{k}^2))$	❌	EXPAND combines the SE attention with a per pixel(local) weight
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (16)	set_transformer	$\mathcal{O}({N}\cdot{K}\cdot{D})$	❌	EXPAND uses K relay nodes
CCNet: Criss-Cross Attention for Semantic Segmentation (296)	CCNet	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	❌	EXPAND each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities (16)	efficient-attention	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND Softmax(Q)(Softmax(K^T)V)
Star-Transformer (40)	fastNLP	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses a relay(global) node and attends to/from that node
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (199)	GCNet	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND squeeze and excitation with an attention pooling (instead of a GAP)
Generating Long Sequences with Sparse Transformers (257)	DeepSpeed	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND sparse block based attention
SCRAM: Spatially Coherent Randomized Attention Maps (1)	-	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation (24)	IN_PAPER	$\mathcal{O}({N}\cdot{D}^2+{N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND combination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks (3)	Permutohedral_attention_module	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys (43)	XLM	$\mathcal{O}({Q}\cdot({K}+{k}^2)\cdot{D})$	✔️	EXPAND search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation (79)	EMANet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND applys expectation maximization to cluster keys into k clusters
BP-Transformer: Modelling Long-Range Context via Binary Partitioning (15)	BPT	$\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D})$	✔️	EXPAND attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Compressive Transformers for Long-Range Sequence Modelling (48)	compressive-transformer-pytorch	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
Axial Attention in Multidimensional Transformers (36)	axial-attention	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	✔️	EXPAND apply attention on each axis separately
Reformer: The Efficient Transformer (216)	trax	$\mathcal{O}({N}\cdot\log({N})\cdot{D}^2)$	✔️	EXPAND uses LSH to find close keys
Sparse Sinkhorn Attention (16)	sinkhorn-transformer	$\mathcal{O}(\frac{{N}^2}{n_b}+{n_b}^2)$	✔️	EXPAND uses a cost matrix to limit attention between buckets
Transformer on a Diet (2)	transformer-on-diet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND dilated transformer like wavenet
Time-aware Large Kernel Convolutions (9)	TaLKConvolutions	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND calculate mean over a dynamic subsequence around each token with the help of summed-area table
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (2)	-	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers (38)	routing-transformer	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND computes attention with same-cluster tokens (computed by online k-means)
Neural Architecture Search for Lightweight Non-Local Networks (11)	AutoNL	$\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2)$	❌	EXPAND computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
Longformer: The Long-Document Transformer (159)	longformer	$\mathcal{O}({N}\cdot({k}+{g})\cdot{D})$	✔️	EXPAND global + blocked attention
ETC: Encoding Long and Structured Inputs in Transformers (16)	-	$\mathcal{O}(({N}\cdot{g}+{g}^2+{N}\cdot{k})\cdot{D})$	❌	EXPAND combines global attention (star transformer with multiple global tokens) with local attention
Multi-scale Transformer Language Models (2)	IN_PAPER	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models (26)	Synthesizer-Rethinking-Self-Attention-Transformer-Models	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions
Jukebox: A Generative Model for Music (45)	jukebox	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND better attention patterns from Sparse Transformer
Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers (0)	-	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions and uses fixed mask patters
GMAT: Global Memory Augmentation for Transformers (2)	gmat	$\mathcal{O}({m}\cdot({N}+{m})\cdot{D})$	❌	EXPAND adds global tokens
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (45)	fast-transformers	$\mathcal{O}({N}\cdot{D}^2)$	✔️	EXPAND uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity (47)	linformer-pytorch	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND project key and value from nd to kd
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (8)	google-research	$\mathcal{O}({N}\cdot{D}^2\cdot\log({D}))$	✔️	EXPAND calculate an unbiased stochastic approximation of the attention matrix
Kronecker Attention Networks (1)	kronecker-attention-pytorch	$\mathcal{O}(({H}+{W})^2\cdot{D})$	❌	EXPAND uses horizontal and lateral average matrices
Real-time Semantic Segmentation with Fast Attention (5)	-	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND l2_norm(q)(l2_norm(k)v)
Fast Transformers with Clustered Attention (6)	fast-transformers	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND groups queries together with LSH
Big Bird: Transformers for Longer Sequences (60)	DeepSpeed	$\mathcal{O}(({g}^2+{N}\cdot({k}+{g}+{r}))\cdot{D})$	❌	EXPAND ETC with random connections
Tensor Low-Rank Reconstruction for Semantic Segmentation (3)	-	$\mathcal{O}(({D}\cdot{H}\cdot{W}+{D}^2+{H}^2+{W}^2)\cdot{r})$	❌	EXPAND decompose the full attention tensor into rank one tensors (CP decomposition)
Looking for change? Roll the Dice and demand Attention (0)	IN_PAPER	$\mathcal{O}({H}\cdot{W}\cdot{D})$	❌	EXPAND uses the fractal tanimoto similarity to compare queries with keys inside the attention module
Rethinking Attention with Performers (30)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND unbiased approximation of the attention matrix with softmax kernel
Memformer: The Memory-Augmented Transformer (0)	memformer	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND attend to memory slots + Memory-Replay BackPropagation
SMYRF: Efficient Attention using Asymmetric Clustering (1)	smyrf	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	❌	EXPAND LSH with balanced clusters
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (0)	Informer2020	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND sparse attention + funnel like encoder
Sub-Linear Memory: How to Make Performers SLiM (0)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND Performer but with sublinear Memory usage
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (0)	Nystromformer	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses Nystrom method to approximate the attention matrix
Linear Transformers Are Secretly Fast Weight Memory Systems (0)	fast-weight-transformers	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness
LambdaNetworks: Modeling Long-Range Interactions Without Attention (6)	lambda-networks	$\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h})$	✔️	EXPAND generates a linear layer based on context + decouple pos/context
Random Feature Attention (2)	-	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND kernel approximation and also transformers are rnn

Articles/Surveys/Benchmarks

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 627

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Separius / Awesome Fast Attention

Programming Languages

Labels

Projects that are alternatives of or similar to Awesome Fast Attention

awesome-fast-attention

Table of Contents

Efficient Attention

Articles/Surveys/Benchmarks