All Projects β†’ lucidrains β†’ charformer-pytorch

lucidrains / charformer-pytorch

Licence: MIT License
Implementation of the GBST block from the Charformer paper, in Pytorch

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to charformer-pytorch

zero-administration-inference-with-aws-lambda-for-hugging-face
Zero administration inference with AWS Lambda for πŸ€—
Stars: ✭ 19 (-74.32%)
Mutual labels:  transformer
pynmt
a simple and complete pytorch implementation of neural machine translation system
Stars: ✭ 13 (-82.43%)
Mutual labels:  transformer
laravel-mutate
Mutate Laravel attributes
Stars: ✭ 13 (-82.43%)
Mutual labels:  transformer
SOLQ
"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.
Stars: ✭ 159 (+114.86%)
Mutual labels:  transformer
laravel5-jsonapi-dingo
Laravel5 JSONAPI and Dingo together to build APIs fast
Stars: ✭ 29 (-60.81%)
Mutual labels:  transformer
uax29
A tokenizer based on Unicode text segmentation (UAX 29), for Go
Stars: ✭ 26 (-64.86%)
Mutual labels:  tokenization
amrlib
A python library that makes AMR parsing, generation and visualization simple.
Stars: ✭ 107 (+44.59%)
Mutual labels:  transformer
vgs-collect-ios
VGS Collect iOS SDK
Stars: ✭ 17 (-77.03%)
Mutual labels:  tokenization
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-71.62%)
Mutual labels:  transformer
PAML
Personalizing Dialogue Agents via Meta-Learning
Stars: ✭ 114 (+54.05%)
Mutual labels:  transformer
kosr
Korean speech recognition based on transformer (트랜슀포머 기반 ν•œκ΅­μ–΄ μŒμ„± 인식)
Stars: ✭ 25 (-66.22%)
Mutual labels:  transformer
are-16-heads-really-better-than-1
Code for the paper "Are Sixteen Heads Really Better than One?"
Stars: ✭ 128 (+72.97%)
Mutual labels:  transformer
attention-is-all-you-need-paper
Implementation of Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Stars: ✭ 97 (+31.08%)
Mutual labels:  transformer
youtokentome-ruby
High performance unsupervised text tokenization for Ruby
Stars: ✭ 17 (-77.03%)
Mutual labels:  tokenization
linformer
Implementation of Linformer for Pytorch
Stars: ✭ 119 (+60.81%)
Mutual labels:  transformer
auto-data-tokenize
Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow
Stars: ✭ 21 (-71.62%)
Mutual labels:  tokenization
trapper
State-of-the-art NLP through transformer models in a modular design and consistent APIs.
Stars: ✭ 28 (-62.16%)
Mutual labels:  transformer
saint
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training
Stars: ✭ 209 (+182.43%)
Mutual labels:  transformer
text-generation-transformer
text generation based on transformer
Stars: ✭ 36 (-51.35%)
Mutual labels:  transformer
Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-70.27%)
Mutual labels:  transformer

Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

Install

$ pip install charformer-pytorch

Usage

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
    dim = 512,                    # dimension of token and intra-block positional embedding
    max_block_size = 4,           # maximum block size
    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by
    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask   = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer

Deviating from the paper, you can also specify block size(s) with different offsets. This is to cover a potential use-case for genomics pre-training, where the tokenizer should be able to learn the correct frame. Simply omit the max_block_size, and pass in blocks as a list of tuples of tuples, each tuple with the format (block size, offset). Offsets must be less than the block size

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 4 + 1,
    dim = 512,
    blocks = ((3, 0), (3, 1), (3, 2)),  # block size of 3, with offsets of 0, 1, 2
    downsample_factor = 3,
    score_consensus_attn = True
).cuda()

basepairs = torch.randint(0, 4, (1, 1023)).cuda()
mask      = torch.ones(1, 1023).bool().cuda()

# both basepairs and mask will be appropriately downsampled

basepairs, mask = tokenizer(basepairs, mask = mask)

Citations

@misc{tay2021charformer,
    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, 
    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
    year    = {2021},
    eprint  = {2106.12672},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].