Alternatives and detailed information of charformer-pytorch

lucidrains / charformer-pytorch

Licence: MIT License

Implementation of the GBST block from the Charformer paper, in Pytorch

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to charformer-pytorch

zero-administration-inference-with-aws-lambda-for-hugging-face

Zero administration inference with AWS Lambda for 🤗

Stars: ✭ 19 (-74.32%)

Mutual labels: transformer

pynmt

a simple and complete pytorch implementation of neural machine translation system

Stars: ✭ 13 (-82.43%)

Mutual labels: transformer

laravel-mutate

Mutate Laravel attributes

Stars: ✭ 13 (-82.43%)

Mutual labels: transformer

SOLQ

"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

Stars: ✭ 159 (+114.86%)

Mutual labels: transformer

laravel5-jsonapi-dingo

Laravel5 JSONAPI and Dingo together to build APIs fast

Stars: ✭ 29 (-60.81%)

Mutual labels: transformer

uax29

A tokenizer based on Unicode text segmentation (UAX 29), for Go

Stars: ✭ 26 (-64.86%)

Mutual labels: tokenization

amrlib

A python library that makes AMR parsing, generation and visualization simple.

Stars: ✭ 107 (+44.59%)

Mutual labels: transformer

vgs-collect-ios

VGS Collect iOS SDK

Stars: ✭ 17 (-77.03%)

Mutual labels: tokenization

semantic-document-relations

Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"

Stars: ✭ 21 (-71.62%)

Mutual labels: transformer

PAML

Personalizing Dialogue Agents via Meta-Learning

Stars: ✭ 114 (+54.05%)

Mutual labels: transformer

kosr

Korean speech recognition based on transformer (트랜스포머 기반 한국어 음성 인식)

Stars: ✭ 25 (-66.22%)

Mutual labels: transformer

are-16-heads-really-better-than-1

Code for the paper "Are Sixteen Heads Really Better than One?"

Stars: ✭ 128 (+72.97%)

Mutual labels: transformer

attention-is-all-you-need-paper

Implementation of Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

Stars: ✭ 97 (+31.08%)

Mutual labels: transformer

youtokentome-ruby

High performance unsupervised text tokenization for Ruby

Stars: ✭ 17 (-77.03%)

Mutual labels: tokenization

linformer

Implementation of Linformer for Pytorch

Stars: ✭ 119 (+60.81%)

Mutual labels: transformer

auto-data-tokenize

Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow

Stars: ✭ 21 (-71.62%)

Mutual labels: tokenization

trapper

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Stars: ✭ 28 (-62.16%)

Mutual labels: transformer

saint

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

Stars: ✭ 209 (+182.43%)

Mutual labels: transformer

text-generation-transformer

text generation based on transformer

Stars: ✭ 36 (-51.35%)

Mutual labels: transformer

Filipino-Text-Benchmarks

Open-source benchmark datasets and pretrained transformer models in the Filipino language.

Stars: ✭ 22 (-70.27%)

Mutual labels: transformer

View All Similar Projects ➔

Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

Install

$ pip install charformer-pytorch

Usage

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
    dim = 512,                    # dimension of token and intra-block positional embedding
    max_block_size = 4,           # maximum block size
    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by
    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask   = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer

Deviating from the paper, you can also specify block size(s) with different offsets. This is to cover a potential use-case for genomics pre-training, where the tokenizer should be able to learn the correct frame. Simply omit the max_block_size, and pass in blocks as a list of tuples of tuples, each tuple with the format (block size, offset). Offsets must be less than the block size

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 4 + 1,
    dim = 512,
    blocks = ((3, 0), (3, 1), (3, 2)),  # block size of 3, with offsets of 0, 1, 2
    downsample_factor = 3,
    score_consensus_attn = True
).cuda()

basepairs = torch.randint(0, 4, (1, 1023)).cuda()
mask      = torch.ones(1, 1023).bool().cuda()

# both basepairs and mask will be appropriately downsampled

basepairs, mask = tokenizer(basepairs, mask = mask)

Citations

@misc{tay2021charformer,
    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, 
    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
    year    = {2021},
    eprint  = {2106.12672},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

lucidrains / charformer-pytorch

Programming Languages

Labels

Projects that are alternatives of or similar to charformer-pytorch

Charformer - Pytorch

Install

Usage

Citations