Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → bytedance → Effective_transformer

bytedance / Effective_transformer

Licence: apache-2.0

Running BERT without Padding

Labels

machine-learning tensorflow transformer inference

Projects that are alternatives of or similar to Effective transformer

LightSeq: A High Performance Inference Library for Sequence Processing and Generation

Stars: ✭ 501 (+196.45%)

Mutual labels: inference, transformer

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.

Stars: ✭ 421 (+149.11%)

Mutual labels: inference, transformer

Turbotransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

Stars: ✭ 826 (+388.76%)

Mutual labels: inference, transformer

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

Stars: ✭ 395 (+133.73%)

Mutual labels: inference, transformer

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Stars: ✭ 143 (-15.38%)

Mutual labels: inference, transformer

Cross-platform, customizable ML solutions for live and streaming media.

Stars: ✭ 15,338 (+8975.74%)

Mutual labels: inference

Machine Learning inference engine for Microcontrollers and Embedded devices

Stars: ✭ 154 (-8.88%)

Mutual labels: inference

Bmw Labeltool Lite

This repository provides you with a easy to use labeling tool for State-of-the-art Deep Learning training purposes.

Stars: ✭ 145 (-14.2%)

Mutual labels: inference

Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.

Stars: ✭ 143 (-15.38%)

Mutual labels: transformer

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

Stars: ✭ 2,994 (+1671.6%)

Mutual labels: inference

Medical Transformer

Pytorch Code for "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation"

Stars: ✭ 153 (-9.47%)

Mutual labels: transformer

Embedding As Service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques

Stars: ✭ 151 (-10.65%)

Mutual labels: transformer

The Story Of Heads

This is a repository with the code for the ACL 2019 paper "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned" and the paper "Analyzing Source and Target Contributions to NMT Predictions".

Stars: ✭ 146 (-13.61%)

Mutual labels: transformer

Named Entity Recognition with BERT using TensorFlow 2.0

Stars: ✭ 155 (-8.28%)

Mutual labels: inference

ncnn is a high-performance neural network inference framework optimized for the mobile platform

Stars: ✭ 13,376 (+7814.79%)

Mutual labels: inference

Deep Learning based Automatic Speech Recognition with attention for the Nvidia Jetson.

Stars: ✭ 161 (-4.73%)

Mutual labels: inference

Tensorflow template application

TensorFlow template application for deep learning

Stars: ✭ 1,851 (+995.27%)

Mutual labels: inference

Routing Transformer

Fully featured implementation of Routing Transformer

Stars: ✭ 149 (-11.83%)

Mutual labels: transformer

Hrnet Semantic Segmentation

The OCR approach is rephrased as Segmentation Transformer: https://arxiv.org/abs/1909.11065. This is an official implementation of semantic segmentation for HRNet. https://arxiv.org/abs/1908.07919

Stars: ✭ 2,369 (+1301.78%)

Mutual labels: transformer

Transformers for Longer Sequences

Stars: ✭ 146 (-13.61%)

Mutual labels: transformer

View All Similar Projects ➔

Effective Transformer

Effective Transformer is built on top of the NVIDIA open sourced project FasterTransformer with many advanced optimizations. Our experiments show Effective Transformer can significantly reduce the execution time and memory consumption, especially for large batch size cases.

Running BERT without Padding

When using BERT to encode a batch of input sequences, we usually treat the input batch as a matrix whose column number equals to the maximum length of all sequences. NVIDIA FasterTransformer can process cases that all sequences have roughly the same length very efficiently. However, if the lengths of sequences in the same batch vary a lot, padding them into the same length means a big waste of both memory and computation resources.

Consider the following case

bert_input = [["Hi"], ["Picking"], ["The", "seed", "of", "Job's", "tears"]]
bert_tokens = [[1], [2], [3,4,5,6,7]]
bert_tokens_padded = [[1, 0, 0, 0, 0], [2, 0, 0, 0, 0], [3, 4, 5, 6, 7]]
bert_tokens_mask = [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 1, 1, 1, 1]]

this input includes 3 sequences and the maximum length is 5. If we just simply treat it as a 3x5 matrix, only 7 out of 15 values are meaningful.

In Effective Transformer, we still take the input batch as a padded matrix but padding values will be dynamically removed and restored during different calculation stages.

By calculating the prefix sum of the input mask matrix, we can access real inputs in each sequence in a matrix with no padding values. The following figure illustrates how to access valid inputs and dynamically remove and restore padding values during the calculation. All valid inputs are colored in green while padding values are colored in gray.

Environment requirements

CMake >= 3.12
gcc >= 6
CUDA 10.0
Python >= 3.5
Tensorflow 1.15.x

Features

dynamic batch size
inference with float32 and float16

Performance

BERT-Base, layers=12, head_num=12, hidden_size=64

Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz

serquence length generated by

avg_seq_len = np.random.randint(
    low = 2 * avg_seq_len - max_seq_len,
    high = max_seq_len + 1,
    size = (batch_size),
    dtype = np.int32)

Tesla V100, float16, maximum sequence length=32, average serquence length≈20

batch_size	XLA (in ms)	Faster Transformer (in ms)	Speedup over XLA	Effective Transformer (in ms)	Speedup over XLA
100	15.08	10.39	1.45	8.75	1.72
200	28.08	19.64	1.43	15.32	1.83
300	41.37	29.65	1.40	22.18	1.86
400	53.65	38.52	1.39	28.31	1.89
500	66.86	48.13	1.39	33.08	2.02
1000	131.46	95.01	1.38	64.34	2.04

Tesla V100, float16, maximum sequence length=64, average serquence length≈40

batch_size	XLA (in ms)	Faster Transformer (in ms)	Speedup over XLA	Effective Transformer (in ms)	Speedup over XLA
100	28.31	20.27	1.40	16.03	1.77
200	54.47	40.08	1.36	30.15	1.81
300	80.53	59.11	1.36	41.27	1.95
400	106.5	78.38	1.36	54.12	1.97
500	132.35	98.03	1.37	65.92	2.01
1000	261.18	190.91	1.38	133.61	1.95

Tesla V100, float32, maximum sequence length=64, average serquence length≈40

batch_size	XLA (in ms)	Faster Transformer (in ms)	Speedup over XLA	Effective Transformer (in ms)	Speedup over XLA
100	103.13	98.52	1.05	67.45	1.53
200	207.40	198.86	1.04	125.44	1.65
300	304.99	290.55	1.05	197.07	1.55
400	405.98	386.04	1.05	247.39	1.64
500	516.88	496.90	1.04	325.37	1.59

Tesla T4, float16, maximum sequence length=32, average serquence length≈20

batch_size	XLA (in ms)	FasterTransformer (in ms)	Speedup over XLA	EffectiveTransformer (in ms)	Speedup over XLA
100	44.94	35.07	1.28	28.63	1.57
200	90.09	67.08	1.34	53.84	1.67
300	136.88	100.96	1.35	82.74	1.65
400	184.80	133.13	1.39	109.09	1.69
500	242.79	166.54	1.46	136.66	1.78

Tesla T4, float16, maximum sequence length=64, average serquence length≈40

batch_size	XLA (in ms)	FasterTransformer (in ms)	Speedup over XLA	EffectiveTransformer (in ms)	Speedup over XLA
100	87.23	65.86	1.30	52.01	1.68
200	176.91	138.53	1.34	108.33	1.63
300	261.25	204.99	1.36	157.84	1.65
400	355.34	272.96	1.33	202.61	1.75
500	452.62	343.89	1.33	250.78	1.80

Run demo

Using python prebuilt packege requires python3.5+ tensorflow1.15.x cuda10.0, tested on debian9.

$ cd effective_transformer
$ pip install -e python

$ python benchmark.py --help
usage: benchmark.py [-h] [-c CONFIG] [-p {fp32,fp16}] [-b BATCH_SIZE]
                    [-m MAX_SEQ_LENGTH] [-a AVG_SEQ_LENGTH]

Bert performance measuring sample.

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Bert config file.
  -p {fp32,fp16}, --precision {fp32,fp16}
                        Weight precision.
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size.
  -m MAX_SEQ_LENGTH, --max_seq_length MAX_SEQ_LENGTH
                        Max sequence length.
  -a AVG_SEQ_LENGTH, --avg_seq_length AVG_SEQ_LENGTH
                        Average sequence length.

Build from source

TF_PATH : path to libtensorflow_framework.so

$ mkdir build && cd build
$ cmake -DTF_PATH=/your/path/to/pythonx.x/site-packages/tensorflow_core/ ..
$ make
$ cp lib/libtf_effectivetransformer.so ../python/effective_transformer/libtf_effectivetransformer.so.1.15

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 169

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗