Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → zhihu → Cubert

zhihu / Cubert

Licence: mit

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

Labels

deep-learning tensorflow cuda transformer inference

Projects that are alternatives of or similar to Cubert

Lightseq

LightSeq: A High Performance Inference Library for Sequence Processing and Generation

Stars: ✭ 501 (+26.84%)

Mutual labels: inference, cuda, transformer

Turbotransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

Stars: ✭ 826 (+109.11%)

Mutual labels: inference, transformer

Tensorflow Cmake

TensorFlow examples in C, C++, Go and Python without bazel but with cmake and FindTensorFlow.cmake

Stars: ✭ 418 (+5.82%)

Mutual labels: inference, cuda

Forward

A library for high performance deep learning inference on NVIDIA GPUs.

Stars: ✭ 136 (-65.57%)

Mutual labels: inference, cuda

Onnxt5

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Stars: ✭ 143 (-63.8%)

Mutual labels: inference, transformer

Tensorrt Laboratory

Explore the Capabilities of the TensorRT Platform

Stars: ✭ 236 (-40.25%)

Mutual labels: inference, cuda

Effective transformer

Running BERT without Padding

Stars: ✭ 169 (-57.22%)

Mutual labels: inference, transformer

fastT5

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.

Stars: ✭ 421 (+6.58%)

Mutual labels: inference, transformer

Gfocal

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection, NeurIPS2020

Stars: ✭ 376 (-4.81%)

Mutual labels: inference

Cudf

cuDF - GPU DataFrame Library

Stars: ✭ 4,370 (+1006.33%)

Mutual labels: cuda

Nvpipe

NVIDIA-accelerated zero latency video compression library for interactive remoting applications

Stars: ✭ 376 (-4.81%)

Mutual labels: cuda

Cuda.jl

CUDA programming in Julia.

Stars: ✭ 370 (-6.33%)

Mutual labels: cuda

Amgcl

C++ library for solving large sparse linear systems with algebraic multigrid method

Stars: ✭ 390 (-1.27%)

Mutual labels: cuda

Flow Forecast

Deep learning PyTorch library for time series forecasting, classification, and anomaly detection (originally for flood forecasting).

Stars: ✭ 368 (-6.84%)

Mutual labels: transformer

Cudanative.jl

Julia support for native CUDA programming

Stars: ✭ 393 (-0.51%)

Mutual labels: cuda

Vuda

VUDA is a header-only library based on Vulkan that provides a CUDA Runtime API interface for writing GPU-accelerated applications.

Stars: ✭ 373 (-5.57%)

Mutual labels: cuda

Mini Caffe

Minimal runtime core of Caffe, Forward only, GPU support and Memory efficiency.

Stars: ✭ 373 (-5.57%)

Mutual labels: cuda

Nlp Tutorials

Simple implementations of NLP models. Tutorials are written in Chinese on my website https://mofanpy.com

Stars: ✭ 394 (-0.25%)

Mutual labels: transformer

Ganet

GA-Net: Guided Aggregation Net for End-to-end Stereo Matching

Stars: ✭ 393 (-0.51%)

Mutual labels: cuda

Music Translation

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Stars: ✭ 385 (-2.53%)

Mutual labels: cuda

View All Similar Projects ➔

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

Highly customized and optimized BERT inference directly on NVIDIA (CUDA, CUBLAS) or Intel MKL, without tensorflow and its framework overhead.

ONLY BERT (Transformer) is supported.

Benchmark

Environment

Tesla P4
28 * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Debian GNU/Linux 8 (jessie)
gcc (Debian 4.9.2-10+deb8u1) 4.9.2
CUDA: release 9.0, V9.0.176
MKL: 2019.0.1.20181227
tensorflow: 1.12.0
BERT: seq_length = 32

GPU (cuBERT)

batch size	128 (ms)	32 (ms)
tensorflow	255.2	70.0
cuBERT	184.6	54.5

CPU (mklBERT)

batch size	128 (ms)	1 (ms)
tensorflow	1504.0	69.9
mklBERT	984.9	24.0

Note: MKL should be run under OMP_NUM_THREADS=? to control its thread number. Other environment variables and their possible values includes:

KMP_BLOCKTIME=0
KMP_AFFINITY=granularity=fine,verbose,compact,1,0

Mixed Precision

cuBERT can be accelerated by Tensor Core and Mixed Precision on NVIDIA Volta and Turing GPUs. We support mixed precision as variables stored in fp16 with computation taken in fp32. The typical accuracy error is less than 1% compared with single precision inference, while the speed achieves more than 2x acceleration.

API

API .h header

Pooler

We support following 2 pooling method.

The standard BERT pooler, which is defined as:

with tf.variable_scope("pooler"):
  # We "pool" the model by simply taking the hidden state corresponding
  # to the first token. We assume that this has been pre-trained
  first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
  self.pooled_output = tf.layers.dense(
    first_token_tensor,
    config.hidden_size,
    activation=tf.tanh,
    kernel_initializer=create_initializer(config.initializer_range))

Simple average pooler:

self.pooled_output = tf.reduce_mean(self.sequence_output, axis=1)

Output

Following outputs are supported:

cuBERT_OutputType	python code
cuBERT_LOGITS	`model.get_pooled_output() * output_weights + output_bias`
cuBERT_PROBS	`probs = tf.nn.softmax(logits, axis=-1)`
cuBERT_POOLED_OUTPUT	`model.get_pooled_output()`
cuBERT_SEQUENCE_OUTPUT	`model.get_sequence_output()`
cuBERT_EMBEDDING_OUTPUT	`model.get_embedding_output()`

Build from Source

mkdir build && cd build
# if build with CUDA
cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_GPU=ON -DCUDA_ARCH_NAME=Common ..
# or build with MKL
cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_MKL_SUPPORT=ON ..
make -j4

# install to /usr/local
# it will also install MKL if -DcuBERT_ENABLE_MKL_SUPPORT=ON
sudo make install

If you would like to run tfBERT_benchmark for performance comparison, please first install tensorflow C API from https://www.tensorflow.org/install/lang_c.

Run Unit Test

Download BERT test model bert_frozen_seq32.pb and vocab.txt from Dropbox, and put them under dir build before run make test or ./cuBERT_test.

Python

We provide simple Python wrapper by Cython, and it can be built and installed after C++ building as follows:

cd python
python setup.py bdist_wheel

# install
pip install dist/cuBERT-xxx.whl

# test
python cuBERT_test.py

Please check the Python API usage and examples at cuBERT_test.py for more details.

Java

Java wrapper is implemented through JNA . After installing maven and C++ building, it can be built as follows:

cd java
mvn clean package # -DskipTests

When using Java JAR, you need to specify jna.library.path to the location of libcuBERT.so if it is not installed to the system path. And jna.encoding should be set to UTF8 as -Djna.encoding=UTF8 in the JVM start-up script.

Please check the Java API usage and example at ModelTest.java for more details.

Install

Pre-built python binary package (currently only with MKL on Linux) can be installed as follows:

Download and install MKL to system path.
Download the wheel package and pip install cuBERT-xxx-linux_x86_64.whl
run python -c 'import libcubert' to verify your installation.

Dependency

Protobuf

cuBERT is built with protobuf-c to avoid version and code conflicting with tensorflow protobuf.

CUDA

Libraries compiled by CUDA with different versions are not compatible.

MKL

MKL is dynamically linked. We install both cuBERT and MKL in sudo make install.

Threading

We assume the typical usage case of cuBERT is for online serving, where concurrent requests of different batch_size should be served as fast as possible. Thus, throughput and latency should be balanced, especially in pure CPU environment.

As the vanilla class Bert is not thread-safe because of its internal buffers for computation, a wrapper class BertM is written to hold locks of different Bert instances for thread safety. BertM will choose one underlying Bert instance by a round-robin manner, and consequence requests of the same Bert instance might be queued by its corresponding lock.

GPU

One Bert is placed on one GPU card. The maximum concurrent requests is the number of usable GPU cards on one machine, which can be controlled by CUDA_VISIBLE_DEVICES if it is specified.

CPU

For pure CPU environment, it is more complicate than GPU. There are 2 level of parallelism:

Request level. Concurrent requests will compete CPU resource if the online server itself is multi-threaded. If the server is single-threaded (for example some server implementation in Python), things will be much easier.
Operation level. The matrix operations are parallelized by OpenMP and MKL. The maximum parallelism is controlled by OMP_NUM_THREADS, MKL_NUM_THREADS, and many other environment variables. We refer our users to first read Using Threaded Intel® MKL in Multi-Thread Application and Recommended settings for calling Intel MKL routines from multi-threaded applications .

Thus, we introduce CUBERT_NUM_CPU_MODELS for better control of request level parallelism. This variable specifies the number of Bert instances created on CPU/memory, which acts same like CUDA_VISIBLE_DEVICES for GPU.

If you have limited number of CPU cores (old or desktop CPUs, or in Docker), it is not necessary to use CUBERT_NUM_CPU_MODELS. For example 4 CPU cores, a request-level parallelism of 1 and operation-level parallelism of 4 should work quite well.
But if you have many CPU cores like 40, it might be better to try with request-level parallelism of 5 and operation-level parallelism of 8.

In summary, OMP_NUM_THREADS or MKL_NUM_THREADS defines how many threads one model could use, and CUBERT_NUM_CPU_MODELS defines how many models in total.

Again, the per request latency and overall throughput should be balanced, and it diffs from model seq_length, batch_size, your CPU cores, your server QPS, and many many other things. You should take a lot benchmark to achieve the best trade-off. Good luck!

Authors

fanliwen
wangruixin
fangkuan
sunxian

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 395

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗