All Projects → TF2-Engine → Tf2

TF2-Engine / Tf2

Licence: apache-2.0
An Open Source Deep Learning Inference Engine Based on FPGA

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tf2

Opencv Mtcnn
An implementation of MTCNN Face detector using OpenCV's DNN module
Stars: ✭ 59 (-47.79%)
Mutual labels:  opencl, inference, dnn
Lq Nets
LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks
Stars: ✭ 195 (+72.57%)
Mutual labels:  cnn, quantization, dnn
Paddleslim
PaddleSlim is an open-source library for deep model compression and architecture search.
Stars: ✭ 677 (+499.12%)
Mutual labels:  quantization, model-compression
Awesome Automl And Lightweight Models
A list of high-quality (newest) AutoML works and lightweight models including 1.) Neural Architecture Search, 2.) Lightweight Structures, 3.) Model Compression, Quantization and Acceleration, 4.) Hyperparameter Optimization, 5.) Automated Feature Engineering.
Stars: ✭ 691 (+511.5%)
Mutual labels:  quantization, model-compression
Model Optimization
A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
Stars: ✭ 992 (+777.88%)
Mutual labels:  quantization, model-compression
Hawq
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.
Stars: ✭ 108 (-4.42%)
Mutual labels:  quantization, model-compression
John
John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
Stars: ✭ 5,656 (+4905.31%)
Mutual labels:  fpga, opencl
Sai
SDK for TEE AI Stick (includes model training script, inference library, examples)
Stars: ✭ 28 (-75.22%)
Mutual labels:  cnn, quantization
Brevitas
Brevitas: quantization-aware training in PyTorch
Stars: ✭ 343 (+203.54%)
Mutual labels:  fpga, quantization
Jacinto Ai Devkit
Training & Quantization of embedded friendly Deep Learning / Machine Learning / Computer Vision models
Stars: ✭ 49 (-56.64%)
Mutual labels:  cnn, quantization
Haddoc2
Caffe to VHDL
Stars: ✭ 57 (-49.56%)
Mutual labels:  fpga, cnn
Tornadovm
TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
Stars: ✭ 479 (+323.89%)
Mutual labels:  fpga, opencl
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (+231.86%)
Mutual labels:  cnn, dnn
Awesome Emdl
Embedded and mobile deep learning research resources
Stars: ✭ 554 (+390.27%)
Mutual labels:  inference, quantization
Trisycl
Generic system-wide modern C++ for heterogeneous platforms with SYCL from Khronos Group
Stars: ✭ 354 (+213.27%)
Mutual labels:  fpga, opencl
Pipecnn
An OpenCL-based FPGA Accelerator for Convolutional Neural Networks
Stars: ✭ 775 (+585.84%)
Mutual labels:  fpga, opencl
Neuronblocks
NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego
Stars: ✭ 1,356 (+1100%)
Mutual labels:  model-compression, dnn
Sdaccel examples
SDAccel Examples
Stars: ✭ 325 (+187.61%)
Mutual labels:  fpga, opencl
Numpy neural network
仅使用numpy从头开始实现神经网络,包括反向传播公式推导过程; numpy构建全连接层、卷积层、池化层、Flatten层;以及图像分类案例及精调网络案例等,持续更新中... ...
Stars: ✭ 339 (+200%)
Mutual labels:  cnn, dnn
Dialectid e2e
End to End Dialect Identification using Convolutional Neural Network
Stars: ✭ 40 (-64.6%)
Mutual labels:  cnn, dnn

Inspur Deep Learning Inference Accelerator TF2

TF2 Community

partners.png

Reconfigurable AI Computing Program 可重构AI计算发展计划 页面入口

TF2

TF2 is a deep learning inference accelerator based on FPGA computing platform, developed by Inspur AI & HPC. A wide range of general purpose deep neural networks can be supported. Models from popular deep learning frameworks such as Pytorch, TensorFLow, and Caffe can be loaded into TF2 easily by toolkits we supplied. The pretrained deep learning model can be compiled into FPGA without any code level FPGA development work, which can be an agile solution for AI inference applications on FPGA. See the link https://1drv.ms/b/s!Am9Mk04MA_K1bpXjzmHS8U04PSI?e=LaSgjb for our paper: A Deep Learning Inference Accelerator Based on Model Compression on FPGA

The TF2 accelerator is composed of two parts: Transform Kit and Runtime Engine.

top

TransForm Kit

Transform Kit is a tool for model optimization and conversion with modules of model compression, pruning and quantification, etc. Transform Kit aims to reduce model data size and simplify mathmatical calculation. Additionally, computational node fusion can also be done in transform kit to relax the data access bandwidth limitation on computing performance by integrating multiple computing nodes into one. Runtime Engine can automatically convert the previously optimized model file into FPGA targeting file by compiling. The compression and pruning operations are optional.

Compression

Model compression is based on Inspur optimized Incremental Network Quantization(INQ) compression method. Deep neural network model data trained by Pytorch and other frameworks can be used as input. It can compress 32-bit floating-point model data into 4-bit integers, making the actual model data size 1/8 of the original with original data structure maintained. The represented value of the compressed model is 4-bit integeral power of 2 or 0. Four 4-bit data is stored as 1 short type data. The accuracy of the typical CNN with or without compression is shown in the following table.

NetWork Top1 Top5 Top1(Compressed) Top5 (Compressed)
Alexnet 0.5676 0.7990 0.5687 0.8000
VGG16 0.6828 0.8827 0.7055 0.8994
GoogLeNet 0.6889 0.8898 0.6857 0.8887
ResNet50 0.7276 0.9101 0.7465 0.9248
SqueezeNet 0.5750 0.8030 0.5900 0.8040
NetWork Map Map(Compressed)
SSD 0.7773 0.7757

Pruning

TF2 Pruning unit includes random pruning and channel pruning modules ( code will be published later ). The random pruning algorithm is a high pruning rate method, but the pruned model is a sparse model. Channel pruning is a kind of structuralized pruning, which is a dynamic pruning method. This method can directly reduce the channels to lower the computational cost. The advantages of this method are 1. the pruned model can be re-trained to the original accuracy with limited training iterations 2. the pruned model can be directly loaded into TF2 Runtime Engine. The pruning rate and the accuracy with or without pruning of ResNet50 are shown in the table below. Model pruning can realize 1.6x speedup on FPGA.

Pruned Ratio Top1 Top5 Top1-gap Top5-gap
0% 0.7277 0.9109 - -
50% 0.7289 0.9118 0.13% ↑ 0.17% ↑
60% 0.7183 0.9079 0.93% ↓ 0.22% ↓

Quantization

Since the model is innately 4-bit data after compression, the quantization of TF2 is only for feature map data. TF2 quantization tool can quantize normalized 32-bit single-precision floating-point feature map data to 8-bit integer, that is, quantized to -128 to 127. The main advantage of feature map data quantization is that it can reduce the storage resource requirement of on-chip feature data by a quarter, and can reduce the logic calculation required for data processing, greatly improving the computing power of FPGA. The algorithm is described as follows:

  1. Calculating the maximum value fmax of the absolute value of the feature map data of each channel of each convolution layer of the neural network;

  2. Find Q according to the equation: 128 = power(2, Q) * fmax;

  3. According to the equation above, the quantized data V = power(2, Q) * fv, where fv is the original single-precision floating-point data.

The calculation accuracy of deep learning neural network SqueezeNet and ResNet50 with or without quantization is shown in the following table.

NetWork Top1 Top5 Top1(Quantized) Top5(Quantized)
Squeezenet 0.5900 0.8040 0.5900 0.8010
ResNet50 0.7145 0.9010 0.7120 0.9043

RunTime Engine

The TF2 Runtime Engine is an intelligent runtime FPGA accelerator which can automatically generate FPGA executive files. It first parses the network structure file and generates the network configuration file required by the Runtinme Engine, and then recompiles FPGA code, which can automatically generates the FPGA executive file.

With the 4-bit integeral power of 2 compressed model data and the 8-bit integeral feature map data in Runtime Engine, the multiplication of model data and feature map data can be converted to shift operation, which eliminates the dependency for DSP floating-point computing resources on FPGA, greatly improves the performance of deep neural network inference on FPGA and effectively reduces its operating power consumption.

The TF2 Runtime Engine top-level computing architecture is shown below. Multiple convolutional layers are executed serially on the FPGA. In order to reduce the storage access limitation on computing performance, the intermediate feature map data is preferentially stored on the chip as much as possible. The model data is read from the external DDR to the FPGA in real time during the calculation process, but reading operation can be performed simultaneously with the calculation, that is, the reading time can be "hidden" under the calculation time. The core computing architecture of TF2 Runtime Engine is shown below.

block.png

The Filter Loader reads model data from the DDR to the chip. The Feature Loader reads the input picture and feature map data from the DDR to the on-chip cache. The Controller generates a control signal for the Scheduler. The Scheduler reads the Feature data according to the control signal, and sends data such as Feature, Filter, timing signal, and control signal to the PE for calculation. PE Array is the core computation unit of the entire computing architecture, performing Shift Accumulate (SAC) or Multiply Accumulate(MAC) calculations . The current version is SAC, which will be updated by MAC computing. There is a Filter Cache in PE, which is used to store the Filter data for calculating current output channel. The Adder adds the partial results of the MAC/SAC calculations to generate the final convolution results. The number of PEs in a PE Array can be configured according to the structure of the neural network and the amount of FPGA resources. MAC/SAC can be calculated in 1D, 2D, or 3D, and can be configured according to field application. The current version has 2D and 3D calculations. The vector length calculated by each MAC/SAC dimension can also be configured according to the specific application and FPGA computing resources. Networks such as ResNet50/SqueezeNet computiing performance is listed as follows.

NetWork Throughput(fps)
SqueezeNet 1485
GoogLeNet 306
FaceNet(MTCNN+SqueezeNet) 1020

Releases and Contributing

We appreciate all contributions. If you are planning to contribute back bug-fixes, or add new algorithms about compression or pruning, please do so without any further discussion.

If you plan to update computing architecture of FPGA, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the architecture in a different direction than you might be aware of.

Reference

  1. Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon RChiu. 2017. An Open deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 55–64.
  2. Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights.arXiv preprint arXiv:1702.03044 (2017).

License

Apache License 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].