All Projects → isarsoft → yolov4-triton-tensorrt

isarsoft / yolov4-triton-tensorrt

Licence: other
This repository deploys YOLOv4 as an optimized TensorRT engine to Triton Inference Server

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
Cuda
1817 projects
CMake
9771 projects

Projects that are alternatives of or similar to yolov4-triton-tensorrt

yolov4 trt ros
YOLOv4 object detector using TensorRT engine
Stars: ✭ 89 (-60.27%)
Mutual labels:  tensorrt, yolov4, yolov4-tiny
Pytorch Yolov4
PyTorch ,ONNX and TensorRT implementation of YOLOv4
Stars: ✭ 3,690 (+1547.32%)
Mutual labels:  tensorrt, yolov4, yolov4-tiny
Scaled-YOLOv4-TensorRT
Got 100fps on TX2. Got 500fps on GeForce GTX 1660 Ti. If the project is useful to you, please Star it.
Stars: ✭ 169 (-24.55%)
Mutual labels:  tensorrt, yolov4-tiny
onnx2tensorRt
tensorRt-inference darknet2onnx pytorch2onnx mxnet2onnx python version
Stars: ✭ 14 (-93.75%)
Mutual labels:  tensorrt, yolov4
Tensorflow Yolov4 Tflite
YOLOv4, YOLOv4-tiny, YOLOv3, YOLOv3-tiny Implemented in Tensorflow 2.0, Android. Convert YOLO v4 .weights tensorflow, tensorrt and tflite
Stars: ✭ 1,881 (+739.73%)
Mutual labels:  tensorrt, yolov4
Tensorrtx
Implementation of popular deep learning networks with TensorRT network definition API
Stars: ✭ 3,456 (+1442.86%)
Mutual labels:  tensorrt, yolov4
isaac ros dnn inference
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU
Stars: ✭ 67 (-70.09%)
Mutual labels:  tensorrt, triton-inference-server
Open-Source-Models
Address book for computer vision models.
Stars: ✭ 30 (-86.61%)
Mutual labels:  yolov4, yolov4-tiny
ScaledYOLOv4
Scaled-YOLOv4: Scaling Cross Stage Partial Network
Stars: ✭ 1,944 (+767.86%)
Mutual labels:  yolov4, yolov4-tiny
LibtorchTutorials
This is a code repository for pytorch c++ (or libtorch) tutorial.
Stars: ✭ 463 (+106.7%)
Mutual labels:  yolov4, yolov4-tiny
ros-yolo-sort
YOLO v3, v4, v5, v6, v7 + SORT tracking + ROS platform. Supporting: YOLO with Darknet, OpenCV(DNN), OpenVINO, TensorRT(tkDNN). SORT supports python(original) and C++. (Not Deep SORT)
Stars: ✭ 162 (-27.68%)
Mutual labels:  tensorrt, yolov4
libdeepvac
Use PyTorch model in C++ project
Stars: ✭ 98 (-56.25%)
Mutual labels:  tensorrt
MutualGuide
Localize to Classify and Classify to Localize: Mutual Guidance in Object Detection
Stars: ✭ 97 (-56.7%)
Mutual labels:  tensorrt
yolov34-cpp-opencv-dnn
基于opencv的4种YOLO目标检测,C++和Python两个版本的实现,仅仅只依赖opencv库就可以运行
Stars: ✭ 152 (-32.14%)
Mutual labels:  yolov4
mtomo
Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.
Stars: ✭ 24 (-89.29%)
Mutual labels:  tensorrt
tensorflow-tensorrt
Tensorflow to TensorRT Model Converter
Stars: ✭ 30 (-86.61%)
Mutual labels:  tensorrt
self-driving-ish computer vision system
This project generates images you've probably seen in autonomous driving demo. Object Detection, Lane Detection, Road Segmentation, Depth Estimation using TensorRT
Stars: ✭ 254 (+13.39%)
Mutual labels:  tensorrt
Perception-of-Autonomous-mobile-robot
Perception of Autonomous mobile robot,Using ROS,rs-lidar-16,By SLAM,Object Detection with Yolov5 Based DNN
Stars: ✭ 40 (-82.14%)
Mutual labels:  yolov4
pnn
pnn is Darknet compatible neural nets inference engine implemented in Rust.
Stars: ✭ 17 (-92.41%)
Mutual labels:  tensorrt
yolov5 tensorrt int8 tools
tensorrt int8 量化yolov5 onnx模型
Stars: ✭ 105 (-53.12%)
Mutual labels:  tensorrt

YOLOv4 on Triton Inference Server with TensorRT

GitHub release (latest by date including pre-releases) License: MIT

This repository shows how to deploy YOLOv4 as an optimized TensorRT engine to Triton Inference Server.

Triton Inference Server takes care of model deployment with many out-of-the-box benefits, like a GRPC and HTTP interface, automatic scheduling on multiple GPUs, shared memory (even on GPU), health metrics and memory resource management.

TensorRT will automatically optimize throughput and latency of our model by fusing layers and chosing the fastest layer implementations for our specific hardware. We will use the TensorRT API to generate the network from scratch and add all non-supported layers as a plugin.

Build TensorRT engine

There are no dependencies needed to run this code, except a working docker environment with GPU support. We will run all compilation inside the TensorRT NGC container to avoid having to install TensorRT natively.

Run the following to get a running TensorRT container with our repo code:

cd yourworkingdirectoryhere
git clone [email protected]:isarsoft/yolov4-triton-tensorrt.git
docker run --gpus all -it --rm -v $(pwd)/yolov4-triton-tensorrt:/yolov4-triton-tensorrt nvcr.io/nvidia/tensorrt:21.10-py3

Docker will download the TensorRT container. You need to select the version (in this case 21.10) according to the version of Triton that you want to use later to ensure the TensorRT versions match. Matching NGC version tags use the same TensorRT version.

Inside the container run the following to compile our code:

cd /yolov4-triton-tensorrt
mkdir build
cd build
cmake ..
make

This will generate two files (liblayerplugin.so and main). The library contains all unsupported TensorRT layers and the executable will build us an optimized engine in a second.

Download the weights for this network from Google Drive. Instructions on how to generate this weight file from the original darknet config and weights can be found here. Place the weight file in the same folder as the executable main. Then run the following to generate a serialized TensorRT engine optimized for your GPU:

./main

This will generate a file called yolov4.engine, which is our serialized TensorRT engine. Together with liblayerplugin.so we can now deploy to Triton Inference Server.

Before we do this we can test the engine with standalone TensorRT by running:

cd /workspace/tensorrt/bin
./trtexec --loadEngine=/yolov4-triton-tensorrt/build/yolov4.engine --plugins=/yolov4-triton-tensorrt/build/liblayerplugin.so
(...)
[I] Starting inference threads
[I] Warmup completed 1 queries over 200 ms*
[I] Timing trace has 204 queries over 3.00185 s
[I] Trace averages of 10 runs:
[I] Average on 10 runs - GPU latency: 7.8773 ms* - Host latency: 9.45764 ms* (end to end 9.48074 ms*, enqueue 1.98274 ms*
[I] Average on 10 runs - GPU latency: 7.73803 ms* - Host latency: 9.3154 ms* (end to end 9.33945 ms*, enqueue 2.02845 ms*
(...)
[I] GPU Compute
[I] min: 7.01465 ms*
[I] max: 9.11838 ms*
[I] mean: 7.79672 ms*

Deploy to Triton Inference Server

We need to create our model repository file structure first:

# Create model repository
cd yourworkingdirectoryhere
mkdir -p triton-deploy/models/yolov4/1/
mkdir triton-deploy/plugins

# Copy engine and plugins
cp yolov4-triton-tensorrt/build/yolov4.engine triton-deploy/models/yolov4/1/model.plan
cp yolov4-triton-tensorrt/build/liblayerplugin.so triton-deploy/plugins/

Now we can start Triton with this model repository:

docker run --gpus all --rm --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/triton-deploy/models:/models -v$(pwd)/triton-deploy/plugins:/plugins --env LD_PRELOAD=/plugins/liblayerplugin.so nvcr.io/nvidia/tritonserver:21.10-py3 tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose 1

This should give us a running Triton instance with our yolov4 model loaded. You can check out what to do next in the Triton Documentation.

How to run model in your code

This repo contains a python client. More information here.

python client.py -o data/dog_result.jpg image data/dog.jpg

exemplary output result

Benchmark

To benchmark the performance of the model, we can run Tritons Performance Client.

To run the perf_client, install the Triton Python SDK (tritonclient), which ships with perf_client as a preinstalled binary.

sudo apt update
sudo apt install libb64-dev

pip install nvidia-pyindex
pip install tritonclient[all]

# Example
perf_client -m yolov4 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 4

Alternatively you can get the Triton Client SDK docker container.

docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:21.10-py3-sdk /bin/bash
cd install/bin
./perf_client (...argumentshere)
# Example
./perf_client -m yolov4 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 4

The following benchmarks were taken on a system with 2 x NVIDIA 2080 Ti GPUs and an AMD Ryzen 9 3950X 16 Core CPU.

Concurrency is the number of concurrent clients invoking inference on the Triton server via grpc. Results are total frames per second (FPS) of all clients combined and average latency in milliseconds for every single respective client.

2x NVIDIA GeForce RTX 2080 Ti
concurrency FP32 B=1 FP32 B=4 FP32 B=8 FP16 B=1 FP16 B=4 FP16 B=8
1 62.8 FPS 15.9 ms 73.6 FPS 54.1 ms 78.4 FPS 103 ms 138.4 FPS 7.22 ms 219.2 FPS 18.2 ms 235.2 FPS 33.9 ms
2 118.8 FPS 16.8 ms 143.2 FPS 55.9 ms 152.0 FPS 104 ms 286.6 FPS 6.98 ms 438.4 FPS 18.2 ms 484.8 FPS 33.0 ms
4 127.4 FPS 31.4 ms 146.4 FPS 109 ms 158.4 FPS 202 ms 323.6 FPS 12.3 ms 479.2 FPS 33.3 ms 536.0 FPS 59.6 ms
8 127.6 FPS 62.7 ms 144.8 FPS 220 ms 156.8 FPS 405 ms 323.2 FPS 24.7 ms 475.2 FPS 67.3 ms 540.8 FPS 118 ms
1x NVIDIA GeForce RTX 2080 Ti (by setting --gpus 1)
concurrency FP32, B=1 FP32, B=4 FP32, B=8 FP16, B=1 FP16, B=4 FP16, B=8
1 57.6 FPS 17.3 ms 68.0 FPS 58.5 ms 72.0 FPS 111 ms 125.4 FPS 7.96 ms 189.6 FPS 21.0 ms 208.0 FPS 38.3 ms
2 59.2 FPS 33.7 ms 69.6 FPS 114 ms 73.6 FPS 217 ms 137.6 FPS 14.5 ms 207.2 FPS 38.5 ms 228.8 FPS 70.3 ms
4 58.6 FPS 68.1 ms 69.6 FPS 229 ms 72.0 FPS 436 ms 137.0 FPS 29.2 ms 206.4 FPS 77.3 ms 227.2 FPS 141 ms
8 58.4 FPS 136 ms 68.8 FPS 460 ms 72.0 FPS 874 ms 136.8 FPS 58.4 ms 206.4 FPS 154 ms 227.2 FPS 282 ms

Contributions

  • olibartfast with a c++ client example
  • t-wata with shared memory support for the python client

Tasks in this repo

  • Layer plugin working with trtexec and Triton
  • FP16 optimization
  • Remove MISH plugin and replace by standard activation layers (see 3b in this blog for the idea)
  • INT8 optimization
  • General optimizations (using this darknet->onnx->tensorrt export with --best flag gives 572 FPS / (batchsize 8) and 392 FPS / (batchsize 1) without full INT8 calibration)
  • YOLOv4 tiny (example is here)
  • YOLOv5
  • Add Triton client code in python
  • Add image pre and postprocessing code
  • Add mAP benchmark
  • Add BatchedNms* to move Nms* to GPU
  • Add dynamic batch size support

Acknowledgments

The initial codebase is from Wang Xinyu in his TensorRTx repo. He had the idea to implement YOLO using only the TensorRT API and its very nice he shares this code. The yolo layer plugin has been continously improved by jkjung-avt in his repo tensorrt_demos. This repo has the purpose to deploy this engine and plugin to Triton and to add additional perfomance improvements to the TensorRT engine.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].