All Projects → triton-inference-server → onnxruntime_backend

triton-inference-server / onnxruntime_backend

Licence: BSD-3-Clause License
The Triton backend for the ONNX Runtime.

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
CMake
9771 projects

Projects that are alternatives of or similar to onnxruntime backend

onnxruntime-rs
Rust wrapper for Microsoft's ONNX Runtime (version 1.8)
Stars: ✭ 149 (+272.5%)
Mutual labels:  inference, onnx-runtime
enterprise-applications-patterns
Collection of enterprise application patterns
Stars: ✭ 17 (-57.5%)
Mutual labels:  backend
deno-auth
User authentication implemented in Deno in accordance with MVC architecture
Stars: ✭ 24 (-40%)
Mutual labels:  backend
noisy-networks-measurements
Noisy network measurement with stan
Stars: ✭ 42 (+5%)
Mutual labels:  inference
forestError
A Unified Framework for Random Forest Prediction Error Estimation
Stars: ✭ 23 (-42.5%)
Mutual labels:  inference
roll
Roll — backend for Clojure
Stars: ✭ 73 (+82.5%)
Mutual labels:  backend
Events-based-organizational-website
The official codebase for college-based (event managing) organizations. FOUR-LEVEL Authorization system and scalable.
Stars: ✭ 14 (-65%)
Mutual labels:  backend
Magento2-Admin-Module-Sample
Minimal code to create an admin/backend module in Magento2
Stars: ✭ 45 (+12.5%)
Mutual labels:  backend
caffe
This fork of BVLC/Caffe is dedicated to supporting Cambricon deep learning processor and improving performance of this deep learning framework when running on Machine Learning Unit(MLU).
Stars: ✭ 40 (+0%)
Mutual labels:  inference
ConfTalks
⚠️ Development is currently on hold 🎥 An open source index of already recorded and scheduled conference talks to help you decide if you should go. Built for all developers 👩‍💻👨‍💻
Stars: ✭ 53 (+32.5%)
Mutual labels:  backend
aionic-core
The core API required for all other Aionic applications
Stars: ✭ 106 (+165%)
Mutual labels:  backend
newrelic-sidekiq-metrics
Implements recording Sidekiq stats (like queue or retry size) to New Relic metrics
Stars: ✭ 15 (-62.5%)
Mutual labels:  backend
bootstrap helper
Bootstrap Helper für REDAXO 5
Stars: ✭ 22 (-45%)
Mutual labels:  backend
lego
LEGO Backend
Stars: ✭ 48 (+20%)
Mutual labels:  backend
flame
Ruby web-framework
Stars: ✭ 43 (+7.5%)
Mutual labels:  backend
scaling-nodejs
📈 Scaling Node.js on each X, Y and Z axis using Node.js Native Modules, PM2, AWS , Load Balancers, AutoScaling, Nginx, AWS Cloudfront
Stars: ✭ 73 (+82.5%)
Mutual labels:  backend
andresrodriguez55.github.io
Personal blog and portfolio with administration panel, notification system and comment system.
Stars: ✭ 18 (-55%)
Mutual labels:  backend
chainer-dense-fusion
Chainer implementation of Dense Fusion
Stars: ✭ 21 (-47.5%)
Mutual labels:  inference
gosane
A sane and simple Go REST API template.
Stars: ✭ 81 (+102.5%)
Mutual labels:  backend
Authl
A library for managing federated identity
Stars: ✭ 20 (-50%)
Mutual labels:  backend

License

ONNX Runtime Backend

The Triton backend for the ONNX Runtime. You can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page.

Use a recent cmake to build and install in a local directory. Typically you will want to build an appropriate ONNX Runtime implementation as part of the build. You do this by specifying a ONNX Runtime version and a Triton container version that you want to use with the backend. You can find the combination of versions used in a particular Triton release in the TRITON_VERSION_MAP at the top of build.py in the branch matching the Triton release you are interested in. For example, to build the ONNX Runtime backend for Triton 21.05, use the versions from TRITON_VERSION_MAP in the r21.05 branch of build.py.

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_BUILD_ONNXRUNTIME_VERSION=1.9.0 -DTRITON_BUILD_CONTAINER_VERSION=21.08 ..
$ make install

The resulting install/backends/onnxruntime directory can be added to a Triton installation as /opt/tritonserver/backends/onnxruntime.

The following required Triton repositories will be pulled and used in the build. By default the "main" branch/tag will be used for each repo but the listed CMake argument can be used to override.

  • triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag]
  • triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
  • triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]

You can add TensorRT support to the ONNX Runtime backend by using -DTRITON_ENABLE_ONNXRUNTIME_TENSORRT=ON. You can add OpenVino support by using -DTRITON_ENABLE_ONNXRUNTIME_OPENVINO=ON -DTRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION=<version>, where <version> is the OpenVino version to use and should match the TRITON_VERSION_MAP entry as described above. So, to build with both TensorRT and OpenVino support:

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_BUILD_ONNXRUNTIME_VERSION=1.9.0 -DTRITON_BUILD_CONTAINER_VERSION=21.08 -DTRITON_ENABLE_ONNXRUNTIME_TENSORRT=ON -DTRITON_ENABLE_ONNXRUNTIME_OPENVINO=ON -DTRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION=2021.2.200 ..
$ make install

ONNX Runtime with TensorRT optimization

TensorRT can be used in conjunction with an ONNX model to further optimize the performance. To enable TensorRT optimization you must set the model configuration appropriately. There are several optimizations available for TensorRT, like selection of the compute precision and workspace size. The optimization parameters and their description are as follows.

  • precision_mode: The precision used for optimization. Allowed values are "FP32", "FP16" and "INT8". Default value is "FP32".
  • max_workspace_size_bytes: The maximum GPU memory the model can use temporarily during execution. Default value is 1GB.
  • int8_calibration_table_name: Specify INT8 calibration table name. Applicable when precision_mode=="INT8" and the models do not contain Q/DQ nodes. If calibration table is provided for model with Q/DQ nodes then ORT session creation will fail.
  • int8_use_native_calibration_table: Calibration table to use. Allowed values are 1 (use native TensorRT generated calibration table) and 0 (use ORT generated calibration table). Default is 0. **Note: Latest calibration table file needs to be copied to trt_engine_cache_path before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.
  • trt_engine_cache_enable: Enable engine caching.
  • trt_engine_cache_path: Specify engine cache path.

The section of model config file specifying these parameters will look like:

.
.
.
optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }}
  ]
}}
.
.
.

ONNX Runtime with CUDA Execution Provider optimization

When GPU is enabled for ORT, CUDA execution provider is enabled. If TensorRT is also enabled then CUDA EP is treated as a fallback option (only comes into picture for nodes which TensorRT cannot execute). If TensorRT is not enabled then CUDA EP is the primary EP which executes the models. ORT enabled configuring options for CUDA EP to further optimize based on the specific model and user scenarios. To enable CUDA EP optimization you must set the model configuration appropriately. There are several optimizations available, like selection of max mem, cudnn conv algorithm etc... The optimization parameters and their description are as follows.

  • cudnn_conv_algo_search: CUDA Convolution algorithm search configuration. Available options are 0 - EXHAUSTIVE (expensive exhaustive benchmarking using cudnnFindConvolutionForwardAlgorithmEx). This is also the default option, 1 - HEURISTIC (lightweight heuristic based search using cudnnGetConvolutionForwardAlgorithm_v7), 2 - DEFAULT (default algorithm using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM)

  • gpu_mem_limit: CUDA memory limit. To use all possible memory pass in maximum size_t. Defaults to SIZE_MAX.

  • arena_extend_strategy: Strategy used to grow the memory arena. Available options are: 0 = kNextPowerOfTwo, 1 = kSameAsRequested. Defaults to 0.

  • do_copy_in_default_stream: Flag indicating if copying needs to take place on the same stream as the compute stream in the CUDA EP. Available options are: 0 = Use separate streams for copying and compute, 1 = Use the same stream for copying and compute. Defaults to 1.

The section of model config file specifying these parameters will look like:

.
.
.
parameters { key: "cudnn_conv_algo_search" value: { string_value: "0" } }
parameters { key: "gpu_mem_limit" value: { string_value: "4294967200" } }
.
.
.

ONNX Runtime with OpenVINO optimization

OpenVINO can be used in conjunction with an ONNX model to further optimize performance. To enable OpenVINO optimization you must set the model configuration as shown below.

.
.
.
optimization { execution_accelerators {
  cpu_execution_accelerator : [ {
    name : "openvino"
  ]
}}
.
.
.

Other Optimization Options with ONNX Runtime

Details regarding when to use these options and what to expect from them can be found here

Model Config Options

  • intra_op_thread_count: Sets the number of threads used to parallelize the execution within nodes. A value of 0 means ORT will pick a default which is number of cores.
  • inter_op_thread_count: Sets the number of threads used to parallelize the execution of the graph (across nodes). If sequential execution is enabled this value is ignored. A value of 0 means ORT will pick a default which is number of cores.
  • execution_mode: Controls whether operators in the graph are executed sequentially or in parallel. Usually when the model has many branches, setting this option to 1 .i.e. "parallel" will give you better performance. Default is 0 which is "sequential execution."
  • level: Refers to the graph optimization level. By default all optimizations are enabled. Allowed values are -1 and 1. -1 refers to BASIC optimizations and 1 refers to basic plus extended optimizations like fusions. Please find the details here
optimization {
  graph : {
    level : 1
}}

parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
parameters { key: "execution_mode" value: { string_value: "0" } }
parameters { key: "inter_op_thread_count" value: { string_value: "0" } }

  • enable_mem_arena: Use 1 to enable the arena and 0 to disable. See this for more information.
  • enable_mem_pattern: Use 1 to enable memory pattern and 0 to disable. See this for more information.
  • memory.enable_memory_arena_shrinkage: See this for more information.

Command line options

When intra and inter op threads is set to 0 or a value higher than 1, by default ORT creates threadpool per session. This may not be ideal in every scenario, therefore ORT also supports global threadpools. When global threadpools are enabled ORT creates 1 global threadpool which is shared by every session. Use the backend config to enable global threadpool. When global threadpool is enabled, intra and inter op num threads config should also be provided via backend config. Config values provided in model config will be ignored.

--backend-config=onnxruntime,enable-global-threadpool=<0,1>, --backend-config=onnxruntime,intra_op_thread_count=<int> , --backend-config=onnxruntime,inter_op_thread_count=<int> 
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].