All Projects → autodeployai → ai-serving

autodeployai / ai-serving

Licence: Apache-2.0 license
Serving AI/ML models in the open standard formats PMML and ONNX with both HTTP (REST API) and gRPC endpoints

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to ai-serving

optimum
🏎️ Accelerate training and inference of 🤗 Transformers with easy to use hardware optimization tools
Stars: ✭ 567 (+364.75%)
Mutual labels:  inference, onnx
Yolov5 Rt Stack
Yet another yolov5, with its runtime stack for libtorch, onnx, tvm and specialized accelerators. You like torchvision's retinanet? You like yolov5? You love yolort!
Stars: ✭ 107 (-12.3%)
Mutual labels:  inference, onnx
serving-runtime
Exposes a serialized machine learning model through a HTTP API.
Stars: ✭ 15 (-87.7%)
Mutual labels:  inference, onnx
ai-deployment
关注AI模型上线、模型部署
Stars: ✭ 149 (+22.13%)
Mutual labels:  pmml, onnx
Volksdep
volksdep is an open-source toolbox for deploying and accelerating PyTorch, ONNX and TensorFlow models with TensorRT.
Stars: ✭ 195 (+59.84%)
Mutual labels:  inference, onnx
onnxruntime-rs
Rust wrapper for Microsoft's ONNX Runtime (version 1.8)
Stars: ✭ 149 (+22.13%)
Mutual labels:  inference, onnx
Mivisionx
MIVisionX toolkit is a set of comprehensive computer vision and machine intelligence libraries, utilities, and applications bundled into a single toolkit. AMD MIVisionX also delivers a highly optimized open-source implementation of the Khronos OpenVX™ and OpenVX™ Extensions.
Stars: ✭ 100 (-18.03%)
Mutual labels:  inference, onnx
nn-Meter
A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.
Stars: ✭ 211 (+72.95%)
Mutual labels:  inference, onnx-models
Ncnn
ncnn is a high-performance neural network inference framework optimized for the mobile platform
Stars: ✭ 13,376 (+10863.93%)
Mutual labels:  inference, onnx
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (+17.21%)
Mutual labels:  inference, onnx
pytorch-to-javascript-with-onnx-js
Run PyTorch models in the browser using ONNX.js
Stars: ✭ 162 (+32.79%)
Mutual labels:  onnx, onnx-models
mediapipe plus
The purpose of this project is to apply mediapipe to more AI chips.
Stars: ✭ 38 (-68.85%)
Mutual labels:  inference, onnx
Multi Model Server
Multi Model Server is a tool for serving neural net models for inference
Stars: ✭ 770 (+531.15%)
Mutual labels:  inference, onnx
Ml Model Ci
MLModelCI is a complete MLOps platform for managing, converting, profiling, and deploying MLaaS (Machine Learning-as-a-Service), bridging the gap between current ML training and serving systems.
Stars: ✭ 122 (+0%)
Mutual labels:  inference, onnx
Libonnx
A lightweight, portable pure C99 onnx inference engine for embedded devices with hardware acceleration support.
Stars: ✭ 217 (+77.87%)
Mutual labels:  inference, onnx
fastT5
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.
Stars: ✭ 421 (+245.08%)
Mutual labels:  inference, onnx
IGoR
IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:
Stars: ✭ 42 (-65.57%)
Mutual labels:  inference
barracuda-style-transfer
Companion code for the Unity Style Transfer blog post, showcasing realtime style transfer using Barracuda.
Stars: ✭ 126 (+3.28%)
Mutual labels:  inference
Molecules Dataset Collection
Collection of data sets of molecules for a validation of properties inference
Stars: ✭ 69 (-43.44%)
Mutual labels:  inference
ims
📚 Introduction to Modern Statistics - A college-level open-source textbook with a modern approach highlighting multivariable relationships and simulation-based inference.
Stars: ✭ 509 (+317.21%)
Mutual labels:  inference

AI-Serving

Serving AI/ML models in the open standard formats PMML and ONNX with both HTTP (REST API) and gRPC endpoints.

Table of Contents

Features

AI Serving is a flexible, high-performance inferencing system for machine learning and deep learning models, designed for production environments. AI Serving provides out-of-the-box integration with PMML and ONNX models, but can be easily extended to serve other formats of models. The next candidate format could be PFA.

Prerequisites

  • Java >= 1.8

Installation

Install using Docker

The easiest and most straight-forward way of using AI-Serving is with Docker images.

Install from Source

Install SBT

The sbt build system is required. After sbt installed, clone this repository, then change into the repository root directory:

cd REPO_ROOT

Build Assembly

AI-Serving depends on ONNX Runtime to support ONNX models, and the default CPU accelerator (OpenMP) is used for ONNX Runtime:

sbt clean assembly

Set the property -Dgpu=true to use the GPU accelerator (CUDA) for ONNX Runtime:

sbt -Dgpu=true clean assembly

Run set test in assembly := {} to disable unit tests if you want to skip them when generating an assembly jar:

sbt -Dgpu=true 'set test in assembly := {}' clean assembly

An assembly jar will be generated:

$REPO_ROOT/target/scala-2.13/ai-serving-assembly-<version>.jar

Start Server

Simply run with the default CPU backend for ONNX models:

java -jar ai-serving-assembly-<version>.jar

GPU backend for ONNX models:

java -Donnxruntime.backend=cuda -jar ai-serving-assembly-<version>.jar

Several available execution backends: TensorRT, DirectML, Dnnl and so on. See Advanced ONNX Runtime Configuration for details.

Server Configurations

By default, the HTTP endpoint is listening on http://0.0.0.0:9090/, and the gRPC port is 9091. You can customize those options that are defined in the application.conf. There are several ways to override the default options, one is to create a new config file based on the default one, then:

java -Dconfig.file=/path/to/config-file -jar ai-serving-assembly-<version>.jar

Another is to override each by setting Java system property, for example:

java -Dservice.http.port=9000 -Dservice.grpc.port=9001 -Dservice.home="/path/to/writable-directory" -jar ai-serving-assembly-<version>.jar

AI-Serving is designed to be persistent or recoverable, so it needs a place to save all served models, that is specified by the property service.home that takes /opt/ai-serving as default, and the directory must be writable.

PMML

Integrates PMML4S to score PMML models. PMML4S is a lightweight, clean and efficient implementation based on the PMML specification from 2.0 through to the latest 4.4.1.

PMML4S is written in pure Scala that running in JVM, AI-Serving needs no special configurations to support PMML models.

ONNX

Leverages ONNX Runtime to make predictions for ONNX models. ONNX Runtime is a performance-focused inference engine for ONNX models.

ONNX Runtime is implemented in C/C++, and AI-Serving calls ONNX Runtime Java API to support ONNX models. ONNX Runtime can support various architectures with multiple hardware accelerators, refer to the table on aka.ms/onnxruntime for details.

Since both CPU and GPU binaries of X64 have been distributed to the Maven Central, AI-Serving depends on them directly now, the CPU is used by default, and you can switch to GPU via the property -Dgpu=true defined above.

If you depend on other different OS/architectures or hardware accelerators, refer to the next topic Advanced ONNX Runtime Configuration, otherwise, ignore it.

Advanced ONNX Runtime Configuration

Build ONNX Runtime

You need to build both native libraries JNI shared library and onnxruntime shared library for your OS/Architecture:

Please, refer to the onnxtime build instructions, and the --build_java option always must be specified.

Load ONNX Runtime

Please, see Build Output lists all generated outputs. Explicitly specify the path to the shared library when starting AI-Serving:

java -Donnxruntime.backend=execution_backend -Donnxruntime.native.onnxruntime4j_jni.path=/path/to/onnxruntime4j_jni -Donnxruntime.native.onnxruntime.path=/path/to/onnxruntime -jar ai-serving-assembly-<version>.jar

REST APIs

When an error occurs, all APIs will return a JSON object as follows:

{
  "error": <an error message string>
}

1. Validate API

URL:

PUT http://host:port/v1/validate

Request:

Model with Content-Type that tells the server which format to handle:

  • Content-Type: application/xml, or Content-Type: text/xml, the input is treated as a PMML model.
  • Content-Type: application/octet-stream, application/vnd.google.protobuf or application/x-protobuf, the input is processed as an ONNX model.

If no Content-Type specified, the server can probe the content type from the input entity, but it could fail.

Response:

Model metadata includes the model type, predictor list, output list, and so on.

{
  "type": <model_type>
  "predictors": [
    {
      "name": <predictor_name1>,
      "type": <field_type>,
      ...
    },
    {
      "name": <predictor_name2>,
      "type": <field_type>,
      ...
    },
    ...
  ],
  "outputs": [
    {
      "name": <output_name1>,
      "type": <field_type>,
      ...
    },
    {
      "name": <output_name2>,
      "type": <field_type>,
      ...
    },
    ...
  ],
  ...
}

2. Deploy API

Deployment URL:

PUT http://host:port/v1/models/${MODEL_NAME}

Request:

Model with Content-Type, see validation request above for details.

Response:

{
  // The specified servable name
  "name": <model_name>
  
  // The deployed version starts from 1
  "version": <model_version>
}

Undeployment URL:

DELETE http://host:port/v1/models/${MODEL_NAME}

Response:

204 No Content

3. Model Metadata API

URL:

GET http://host:port/v1/models[/${MODEL_NAME}[/versions/${MODEL_VERSION}]]
  • If /${MODEL_NAME}/versions/${MODEL_VERSION} is missing, all models are returned.
  • If /versions/${MODEL_VERSION} is missing, all versions of the specified model are returned.
  • Otherwise, only the specified version and model is returnd.

Response:

// All models are returned from GET http://host:port/v1/models
[
  {
    "name": <model_name1>,
    "versions": [
      {
        "version": 1,
        ...
      },
      {
        "version": 2,
        ...
      },
      ...
    ]
  },
  {
    "name": <model_name2>,
    "versions": [
      {
        "version": 1,
        ...
      },
      {
        "version": 2,
        ...
      },
      ...
    ]
  },
  ...
]
// All versions of the specified model are returned from GET http://host:port/v1/models/${MODEL_NAME}
{
  "name": <model_name>,
  "versions": [
    {
      "version": 1,
      ...
    },
    {
      "version": 2,
      ...
    },
    ...
  ]
}
// The specified version and model is returned from http://host:port/v1/models/${MODEL_NAME}/versions/${MODEL_VERSION}
{
  "name": <model_name>,
  "versions": [
    {
      "version": <model_version>,
      ...
    }
  ]
}

4. Predict API

URL:

POST http://host:port/v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]

/versions/${MODEL_VERSION} is optional. If omitted the latest version is used.

Request:

The request body could have two formats: JSON and binary, the HTTP header Content-Type tells the server which format to handle and thus it is required for all requests.

  • Content-Type: application/json. The request body must be a JSON object formatted as follows:

    {
      "X": {
        "records": [
          {
            "predictor_name1": <value>|<(nested)list>,
            "predictor_name2": <value>|<(nested)list>,
            ...
          },
          {
            "predictor_name1": <value>|<(nested)list>,
            "predictor_name2": <value>|<(nested)list>,
            ...
          },
          ...
        ],
        "columns": [ "predictor_name1", "predictor_name2", ... ],
        "data": [ 
          [ <value>|<(nested)list>, <value>|<(nested)list>, ... ], 
          [ <value>|<(nested)list>, <value>|<(nested)list>, ... ], 
          ... 
        ]
      },
      // Output filters to specify which output fields need to be returned.
      // If the list is empty, all outputs will be included.
      "filter": <list>
    }
    

    The X can take more than one records, as you see above, there are two formats supported. You could use any one, usually the split format is smaller for multiple records.

    • records : list like [{column -> value}, … , {column -> value}]
    • split : dict like {columns -> [columns], data -> [values]}
  • Content-Type: application/octet-stream, application/vnd.google.protobuf or application/x-protobuf.

    The request body must be the protobuf message PredictRequest of gRPC API, besides of those common scalar values, it can use the standard onnx.TensorProto value directly.

  • Otherwise, an error will be returned.

Response:

The server always return the same type as your request.

  • For the JSON format. The response body is a JSON object formatted as follows:
{
  "result": {
    "records": [
      {
        "output_name1": <value>|<(nested)list>,
        "output_name2": <value>|<(nested)list>,
        ...
      },
      {
        "output_name1": <value>|<(nested)list>,
        "output_name2": <value>|<(nested)list>,
        ...
      },
      ...
    ],
    "columns": [ "output_name1", "output_name2", ... ],
    "data": [ 
      [ <value>|<(nested)list>, <value>|<(nested)list>, ... ], 
      [ <value>|<(nested)list>, <value>|<(nested)list>, ... ], 
      ... 
    ]
  }
}
  • For the binary format. The response body is the protobuf message PredictResponse of gRPC API.

Generally speaking, the binary payload has better latency, especially for the big tensor value for ONNX models, while the JSON format is easy for human readability.

gRPC APIs

Please, refer to the protobuf file ai-serving.proto for details. You could generate your client and make a gRPC call to it using your favorite language. To learn more about how to generate the client code and call to the server, please refer to the tutorials of gRPC.

Examples

Python

Curl

Start AI-Serving

We will use Docker to run the AI-Serving:

docker pull autodeployai/ai-serving:latest
docker run --rm -it -v /opt/ai-serving:/opt/ai-serving -p 9090:9090 -p 9091:9091 autodeployai/ai-serving:latest

16:06:47.722 INFO  AI-Serving-akka.actor.default-dispatcher-5 akka.event.slf4j.Slf4jLogger             Slf4jLogger started
16:06:47.833 INFO  main            ai.autodeploy.serving.AIServer$          Predicting thread pool size: 16
16:06:48.305 INFO  main            a.autodeploy.serving.protobuf.GrpcServer AI-Serving grpc server started, listening on 9091
16:06:49.433 INFO  main            ai.autodeploy.serving.AIServer$          AI-Serving http server started, listening on http://0.0.0.0:9090/

Make REST API calls to AI-Serving

In a different terminal, run cd $REPO_ROOT/src/test/resources, use the curl tool to make REST API calls.

Scoring PMML models

We will use the Iris decision tree model single_iris_dectree.xml to see REST APIs in action.

  • Validate the PMML model
curl -X PUT --data-binary @single_iris_dectree.xml -H "Content-Type: application/xml"  http://localhost:9090/v1/validate
{
  "algorithm": "TreeModel",
  "app": "KNIME",
  "appVersion": "2.8.0",
  "copyright": "KNIME",
  "formatVersion": "4.1",
  "functionName": "classification",
  "outputs": [
    {
      "name": "predicted_class",
      "optype": "nominal",
      "type": "string"
    },
    {
      "name": "probability",
      "optype": "continuous",
      "type": "real"
    },
    {
      "name": "probability_Iris-setosa",
      "optype": "continuous",
      "type": "real"
    },
    {
      "name": "probability_Iris-versicolor",
      "optype": "continuous",
      "type": "real"
    },
    {
      "name": "probability_Iris-virginica",
      "optype": "continuous",
      "type": "real"
    },
    {
      "name": "node_id",
      "optype": "nominal",
      "type": "string"
    }
  ],
  "predictors": [
    {
      "name": "sepal_length",
      "optype": "continuous",
      "type": "double",
      "values": "[4.3,7.9]"
    },
    {
      "name": "sepal_width",
      "optype": "continuous",
      "type": "double",
      "values": "[2.0,4.4]"
    },
    {
      "name": "petal_length",
      "optype": "continuous",
      "type": "double",
      "values": "[1.0,6.9]"
    },
    {
      "name": "petal_width",
      "optype": "continuous",
      "type": "double",
      "values": "[0.1,2.5]"
    }
  ],
  "runtime": "PMML4S",
  "serialization": "pmml",
  "targets": [
    {
      "name": "class",
      "optype": "nominal",
      "type": "string",
      "values": "Iris-setosa,Iris-versicolor,Iris-virginica"
    }
  ],
  "type": "PMML"
}
  • Deploy the PMML model
curl -X PUT --data-binary @single_iris_dectree.xml -H "Content-Type: application/xml"  http://localhost:9090/v1/models/iris
{
  "name": "iris",
  "version": 1
}
  • Get metadata of the PMML model
curl -X GET http://localhost:9090/v1/models/iris
{
  "createdAt": "2020-04-23T06:29:32",
  "id": "56fa4917-e904-4364-9c15-4c87d84ec2c4",
  "latestVersion": 1,
  "name": "iris",
  "updateAt": "2020-04-23T06:29:32",
  "versions": [
    {
      "algorithm": "TreeModel",
      "app": "KNIME",
      "appVersion": "2.8.0",
      "copyright": "KNIME",
      "createdAt": "2020-04-23T06:29:32",
      "formatVersion": "4.1",
      "functionName": "classification",
      "hash": "fc44c33123836be368d3f24829360020",
      "outputs": [
        {
          "name": "predicted_class",
          "optype": "nominal",
          "type": "string"
        },
        {
          "name": "probability",
          "optype": "continuous",
          "type": "real"
        },
        {
          "name": "probability_Iris-setosa",
          "optype": "continuous",
          "type": "real"
        },
        {
          "name": "probability_Iris-versicolor",
          "optype": "continuous",
          "type": "real"
        },
        {
          "name": "probability_Iris-virginica",
          "optype": "continuous",
          "type": "real"
        },
        {
          "name": "node_id",
          "optype": "nominal",
          "type": "string"
        }
      ],
      "predictors": [
        {
          "name": "sepal_length",
          "optype": "continuous",
          "type": "double",
          "values": "[4.3,7.9]"
        },
        {
          "name": "sepal_width",
          "optype": "continuous",
          "type": "double",
          "values": "[2.0,4.4]"
        },
        {
          "name": "petal_length",
          "optype": "continuous",
          "type": "double",
          "values": "[1.0,6.9]"
        },
        {
          "name": "petal_width",
          "optype": "continuous",
          "type": "double",
          "values": "[0.1,2.5]"
        }
      ],
      "runtime": "PMML4S",
      "serialization": "pmml",
      "size": 3497,
      "targets": [
        {
          "name": "class",
          "optype": "nominal",
          "type": "string",
          "values": "Iris-setosa,Iris-versicolor,Iris-virginica"
        }
      ],
      "type": "PMML",
      "version": 1
    }
  ]
}
  • Predict the PMML model using the JSON payload in records
curl -X POST -d '{"X": [{"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2}]}' -H "Content-Type: application/json"  http://localhost:9090/v1/models/iris
{
  "result": [
    {
      "node_id": "1",
      "probability_Iris-setosa": 1.0,
      "predicted_class": "Iris-setosa",
      "probability_Iris-virginica": 0.0,
      "probability_Iris-versicolor": 0.0,
      "probability": 1.0
    }
  ]
}
  • Predict the PMML model using the JSON payload in split with filters
curl -X POST -d '{"X": {"columns": ["sepal_length", "sepal_width", "petal_length", "petal_width"],"data":[[5.1, 3.5, 1.4, 0.2], [7, 3.2, 4.7, 1.4]]}, "filter": ["predicted_class"]}' -H "Content-Type: application/json"  http://localhost:9090/v1/models/iris
{
  "result": {
    "columns": [
      "predicted_class"
    ],
    "data": [
      [
        "Iris-setosa"
      ],
      [
        "Iris-versicolor"
      ]
    ]
  }
}

Scoring ONNX models

We will use the pre-trained MNIST Handwritten Digit Recognition ONNX Model to see REST APIs in action.

  • Validate the ONNX model
curl -X PUT --data-binary @mnist.onnx -H "Content-Type: application/octet-stream"  http://localhost:9090/v1/validate
{
  "outputs": [
    {
      "name": "Plus214_Output_0",
      "shape": [
        1,
        10
      ],
      "type": "tensor(float)"
    }
  ],
  "predictors": [
    {
      "name": "Input3",
      "shape": [
        1,
        1,
        28,
        28
      ],
      "type": "tensor(float)"
    }
  ],
  "runtime": "ONNX Runtime",
  "serialization": "onnx",
  "type": "ONNX"
}
  • Deploy the ONNX model
curl -X PUT --data-binary @mnist.onnx -H "Content-Type: application/octet-stream"  http://localhost:9090/v1/models/mnist
{
  "name": "mnist",
  "version": 1
}
  • Get metadata of the ONNX model
curl -X GET http://localhost:9090/v1/models/mnist
{
  "createdAt": "2020-04-16T15:18:18",
  "id": "850bf345-5c4c-4312-96c8-6ee715113961",
  "latestVersion": 1,
  "name": "mnist",
  "updateAt": "2020-04-16T15:18:18",
  "versions": [
    {
      "createdAt": "2020-04-16T15:18:18",
      "hash": "104617a683b4e62469478e07e1518aaa",
      "outputs": [
        {
          "name": "Plus214_Output_0",
          "shape": [
            1,
            10
          ],
          "type": "tensor(float)"
        }
      ],
      "predictors": [
        {
          "name": "Input3",
          "shape": [
            1,
            1,
            28,
            28
          ],
          "type": "tensor(float)"
        }
      ],
      "runtime": "ONNX Runtime",
      "serialization": "onnx",
      "size": 26454,
      "type": "ONNX",
      "version": 1
    }
  ]
}
  • Predict the ONNX model using the REST payload in records
curl -X POST -d @mnist_request_0.json -H "Content-Type: application/json" http://localhost:9090/v1/models/mnist
{
  "result": [
    {
      "Plus214_Output_0": [
        [
          975.6703491210938,
          -618.7241821289062,
          6574.5654296875,
          668.0283203125,
          -917.2710571289062,
          -1671.6361083984375,
          -1952.7598876953125,
          -61.54957580566406,
          -777.1764526367188,
          -1439.5316162109375
        ]
      ]
    }
  ]
}
  • Predict the ONNX model using the binary payload
curl -X POST --data-binary @mnist_request_0.pb -o response_0.pb -H "Content-Type: application/octet-stream" http://localhost:9090/v1/models/mnist

Save the binary response to response_0.pb that is in the protobuf format, an instance of PredictResponse message, you could use the generated client from ai-serving.proto to read it.

Note, the content type of predict request must be specified explicitly and take one of four candidates. An incorrect request URL or body returns an HTTP error status.

curl -i -X POST -d @mnist_request_0.json  http://localhost:9090/v1/models/mnist
HTTP/1.1 400 Bad Request
Server: akka-http/10.1.11
Date: Sun, 19 Apr 2020 06:25:25 GMT
Connection: close
Content-Type: application/json
Content-Length: 92

{"error":"Prediction request takes unknown content type: application/x-www-form-urlencoded"}

Deploy and Manage AI/ML models at scale

See the DaaS system that deploys AI/ML models in production at scale on Kubernetes.

Support

If you have any questions about the AI-Serving library, please open issues on this repository.

Feedback and contributions to the project, no matter what kind, are always very welcome.

License

AI-Serving is licensed under APL 2.0.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].