All Projects → nikolaydubina → go-ml-benchmarks

nikolaydubina / go-ml-benchmarks

Licence: other
⏱ Benchmarks of machine learning inference for Go

Programming Languages

go
31211 projects - #10 most used programming language
python
139335 projects - #7 most used programming language
Makefile
30231 projects
Jupyter Notebook
11667 projects
C++
36643 projects - #6 most used programming language
CMake
9771 projects

Projects that are alternatives of or similar to go-ml-benchmarks

aws-lambda-docker-serverless-inference
Serve scikit-learn, XGBoost, TensorFlow, and PyTorch models with AWS Lambda container images support.
Stars: ✭ 56 (+107.41%)
Mutual labels:  scikit-learn, inference, xgboost
Mljar Supervised
Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning 🚀
Stars: ✭ 961 (+3459.26%)
Mutual labels:  scikit-learn, xgboost
Machine Learning Alpine
Alpine Container for Machine Learning
Stars: ✭ 30 (+11.11%)
Mutual labels:  scikit-learn, xgboost
Nyoka
Nyoka is a Python library to export ML/DL models into PMML (PMML 4.4.1 Standard).
Stars: ✭ 127 (+370.37%)
Mutual labels:  scikit-learn, xgboost
Autoviz
Automatically Visualize any dataset, any size with a single line of code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Stars: ✭ 310 (+1048.15%)
Mutual labels:  scikit-learn, xgboost
Openscoring
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
Stars: ✭ 536 (+1885.19%)
Mutual labels:  scikit-learn, xgboost
Auto ml
[UNMAINTAINED] Automated machine learning for analytics & production
Stars: ✭ 1,559 (+5674.07%)
Mutual labels:  scikit-learn, xgboost
ai-deployment
关注AI模型上线、模型部署
Stars: ✭ 149 (+451.85%)
Mutual labels:  scikit-learn, xgboost
Emlearn
Machine Learning inference engine for Microcontrollers and Embedded devices
Stars: ✭ 154 (+470.37%)
Mutual labels:  scikit-learn, inference
Stacking
Stacked Generalization (Ensemble Learning)
Stars: ✭ 173 (+540.74%)
Mutual labels:  scikit-learn, xgboost
Mars
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
Stars: ✭ 2,308 (+8448.15%)
Mutual labels:  scikit-learn, xgboost
Dtreeviz
A python library for decision tree visualization and model interpretation.
Stars: ✭ 1,857 (+6777.78%)
Mutual labels:  scikit-learn, xgboost
Eli5
A library for debugging/inspecting machine learning classifiers and explaining their predictions
Stars: ✭ 2,477 (+9074.07%)
Mutual labels:  scikit-learn, xgboost
Hyperparameter hunter
Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
Stars: ✭ 648 (+2300%)
Mutual labels:  scikit-learn, xgboost
Arch-Data-Science
Archlinux PKGBUILDs for Data Science, Machine Learning, Deep Learning, NLP and Computer Vision
Stars: ✭ 92 (+240.74%)
Mutual labels:  scikit-learn, xgboost
Tpot
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Stars: ✭ 8,378 (+30929.63%)
Mutual labels:  scikit-learn, xgboost
Quora question pairs NLP Kaggle
Quora Kaggle Competition : Natural Language Processing using word2vec embeddings, scikit-learn and xgboost for training
Stars: ✭ 17 (-37.04%)
Mutual labels:  scikit-learn, xgboost
mloperator
Machine Learning Operator & Controller for Kubernetes
Stars: ✭ 85 (+214.81%)
Mutual labels:  scikit-learn, xgboost
M2cgen
Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies
Stars: ✭ 1,962 (+7166.67%)
Mutual labels:  scikit-learn, xgboost
Hyperactive
A hyperparameter optimization and data collection toolbox for convenient and fast prototyping of machine-learning models.
Stars: ✭ 182 (+574.07%)
Mutual labels:  scikit-learn, xgboost

Go Machine Learning Benchmarks

Given a raw data in a Go service, how quickly can I get machine learning inference for it?

Typically, Go is dealing with structured single sample data. Thus, we are focusing on tabular machine learning models only, such as popular XGBoost. It is common to run Go service in a backed form and on Linux platform, thus we do not consider other deployment options. In the work bellow, we compare typical implementations on how this inference task can be performed.

diagram

host: AWS EC2 t2.xlarge shared
os: Ubuntu 20.04 LTS 
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
BenchmarkXGB_Go_GoFeatureProcessing_GoLeaves_noalloc                              491 ns/op
BenchmarkXGB_Go_GoFeatureProcessing_GoLeaves                                      575 ns/op
BenchmarkXGB_Go_GoFeatureProcessing_UDS_RawBytes_Python_XGB                    243056 ns/op
BenchmarkXGB_CGo_GoFeatureProcessing_XGB                                       244941 ns/op
BenchmarkXGB_Go_GoFeatureProcessing_UDS_gRPC_CPP_XGB                           367433 ns/op
BenchmarkXGB_Go_GoFeatureProcessing_UDS_gRPC_Python_XGB                        785147 ns/op
BenchmarkXGB_Go_UDS_gRPC_Python_sklearn_XGB                                  21699830 ns/op
BenchmarkXGB_Go_HTTP_JSON_Python_Gunicorn_Flask_sklearn_XGB                  21935237 ns/op

Abbreviations and Frameworks

Dataset and Model

We are using classic Titanic dataset. It contains numerical and categorical features, which makes it a representative of typical case. Data and notebooks to train model and preprocessor is available in /data and /notebooks.

Some numbers for reference

How fast do you need to get?

                   200ps - 4.6GHz single cycle time
                1ns      - L1 cache latency
               10ns      - L2/L3 cache SRAM latency
               20ns      - DDR4 CAS, first byte from memory latency
               20ns      - C++ raw hardcoded structs access
               80ns      - C++ FlatBuffers decode/traverse/dealloc
              150ns      - PCIe bus latency
              171ns      - cgo call boundary, 2015
              200ns      - HFT FPGA
              475ns      - 2020 MLPerf winner recommendation inference time per sample
 ---------->  500ns      - go-featureprocessing + leaves
              800ns      - Go Protocol Buffers Marshal
              837ns      - Go json-iterator/go json unmarshal
           1µs           - Go protocol buffers unmarshal
           3µs           - Go JSON Marshal
           7µs           - Go JSON Unmarshal
          10µs           - PCIe/NVLink startup time
          17µs           - Python JSON encode/decode times
          30µs           - UNIX domain socket; eventfd; fifo pipes
         100µs           - Redis intrinsic latency; KDB+; HFT direct market access
         200µs           - 1GB/s network air latency; Go garbage collector pauses interval 2018
         230µs           - San Francisco to San Jose at speed of light
         500µs           - NGINX/Kong added latency
     10ms                - AWS DynamoDB; WIFI6 "air" latency
     15ms                - AWS Sagemaker latency; "Flash Boys" 300million USD HFT drama
     30ms                - 5G "air" latency
     36ms                - San Francisco to Hong-Kong at speed of light
    100ms                - typical roundtrip from mobile to backend
    200ms                - AWS RDS MySQL/PostgreSQL; AWS Aurora
 10s                     - AWS Cloudfront 1MB transfer time

Profiling and Analysis

[491ns/575ns] Leaves — we see that most of time taken in Leaves Random Forest code. Leaves code does not have mallocs. Inplace preprocessing does not have mallocs, with non-inplace version malloc happen and takes and takes half of time of preprocessing. leaves

[243µs] UDS Raw bytes Python — we see that Python takes much longer time than preprocessing in Go, however Go is at least visible on the chart. We also note that Python spends most of the time in libgomp.so call, this library is in GNU OpenMP written in C which does parallel operations.

uds

[244µs] CGo version — similarly, we see that call to libgomp.so is being done. It is much smaller compare to rest of o CGo code, as compared to Python version above. Over overall results are not better then? Likely this is due to performance degradation from Go to CGo. We also note that malloc is done.

cgo

[367µs] gRPC over UDS to C++ — we see that Go code is around 50% of C++ version. In C++ 50% of time spend on gRPC code. Lastly, C++ also uses libgomp.so. We don't see on this chart, but likely Go code also spends considerable time on gRPC code.

cgo

[785µs] gRPC over UDS to Python wihout sklearn — we see that Go code is visible in the chart. Python spends only portion on time in libgomp.so.

cgo

[21ms] gRPC over UDS to Python with sklearn — we see that Go code (main.test) is no longer visible the chart. Python spends only small fraction of time on libgomp.so.

cgo

[22ms] REST service version with sklearn — similarly, we see that Go code (main.test) is no longer visible in the chart. Python spends more time in libgomp.so as compared to Python + gRPC + skelarn version, however it is not clear why results are worse.

cgo

Future work

  • go-featureprocessing - gRPCFlatBuffers - C++ - XGB
  • batch mode
  • UDS - gRPC - C++ - ONNX (sklearn + XGBoost)
  • UDS - gRPC - Python - ONNX (sklearn + XGBoost)
  • cgo ONNX (sklearn + XGBoost) (examples: 1)
  • native Go ONNX (sklearn + XGBoost) — no official support, https://github.com/owulveryck/onnx-go is not complete
  • text
  • images
  • videos

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].