Inference optimization is a complex subject and will depend on your model and use case. This page provides various pieces of advice.
Custom Python Code Optimization¶
Using the Seldon python wrapper there are various optimization areas one needs to look at.
Seldon Protocol Payload Types with REST and gRPC¶
Depending on whether you want to use REST or gRPC and want to send tensor data the format of the request will have a deserialization/serialization cost in the python wrapper. This is investigated in a python serialization notebook.
The conclusions are:
gRPC is faster than REST
tftensor is best for large batch size
ndarray with gRPC is bad for large batch size
simpler tensor/ndarray is better for small batch size
If you are running inference on Intel CPUs with compatible libraries then correct usage of environment variables for KMP and OMP can be useful. Most of the advice on these subjects usually discusses a singel inference request and how to optimize for low latency. One must be careful when using KMP_AFFINITY when you expect to handle parallel inference requests as they may block in unexpected ways if CPU Affinity is being used. We provide an example benchmarking notebook.
There are many resources to loop deeper for your model case. Some we have found are:
Maximize TensorFlow Performance on CPU: Considerations and Recommendations for Inference Workloads
Best Practicesfor ScalingDeep LearningTraining and Inference with TensorFlow* OnIntel® Xeon® Processor Based HPC Infrastructures
Optimizing BERT model for Intel CPU Cores using ONNX runtime default execution provider
Consider adjusting OMP_NUM_THREADS environment variable for containerized deployments
General Best Practices for Intel® Optimization for TensorFlow
From 1.10.0 release of Seldon Core the python wrapper gRPC server will also respect GUNICORN_NUM_WORKERS and be able to handle parallel gRPC requests.
We provide links to various benchmarking notebooks.