Inference Optimization

Inference optimization is a complex subject and will depend on your model and use case. This page provides various pieces of advice.

Custom Python Code Optimization

Using the Seldon python wrapper there are various optimization areas one needs to look at.

Seldon Protocol Payload Types with REST and gRPC

Depending on whether you want to use REST or gRPC and want to send tensor data the format of the request will have a deserialization/serialization cost in the python wrapper. This is investigated in a python serialization notebook.

The conclusions are:

  • gRPC is faster than REST

  • tftensor is best for large batch size

  • ndarray with gRPC is bad for large batch size

  • simpler tensor/ndarray is better for small batch size

KMP_AFFINITY

If you are running inference on Intel CPUs with compatible libraries then correct usage of environment variables for KMP and OMP can be useful. Most of the advice on these subjects usually discusses a singel inference request and how to optimize for low latency. One must be careful when using KMP_AFFINITY when you expect to handle parallel inference requests as they may block in unexpected ways if CPU Affinity is being used. We provide an example benchmarking notebook.

There are many resources to loop deeper for your model case. Some we have found are:

gRPC multi-processing

From 1.10.0 release of Seldon Core the python wrapper gRPC server will also respect GUNICORN_NUM_WORKERS and be able to handle parallel gRPC requests.

Benchmarks

We provide links to various benchmarking notebooks.