Seldon Python Server Configuration

To serve your component, Seldon’s Python wrapper will use Gunicorn under the hood by default. Gunicorn is a high-performing HTTP server for Unix which allows you to easily scale your model across multiple worker processes and threads.

Note

Gunicorn will only handle the horizontal scaling of your model within the same pod and container. To learn more about how to scale your model across multiple pod replicas see this section of the docs.

Workers

By default, Seldon will only use a single worker process. However, it’s possible to increase this number through the GUNICORN_WORKERS env var for REST and the GRPC_WORKERS env var for GRPC. This variable can be controlled directly through the SeldonDeployment CRD.

For example, to run your model under 8 processes (4 RESt and 4 GRPC), you could do:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: gunicorn
spec:
  name: worker
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.0
          name: classifier
          env:
          - name: GUNICORN_WORKERS
            value: '4'
          - name: GRPC_WORKERS
            value: '4'
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: REST
      name: classifier
      type: MODEL
    labels:
      version: v1
    name: example
    replicas: 1

Running only REST server by disabling GRPC server

By default the Seldon models run a REST and GRPC server with a single process each. If the machine learning model is loaded in each process, this can result in a large overhead in cases where the model artifacts are very large as there would be an instance of the model loaded for each worker. For this case, it is possible to disable the GRPC server by setting GRPC_WORKERS to 0, which would end up not starting a GRPC server. It is important to note that the GRPC endpoint will still be available in the service orchestrator so GRPC requests would no longer work. An example of this would be as follows:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: gunicorn
spec:
  name: worker
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.0
          name: classifier
          env:
          - name: GRPC_WORKERS
            value: '0'
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: REST
      name: classifier
      type: MODEL
    labels:
      version: v1
    name: example
    replicas: 1

Threads

By default, Seldon will process your model’s incoming requests using 1 thread per worker process. You can increase this number through the GUNICORN_THREADS environment variable. This variable can be controlled directly through the SeldonDeployment CRD.

For example, to run your model with 5 threads per worker, you could do:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: gunicorn
spec:
  name: worker
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.0
          name: classifier
          env:
          - name: GUNICORN_THREADS
            value: '5'
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: REST
      name: classifier
      type: MODEL
    labels:
      version: v1
    name: example
    replicas: 1

Disable multithreading

In some cases, you may want to completely disable multithreading. To serve your model within a single thread, set the environment variable FLASK_SINGLE_THREADED to 1. This is not the most optimal setup for most models, but can be useful when your model cannot be made thread-safe like many GPU-based models that deadlock when accessed from multiple threads.

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: flaskexample
spec:
  name: worker
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.0
          name: classifier
          env:
          - name: FLASK_SINGLE_THREADED
            value: '1'
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: REST
      name: classifier
      type: MODEL
    labels:
      version: v1
    name: example
    replicas: 1

Development server

While Gunicorn is recommended for production workloads, it’s also possible to use Flask’s built-in development server. To enable the development server, you can set the SELDON_DEBUG variable to 1.

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: flask-development-server
spec:
  name: worker
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.0
          name: classifier
          env:
          - name: SELDON_DEBUG
            value: '1'
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: REST
      name: classifier
      type: MODEL
    labels:
      version: v1
    name: example
    replicas: 1

Configuration

Python Server can be configured using environmental variables or command line flags.

CLI Flags

Environment Variable

Default

Notes

interface_name

N/A

N/A

First positional argument. Required. If contains . first part is interpreted as module name.

--http-port

PREDICTIVE_UNIT_HTTP_SERVICE_PORT

9000

Http port of Seldon service. In k8s this is controlled by Seldon Core Operator.

--grpc-port

PREDICTIVE_UNIT_GRPC_SERVICE_PORT

5000

Grpc port of Seldon service. In k8s this is controlled by Seldon Core Operator.

--metrics-port

PREDICTIVE_UNIT_METRICS_SERVICE_PORT

6000

Metrics port of Seldon service. In k8s this is controlled by Seldon Core Operator.

--service-type

N/A

MODEL

Service type of model. Can be MODEL, ROUTER, TRANSFORMER, COMBINER or OUTLIER_DETECTOR.

--parameters

N/A

[]

List of parameters to be passed to Model class.

--log-level

LOG_LEVEL_ENV

INFO

Python log level. Can be DEBUG, INFO, WARNING or ERROR.

--debug

SELDON_DEBUG

false

Enable debug mode that enables flask development server and sets logging to DEBUG. Values 1, true or t (case insensitive) will be interpreted as True.

--tracing

TRACING

0

Enable tracing. Can be 0 or 1.

--workers

GUNICORN_WORKERS

1

Number of Gunicorn workers for handling requests.

--threads

GUNICORN_THREADS

1

Number of threads to run per Gunicorn worker.

--max-requests

GUNICORN_MAX_REQUESTS

0

Maximum number of requests gunicorn worker will process before restarting.

--max-requests-jitter

GUNICORN_MAX_REQUESTS_JITTER

0

Maximum random jitter to add to max-requests.

--keepalive

GUNICORN_KEEPALIVE

2

The number of seconds to wait for requests on a Keep-Alive connection.

--access-log

GUNICORN_ACCESS_LOG

false

Enable gunicorn access log.

--pidfile

N/A

None

A file path to use for the Gunicorn PID file.

--single-threaded

FLASK_SINGLE_THREADED

0

Force the Flask app to run single-threaded. Also applies to Gunicorn. Can be 0 or 1.

N/A

FILTER_METRICS_ACCESS_LOGS

not debug

Filter out logs related to Prometheus accessing the metrics port. By default enabled in production and disabled in debug mode.

N/A

PREDICTIVE_UNIT_METRICS_ENDPOINT

/metrics

Endpoint name for Prometheus metrics. In k8s deployment default is /prometheus.

N/A

PAYLOAD_PASSTHROUGH

false

Skip decoding of payloads.

Running Processes

The total processes started for the python server will be as follows:

  • REST server. A Gunicorn server which will have 1 master process and N worker processes where N is defined by the environment variable GUNICORN_WORKERS.

  • gRPC server. A master process and N worker processes where N is defined by the environment variable GRPC_WORKERS.

  • Metrics server. A Gunicorn server with 1 master and 1 worker.

  • Metrics collector. A single process that collects the metrics across workers.