This page was generated from examples/triton_gpt2/README.ipynb.

Pretrained GPT2 Model Deployment Example

In this notebook, we will run an example of text generation using GPT2 model exported from HuggingFace and deployed with Seldon’s Triton pre-packed server. the example also covers converting the model to ONNX format. The implemented example below is of the Greedy approach for the next token prediction. more info:

After we have the module deployed to Kubernetes, we will run a simple load test to evaluate the module inference performance.


  1. Download pretrained GPT2 model from hugging face

  2. Convert the model to ONNX

  3. Store it in MinIo bucket

  4. Setup Seldon-Core in your kubernetes cluster

  5. Deploy the ONNX model with Seldon’s prepackaged Triton server.

  6. Interact with the model, run a greedy alg example (generate sentence completion)

  7. Run load test using vegeta

  8. Clean-up

Basic requirements

  • Helm v3.0.0+

  • A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)

  • kubectl v1.14+

  • Python 3.6+

%%writefile requirements.txt
Writing requirements.txt
[ ]:
!pip install -r requirements.txt

Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally

[ ]:
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained(
    "gpt2", from_pt=True, pad_token_id=tokenizer.eos_token_id
model.save_pretrained("./tfgpt2model", saved_model=True)

Convert the TensorFlow saved model to ONNX

[ ]:
!python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 11  --output model.onnx

Copy your model to a local MinIo

Setup MinIo

Use the provided notebook to install MinIo in your cluster and configure mc CLI tool. Instructions also online.

– Note: You can use your prefer remote storage server (google/ AWS etc.)

Create a Bucket and store your model

!mc mb minio/language-models/onnx-gpt2/1 -p
!mc cp ./model.onnx minio/language-models/onnx-gpt2/1/
Bucket created successfully `minio/language-models/onnx-gpt2/1`.
...odel.onnx:  622.29 MiB / 622.29 MiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 136.59 MiB/s 4s

Run Seldon in your kubernetes cluster

Follow the Seldon-Core Setup notebook to Setup a cluster with Ambassador Ingress or Istio and install Seldon Core

Deploy your model with Seldon pre-packaged Triton server

%%writefile secret.yaml

apiVersion: v1
kind: Secret
  name: seldon-init-container-secret
type: Opaque
  RCLONE_CONFIG_S3_ENDPOINT: http://minio.minio-system.svc.cluster.local:9000

Writing secret.yaml
%%writefile gpt2-deploy.yaml
kind: SeldonDeployment
  name: gpt2
  - graph:
      implementation: TRITON_SERVER
        mode: all
      modelUri: s3://language-models
      envSecretRefName: seldon-init-container-secret
      name: onnx-gpt2
      type: MODEL
    name: default
    replicas: 1
  protocol: kfserving
Writing gpt2-deploy.yaml
!kubectl apply -f secret.yaml -n default
!kubectl apply -f gpt2-deploy.yaml -n default
secret/seldon-init-container-secret configured unchanged
!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=gpt2 -o jsonpath='{.items[0]}')
deployment "gpt2-default-0-onnx-gpt2" successfully rolled out

Interact with the model: get model metadata (a “test” request to make sure our model is available and loaded correctly)

!curl -s http://localhost:80/seldon/default/gpt2/v2/models/onnx-gpt2

Run prediction test: generate a sentence completion using GPT2 model - Greedy approach

import json

import numpy as np
import requests
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
    input_ids = tokenizer.encode(gen_sentence, return_tensors="tf")
    shape = input_ids.shape.as_list()
    payload = {
        "inputs": [
                "name": "input_ids",
                "datatype": "INT32",
                "shape": shape,
                "data": input_ids.numpy().tolist(),
                "name": "attention_mask",
                "datatype": "INT32",
                "shape": shape,
                "data": np.ones(shape, dtype=np.int32).tolist(),

    ret =

        res = ret.json()

    # extract logits
    logits = np.array(res["outputs"][1]["data"])
    logits = logits.reshape(res["outputs"][1]["shape"])

    # take the best next token probability of the last token of input ( greedy approach)
    next_token = logits.argmax(axis=2)[0]
    next_token_str = tokenizer.decode(
        next_token[-1:], skip_special_tokens=True, clean_up_tokenization_spaces=True
    gen_sentence += " " + next_token_str
    count += 1

print(f"Input: {input_text}\nOutput: {gen_sentence}")
Input: I enjoy working in Seldon
Output: I enjoy working in Seldon 's office , and I 'm glad to see that

Run Load Test / Performance Test using vegeta

Install vegeta, for more details take a look in vegeta official documentation

[ ]:
!tar -zxvf vegeta-12.8.3-linux-amd64.tar.gz
!chmod +x vegeta

Generate vegeta target file contains “post” cmd with payload in the requiered structure

import base64
import json
from subprocess import PIPE, Popen, run

import numpy as np
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
input_ids = tokenizer.encode(input_text, return_tensors="tf")
shape = input_ids.shape.as_list()
payload = {
    "inputs": [
            "name": "input_ids",
            "datatype": "INT32",
            "shape": shape,
            "data": input_ids.numpy().tolist(),
            "name": "attention_mask",
            "datatype": "INT32",
            "shape": shape,
            "data": np.ones(shape, dtype=np.int32).tolist(),

cmd = {
    "method": "POST",
    "header": {"Content-Type": ["application/json"]},
    "url": "http://localhost:80/seldon/default/gpt2/v2/models/gpt2/infer",
    "body": base64.b64encode(bytes(json.dumps(payload), "utf-8")).decode("utf-8"),

with open("vegeta_target.json", mode="w") as file:
    json.dump(cmd, file)
[ ]:
!vegeta attack -targets=vegeta_target.json -rate=1 -duration=60s -format=json | vegeta report -type=text


!kubectl delete -f gpt2-deploy.yaml -n default "gpt2" deleted