This page was generated from examples/triton_gpt2/README.ipynb.
Pretrained GPT2 Model Deployment Example¶
In this notebook, we will run an example of text generation using GPT2 model exported from HuggingFace and deployed with Seldon’s Triton pre-packed server. the example also covers converting the model to ONNX format. The implemented example below is of the Greedy approach for the next token prediction. more info: https://huggingface.co/transformers/model_doc/gpt2.html?highlight=gpt2
After we have the module deployed to Kubernetes, we will run a simple load test to evaluate the module inference performance.
Steps:¶
Download pretrained GPT2 model from hugging face
Convert the model to ONNX
Store it in MinIo bucket
Setup Seldon-Core in your kubernetes cluster
Deploy the ONNX model with Seldon’s prepackaged Triton server.
Interact with the model, run a greedy alg example (generate sentence completion)
Run load test using vegeta
Clean-up
Basic requirements¶
Helm v3.0.0+
A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)
kubectl v1.14+
Python 3.6+
[1]:
%%writefile requirements.txt
transformers==4.5.1
torch==1.8.1
tokenizers<0.11,>=0.10.1
tensorflow==2.4.1
tf2onnx
Writing requirements.txt
[ ]:
!pip install --trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org -r requirements.txt
Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally¶
[ ]:
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained(
"gpt2", from_pt=True, pad_token_id=tokenizer.eos_token_id
)
model.save_pretrained("./tfgpt2model", saved_model=True)
Convert the TensorFlow saved model to ONNX¶
[ ]:
!python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 11 --output model.onnx
Copy your model to a local MinIo¶
Setup MinIo¶
Use the provided notebook to install MinIo in your cluster and configure mc
CLI tool. Instructions also online.
– Note: You can use your prefer remote storage server (google/ AWS etc.)
Create a Bucket and store your model¶
[23]:
!mc mb minio/language-models/onnx-gpt2/1 -p
!mc cp ./model.onnx minio/language-models/onnx-gpt2/1/
Bucket created successfully `minio/language-models/onnx-gpt2/1`.
...odel.onnx: 622.29 MiB / 622.29 MiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 136.59 MiB/s 4s
Run Seldon in your kubernetes cluster¶
Follow the Seldon-Core Setup notebook to Setup a cluster with Ambassador Ingress or Istio and install Seldon Core
Deploy your model with Seldon pre-packaged Triton server¶
[6]:
%%writefile secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: seldon-init-container-secret
type: Opaque
stringData:
RCLONE_CONFIG_S3_TYPE: s3
RCLONE_CONFIG_S3_PROVIDER: minio
RCLONE_CONFIG_S3_ENV_AUTH: "false"
RCLONE_CONFIG_S3_ACCESS_KEY_ID: minioadmin
RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: minioadmin
RCLONE_CONFIG_S3_ENDPOINT: http://minio.minio-system.svc.cluster.local:9000
Writing secret.yaml
[4]:
%%writefile gpt2-deploy.yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: gpt2
spec:
predictors:
- graph:
implementation: TRITON_SERVER
logger:
mode: all
modelUri: s3://language-models
envSecretRefName: seldon-init-container-secret
name: onnx-gpt2
type: MODEL
name: default
replicas: 1
protocol: kfserving
Writing gpt2-deploy.yaml
[7]:
!kubectl apply -f secret.yaml -n default
!kubectl apply -f gpt2-deploy.yaml -n default
secret/seldon-init-container-secret configured
seldondeployment.machinelearning.seldon.io/gpt2 unchanged
[8]:
!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=gpt2 -o jsonpath='{.items[0].metadata.name}')
deployment "gpt2-default-0-onnx-gpt2" successfully rolled out
Interact with the model: get model metadata (a “test” request to make sure our model is available and loaded correctly)¶
[9]:
!curl -s http://localhost:80/seldon/default/gpt2/v2/models/onnx-gpt2
{"name":"onnx-gpt2","versions":["1"],"platform":"onnxruntime_onnx","inputs":[{"name":"input_ids","datatype":"INT32","shape":[-1,-1]},{"name":"attention_mask","datatype":"INT32","shape":[-1,-1]}],"outputs":[{"name":"past_key_values","datatype":"FP32","shape":[12,2,-1,12,-1,64]},{"name":"logits","datatype":"FP32","shape":[-1,-1,50257]}]}
Run prediction test: generate a sentence completion using GPT2 model - Greedy approach¶
[33]:
import json
import numpy as np
import requests
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
input_ids = tokenizer.encode(gen_sentence, return_tensors="tf")
shape = input_ids.shape.as_list()
payload = {
"inputs": [
{
"name": "input_ids",
"datatype": "INT32",
"shape": shape,
"data": input_ids.numpy().tolist(),
},
{
"name": "attention_mask",
"datatype": "INT32",
"shape": shape,
"data": np.ones(shape, dtype=np.int32).tolist(),
},
]
}
ret = requests.post(
"http://localhost:80/seldon/default/gpt2/v2/models/onnx-gpt2/infer",
json=payload,
)
try:
res = ret.json()
except:
continue
# extract logits
logits = np.array(res["outputs"][1]["data"])
logits = logits.reshape(res["outputs"][1]["shape"])
# take the best next token probability of the last token of input ( greedy approach)
next_token = logits.argmax(axis=2)[0]
next_token_str = tokenizer.decode(
next_token[-1:], skip_special_tokens=True, clean_up_tokenization_spaces=True
).strip()
gen_sentence += " " + next_token_str
count += 1
print(f"Input: {input_text}\nOutput: {gen_sentence}")
Input: I enjoy working in Seldon
Output: I enjoy working in Seldon 's office , and I 'm glad to see that
Run Load Test / Performance Test using vegeta¶
Install vegeta, for more details take a look in vegeta official documentation¶
[ ]:
!wget https://github.com/tsenart/vegeta/releases/download/v12.8.3/vegeta-12.8.3-linux-amd64.tar.gz
!tar -zxvf vegeta-12.8.3-linux-amd64.tar.gz
!chmod +x vegeta
Generate vegeta target file contains “post” cmd with payload in the requiered structure¶
[35]:
import base64
import json
from subprocess import PIPE, Popen, run
import numpy as np
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
input_ids = tokenizer.encode(input_text, return_tensors="tf")
shape = input_ids.shape.as_list()
payload = {
"inputs": [
{
"name": "input_ids",
"datatype": "INT32",
"shape": shape,
"data": input_ids.numpy().tolist(),
},
{
"name": "attention_mask",
"datatype": "INT32",
"shape": shape,
"data": np.ones(shape, dtype=np.int32).tolist(),
},
]
}
cmd = {
"method": "POST",
"header": {"Content-Type": ["application/json"]},
"url": "http://localhost:80/seldon/default/gpt2/v2/models/gpt2/infer",
"body": base64.b64encode(bytes(json.dumps(payload), "utf-8")).decode("utf-8"),
}
with open("vegeta_target.json", mode="w") as file:
json.dump(cmd, file)
file.write("\n\n")
[ ]:
!vegeta attack -targets=vegeta_target.json -rate=1 -duration=60s -format=json | vegeta report -type=text
Clean-up¶
[11]:
!kubectl delete -f gpt2-deploy.yaml -n default
seldondeployment.machinelearning.seldon.io "gpt2" deleted