Tensor Fusion Deployment for Kubernetes
Prerequisites
- Create a Kubernetes cluster with GPU pool enabled, GPU Operator enabled
- Kubernetes is able to connect to Docker Hub to pull public images
- Create tensor-fusion-test namespace for evaluation
kubectl create ns tensor-fusion-test
Step 1. Run server side on GPU node
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensor-fusion-gpu-server
namespace: tensor-fusion-test
spec:
replicas: 1
selector:
matchLabels:
workload: test
template:
metadata:
labels:
workload: test
spec:
# Recommend to use fixed node during testing and evaluation of TensorFusion
nodeSelector:
kubernetes.io/hostname: replace-me-with-kubernetes-node-name
hostNetwork: true
containers:
- name: server
image: tensor-fusion/tensor-fusion-worker:v1.0.1-beta
command:
- sh
- -c
# when driver version is 535.183.*, -k is 0x298, when it's 550.*, -k is 0x268
- "vcuda -n native -s 9997 -r 9998 -p 9999 -a 0x1129 -k 0x298"
resources:
limits:
nvidia.com/gpu: '1' // obtain one GPU for testing, could be multiple
Step 2. Deploy client side test app
After serverside running successfully, copy the NodeIP, replace the vcuda-client startup command.
REPLACE_ME => Server Node IP
Before patch exiting workload to move to away from GPU node and schedule to CPU node, you can run this on CPU node to test the functionality.
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensor-fusion-test-cpu-client
namespace: tensor-fusion-test
spec:
replicas: 1
selector:
matchLabels:
workload: test
template:
metadata:
labels:
workload: test
spec:
volumes:
- name: vcuda-libs
emptyDir: {}
initContainers:
- name: init-hook
image: tensor-fusion/tensor-fusion-client:v1.0.1-beta
command:
- sh
- -c
- cp /lib/vcuda/*.so /target/lib/vcuda/ && cp /lib/vcuda/official.libcuda.so.1 /target/lib/vcuda/libcuda.so.1
volumeMounts:
- mountPath: /target/lib/vcuda
name: vcuda-libs
containers:
- name: app
image: pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "mkdir -p /usr/local/nvidia/lib/ && cp -r /lib/vcuda/libcuda.so.1 /usr/local/nvidia/lib/ && pip3 install transformers sentencepiece"]
env:
- name: DISABLE_ADDMM_CUDA_LT
value: "1"
- name: VCUDA_NODE_INDICE_LIST
value: "0"
- name: VCUDA_NODE_HOST_LIST
value: "REPLACE_ME"
- name: VCUDA_NODE_PROTOCOL_LIST
value: "native"
- name: VCUDA_NODE_SEND_PORT_LIST
value: "9998"
- name: VCUDA_NODE_RECV_PORT_LIST
value: "9997"
- name: VCUDA_NODE_PORT_LIST
value: "9999"
- name: VCUDA_GPU_INDICE_LIST
value: "0"
- name: "LD_PRELOAD"
value: "/lib/vcuda/libutilities.so:/lib/vcuda/libvcuda.so"
command:
- sleep
- infinity
volumeMounts:
- mountPath: /lib/vcuda
name: vcuda-libs
Then run "kubectl exec" into the "app" container, run this command inside the shell to start python REPL console.
python3
Finally, test a simple Google T5 model inference in CPU pod, initialization duration will be 20s to 2 minutes, depends on intranet latency, afterwards, it should translate English "Hello" to German "Hallo" in seconds.
from transformers import pipeline
pipe = pipeline("translation_en_to_de", model="google-t5/t5-base", device="cuda:0")
pipe("Hello")
Step 3. Patch your existing service deployment
If you've installed Kyverno, apply this yaml into Kubernetes. It mainly auto inject following changes, only if "tensor-fusion.ai/enabled" is "true" annotation present on the Deployment podTemplate:
- Add 'vcuda-libs' emptyDir volume to the deployment
- Inject init container to copy LD_PRELOAD libs into application container
- Inject other environment variables to set configurations, note that the value of VCUDA_NODE_HOST_LIST should be replaced with the server Node IP
If you don't have Kyverno installed, please manually perform actions above.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: inject-tensor-fusion
annotations:
policies.kyverno.io/title: Inject Tensor Fusion runtime
policies.kyverno.io/subject: Deployment,Volume
policies.kyverno.io/minversion: 1.6.0
policies.kyverno.io/description: >-
Inject Tensor fusion runtime include a sidecar client container, an init container to provide cuda stub and hook pytorch
spec:
admission: true
background: true
rules:
- name: inject-tensor-fusion-sidecar
match:
any:
- resources:
annotations:
tensor-fusion.ai/enabled: 'true'
kinds:
- Pod
mutate:
patchStrategicMerge:
spec:
containers:
- name: app # the container name must be app
env:
- name: DISABLE_ADDMM_CUDA_LT
value: "1"
- name: VCUDA_NODE_INDICE_LIST
value: "0"
- name: VCUDA_NODE_HOST_LIST
value: "REPLACE_ME"
- name: VCUDA_NODE_PROTOCOL_LIST
value: "native"
- name: VCUDA_NODE_SEND_PORT_LIST
value: "9998"
- name: VCUDA_NODE_RECV_PORT_LIST
value: "9997"
- name: VCUDA_NODE_PORT_LIST
value: "9999"
- name: VCUDA_GPU_INDICE_LIST
value: "0"
- name: LD_LIBRARY_PATH
value: /lib/vcuda/official
- name: LD_PRELOAD
value: /lib/vcuda/libutilities.so:/lib/vcuda/libvcuda.so
volumeMounts:
- name: vcuda-libs
mountPath: /lib/vcuda
initContainers:
- command:
- sh
- '-c'
- >-
mkdir -p /target/lib/vcuda/official && mv /lib/vcuda/official.libcuda.so.1 /target/lib/vcuda/official/libcuda.so.1 && cp /lib/vcuda/*.so /target/lib/vcuda/
image: tensor-fusion/tensor-fusion-client:v1.0.1-beta
imagePullPolicy: IfNotPresent
name: copy-runtime-libs
volumeMounts:
- mountPath: /target/lib/vcuda
name: vcuda-libs
volumes:
- emptyDir: {}
name: vcuda-libs
Then you could apply the following yaml to create a simple pytorch workload to test the injection.
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensor-fusion-test-on-cpu-node-kyverno
namespace: tensor-fusion-test
spec:
replicas: 1
selector:
matchLabels:
workload: test
template:
metadata:
labels:
workload: test
annotations:
tensor-fusion.ai/enabled: "true"
spec:
containers:
- name: app
image: pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
command:
- sleep
- infinity
After pod started, run this command. If everything is fine, you could modify the existing Deployment pod template to trigger kyverno injection and migrate to Tensor Fusion
pip3 install transformers sentencepiece
LD_PRELOAD="" cat <<EOT >> t5.test.py
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import TextStreamer
model_id = "google-t5/t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer)
model = T5ForConditionalGeneration.from_pretrained(model_id)
model = model.to("cuda:0")
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, streamer=streamer)
EOT
python3 t5.test.py
# Output <pad>Wie alt bist du?</s> in the end