Skip to content

Workload Configuration ​

This doc explains how to allocate vGPU resources for your AI workloads using annotations and WorkloadProfile custom resources.

Add Pod Annotations ​

Add the following annotations to your Pod metadata to configure GPU workload requirements.

Annotation Reference ​

Basic Annotations

AnnotationDescriptionExample Value
tensor-fusion.ai/tflops-requestRequested TFlops(FP16) per vGPU worker per GPU device'10'
tensor-fusion.ai/vram-requestRequested VRAM(stand for Video Memory or Frame Buffer) per vGPU worker per GPU device4Gi
tensor-fusion.ai/tflops-limitMaximum TFlops(FP16) allowed per vGPU worker per GPU device'20'
tensor-fusion.ai/vram-limitMaximum VRAM(stand for Video Memory or Frame Buffer) allowed per vGPU worker per GPU device4Gi
tensor-fusion.ai/inject-containerContainer to inject GPU resources into, could be comma split format for multiple containerspython
tensor-fusion.ai/qosQuality of service levellow medium high critical
tensor-fusion.ai/is-local-gpuSchedule the workload to the same GPU server that runs vGPU worker for best performance, default to false'true'
tensor-fusion.ai/workloadTensorFusionWorkload name, if exists, will share the same vGPU workerspytorch-example
tensor-fusion.ai/generate-workloadEnables workload generation, if set to false, will not create new TensorFusionWorkload'true'
tensor-fusion.ai/workload-profileReference to a WorkloadProfile to reuse pre-defined parametersdefault-profile
tensor-fusion.ai/replicasNumber of vGPU worker replicas to create, each vGPU worker will be allocated requested computing resources, should be the same value as Deployment's replicas'2'
tensor-fusion.ai/gpupoolSpecifies target GPU pooldefault-pool

Advanced Annotations

AnnotationDescriptionExample Value
tensor-fusion.ai/gpu-countRequested GPU device count, each vGPU worker will map to N physical GPU devices set by this field, and vram/tflops resource consumption will be scaled by this field, default to 1, your AI workloads can get cuda:0 device'4'
tensor-fusion.ai/gpu-modelSpecifies the GPU/NPU modelA100 H100 L4 L40s
tensor-fusion.ai/auto-requestsAuto set vram and/or tflops requests based on workload historical metrics, for detail settings please use WorkloadProfile custom resource'true'
tensor-fusion.ai/auto-limitsAuto set vram and/or tflops limits based on workload historical metrics, for detail settings please use WorkloadProfile custom resource'true'
tensor-fusion.ai/auto-replicasAuto set vGPU worker replicas based on workload historical metrics, for detail settings please use WorkloadProfile custom resource'true'
tensor-fusion.ai/no-standalone-worker-modeThis mode is only available when is-local-gpu set to true, in this mode, TensorFusion will also inject vGPU worker into init container, so that to achieve best performance, trade-off is user might by-pass the vGPU worker and using physical GPU directly'true'

Example Config ​

yaml
kind: Deployment
apiVersion: apps/v1
metadata: {}
spec:
  template:
    metadata:
      labels:
        tensor-fusion.ai/enabled: "true"
      annotations:
        tensor-fusion.ai/gpupool: default-pool
        tensor-fusion.ai/inject-container: python # could be comma split if multiple containers using GPU 
        tensor-fusion.ai/replicas: '1' # The GPU worker replicas, same as Deployment replicas in most cases 
        tensor-fusion.ai/tflops-limit: '20'
        tensor-fusion.ai/tflops-request: '10'
        tensor-fusion.ai/vram-limit: 4Gi
        tensor-fusion.ai/vram-request: 4Gi
        tensor-fusion.ai/qos: medium
        tensor-fusion.ai/workload: pytorch-example
        tensor-fusion.ai/generate-workload: 'true'  # If set to false, will use the workload of tensor-fusion.ai/workload rather than start new GPU workers 
        tensor-fusion.ai/workload-profile: default-profile # WorkloadProfile has lower priority 
        tensor-fusion.ai/is-local-gpu: 'true'
        tensor-fusion.ai/gpu-count: '1' # GPU device number per TensorFusion Worker
    spec: {}

Configure WorkloadProfile Custom Resource ​

For advanced features like auto-scaling, create a WorkloadProfile custom resource and reference it in your Pod annotations.

yaml
apiVersion: tensor-fusion.ai/v1
kind: WorkloadProfile
metadata:
  name: example-workload-profile
  namespace: same-namespace-as-your-workload
spec:
  # Specify AI computing resources needed
  resources:
    requests:
      tflops: "5"
      vram: "3Gi"
    limits:
      tflops: "15"
      vram: "3Gi"
  # Specify the number of vGPU workers, usually the same as Deployment replicas
  replicas: 1
  
  # Schedule the workload to the same GPU server that runs GPU worker for best performance
  isLocalGPU: true

  # Specify pool name (optional)
  poolName: default-pool

  # Specify QoS level (defaults to medium)
  qos: medium
  
  # Specify the number of GPU devices per vGPU worker (optional, default to 1)
  gpuCount: 1
  
  # Specify the GPU/NPU model (optional)
  gpuModel: A100
  
  # Auto-scaling configuration options (optional)
  autoScalingConfig: {}

Then reference this profile in your Pod annotation:

yaml
tensor-fusion.ai/workload-profile: example-workload-profile

For more details on WorkloadProfile schema, see the WorkloadProfile Schema Reference.