Configure AutoScaling

Configure AutoScaling Strategies for AI Workloads, including auto-scaling vGPU resource requests, limits etc.

Step 1. Enable AutoScaling

Add Pod AutoScaling Annotations

To be used in conjunction with workload annotations：Create Workload

  # Enable vertical scaling
  tensor-fusion.ai/auto-resources: 'true'
  # Configure target resource, options: all|tflops|vram, if empty only provides recommendations
  tensor-fusion.ai/auto-scale-target-resource: all
  # Enable horizontal scaling
  tensor-fusion.ai/auto-replicas: 'true'

Detailed Configuration Using Workload Configuration File

Vertical Scaling: Based on historical GPU resource usage data, the community VPA Histogram algorithm is employed. The estimates generated by the VPA algorithm consist of Target, LowerBound, and UpperBound, corresponding by default to P90, P50, and P95 usage levels. If the current resource usage falls outside the LowerBound and UpperBound range, a recommended value is generated.

[!NOTE] Note: If enable is not set to true, or if targetResource is empty, only resource recommendations will be generated, and the recommended values will not be applied in practice.

Cron Scaling: Based on standard cron expressions, scaling takes effect when enable is true and within the start and end time range. Outside this range, resources revert to the values specified when the workload was added. Cron Expression Reference

autoScalingConfig:
    # Vertical scaling configuration
    autoSetResources:
      # Enable/disable
      enable: true
      # Target resource
      targetResource: all
      # Tflops usage percentile that will be used as a base for tflops target recommendation. Default: 0.9
      targetTflopsPercentile: 0.9
      # Tflops usage percentile that will be used for the lower bound on tflops recommendation. Default: 0.5
      lowerBoundTflopsPercentile: 0.5
      # Tflops usage percentile that will be used for the upper bound on tflops recommendation. Default: 0.95
      upperBoundTflopsPercentile: 0.95
      # Vram usage percentile that will be used as a base for vram target recommendation. Default: 0.9
      targetVramPercentile: 0.9
      # Vram usage percentile that will be used for the lower bound on vram recommendation. Default: 0.5
      lowerBoundVramPercentile: 0.5
      # Vram usage percentile that will be used for the upper bound on vram recommendation. Default: 0.95
      upperBoundVramPercentile: 0.95
      # Fraction of usage added as the safety margin to the recommended request. Default: 0.15
      requestMarginFraction: 0.15
      # The time interval used for computing the confidence multiplier for the lower and upper bound. Default: 24h
      confidenceInterval: 24h
    autoSetReplicas: {}
    # Cron-based scaling configuration
    cronScalingRules:
        # Enable/disable this rule
      - enable: True
        # Rule name
        name: "test"
        # Rule start time
        start: "0 0 * * Thu"
        # Rule end time
        end: "59 23 * * Thu"
        # Desired GPU resource
        desiredResources:
          limits:
            tflops: "99"
            vram: 10Gi
          requests:
            tflops: "44"
            vram: 5Gi

Step 2. Monitor Scaling Status

The workload generates a corresponding TensorFusionWorkload resource object, and the fields in Status reflect the current scaling status in real time.

status:
  conditions:
    # Reason for GPU resource recommendation
    - lastTransitionTime: '2025-10-09T09:16:46Z'
      message: TFLOPS scaled up due to (1) below lower bound (2)
      reason: OutOfEstimatedBound
      status: 'True'
      type: RecommendationProvided
  # Current GPU resource recommendations
  recommendation:
    limits:
      tflops: '13'
      vram: 1Gi
    requests:
      tflops: '13'
      vram: 1Gi
  # Number of replicas with applied GPU resource recommendations
  appliedRecommendedReplicas: 3
  # Currently active cron scaling rule
  activeCronScalingRule: <...>

Step 1. Enable AutoScaling

Add Pod AutoScaling Annotations

To be used in conjunction with workload annotations：Create Workload

  # Enable vertical scaling
  tensor-fusion.ai/auto-resources: 'true'
  # Configure target resource, options: all|tflops|vram, if empty only provides recommendations
  tensor-fusion.ai/auto-scale-target-resource: all
  # Enable horizontal scaling
  tensor-fusion.ai/auto-replicas: 'true'

Detailed Configuration Using Workload Configuration File

Vertical Scaling: Based on historical GPU resource usage data, the community VPA Histogram algorithm is employed. The estimates generated by the VPA algorithm consist of Target, LowerBound, and UpperBound, corresponding by default to P90, P50, and P95 usage levels. If the current resource usage falls outside the LowerBound and UpperBound range, a recommended value is generated.

[!NOTE] Note: If enable is not set to true, or if targetResource is empty, only resource recommendations will be generated, and the recommended values will not be applied in practice.

Cron Scaling: Based on standard cron expressions, scaling takes effect when enable is true and within the start and end time range. Outside this range, resources revert to the values specified when the workload was added. Cron Expression Reference

autoScalingConfig:
    # Vertical scaling configuration
    autoSetResources:
      # Enable/disable
      enable: true
      # Target resource
      targetResource: all
      # Tflops usage percentile that will be used as a base for tflops target recommendation. Default: 0.9
      targetTflopsPercentile: 0.9
      # Tflops usage percentile that will be used for the lower bound on tflops recommendation. Default: 0.5
      lowerBoundTflopsPercentile: 0.5
      # Tflops usage percentile that will be used for the upper bound on tflops recommendation. Default: 0.95
      upperBoundTflopsPercentile: 0.95
      # Vram usage percentile that will be used as a base for vram target recommendation. Default: 0.9
      targetVramPercentile: 0.9
      # Vram usage percentile that will be used for the lower bound on vram recommendation. Default: 0.5
      lowerBoundVramPercentile: 0.5
      # Vram usage percentile that will be used for the upper bound on vram recommendation. Default: 0.95
      upperBoundVramPercentile: 0.95
      # Fraction of usage added as the safety margin to the recommended request. Default: 0.15
      requestMarginFraction: 0.15
      # The time interval used for computing the confidence multiplier for the lower and upper bound. Default: 24h
      confidenceInterval: 24h
    autoSetReplicas: {}
    # Cron-based scaling configuration
    cronScalingRules:
        # Enable/disable this rule
      - enable: True
        # Rule name
        name: "test"
        # Rule start time
        start: "0 0 * * Thu"
        # Rule end time
        end: "59 23 * * Thu"
        # Desired GPU resource
        desiredResources:
          limits:
            tflops: "99"
            vram: 10Gi
          requests:
            tflops: "44"
            vram: 5Gi

Step 2. Monitor Scaling Status

The workload generates a corresponding TensorFusionWorkload resource object, and the fields in Status reflect the current scaling status in real time.

status:
  conditions:
    # Reason for GPU resource recommendation
    - lastTransitionTime: '2025-10-09T09:16:46Z'
      message: TFLOPS scaled up due to (1) below lower bound (2)
      reason: OutOfEstimatedBound
      status: 'True'
      type: RecommendationProvided
  # Current GPU resource recommendations
  recommendation:
    limits:
      tflops: '13'
      vram: 1Gi
    requests:
      tflops: '13'
      vram: 1Gi
  # Number of replicas with applied GPU resource recommendations
  appliedRecommendedReplicas: 3
  # Currently active cron scaling rule
  activeCronScalingRule: <...>

Step 1. Enable AutoScaling

Add Pod AutoScaling Annotations

Detailed Configuration Using Workload Configuration File

Step 2. Monitor Scaling Status

Table of Contents

Configure AutoScaling

Step 1. Enable AutoScaling

Add Pod AutoScaling Annotations

Detailed Configuration Using Workload Configuration File

Step 2. Monitor Scaling Status

Table of Contents