Skip to content

SchedulingConfigTemplate

SchedulingConfigTemplate is the Schema for the schedulingconfigtemplates API.

Kubernetes Resource Information

FieldValue
API Versiontensor-fusion.ai/v1
KindSchedulingConfigTemplate
ScopeCluster

Table of Contents

Spec

Place the workload to right nodes and scale smart.

Property
Type
Constraints
Description
autoScaling objectscale the workload based on the usage and traffic
hypervisor objectsingle GPU device multi-process queuing and fair scheduling with QoS constraint
placement objectplace the client or worker to best matched nodes
reBalancer objectavoid hot GPU devices and continuously balance the workload
implemented by trigger a simulation scheduling and advise better GPU nodes for scheduler

autoScaling

scale the workload based on the usage and traffic

Properties

Property
Type
Constraints
Description
autoSetLimits objectlayer 1 vertical auto-scaling, turbo burst to existing GPU cards quickly
VPA-like, aggregate metrics data <1m
autoSetReplicas objectlayer 2 horizontal auto-scaling, scale up to more GPU cards if max limits threshold hit
HPA-like, aggregate metrics data 1m-1h (when tf-worker scaled-up, should also trigger client pod's owner[Deployment etc.]'s replica increasing, check if KNative works)
autoSetRequests objectlayer 3 adjusting, to match the actual usage in the long run, only for N:M remote vGPU mode, not impl yet
Adjust baseline requests to match the actual usage in longer period, such as 1day - 2weeks

autoSetLimits

layer 1 vertical auto-scaling, turbo burst to existing GPU cards quickly
VPA-like, aggregate metrics data <1m

Properties

Property
Type
Constraints
Description
enableboolean
evaluationPeriodstring
extraTFlopsBufferRatiostring
ignoredDeltaRangestring
maxRatioToRequestsstringthe multiplier of requests, to avoid limit set too high, like 5.0
prediction object
scaleUpStepstring
targetResourcestringtarget resource to scale limits, such as "tflops", "vram", or "all" by default

prediction

Properties
Property
Type
Constraints
Description
enableboolean
historyDataPeriodstring
modelstring
predictionPeriodstring

autoSetReplicas

layer 2 horizontal auto-scaling, scale up to more GPU cards if max limits threshold hit
HPA-like, aggregate metrics data 1m-1h (when tf-worker scaled-up, should also trigger client pod's owner[Deployment etc.]'s replica increasing, check if KNative works)

Properties

Property
Type
Constraints
Description
enableboolean
evaluationPeriodstring
scaleDownCoolDownTimestring
scaleDownStepstring
scaleUpCoolDownTimestring
scaleUpStepstring
targetTFlopsOfLimitsstring

autoSetRequests

layer 3 adjusting, to match the actual usage in the long run, only for N:M remote vGPU mode, not impl yet
Adjust baseline requests to match the actual usage in longer period, such as 1day - 2weeks

Properties

Property
Type
Constraints
Description
aggregationPeriodstring
enableboolean
evaluationPeriodstring
extraBufferRatiostringthe request buffer ratio, for example actual usage is 1.0, 10% buffer will be 1.1 as final preferred requests
percentileForAutoRequestsstring
prediction object
targetResourcestringtarget resource to scale requests, such as "tflops", "vram", or "all" by default

prediction

Properties
Property
Type
Constraints
Description
enableboolean
historyDataPeriodstring
modelstring
predictionPeriodstring

hypervisor

single GPU device multi-process queuing and fair scheduling with QoS constraint

Properties

Property
Type
Constraints
Description
autoFreezeAndResume objectadditional layer to save VRAM, auto-freeze memory and cool down to RAM and Disk
Hypervisor will monitor and trigger freeze of inactive workers, Operator should mark them as scaled-to-zero and release the GPU pool resources, don't scale down CPU client part, so that they can continue to serve the traffic or scale down by other auto-scaling solutions like KEDA/KNative
multiProcessQueuing objectHypervisor will move low priority jobs to pending queue if GPU is full
This config can adjust hypervisor's queueing behavior to balance the co-scheduling CUDA calls

autoFreezeAndResume

additional layer to save VRAM, auto-freeze memory and cool down to RAM and Disk
Hypervisor will monitor and trigger freeze of inactive workers, Operator should mark them as scaled-to-zero and release the GPU pool resources, don't scale down CPU client part, so that they can continue to serve the traffic or scale down by other auto-scaling solutions like KEDA/KNative

Properties

Property
Type
Constraints
Description
autoFreeze array
intelligenceWarmup object

autoFreeze (items)

Properties
Property
Type
Constraints
Description
enableboolean
freezeToDiskTTLstring
freezeToMemTTLstring
qosstringlow medium high critical

intelligenceWarmup

Properties
Property
Type
Constraints
Description
enableboolean
historyDataPeriodstring
modelstring
predictionPeriodstring

multiProcessQueuing

Hypervisor will move low priority jobs to pending queue if GPU is full
This config can adjust hypervisor's queueing behavior to balance the co-scheduling CUDA calls

Properties

Property
Type
Constraints
Description
enableboolean
intervalstring
queueLevelTimeSlicesarray

placement

place the client or worker to best matched nodes

Properties

Property
Type
Constraints
Description
allowUsingLocalGPUbooleanDefault: true
gpuFilters array
modestringCompactFirst LowLoadFirstDefault: CompactFirst

gpuFilters (items)

Properties

Property
Type
Constraints
Description
paramsobject
typestring

reBalancer

avoid hot GPU devices and continuously balance the workload
implemented by trigger a simulation scheduling and advise better GPU nodes for scheduler

Properties

Property
Type
Constraints
Description
enableboolean
intervalstring
reBalanceCoolDownTimestring
threshold object

threshold

Properties

Property
Type
Constraints
Description
matchAnyobject

Status

SchedulingConfigTemplateStatus defines the observed state of SchedulingConfigTemplate.