TensorFusionCluster

TensorFusionCluster is the Schema for the tensorfusionclusters API.

Kubernetes Resource Information

Field	Value
API Version	tensor-fusion.ai/v1
Kind	TensorFusionCluster
Scope	Cluster

API Information
Spec
Status

Spec

TensorFusionClusterSpec defines the desired state of TensorFusionCluster.

Property	Type	Constraints	Description
computingVendor ↓	object		ComputingVendorConfig defines the Cloud vendor connection such as AWS, GCP, Azure etc.
gpuPools ↓	array

computingVendor

ComputingVendorConfig defines the Cloud vendor connection such as AWS, GCP, Azure etc.

Properties

Property	Type	Constraints	Description
authType	string	accessKey serviceAccountRole
enable	boolean		Default: `true`
name	string
params ↓	object
type	string	aws lambda-labs gcp azure oracle-oci ibm openshift vultr together-ai alibaba nvidia tencent runpod karpenter mock	support popular cloud providers

params

Properties

Property	Type	Description
accessKeyPath	string	the secret of access key and secret key or config file, must be mounted as file path
configFile	string
defaultRegion	string
extraParams	object	User can set extra cloud vendor params, eg. in ali cloud:" spotPriceLimit, spotDuration, spotInterruptionBehavior, systemDiskCategory, systemDiskSize, dataDiskPerformanceLevel in aws cloud: TODO
iamRole	string	preferred IAM role since it's more secure
secretKeyPath	string

gpuPools (items)

Properties

Property	Type	Description
isDefault	boolean
name	string
specTemplate ↓	object	GPUPoolSpec defines the desired state of GPUPool.

specTemplate

GPUPoolSpec defines the desired state of GPUPool.

Properties

Property	Type	Description
capacityConfig ↓	object
componentConfig ↓	object	Customize system components for seamless onboarding.
nodeManagerConfig ↓	object
qosConfig ↓	object	Define different QoS and their price.
schedulingConfigTemplate	string

capacityConfig

Properties

Property	Type	Constraints	Description
maxResources ↓	object
minResources ↓	object
oversubscription ↓	object
warmResources ↓	object

maxResources

Properties

Property	Type	Constraints	Description
cpu	any	pattern: Regex	CPU/Memory is only available when CloudVendor connection is enabled
memory	any	pattern: Regex
tflops	any	pattern: Regex
vram	any	pattern: Regex

minResources

Properties

Property	Type	Constraints	Description
cpu	any	pattern: Regex	CPU/Memory is only available when CloudVendor connection is enabled
memory	any	pattern: Regex
tflops	any	pattern: Regex
vram	any	pattern: Regex

oversubscription

Properties

Property	Type	Constraints	Description
tflopsOversellRatio	integer<int32>	min: 100 max: 100000	The multi of TFlops to oversell, default to 500%, indicates 5 times oversell Default: `500`
vramExpandToHostDisk	integer<int32>	min: 0 max: 100	the percentage of Host Disk appending to GPU VRAM, default to 70% Default: `70`
vramExpandToHostMem	integer<int32>	min: 0 max: 100	the percentage of Host RAM appending to GPU VRAM, default to 50% Default: `50`

warmResources

Properties

Property	Type	Constraints	Description
cpu	any	pattern: Regex	CPU/Memory is only available when CloudVendor connection is enabled
memory	any	pattern: Regex
tflops	any	pattern: Regex
vram	any	pattern: Regex

componentConfig

Customize system components for seamless onboarding.

Properties

Property	Type	Constraints	Description
client ↓	object
hypervisor ↓	object
nodeDiscovery ↓	object
worker ↓	object

client

Properties

Property	Type	Constraints	Description
embeddedModeImage	string
operatorEndpoint	string
patchEmbeddedWorkerToPod	object
patchToContainer	object
patchToEmbeddedWorkerContainer	object
patchToPod	object
remoteModeImage	string

hypervisor

Properties

Property	Type	Constraints	Description
enableVector	boolean
image	string
podTemplate	object
portNumber	integer<int32>	min: 0 max: 65535	Default: `8000`
vectorImage	string

nodeDiscovery

Properties

Property	Type	Constraints	Description
image	string
podTemplate	object

worker

Properties

Property	Type	Constraints	Description
image	string
podTemplate	object

nodeManagerConfig

Properties

Property	Type	Constraints	Description
nodeCompaction ↓	object
nodePoolRollingUpdatePolicy ↓	object
nodeProvisioner ↓	object		NodeProvisioner or NodeSelector, they are exclusive. NodeSelector is for existing GPUs, NodeProvisioner is for Karpenter-like auto management.
nodeSelector ↓	object		A node selector represents the union of the results of one or more label queries over a set of nodes; that is, it represents the OR of the selectors represented by the node selector terms.
provisioningMode	string	Provisioned AutoSelect Karpenter	Default: `AutoSelect`

nodeCompaction

Properties

Property	Type	Constraints	Description
period	string		Default: `5m`

nodePoolRollingUpdatePolicy

Properties

Property	Type	Constraints	Description
autoUpdate	boolean		Default: `true`
batchInterval	string		Default: `10m`
batchPercentage	integer<int32>	min: 0 max: 100	Default: `100`
maintenanceWindow ↓	object
maxDuration	string		Default: `10m`

maintenanceWindow

Properties

Property	Type	Constraints	Description
includes	array		crontab syntax.

nodeProvisioner

NodeProvisioner or NodeSelector, they are exclusive.
NodeSelector is for existing GPUs, NodeProvisioner is for Karpenter-like auto management.

Properties

Property	Type	Description
budget ↓	object	NodeProvisioner will start an virtual billing based on public pricing or customized pricing, if the VM's costs exceeded any budget constraints, the new VM will not be created, and alerts will be generated
cpuNodeLabels	object
cpuRequirements ↓	array
cpuTaints ↓	array
gpuNodeAnnotations	object
gpuNodeLabels	object
gpuRequirements ↓	array
gpuTaints ↓	array
karpenterNodeClassRef ↓	object	Karpenter NodeClass name
nodeClass	string	TensorFusion GPUNodeClass name

budget

NodeProvisioner will start an virtual billing based on public pricing or customized pricing, if the VM's costs exceeded any budget constraints, the new VM will not be created, and alerts will be generated

Properties

Property	Type	Constraints	Description
budgetExceedStrategy	string	AlertOnly AlertAndTerminateVM	Default: `AlertOnly`
budgetPerDay	string		Default: `100`
budgetPerMonth	string		Default: `1000`
budgetPerQuarter	string		Default: `3000`

cpuRequirements (items)

Properties

Property	Type	Constraints	Description
key	string	node.kubernetes.io/instance-type kubernetes.io/arch kubernetes.io/os topology.kubernetes.io/region topology.kubernetes.io/zone karpenter.sh/capacity-type tensor-fusion.ai/gpu-vendor tensor-fusion.ai/gpu-instance-family tensor-fusion.ai/gpu-instance-size
operator	string	In Exists DoesNotExist Gt Lt	A node selector operator is the set of operators that can be used in a node selector requirement. Default: `In`
values	array

cpuTaints (items)

Properties

Property	Type	Constraints	Description
effect	string	NoSchedule NoExecute PreferNoSchedule	Default: `NoSchedule`
key	string
value	string

gpuRequirements (items)

Properties

Property	Type	Constraints	Description
key	string	node.kubernetes.io/instance-type kubernetes.io/arch kubernetes.io/os topology.kubernetes.io/region topology.kubernetes.io/zone karpenter.sh/capacity-type tensor-fusion.ai/gpu-vendor tensor-fusion.ai/gpu-instance-family tensor-fusion.ai/gpu-instance-size
operator	string	In Exists DoesNotExist Gt Lt	A node selector operator is the set of operators that can be used in a node selector requirement. Default: `In`
values	array

gpuTaints (items)

Properties

Property	Type	Constraints	Description
effect	string	NoSchedule NoExecute PreferNoSchedule	Default: `NoSchedule`
key	string
value	string

karpenterNodeClassRef

Karpenter NodeClass name

Properties

Property	Type	Constraints	Description
group	string
kind	string
name	string
version	string

nodeSelector

A node selector represents the union of the results of one or more label queries
over a set of nodes; that is, it represents the OR of the selectors represented
by the node selector terms.

Properties

Property	Type	Constraints	Description
nodeSelectorTerms ↓	array		Required. A list of node selector terms. The terms are ORed.

nodeSelectorTerms (items)

Required. A list of node selector terms. The terms are ORed.

Properties

Property	Type	Constraints	Description
matchExpressions ↓	array		A list of node selector requirements by node's labels.
matchFields ↓	array		A list of node selector requirements by node's fields.

matchExpressions (items)

A list of node selector requirements by node's labels.

Properties

Property	Type	Description
key	string	The label key that the selector applies to.
operator	string	Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
values	array	An array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. If the operator is Gt or Lt, the values array must have a single element, which will be interpreted as an integer. This array is replaced during a strategic merge patch.

matchFields (items)

A list of node selector requirements by node's fields.

Properties

Property	Type	Description
key	string	The label key that the selector applies to.
operator	string	Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
values	array	An array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. If the operator is Gt or Lt, the values array must have a single element, which will be interpreted as an integer. This array is replaced during a strategic merge patch.

qosConfig

Define different QoS and their price.

Properties

Property	Type	Constraints
defaultQoS	string	low medium high critical
definitions ↓	array
pricing ↓	array

definitions (items)

Properties

Property	Type	Constraints
description	string
name	string	low medium high critical
priority	integer

pricing (items)

Properties

Property	Type	Constraints	Description
limitsOverRequests	string		Default requests and limitsOverRequests are same, indicates normal on-demand serverless GPU usage, in hands-on lab low QoS case, limitsOverRequests should be lower, so that user can get burstable GPU resources with very low cost Default: `1`
qos	string	low medium high critical
requests ↓	object		The default pricing based on second level pricing from https://modal.com/pricing with Tensor/CUDA Core : HBM = 2:1

requests

The default pricing based on second level pricing from https://modal.com/pricing
with Tensor/CUDA Core : HBM = 2:1

Properties

Property	Type	Constraints	Description
perFP16TFlopsPerHour	string		Default: `$0.0069228`
perGBOfVRAMPerHour	string		Default: `$0.01548`

Status

TensorFusionClusterStatus defines the observed state of TensorFusionCluster.

Property	Type	Constraints	Description
allocatedTFlopsPercent	string
allocatedVRAMPercent	string
availableTFlops	any	pattern: Regex
availableVRAM	any	pattern: Regex
cloudVendorConfigHash	string
conditions ↓	array
notReadyGPUPools	array
phase	string	Pending Running Updating Destroying Unknown	TensorFusionClusterPhase represents the phase of the TensorFusionCluster resource. Default: `Pending`
potentialSavingsPerMonth	string
readyGPUPools	array
retryCount	integer<int64>		Default: `0`
savedCostsPerMonth	string
totalGPUs	integer<int32>
totalNodes	integer<int32>
totalPools	integer<int32>
totalTFlops	any	pattern: Regex
totalVRAM	any	pattern: Regex
utilizedTFlopsPercent	string
utilizedVRAMPercent	string
virtualAvailableTFlops	any	pattern: Regex
virtualAvailableVRAM	any	pattern: Regex
virtualTFlops	any	pattern: Regex
virtualVRAM	any	pattern: Regex

conditions (items)

Properties

Property	Type	Constraints	Description
lastTransitionTime	string<date-time>		lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
message	string	maxLength: 32768	message is a human readable message indicating details about the transition. This may be an empty string.
observedGeneration	integer<int64>	min: 0	observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.
reason	string	minLength: 1 maxLength: 1024 pattern: Regex	reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.
status	string	True False Unknown	status of the condition, one of True, False, Unknown.
type	string	maxLength: 316 pattern: Regex	type of condition in CamelCase or in foo.example.com/CamelCase.

TensorFusionCluster ​

Kubernetes Resource Information ​

Table of Contents ​

Spec ​

computingVendor ​

Properties ​

params ​

Properties ​

gpuPools (items) ​

Properties ​

specTemplate ​

Properties ​

capacityConfig ​

Properties ​

maxResources ​

Properties ​

minResources ​

Properties ​

oversubscription ​

Properties ​

warmResources ​

Properties ​

componentConfig ​

Properties ​

client ​

Properties ​

hypervisor ​

Properties ​

nodeDiscovery ​

Properties ​

worker ​

Properties ​

nodeManagerConfig ​

Properties ​

nodeCompaction ​

Properties ​

nodePoolRollingUpdatePolicy ​

Properties ​

maintenanceWindow ​

Properties ​

nodeProvisioner ​

Properties ​

budget ​

Properties ​

cpuRequirements (items) ​

Properties ​

cpuTaints (items) ​

Properties ​

gpuRequirements (items) ​

Properties ​

gpuTaints (items) ​

Properties ​

karpenterNodeClassRef ​

Properties ​

nodeSelector ​

Properties ​

nodeSelectorTerms (items) ​

Properties ​

matchExpressions (items) ​

Properties ​

matchFields (items) ​

Properties ​

qosConfig ​

Properties ​

definitions (items) ​

Properties ​

pricing (items) ​

Properties ​

requests ​

Properties ​

Status ​

conditions (items) ​

Properties ​

TensorFusionCluster

Kubernetes Resource Information

Table of Contents

Spec

computingVendor

Properties

params

Properties

gpuPools (items)

Properties

specTemplate

Properties

capacityConfig

Properties

maxResources

Properties

minResources

Properties

oversubscription

Properties

warmResources

Properties

componentConfig

Properties

client

Properties

hypervisor

Properties

nodeDiscovery

Properties

worker

Properties

nodeManagerConfig

Properties

nodeCompaction

Properties

nodePoolRollingUpdatePolicy

Properties

maintenanceWindow

Properties

nodeProvisioner

Properties

budget

Properties

cpuRequirements (items)

Properties

cpuTaints (items)

Properties

gpuRequirements (items)

Properties

gpuTaints (items)

Properties

karpenterNodeClassRef

Properties

nodeSelector

Properties

nodeSelectorTerms (items)

Properties

matchExpressions (items)

Properties

matchFields (items)

Properties

qosConfig

Properties

definitions (items)

Properties

pricing (items)

Properties

requests

Properties

Status

conditions (items)

Properties