TensorFusionCluster โ
TensorFusionCluster is the Schema for the tensorfusionclusters API.
Kubernetes Resource Information โ
Field | Value |
---|---|
API Version | tensor-fusion.ai/v1 |
Kind | TensorFusionCluster |
Scope | Cluster |
Table of Contents โ
Spec โ
TensorFusionClusterSpec defines the desired state of TensorFusionCluster.
Property | Type | Constraints | Description |
---|---|---|---|
computingVendor โ | object | ComputingVendorConfig defines the Cloud vendor connection such as AWS, GCP, Azure etc. | |
gpuPools โ | array |
computingVendor โ
ComputingVendorConfig defines the Cloud vendor connection such as AWS, GCP, Azure etc.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
authType | string | accessKey serviceAccountRole | |
enable | boolean | Default: true | |
name | string | ||
params โ | object | ||
type | string | aws lambda-labs gcp azure oracle-oci ibm openshift vultr together-ai alibaba nvidia tencent runpod karpenter mock | support popular cloud providers |
params โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
accessKeyPath | string | the secret of access key and secret key or config file, must be mounted as file path | |
configFile | string | ||
defaultRegion | string | ||
extraParams | object | User can set extra cloud vendor params, eg. in ali cloud:" spotPriceLimit, spotDuration, spotInterruptionBehavior, systemDiskCategory, systemDiskSize, dataDiskPerformanceLevel in aws cloud: TODO | |
iamRole | string | preferred IAM role since it's more secure | |
secretKeyPath | string |
gpuPools (items) โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
isDefault | boolean | ||
name | string | ||
specTemplate โ | object | GPUPoolSpec defines the desired state of GPUPool. |
specTemplate โ
GPUPoolSpec defines the desired state of GPUPool.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
capacityConfig โ | object | ||
componentConfig โ | object | Customize system components for seamless onboarding. | |
nodeManagerConfig โ | object | ||
qosConfig โ | object | Define different QoS and their price. | |
schedulingConfigTemplate | string |
capacityConfig โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
maxResources โ | object | ||
minResources โ | object | ||
oversubscription โ | object | ||
warmResources โ | object |
maxResources โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
cpu | any | pattern: Regex | CPU/Memory is only available when CloudVendor connection is enabled |
memory | any | pattern: Regex | |
tflops | any | pattern: Regex | |
vram | any | pattern: Regex |
minResources โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
cpu | any | pattern: Regex | CPU/Memory is only available when CloudVendor connection is enabled |
memory | any | pattern: Regex | |
tflops | any | pattern: Regex | |
vram | any | pattern: Regex |
oversubscription โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
tflopsOversellRatio | integer<int32> | min: 100 max: 100000 | The multi of TFlops to oversell, default to 500%, indicates 5 times oversell Default: 500 |
vramExpandToHostDisk | integer<int32> | min: 0 max: 100 | the percentage of Host Disk appending to GPU VRAM, default to 70% Default: 70 |
vramExpandToHostMem | integer<int32> | min: 0 max: 100 | the percentage of Host RAM appending to GPU VRAM, default to 50% Default: 50 |
componentConfig โ
Customize system components for seamless onboarding.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
client โ | object | ||
hypervisor โ | object | ||
nodeDiscovery โ | object | ||
worker โ | object |
client โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
embeddedModeImage | string | ||
operatorEndpoint | string | ||
patchEmbeddedWorkerToPod | object | ||
patchToContainer | object | ||
patchToEmbeddedWorkerContainer | object | ||
patchToPod | object | ||
remoteModeImage | string |
nodeManagerConfig โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
nodeCompaction โ | object | ||
nodePoolRollingUpdatePolicy โ | object | ||
nodeProvisioner โ | object | NodeProvisioner or NodeSelector, they are exclusive. NodeSelector is for existing GPUs, NodeProvisioner is for Karpenter-like auto management. | |
nodeSelector โ | object | A node selector represents the union of the results of one or more label queries over a set of nodes; that is, it represents the OR of the selectors represented by the node selector terms. | |
provisioningMode | string | Provisioned AutoSelect Karpenter | Default: AutoSelect |
nodePoolRollingUpdatePolicy โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
autoUpdate | boolean | Default: true | |
batchInterval | string | Default: 10m | |
batchPercentage | integer<int32> | min: 0 max: 100 | Default: 100 |
maintenanceWindow โ | object | ||
maxDuration | string | Default: 10m |
nodeProvisioner โ
NodeProvisioner or NodeSelector, they are exclusive.
NodeSelector is for existing GPUs, NodeProvisioner is for Karpenter-like auto management.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
budget โ | object | NodeProvisioner will start an virtual billing based on public pricing or customized pricing, if the VM's costs exceeded any budget constraints, the new VM will not be created, and alerts will be generated | |
cpuNodeLabels | object | ||
cpuRequirements โ | array | ||
cpuTaints โ | array | ||
gpuNodeAnnotations | object | ||
gpuNodeLabels | object | ||
gpuRequirements โ | array | ||
gpuTaints โ | array | ||
karpenterNodeClassRef โ | object | Karpenter NodeClass name | |
nodeClass | string | TensorFusion GPUNodeClass name |
budget โ
NodeProvisioner will start an virtual billing based on public pricing or customized pricing, if the VM's costs exceeded any budget constraints, the new VM will not be created, and alerts will be generated
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
budgetExceedStrategy | string | AlertOnly AlertAndTerminateVM | Default: AlertOnly |
budgetPerDay | string | Default: 100 | |
budgetPerMonth | string | Default: 1000 | |
budgetPerQuarter | string | Default: 3000 |
cpuRequirements (items) โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
key | string | node.kubernetes.io/instance-type kubernetes.io/arch kubernetes.io/os topology.kubernetes.io/region topology.kubernetes.io/zone karpenter.sh/capacity-type tensor-fusion.ai/gpu-vendor tensor-fusion.ai/gpu-instance-family tensor-fusion.ai/gpu-instance-size | |
operator | string | In Exists DoesNotExist Gt Lt | A node selector operator is the set of operators that can be used in a node selector requirement. Default: In |
values | array |
cpuTaints (items) โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
effect | string | NoSchedule NoExecute PreferNoSchedule | Default: NoSchedule |
key | string | ||
value | string |
gpuRequirements (items) โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
key | string | node.kubernetes.io/instance-type kubernetes.io/arch kubernetes.io/os topology.kubernetes.io/region topology.kubernetes.io/zone karpenter.sh/capacity-type tensor-fusion.ai/gpu-vendor tensor-fusion.ai/gpu-instance-family tensor-fusion.ai/gpu-instance-size | |
operator | string | In Exists DoesNotExist Gt Lt | A node selector operator is the set of operators that can be used in a node selector requirement. Default: In |
values | array |
nodeSelector โ
A node selector represents the union of the results of one or more label queries
over a set of nodes; that is, it represents the OR of the selectors represented
by the node selector terms.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
nodeSelectorTerms โ | array | Required. A list of node selector terms. The terms are ORed. |
nodeSelectorTerms (items) โ
Required. A list of node selector terms. The terms are ORed.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
matchExpressions โ | array | A list of node selector requirements by node's labels. | |
matchFields โ | array | A list of node selector requirements by node's fields. |
matchExpressions (items) โ
A list of node selector requirements by node's labels.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
key | string | The label key that the selector applies to. | |
operator | string | Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt. | |
values | array | An array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. If the operator is Gt or Lt, the values array must have a single element, which will be interpreted as an integer. This array is replaced during a strategic merge patch. |
matchFields (items) โ
A list of node selector requirements by node's fields.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
key | string | The label key that the selector applies to. | |
operator | string | Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt. | |
values | array | An array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. If the operator is Gt or Lt, the values array must have a single element, which will be interpreted as an integer. This array is replaced during a strategic merge patch. |
qosConfig โ
Define different QoS and their price.
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
defaultQoS | string | low medium high critical | |
definitions โ | array | ||
pricing โ | array |
definitions (items) โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
description | string | ||
name | string | low medium high critical | |
priority | integer |
pricing (items) โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
limitsOverRequests | string | Default requests and limitsOverRequests are same, indicates normal on-demand serverless GPU usage, in hands-on lab low QoS case, limitsOverRequests should be lower, so that user can get burstable GPU resources with very low cost Default: 1 | |
qos | string | low medium high critical | |
requests โ | object | The default pricing based on second level pricing from https://modal.com/pricing with Tensor/CUDA Core : HBM = 2:1 |
requests โ
The default pricing based on second level pricing from https://modal.com/pricing
with Tensor/CUDA Core : HBM = 2:1
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
perFP16TFlopsPerHour | string | Default: $0.0069228 | |
perGBOfVRAMPerHour | string | Default: $0.01548 |
Status โ
TensorFusionClusterStatus defines the observed state of TensorFusionCluster.
Property | Type | Constraints | Description |
---|---|---|---|
allocatedTFlopsPercent | string | ||
allocatedVRAMPercent | string | ||
availableTFlops | any | pattern: Regex | |
availableVRAM | any | pattern: Regex | |
cloudVendorConfigHash | string | ||
conditions โ | array | ||
notReadyGPUPools | array | ||
phase | string | Pending Running Updating Destroying Unknown | TensorFusionClusterPhase represents the phase of the TensorFusionCluster resource. Default: Pending |
potentialSavingsPerMonth | string | ||
readyGPUPools | array | ||
retryCount | integer<int64> | Default: 0 | |
savedCostsPerMonth | string | ||
totalGPUs | integer<int32> | ||
totalNodes | integer<int32> | ||
totalPools | integer<int32> | ||
totalTFlops | any | pattern: Regex | |
totalVRAM | any | pattern: Regex | |
utilizedTFlopsPercent | string | ||
utilizedVRAMPercent | string | ||
virtualAvailableTFlops | any | pattern: Regex | |
virtualAvailableVRAM | any | pattern: Regex | |
virtualTFlops | any | pattern: Regex | |
virtualVRAM | any | pattern: Regex |
conditions (items) โ
Properties โ
Property | Type | Constraints | Description |
---|---|---|---|
lastTransitionTime | string<date-time> | lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable. | |
message | string | maxLength: 32768 | message is a human readable message indicating details about the transition. This may be an empty string. |
observedGeneration | integer<int64> | min: 0 | observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance. |
reason | string | minLength: 1 maxLength: 1024 pattern: Regex | reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty. |
status | string | True False Unknown | status of the condition, one of True, False, Unknown. |
type | string | maxLength: 316 pattern: Regex | type of condition in CamelCase or in foo.example.com/CamelCase. |