Skip to content

TensorFusion Metrics Reference ​

TensorFusion collects comprehensive metrics for monitoring GPU infrastructure, workloads, and system performance. All metrics are stored in GreptimeDB with time-series indexing.

System Metrics ​

Measurement: tf_system_metrics

Cluster-wide system statistics and operational counters.

Tags ​

TagDescription
poolGPU pool identifier

Fields ​

FieldTypeDescription
total_workers_cntint64Total active workers
total_nodes_cntint64Total nodes in cluster
total_allocation_fail_cntint64Cumulative allocation failures
total_allocation_success_cntint64Cumulative successful allocations
total_scale_up_cntint64Cumulative scale-up events
total_scale_down_cntint64Cumulative scale-down events
tstimestampRecord timestamp

Worker Resource Metrics ​

Measurement: tf_worker_resources

Resource allocation and usage per worker.

Tags ​

TagDescription
workerWorker identifier
workloadAssociated workload
poolGPU pool identifier
namespaceKubernetes namespace
qosQuality of Service class

Fields ​

FieldTypeDescription
tflops_requestfloat64Requested TFLOPS
tflops_limitfloat64TFLOPS limit
vram_bytes_requestfloat64Requested VRAM in bytes
vram_bytes_limitfloat64VRAM limit in bytes
gpu_countintNumber of GPUs allocated
raw_costfloat64Raw compute cost
readyboolWorker readiness status
tstimestampRecord timestamp

Node Resource Metrics ​

Measurement: tf_node_resources

Resource allocation and utilization per node.

Tags ​

TagDescription
nodeNode identifier
poolGPU pool identifier
phaseNode phase/status

Fields ​

FieldTypeDescription
allocated_tflopsfloat64Allocated TFLOPS
allocated_tflops_percentfloat64TFLOPS utilization percentage
allocated_vram_bytesfloat64Allocated VRAM in bytes
allocated_vram_percentfloat64VRAM utilization percentage
allocated_tflops_percent_virtualfloat64TFLOPS vs virtual capacity
allocated_vram_percent_virtualfloat64VRAM vs virtual capacity
raw_costfloat64Node compute cost
gpu_countintNumber of GPUs on node
tstimestampRecord timestamp

Worker Usage Metrics ​

Measurement: tf_worker_usage

Real-time worker resource usage from hypervisor.

Tags ​

TagDescription
workloadAssociated workload
worker_nameWorker identifier
namespaceKubernetes namespace
pool_nameGPU pool identifier
node_nameHost node name
uuidGPU UUID

Fields ​

FieldTypeDescription
compute_percentagefloat64GPU compute utilization
compute_tflopsfloat64Actual TFLOPS usage
memory_percentagefloat64VRAM utilization percentage
memory_bytesuint64VRAM usage in bytes
tstimestampRecord timestamp

GPU Usage Metrics ​

Measurement: tf_gpu_usage

Detailed GPU hardware metrics from hypervisor.

Tags ​

TagDescription
nodeHost node name
poolGPU pool identifier
uuidGPU UUID

Fields ​

FieldTypeDescription
compute_percentagefloat64GPU compute utilization
memory_percentagefloat64VRAM utilization percentage
memory_bytesuint64VRAM usage in bytes
compute_tflopsfloat64Actual TFLOPS usage
rxfloat64PCIe receive KB/s
txfloat64PCIe transmit KB/s
temperaturefloat64GPU temperature (°C)
graphics_clock_mhzfloat64Graphics clock frequency
sm_clock_mhzfloat64SM clock frequency
memory_clock_mhzfloat64Memory clock frequency
video_clock_mhzfloat64Video clock frequency
power_usagefloat64Power consumption (W)
nvlink_rxfloat64NVLink receive throughput
nvlink_txfloat64NVLink transmit throughput
tstimestampRecord timestamp