Skip to content

TensorFusion System Metrics ​

TensorFusion provides comprehensive metrics to monitor the performance, utilization, and health of your GPU infrastructure. This reference documents all available metrics for monitoring and troubleshooting your TensorFusion deployment.

Table of Contents ​

GPU Metrics ​

These metrics provide insights into the performance and utilization of individual GPUs inside TensorFusion cluster.

Computing & VRAM ​

Metric NameDescriptionFieldsUse Case
tf_gpu_metrics_avg_compute_percentageAverage GPU compute utilization percentage over the collection intervaluuidMonitor GPU computational load and identify potential bottlenecks
tf_gpu_metrics_avg_memory_bytesAverage GPU memory (VRAM) usage in bytesuuidTrack memory consumption patterns and detect memory leaks or inefficient usage

Common Fields:

  • uuid: Unique identifier of the GPU
  • greptime_timestamp: Time when the metric was collected
  • greptime_value: The metric value (percentage for compute, bytes for memory)

Network ​

Metric NameDescriptionFieldsUse Case
tf_gpu_metrics_avg_rxAverage GPU receive network throughput in bytes per seconduuidMonitor data transfer to the GPU for distributed workloads
tf_gpu_metrics_avg_txAverage GPU transmit network throughput in bytes per seconduuidMonitor data transfer from the GPU for distributed workloads

Common Fields:

  • uuid: Unique identifier of the GPU
  • greptime_timestamp: Time when the metric was collected
  • greptime_value: Network throughput in bytes/second

Temperature ​

Metric NameDescriptionFieldsUse Case
tf_gpu_metrics_avg_temperatureAverage GPU temperature in degrees CelsiusuuidMonitor thermal conditions to prevent overheating and ensure optimal performance

Common Fields:

  • uuid: Unique identifier of the GPU
  • greptime_timestamp: Time when the metric was collected
  • greptime_value: Temperature in degrees Celsius

GPU Scheduler Metrics ​

These metrics provide insights into the GPU resource allocation and scheduling decisions.

Metric NameDescriptionFieldsUse Case
tf_gpu_tflops_limitMaximum TFLOPS (Trillion Floating Point Operations Per Second) capacity of the GPUnamespace, pool, workerUnderstand the theoretical computational limits of available GPUs
tf_gpu_tflops_requestRequested TFLOPS for the GPU by workloadsnamespace, pool, workerTrack computational resource requests and allocation efficiency
tf_vram_bytes_limitMaximum VRAM capacity in bytes available on the GPUnamespace, pool, workerUnderstand memory constraints for workload planning
tf_vram_bytes_requestRequested VRAM in bytes by workloadsnamespace, pool, workerTrack memory resource requests and allocation efficiency

Common Fields:

  • namespace: Kubernetes namespace
  • pool: GPU pool identifier
  • worker: Worker node identifier
  • greptime_timestamp: Time when the metric was collected
  • greptime_value: The metric value (TFLOPS or bytes depending on the metric)

GPU Worker Metrics ​

These metrics provide insights into the performance and utilization at each TensorFusion GPU worker level.

Metric NameDescriptionFieldsUse Case
tf_worker_metrics_avg_compute_percentageAverage worker compute utilization percentage across all GPUs on the nodeworker, uuidMonitor overall worker node computational load and balance
tf_worker_metrics_avg_memory_bytesAverage worker memory usage in bytes across all GPUs on the nodeworker, uuidTrack overall worker node memory consumption patterns

Common Fields:

  • worker: Worker node identifier
  • uuid: Unique identifier of the GPU
  • greptime_timestamp: Time when the metric was collected
  • greptime_value: The metric value (percentage for compute, bytes for memory)