LogoTensorFusion Docs
LogoTensorFusion Docs
HomepageDocumentation

Getting Started

OverviewKubernetes InstallVM/Server Install(K3S)Helm On-premises InstallHost/GuestVM InstallTensorFusion Architecture

Application Operations

Create WorkloadConfigure AutoScalingMigrate Existing WorkloadBest Practices

Customize AI Infra

Production-Grade DeploymentConfig QoS and BillingBring Your Own CloudManaging License

Maintenance & Optimization

Upgrade ComponentsSetup AlertsGPU Live MigrationPreload ModelOptimize GPU Efficiency

Troubleshooting

HandbookTracing/ProfilingQuery Metrics & Logs

Reference

Comparison

Compare with NVIDIA vGPUCompare with MIG/MPSCompare with Run.AICompare with HAMi

Best Practices

Best practices for resource configuration, performance, and stability using TensorFusion annotations.

This guide focuses on common workload-annotation patterns to balance performance, cost, and stability. For the full annotation list, see Workload Configuration. For terminology, see Terminology.

Basic Usage

1. Use requests + limits to precisely control compute/VRAM

Set both tflops and vram requests/limits to avoid jitter or over-commit:

metadata:
  annotations:
    tensor-fusion.ai/tflops-request: "10"
    tensor-fusion.ai/tflops-limit: "20"
    tensor-fusion.ai/vram-request: "4Gi"
    tensor-fusion.ai/vram-limit: "4Gi"

If you need percentage-based compute, use compute-percent-*, but do not mix it with tflops-*.

You can also set GPU requests/limits in Pod resources. The system reads nvidia.com/gpu (or other vendor resource names) from limits and converts it to compute-percent (default 100%). If tflops-* is not set, requests inherit from limits. This path may not yield a concrete TFLOPs value unless the GPU model is known:

spec:
  containers:
    - name: trainer
      resources:
        requests:
          nvidia.com/gpu: "1"
        limits:
          nvidia.com/gpu: "1"

2. Prefer local GPU mode to reduce latency

For latency-sensitive workloads, enable local GPU. Sidecar Worker further reduces communication overhead:

metadata:
  annotations:
    tensor-fusion.ai/is-local-gpu: "true"
    tensor-fusion.ai/sidecar-worker: "true"

3. Reuse configuration via WorkloadProfile

When multiple workloads share a template, use WorkloadProfile and override only the differences:

metadata:
  annotations:
    tensor-fusion.ai/workload-profile: "default-profile"
    tensor-fusion.ai/tflops-request: "12" # override profile default

4. Explicitly declare multi-GPU usage

For multi-GPU tasks set gpu-count:

metadata:
  annotations:
    tensor-fusion.ai/gpu-count: "2"

5. Multi-container Pods should specify per-container GPU requirements

If multiple containers in a Pod need GPUs, specify the GPU count per container to avoid scheduling ambiguity:

metadata:
  annotations:
    tensor-fusion.ai/container-gpu-count: '{"trainer":1,"sidecar":1}'

If per-container GPU counts are not provided, containers share the same GPUs by default; in multi-GPU scenarios this may be misinterpreted as a single-GPU workload.

Advanced Usage

6. Pick the right QoS for your workload priority

Use higher QoS for critical inference and lower QoS for batch/offline jobs:

metadata:
  annotations:
    tensor-fusion.ai/qos: "high"

7. Pin model or use dedicated GPU for stable performance

If your model is sensitive to GPU type, or you need stable performance:

metadata:
  annotations:
    tensor-fusion.ai/gpu-model: "A100"
    tensor-fusion.ai/dedicated-gpu: "true"

8. Automation tip: enable Autoscale, then tune via WorkloadProfile

Use annotations to turn on autoscaling quickly, then move fine-grained settings into a WorkloadProfile:

metadata:
  annotations:
    tensor-fusion.ai/autoscale: "true"
    tensor-fusion.ai/autoscale-target: "all"
    tensor-fusion.ai/workload-profile: "autoscale-default"
  • compute: only auto-adjust compute (TFLOPs/compute-percent)
  • vram: only auto-adjust VRAM
  • all: auto-adjust both compute + VRAM

Example WorkloadProfile (auto-set resources from usage history):

apiVersion: tensor-fusion.ai/v1
kind: WorkloadProfile
metadata:
  name: autoscale-default
spec:
  autoScalingConfig:
    autoSetResources:
      enable: true
      targetResource: all # compute | vram | all
      historyDataPeriod: 2h
      marginFraction: "0.15"

9. Gradually enable TensorFusion via canary

For large clusters, use enabled-replicas for a gradual rollout:

metadata:
  annotations:
    tensor-fusion.ai/enabled-replicas: "1"

10. Choose isolation mode by risk level

  • soft: default, suitable for most shared training/inference
  • hard: for multi-tenant or higher-risk scenarios
  • partitioned: when hardware partitioning is required
metadata:
  annotations:
    tensor-fusion.ai/isolation: "hard"

Common model / chip sizing reference

Common model sizing reference

Use this table to capture typical TFLOPs and VRAM requirements per model and workload type; GPU count should be adjusted based on GPU SKU and cluster constraints.

Model / TaskScenario (Train/FT/Infer)PrecisionTarget TFLOPsVRAM Requirement (Approx.)Notes
LLaMA 7BTrain (full)FP16200–400 TFLOPs~50–60 GBSmall-scale pretraining
LLaMA 7BInferenceBF16/INT81–3 TFLOPs~12–14 GBLow-latency inference
GPT-2 1.5BTrain (full)FP1660–120 TFLOPs~10–20 GBSmall model training
DeepSeek-7BFine-tuneFP1660–100 TFLOPs~14–18 GBLoRA / instruction tuning
DeepSeek-7BInferenceBF16 / INT81–3 TFLOPs~14–18 GBSingle-GPU online serving
DeepSeek-33BFine-tuneFP16180–300 TFLOPs~60–80 GBEnterprise fine-tuning
DeepSeek-33BInferenceBF16 / INT86–12 TFLOPs~60–80 GBHigh-quality dialogue
DeepSeek-67BFine-tuneFP16350–600 TFLOPs~120–140 GBPrivate large model
DeepSeek-67BInferenceBF16 / INT812–25 TFLOPs~120–140 GBHigh concurrency, multi-GPU
Kimi-Base (~30B)InferenceBF1620–40 TFLOPs~60–70 GBLong-context driven
Kimi-BaseInferenceBF1680–150 TFLOPs~80–150 GBKV cache heavy
Kimi-BaseFine-tuneFP16250–400 TFLOPs~250–400 GBLong-text training
Kimi-MoE (est.)InferenceBF1615–30 TFLOPs~60–70 GBSparse MoE activation
Qwen-7BFine-tuneFP1660–100 TFLOPs~24–40 GB/
Qwen-14BFine-tuneFP16120–200 TFLOPs~48–60 GB/
Baichuan-13BInferenceBF164–8 TFLOPs~24–26 GB/

GPU peak TFLOPs reference

Use this table to align model sizing with chip capability.

Data from multiple vendors:

GPU ModelVendorFP16/BF16 Peak TFLOPs
A100nvdia312
H100nvidia800
H200nvida989
MI250Xamd383
MI300Xamd1300+
Ascend 910华为320
Ascend 910B华为400+
Ascend 310P华为16

Table of Contents

Basic Usage
1. Use requests + limits to precisely control compute/VRAM
2. Prefer local GPU mode to reduce latency
3. Reuse configuration via WorkloadProfile
4. Explicitly declare multi-GPU usage
5. Multi-container Pods should specify per-container GPU requirements
Advanced Usage
6. Pick the right QoS for your workload priority
7. Pin model or use dedicated GPU for stable performance
8. Automation tip: enable Autoscale, then tune via WorkloadProfile
9. Gradually enable TensorFusion via canary
10. Choose isolation mode by risk level
Common model / chip sizing reference
Common model sizing reference
GPU peak TFLOPs reference