Best Practices
Best practices for resource configuration, performance, and stability using TensorFusion annotations.
This guide focuses on common workload-annotation patterns to balance performance, cost, and stability. For the full annotation list, see Workload Configuration. For terminology, see Terminology.
Basic Usage
1. Use requests + limits to precisely control compute/VRAM
Set both tflops and vram requests/limits to avoid jitter or over-commit:
metadata:
annotations:
tensor-fusion.ai/tflops-request: "10"
tensor-fusion.ai/tflops-limit: "20"
tensor-fusion.ai/vram-request: "4Gi"
tensor-fusion.ai/vram-limit: "4Gi"If you need percentage-based compute, use compute-percent-*, but do not mix it with tflops-*.
You can also set GPU requests/limits in Pod resources. The system reads nvidia.com/gpu (or other vendor resource names) from limits and converts it to compute-percent (default 100%). If tflops-* is not set, requests inherit from limits. This path may not yield a concrete TFLOPs value unless the GPU model is known:
spec:
containers:
- name: trainer
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"2. Prefer local GPU mode to reduce latency
For latency-sensitive workloads, enable local GPU. Sidecar Worker further reduces communication overhead:
metadata:
annotations:
tensor-fusion.ai/is-local-gpu: "true"
tensor-fusion.ai/sidecar-worker: "true"3. Reuse configuration via WorkloadProfile
When multiple workloads share a template, use WorkloadProfile and override only the differences:
metadata:
annotations:
tensor-fusion.ai/workload-profile: "default-profile"
tensor-fusion.ai/tflops-request: "12" # override profile default4. Explicitly declare multi-GPU usage
For multi-GPU tasks set gpu-count:
metadata:
annotations:
tensor-fusion.ai/gpu-count: "2"5. Multi-container Pods should specify per-container GPU requirements
If multiple containers in a Pod need GPUs, specify the GPU count per container to avoid scheduling ambiguity:
metadata:
annotations:
tensor-fusion.ai/container-gpu-count: '{"trainer":1,"sidecar":1}'If per-container GPU counts are not provided, containers share the same GPUs by default; in multi-GPU scenarios this may be misinterpreted as a single-GPU workload.
Advanced Usage
6. Pick the right QoS for your workload priority
Use higher QoS for critical inference and lower QoS for batch/offline jobs:
metadata:
annotations:
tensor-fusion.ai/qos: "high"7. Pin model or use dedicated GPU for stable performance
If your model is sensitive to GPU type, or you need stable performance:
metadata:
annotations:
tensor-fusion.ai/gpu-model: "A100"
tensor-fusion.ai/dedicated-gpu: "true"8. Automation tip: enable Autoscale, then tune via WorkloadProfile
Use annotations to turn on autoscaling quickly, then move fine-grained settings into a WorkloadProfile:
metadata:
annotations:
tensor-fusion.ai/autoscale: "true"
tensor-fusion.ai/autoscale-target: "all"
tensor-fusion.ai/workload-profile: "autoscale-default"compute: only auto-adjust compute (TFLOPs/compute-percent)vram: only auto-adjust VRAMall: auto-adjust both compute + VRAM
Example WorkloadProfile (auto-set resources from usage history):
apiVersion: tensor-fusion.ai/v1
kind: WorkloadProfile
metadata:
name: autoscale-default
spec:
autoScalingConfig:
autoSetResources:
enable: true
targetResource: all # compute | vram | all
historyDataPeriod: 2h
marginFraction: "0.15"9. Gradually enable TensorFusion via canary
For large clusters, use enabled-replicas for a gradual rollout:
metadata:
annotations:
tensor-fusion.ai/enabled-replicas: "1"10. Choose isolation mode by risk level
soft: default, suitable for most shared training/inferencehard: for multi-tenant or higher-risk scenariospartitioned: when hardware partitioning is required
metadata:
annotations:
tensor-fusion.ai/isolation: "hard"Common model / chip sizing reference
Common model sizing reference
Use this table to capture typical TFLOPs and VRAM requirements per model and workload type; GPU count should be adjusted based on GPU SKU and cluster constraints.
| Model / Task | Scenario (Train/FT/Infer) | Precision | Target TFLOPs | VRAM Requirement (Approx.) | Notes |
|---|---|---|---|---|---|
| LLaMA 7B | Train (full) | FP16 | 200–400 TFLOPs | ~50–60 GB | Small-scale pretraining |
| LLaMA 7B | Inference | BF16/INT8 | 1–3 TFLOPs | ~12–14 GB | Low-latency inference |
| GPT-2 1.5B | Train (full) | FP16 | 60–120 TFLOPs | ~10–20 GB | Small model training |
| DeepSeek-7B | Fine-tune | FP16 | 60–100 TFLOPs | ~14–18 GB | LoRA / instruction tuning |
| DeepSeek-7B | Inference | BF16 / INT8 | 1–3 TFLOPs | ~14–18 GB | Single-GPU online serving |
| DeepSeek-33B | Fine-tune | FP16 | 180–300 TFLOPs | ~60–80 GB | Enterprise fine-tuning |
| DeepSeek-33B | Inference | BF16 / INT8 | 6–12 TFLOPs | ~60–80 GB | High-quality dialogue |
| DeepSeek-67B | Fine-tune | FP16 | 350–600 TFLOPs | ~120–140 GB | Private large model |
| DeepSeek-67B | Inference | BF16 / INT8 | 12–25 TFLOPs | ~120–140 GB | High concurrency, multi-GPU |
| Kimi-Base (~30B) | Inference | BF16 | 20–40 TFLOPs | ~60–70 GB | Long-context driven |
| Kimi-Base | Inference | BF16 | 80–150 TFLOPs | ~80–150 GB | KV cache heavy |
| Kimi-Base | Fine-tune | FP16 | 250–400 TFLOPs | ~250–400 GB | Long-text training |
| Kimi-MoE (est.) | Inference | BF16 | 15–30 TFLOPs | ~60–70 GB | Sparse MoE activation |
| Qwen-7B | Fine-tune | FP16 | 60–100 TFLOPs | ~24–40 GB | / |
| Qwen-14B | Fine-tune | FP16 | 120–200 TFLOPs | ~48–60 GB | / |
| Baichuan-13B | Inference | BF16 | 4–8 TFLOPs | ~24–26 GB | / |
GPU peak TFLOPs reference
Use this table to align model sizing with chip capability.
Data from multiple vendors:
| GPU Model | Vendor | FP16/BF16 Peak TFLOPs |
|---|---|---|
| A100 | nvdia | 312 |
| H100 | nvidia | 800 |
| H200 | nvida | 989 |
| MI250X | amd | 383 |
| MI300X | amd | 1300+ |
| Ascend 910 | 华为 | 320 |
| Ascend 910B | 华为 | 400+ |
| Ascend 310P | 华为 | 16 |
TensorFusion Docs