Skip to content

虚拟机/物理机部署方案

TensorFusion GPU资源池基于Kubernetes运行,您只需选择一台或多台服务器部署Kubernetes控制平面,并将GPU服务器作为节点接入集群即可。该部署方式不会影响您现有的虚拟机/物理机环境以及非容器化服务

完成集群搭建后,您可以将现有服务迁移至TensorFusion创建的本地或远程GPU工作节点。

前置条件

  • 至少一台运行Linux的虚拟机或物理机,已安装GPU
  • 访问DockerHub的权限

步骤1:安装K3S控制节点

选择一台VM/BareMetal来安装K3S,以提供一个简单的Kubernetes环境。您也可以使用其他方式来初始化Kubernetes。

bash
curl -sfL https://get.k3s.io | sh -s - server --tls-san $(curl -s https://ifconfig.me)

然后获取token,用于后续添加GPU节点

bash
cat /var/lib/rancher/k3s/server/node-token

步骤2:配置GPU节点

由于TensorFusion系统运行在容器化环境中,您需要在GPU节点上配置NVIDIA Container Toolkit,然后安装K3S Agent。

bash
# 只需要在每台GPU服务器上运行一次
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit

mkdir -p /var/lib/rancher/k3s/agent/etc/containerd/
cat << EOF >> /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/cni"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true
  BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/k3s/agent/etc/containerd/certs.d"
EOF

步骤3:添加GPU服务器为K3S节点

bash
# 替换MASTER_IP和K3S_TOKEN,然后在每台GPU服务器上运行
export MASTER_IP=<master-private-ip-from-step-1-vm>
export K3S_TOKEN=<k3s-token-from-step-1-cat-command-result>

curl -sfL https://get.k3s.io | K3S_URL=https://$MASTER_IP:6443 K3S_TOKEN=$K3S_TOKEN INSTALL_K3S_EXEC="--node-label nvidia.com/gpu.present=true --node-label feature.node.kubernetes.io/cpu-model.vendor_id=NVIDIA --node-label feature.node.kubernetes.io/pci-10de.present=true" sh -s -

步骤4:验证GPU节点是否添加成功

bash
# ssh in master vm/baremetal
kubectl get nodes --show-labels | grep nvidia.com/gpu.present=true

预期输出:

bash
gpu-node-name   Ready   <none>   42h   v1.32.1 beta.kubernetes.io/arch=amd64,...,kubernetes.io/os=linux,nvidia.com/gpu.present=true

步骤5:安装TensorFusion

您可以按照Kubernetes部署来安装TensorFusion。

安装完成后,您可以在新安装的Kubernetes集群中使用TensorFusion。

可选项:从VM/BareMetal连接TensorFusion vGPU

如果您的工作负载运行在VM/BareMetal,您可以在TensorFusion集群中分配资源并从VM/BareMetal连接vGPU。

bash
TODO: Linux or Windows, Local or Remote vGPU
# Download TensorFusion Libs, Add LD_PRELOAD / LD_LIBRARY_PATH env var