MitseaBlog/content/post/d29bb28b14984443b232263348b946ba/index.zh-cn.md
2024-07-31 16:52:27 +08:00

195 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

+++
author = "FlintyLemming"
title = "Proxmox VE 8.1 vGPU 配置 A6000"
slug = "d29bb28b14984443b232263348b946ba"
date = "2023-12-13"
description = "新到的大玩具"
categories = ["Consumer", "Linux"]
tags = ["pve", "Nvidia"]
image = "https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/jigar-panchal-TVyPnkS5k5w-unsplash.avif"
+++
## 操作环境
Dell R750xa 配置如下
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled.avif)
## 设备配置
确保开启虚拟化和 SR-IOV
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%201.avif)
## Proxmox VM host 环境配置
### 配置软件源
1. 删除企业源和 Ceph 源
```bash
rm /etc/apt/sources.list.d/pve-enterprise.list
rm /etc/apt/sources.list.d/ceph.list
```
2. 修改软件源为国内源
```bash
nano /etc/apt/sources.list
# 内容修改为如下内容
deb https://mirrors.aliyun.com/debian/ bookworm main contrib non-free
deb-src https://mirrors.aliyun.com/debian/ bookworm main contrib non-free
deb https://mirrors.aliyun.com/debian/ bookworm-updates main contrib non-free
deb-src https://mirrors.aliyun.com/debian/ bookworm-updates main contrib non-free
deb https://mirrors.aliyun.com/debian/ bookworm-backports main contrib non-free
deb-src https://mirrors.aliyun.com/debian/ bookworm-backports main contrib non-free
deb https://mirrors.ustc.edu.cn/debian-security/ stable-security main contrib non-free
deb-src https://mirrors.ustc.edu.cn/debian-security/ stable-security main contrib non-free
```
### 其他系统配置
1. 开启 iommu
```bash
nano /etc/default/grub
# 找到
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
# 改为:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
# 更新 grub
update-grub
```
2. 加载 vfio 模块
```bash
echo vfio >> /etc/modules
echo vfio_iommu_type1 >> /etc/modules
echo vfio_pci >> /etc/modules
echo vfio_virqfd >> /etc/modules
```
3. 屏蔽现有开源驱动,然后重启
```bash
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidiafb" >> /etc/modprobe.d/blacklist.conf
# 更新内核参数
update-initramfs -k all -u
```
### 修改显卡模式
1. 如果 GPU 带显示接口,需要修改显卡模式。使用下面的命令检查,如果结果中显示为 VGA compatible controller 就需要修改。
```bash
lspci | grep NVIDIA
# 执行结果
17:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
17:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
65:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
ca:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
ca:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
e3:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
e3:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
```
2. 下载 NVIDIA Display Mode Selector Utility可以从[这里](https://index.mitsea.com/%E8%BD%AF%E4%BB%B6/%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F/Display_Mode-1.61.0.zip)下但是不保证链接有效性
3. 检查当前显卡,获得序号
```bash
chmod +x displaymodeselector
./displaymodeselector --list
```
4. 修改显卡模式
```bash
./displaymodeselector --gpumode physical_display_disabled -i 0
./displaymodeselector --gpumode physical_display_disabled -i 1
./displaymodeselector --gpumode physical_display_disabled -i 2
./displaymodeselector --gpumode physical_display_disabled -i 3
```
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%202.avif)
5. 重启服务器,重启后应该显示为 3D Controller
```bash
17:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
65:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
e3:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
```
### 安装驱动
1. 安装 NVIDIA Driver 安装时需要的依赖
```bash
apt update
apt install build-essential dkms mdevctl pve-headers-$(uname -r)
```
2. 安装驱动,下载的驱动包有好几个驱动,安装 host 驱动。驱动可以从[这里](https://index.mitsea.com/%E8%BD%AF%E4%BB%B6/%E9%A9%B1%E5%8A%A8%E5%92%8C%E5%85%B6%E4%BB%96%E9%95%9C%E5%83%8F/NVIDIA-GRID-Linux-KVM-535.104.06-535.104.05-537.13.zip)下,但是不保证链接有效性。把驱动传到服务器上后,设置执行权限后运行。
```bash
chmod +x NVIDIA-Linux-x86_64-535.104.06-vgpu-kvm.run
./NVIDIA-Linux-x86_64-535.104.06-vgpu-kvm.run --dkms
```
3. 执行 `nvidia-smi` 后无误即可
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%203.avif)
## 搭建 vGPU 授权服务器
[Oscar Krause / FastAPI-DLS · GitLab](https://git.collinwebdesigns.de/oscar.krause/fastapi-dls)
按照仓库 Readme 搭建就行了,主要就是强制 https本地的话需要生成一个自签名证书。法外狂徒挂公网可以无视nginx 证书配好就行。对于挂在公网上有几个注意点:
1. docker 命令中的 `DLS_URL=`hostname -i`` 填你反代时要使用的域名例如`DLS_URL=`xxx.xxx.com``
2. `DLS_PORT=443` 不要动,只改 port 映射出去的端口,比如改成 `-p 4433:443` 这样反代那边就反代容器 IP:4433
## 虚拟机添加设备
开机后需要启用 SR-IOV 设备,每次开机都要执行,可以写成一个服务开机自动执行一次
```jsx
/usr/lib/nvidia/sriov-manage -e ALL
```
Raw Device 选择一个不是 .0 的设备后MDev Type 就可以选 vGPU Profile 了。如果想要用整张显卡,也不要通 .0 的设备,据说会容易导致 pve 爆炸失联,建议还是选择一个用完所有显存的 Profile。
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%204.avif)
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%205.avif)
## 激活 vGPU 授权
参考激活服务器 Readme 中 Setup Client 一节
[Oscar Krause / FastAPI-DLS · GitLab](https://git.collinwebdesigns.de/oscar.krause/fastapi-dls#setup-client)
### Windows
1. 进入 Windows 后先安装之前那个驱动包里的 host 驱动
2. 从 https://<你的dls服务器>/-/client-token 上下载配置文件,然后放到 C:\Program Files\NVIDIA Corporation\vGPU Licensing\ClientConfigToken 下
3. 重启电脑,然后就能看到正在获取许可证并激活成功
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/CleanShot_2023-12-13_at_22.17.132x.avif)
### Linux
执行下面的命令
```bash
curl --insecure -L -X GET https://<dls-hostname-or-ip>/-/client-token -o /etc/nvidia/ClientConfigToken/client_configuration_token_$(date '+%d-%m-%Y-%H-%M-%S').tok
service nvidia-gridd restart
```
> Photo by [Jigar Panchal](https://unsplash.com/@brave4_heart?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash) on [Unsplash](https://unsplash.com/photos/a-very-colorful-abstract-background-with-a-lot-of-blocks-TVyPnkS5k5w?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash)