Compare commits

..

No commits in common. "dc86e45e1ca8f2f2208e676a2f3224b30663a734" and "78f290a9542d78dd9cdc72be54c78fa8ae3d7e7a" have entirely different histories.

View File

@ -1,195 +0,0 @@
+++
author = "FlintyLemming"
title = "Proxmox VE 8.1 vGPU 配置 A6000"
slug = "d29bb28b14984443b232263348b946ba"
date = "2023-12-13"
description = ""
categories = ["Consumer", "Linux"]
tags = ["pve", "Nvidia"]
image = "https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/jigar-panchal-TVyPnkS5k5w-unsplash.jpg?x-oss-process=style/ImageCompress"
+++
## 操作环境
Dell R750xa 配置如下
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled.png?x-oss-process=style/ImageCompress)
## 设备配置
确保开启虚拟化和 SR-IOV
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%201.png?x-oss-process=style/ImageCompress)
## Proxmox VM host 环境配置
### 配置软件源
1. 删除企业源和 Ceph 源
```bash
rm /etc/apt/sources.list.d/pve-enterprise.list
rm /etc/apt/sources.list.d/ceph.list
```
2. 修改软件源为国内源
```bash
nano /etc/apt/sources.list
# 内容修改为如下内容
deb https://mirrors.aliyun.com/debian/ bookworm main contrib non-free
deb-src https://mirrors.aliyun.com/debian/ bookworm main contrib non-free
deb https://mirrors.aliyun.com/debian/ bookworm-updates main contrib non-free
deb-src https://mirrors.aliyun.com/debian/ bookworm-updates main contrib non-free
deb https://mirrors.aliyun.com/debian/ bookworm-backports main contrib non-free
deb-src https://mirrors.aliyun.com/debian/ bookworm-backports main contrib non-free
deb https://mirrors.ustc.edu.cn/debian-security/ stable-security main contrib non-free
deb-src https://mirrors.ustc.edu.cn/debian-security/ stable-security main contrib non-free
```
### 其他系统配置
1. 开启 iommu
```bash
nano /etc/default/grub
# 找到
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
# 改为:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
# 更新 grub
update-grub
```
2. 加载 vfio 模块
```bash
echo vfio >> /etc/modules
echo vfio_iommu_type1 >> /etc/modules
echo vfio_pci >> /etc/modules
echo vfio_virqfd >> /etc/modules
```
3. 屏蔽现有开源驱动,然后重启
```bash
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidiafb" >> /etc/modprobe.d/blacklist.conf
# 更新内核参数
update-initramfs -k all -u
```
### 修改显卡模式
1. 如果 GPU 带显示接口,需要修改显卡模式。使用下面的命令检查,如果结果中显示为 VGA compatible controller 就需要修改。
```bash
lspci | grep NVIDIA
# 执行结果
17:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
17:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
65:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
ca:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
ca:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
e3:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
e3:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
```
2. 下载 NVIDIA Display Mode Selector Utility可以从[这里](https://index.mitsea.com/%E8%BD%AF%E4%BB%B6/%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F/Display_Mode-1.61.0.zip)下但是不保证链接有效性
3. 检查当前显卡,获得序号
```bash
chmod +x displaymodeselector
./displaymodeselector --list
```
4. 修改显卡模式
```bash
./displaymodeselector --gpumode physical_display_disabled -i 0
./displaymodeselector --gpumode physical_display_disabled -i 1
./displaymodeselector --gpumode physical_display_disabled -i 2
./displaymodeselector --gpumode physical_display_disabled -i 3
```
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%202.png?x-oss-process=style/ImageCompress)
5. 重启服务器,重启后应该显示为 3D Controller
```bash
17:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
65:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
e3:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
```
### 安装驱动
1. 安装 NVIDIA Driver 安装时需要的依赖
```bash
apt update
apt install build-essential dkms mdevctl pve-headers-$(uname -r)
```
2. 安装驱动,下载的驱动包有好几个驱动,安装 host 驱动。驱动可以从[这里](https://index.mitsea.com/%E8%BD%AF%E4%BB%B6/%E9%A9%B1%E5%8A%A8%E5%92%8C%E5%85%B6%E4%BB%96%E9%95%9C%E5%83%8F/NVIDIA-GRID-Linux-KVM-535.104.06-535.104.05-537.13.zip)下,但是不保证链接有效性。把驱动传到服务器上后,设置执行权限后运行。
```bash
chmod +x NVIDIA-Linux-x86_64-535.104.06-vgpu-kvm.run
./NVIDIA-Linux-x86_64-535.104.06-vgpu-kvm.run --dkms
```
3. 执行 `nvidia-smi` 后无误即可
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%203.png?x-oss-process=style/ImageCompress)
## 搭建 vGPU 授权服务器
[Oscar Krause / FastAPI-DLS · GitLab](https://git.collinwebdesigns.de/oscar.krause/fastapi-dls)
按照仓库 Readme 搭建就行了,主要就是强制 https本地的话需要生成一个自签名证书。法外狂徒挂公网可以无视nginx 证书配好就行。对于挂在公网上有几个注意点:
1. docker 命令中的 `DLS_URL=`hostname -i`` 填你反代时要使用的域名例如`DLS_URL=`xxx.xxx.com``
2. `DLS_PORT=443` 不要动,只改 port 映射出去的端口,比如改成 `-p 4433:443` 这样反代那边就反代容器 IP:4433
## 虚拟机添加设备
开机后需要启用 SR-IOV 设备,每次开机都要执行,可以写成一个服务开机自动执行一次
```jsx
/usr/lib/nvidia/sriov-manage -e ALL
```
Raw Device 选择一个不是 .0 的设备后MDev Type 就可以选 vGPU Profile 了。如果想要用整张显卡,也不要通 .0 的设备,据说会容易导致 pve 爆炸失联,建议还是选择一个用完所有显存的 Profile。
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%204.png?x-oss-process=style/ImageCompress)
![](https://img.mitsea.com/blog/posts/2023/12/Proxmox%20VE%208.1%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89/Untitled%205.png?x-oss-process=style/ImageCompress)
## 激活 vGPU 授权
参考激活服务器 Readme 中 Setup Client 一节
[Oscar Krause / FastAPI-DLS · GitLab](https://git.collinwebdesigns.de/oscar.krause/fastapi-dls#setup-client)
### Windows
1. 进入 Windows 后先安装之前那个驱动包里的 host 驱动
2. 从 https://<你的dls服务器>/-/client-token 上下载配置文件,然后放到 C:\Program Files\NVIDIA Corporation\vGPU Licensing\ClientConfigToken 下
3. 重启电脑,然后就能看到正在获取许可证并激活成功
![CleanShot 2023-12-13 at 22.17.13@2x.png](Proxmox%20VE%208%201%20vGPU%20%E9%85%8D%E7%BD%AE%20%EF%BC%88A6000%EF%BC%89%20d29bb28b14984443b232263348b946ba/CleanShot_2023-12-13_at_22.17.132x.png?x-oss-process=style/ImageCompress)
### Linux
执行下面的命令
```bash
curl --insecure -L -X GET https://<dls-hostname-or-ip>/-/client-token -o /etc/nvidia/ClientConfigToken/client_configuration_token_$(date '+%d-%m-%Y-%H-%M-%S').tok
service nvidia-gridd restart
```
> Photo by [Jigar Panchal](https://unsplash.com/@brave4_heart?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash) on [Unsplash](https://unsplash.com/photos/a-very-colorful-abstract-background-with-a-lot-of-blocks-TVyPnkS5k5w?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash)