GPU Passthrough #4

Open
opened 2025-12-22 17:02:53 +00:00 by alvis · 0 comments
Owner

Unsuccessful attempt (error -22 + “loading vfio” grub stuck)

dmesg is command that checks kernel ring buffer - critical messages about kernel loading, PCI, etc.

Enabling IOMMU (direct PCI passthrough):

  1. Enable in BIOS
  2. Check if enabled in the host kernel (should see DMAR: IOMMU enabled)
sudo dmesg | grep -e DMAR -e IOMMU

If no output, then do:

sudo nano /etc/default/grub

Add GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"

sudo update-grub
sudo reboot

Now getting this error:

Error: Failed to start device "nvidia-gpu": Failed to override IOMMU group driver: Device took too long to activate at "/sys/bus/pci/drivers/vfio-pci/0000:01:00.0"

Following the perplexity tutorial, need to unbind the current driver, and prevent from further loading (verify vendor and card id in echo).

lspci -n -s 0000:01:00.0

echo "10de 1b81" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id

echo -n "0000:01:00.0" | sudo tee /sys/bus/pci/devices/0000:01:00.0/driver/unbind

echo -n "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind

Binding gives the following error:

0000:01:00.0tee: /sys/bus/pci/drivers/vfio-pci/bind: Invalid argument

AI suggests that the driver may still be in use. However, if I try to unbind - there is “no such dir” error, which means its unbind already. Still, I’ve added nvidia existing drivers to the backlist:

sudo nano /etc/modprobe.d/blacklist-nvidia.conf
blacklist nouveau
options nouveau modeset=0
blacklist nvidia
blacklist nvidiafb
blacklist rivafb
blacklist rivatv
sudo update-initramfs -u
sudo nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash modprobe.blacklist=nouveau,nvidiafb,nvidia"
sudo update-grub
sudo reboot
lsmod | grep -e nouveau -e nvidia

Should show nothing.

After this, binding still fails, now I’m stuck.

Need to try:

# Load vfio-pci with disable_vga option
modprobe vfio-pci disable_vga=1

This didn’t work. Claude gives this:

alvis@agaphub:~$ cat /sys/bus/pci/devices/0000:01:00.0/boot_vga
1
meaning that GPU is loaded to be used as a primary display.

Now fixing.

After lots of trying, I figured out that adding a new device to a PCI slot (like a second GPU) apparently changes the enp4s5 ethernet to enp5s5 or something like this, which makes netplan configuration invalid and prevents from connecting to the PC. Fixed by adding two ethernets to netplan at the same time.

Check success (for 1070):

lspci -nnk -d 10de:1b81

Debugging by (kernel errors):

sudo dmesg | tail -20

vfio need to boot at restart:
Create /etc/initramfs-tools/modules (or edit if exists):

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

then

sudo update-initramfs -u -k all

After restart:

lsmod | grep vfio

Check for errors:

sudo dmesg | grep -i vfio

After switching GPUs back the system wouldn’t load anymore.

## Unsuccessful attempt (error -22 + “loading vfio” grub stuck) - Adding CUDA to a container, official tutorial: https://ubuntu.com/tutorials/gpu-data-processing-inside-lxd#8-test-cuda-within-lxd dmesg is command that checks kernel ring buffer - critical messages about kernel loading, PCI, etc. Enabling IOMMU (direct PCI passthrough): 1. Enable in BIOS 2. Check if enabled in the host kernel (should see `DMAR: IOMMU enabled`) ``` sudo dmesg | grep -e DMAR -e IOMMU ``` If no output, then do: ``` sudo nano /etc/default/grub ``` Add GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on" ```bash sudo update-grub sudo reboot ``` Now getting this error: ``` Error: Failed to start device "nvidia-gpu": Failed to override IOMMU group driver: Device took too long to activate at "/sys/bus/pci/drivers/vfio-pci/0000:01:00.0" ``` Following the perplexity tutorial, need to unbind the current driver, and prevent from further loading (verify vendor and card id in echo). ``` lspci -n -s 0000:01:00.0 echo "10de 1b81" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id echo -n "0000:01:00.0" | sudo tee /sys/bus/pci/devices/0000:01:00.0/driver/unbind echo -n "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind ``` Binding gives the following error: ``` 0000:01:00.0tee: /sys/bus/pci/drivers/vfio-pci/bind: Invalid argument ``` AI suggests that the driver may still be in use. However, if I try to unbind - there is “no such dir” error, which means its unbind already. Still, I’ve added nvidia existing drivers to the backlist: ``` sudo nano /etc/modprobe.d/blacklist-nvidia.conf ``` ``` blacklist nouveau options nouveau modeset=0 blacklist nvidia blacklist nvidiafb blacklist rivafb blacklist rivatv ``` ``` sudo update-initramfs -u sudo nano /etc/default/grub ``` ``` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash modprobe.blacklist=nouveau,nvidiafb,nvidia" ``` ``` sudo update-grub sudo reboot lsmod | grep -e nouveau -e nvidia ``` Should show nothing. After this, binding still fails, now I’m stuck. Need to try: ``` # Load vfio-pci with disable_vga option modprobe vfio-pci disable_vga=1 ``` This didn’t work. Claude gives this: alvis@agaphub:~$ cat /sys/bus/pci/devices/0000:01:00.0/boot_vga 1 meaning that GPU is loaded to be used as a primary display. Now fixing. After lots of trying, I figured out that adding a new device to a PCI slot (like a second GPU) apparently changes the enp4s5 ethernet to enp5s5 or something like this, which makes netplan configuration invalid and prevents from connecting to the PC. Fixed by adding two ethernets to netplan at the same time. Check success (for 1070): ``` lspci -nnk -d 10de:1b81 ``` Debugging by (kernel errors): ``` sudo dmesg | tail -20 ``` vfio need to boot at restart: Create /etc/initramfs-tools/modules (or edit if exists): ``` vfio vfio_iommu_type1 vfio_pci vfio_virqfd ``` then ``` sudo update-initramfs -u -k all ``` After restart: ``` lsmod | grep vfio ``` Check for errors: ``` sudo dmesg | grep -i vfio ``` After switching GPUs back the system wouldn’t load anymore.
alvis added the Kind/Feature
Priority
Medium
3
labels 2025-12-22 17:02:53 +00:00
Sign in to join this conversation.