diff --git a/vm-management/troubleshooting.md b/vm-management/troubleshooting.md index 1b259bd..d890eda 100644 --- a/vm-management/troubleshooting.md +++ b/vm-management/troubleshooting.md @@ -678,9 +678,12 @@ This troubleshooting guide covers comprehensive recovery procedures for VM manag - nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean - Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller +**Remediation:** Upgraded Nvidia driver 570.211.01 → 580.126.09. The 570 packages were in a broken state (partial upgrade left metapackage pinned at `.24.04.1` while deps moved to `.24.04.2`), requiring explicit removal of `nvidia-driver-570 nvidia-dkms-570 nvidia-kernel-source-570 nvidia-kernel-common-570 libnvidia-common-570 libnvidia-gl-570` with `--allow-change-held-packages` before 580 could install cleanly. Note: 590 drivers reported unstable — avoid. + **Lesson:** - Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots - Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend - Docker containers with restart policies recovered without intervention — validates that approach - Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation -- Monitor nvme1 media integrity error count; if it increments, plan replacement \ No newline at end of file +- Monitor nvme1 media integrity error count; if it increments, plan replacement +- When upgrading Nvidia driver major versions on Ubuntu, apt often can't resolve conflicts automatically — explicitly remove the old driver packages with `--allow-change-held-packages` first \ No newline at end of file