docs: sync KB — troubleshooting.md

2026-03-25 00:00:43 -05:00 · 2026-03-25 00:00:43 -05:00 · 646991e1a9
commit 646991e1a9
parent 08a9dcd6eb
1 changed files with 4 additions and 1 deletions
--- a/vm-management/troubleshooting.md
+++ b/vm-management/troubleshooting.md
@ -678,9 +678,12 @@ This troubleshooting guide covers comprehensive recovery procedures for VM manag
 - nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
 - Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller

+**Remediation:** Upgraded Nvidia driver 570.211.01 → 580.126.09. The 570 packages were in a broken state (partial upgrade left metapackage pinned at `.24.04.1` while deps moved to `.24.04.2`), requiring explicit removal of `nvidia-driver-570 nvidia-dkms-570 nvidia-kernel-source-570 nvidia-kernel-common-570 libnvidia-common-570 libnvidia-gl-570` with `--allow-change-held-packages` before 580 could install cleanly. Note: 590 drivers reported unstable — avoid.
+
 **Lesson:**
 - Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots
 - Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend
 - Docker containers with restart policies recovered without intervention — validates that approach
 - Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation
- Monitor nvme1 media integrity error count; if it increments, plan replacement
+- Monitor nvme1 media integrity error count; if it increments, plan replacement
+- When upgrading Nvidia driver major versions on Ubuntu, apt often can't resolve conflicts automatically — explicitly remove the old driver packages with `--allow-change-held-packages` first