From 646991e1a94b1f7f3432814a3040952c4f35e086 Mon Sep 17 00:00:00 2001 From: Cal Corum Date: Wed, 25 Mar 2026 00:00:43 -0500 Subject: [PATCH] =?UTF-8?q?docs:=20sync=20KB=20=E2=80=94=20troubleshooting?= =?UTF-8?q?.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- vm-management/troubleshooting.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/vm-management/troubleshooting.md b/vm-management/troubleshooting.md index 1b259bd..d890eda 100644 --- a/vm-management/troubleshooting.md +++ b/vm-management/troubleshooting.md @@ -678,9 +678,12 @@ This troubleshooting guide covers comprehensive recovery procedures for VM manag - nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean - Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller +**Remediation:** Upgraded Nvidia driver 570.211.01 → 580.126.09. The 570 packages were in a broken state (partial upgrade left metapackage pinned at `.24.04.1` while deps moved to `.24.04.2`), requiring explicit removal of `nvidia-driver-570 nvidia-dkms-570 nvidia-kernel-source-570 nvidia-kernel-common-570 libnvidia-common-570 libnvidia-gl-570` with `--allow-change-held-packages` before 580 could install cleanly. Note: 590 drivers reported unstable — avoid. + **Lesson:** - Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots - Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend - Docker containers with restart policies recovered without intervention — validates that approach - Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation -- Monitor nvme1 media integrity error count; if it increments, plan replacement \ No newline at end of file +- Monitor nvme1 media integrity error count; if it increments, plan replacement +- When upgrading Nvidia driver major versions on Ubuntu, apt often can't resolve conflicts automatically — explicitly remove the old driver packages with `--allow-change-held-packages` first \ No newline at end of file