docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s

This commit is contained in:
Cal Corum 2026-03-25 00:00:43 -05:00
parent 08a9dcd6eb
commit 646991e1a9

View File

@ -678,9 +678,12 @@ This troubleshooting guide covers comprehensive recovery procedures for VM manag
- nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
- Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller
**Remediation:** Upgraded Nvidia driver 570.211.01 → 580.126.09. The 570 packages were in a broken state (partial upgrade left metapackage pinned at `.24.04.1` while deps moved to `.24.04.2`), requiring explicit removal of `nvidia-driver-570 nvidia-dkms-570 nvidia-kernel-source-570 nvidia-kernel-common-570 libnvidia-common-570 libnvidia-gl-570` with `--allow-change-held-packages` before 580 could install cleanly. Note: 590 drivers reported unstable — avoid.
**Lesson:**
- Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots
- Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend
- Docker containers with restart policies recovered without intervention — validates that approach
- Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation
- Monitor nvme1 media integrity error count; if it increments, plan replacement
- Monitor nvme1 media integrity error count; if it increments, plan replacement
- When upgrading Nvidia driver major versions on Ubuntu, apt often can't resolve conflicts automatically — explicitly remove the old driver packages with `--allow-change-held-packages` first