docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
This commit is contained in:
parent
08a9dcd6eb
commit
646991e1a9
@ -678,9 +678,12 @@ This troubleshooting guide covers comprehensive recovery procedures for VM manag
|
||||
- nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
|
||||
- Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller
|
||||
|
||||
**Remediation:** Upgraded Nvidia driver 570.211.01 → 580.126.09. The 570 packages were in a broken state (partial upgrade left metapackage pinned at `.24.04.1` while deps moved to `.24.04.2`), requiring explicit removal of `nvidia-driver-570 nvidia-dkms-570 nvidia-kernel-source-570 nvidia-kernel-common-570 libnvidia-common-570 libnvidia-gl-570` with `--allow-change-held-packages` before 580 could install cleanly. Note: 590 drivers reported unstable — avoid.
|
||||
|
||||
**Lesson:**
|
||||
- Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots
|
||||
- Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend
|
||||
- Docker containers with restart policies recovered without intervention — validates that approach
|
||||
- Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation
|
||||
- Monitor nvme1 media integrity error count; if it increments, plan replacement
|
||||
- Monitor nvme1 media integrity error count; if it increments, plan replacement
|
||||
- When upgrading Nvidia driver major versions on Ubuntu, apt often can't resolve conflicts automatically — explicitly remove the old driver packages with `--allow-change-held-packages` first
|
||||
Loading…
Reference in New Issue
Block a user