docs: sync KB — troubleshooting.md

2026-03-24 22:00:43 -05:00 · 2026-03-24 22:00:43 -05:00 · 08a9dcd6eb
commit 08a9dcd6eb
parent cedb056bce
1 changed files with 27 additions and 1 deletions
--- a/vm-management/troubleshooting.md
+++ b/vm-management/troubleshooting.md
@ -657,4 +657,30 @@ docker system info && docker system df
 - **Proxmox Console**: Direct VM access when SSH fails
 - **Emergency Contact**: Use Discord notifications for critical issues

-This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
+This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
+
+## ubuntu-manticore crash recovery — initramfs fsck on wrong device (2026-03-24)
+
+**Severity:** High — server unbootable, all services down (pihole DNS, jellyfin, tdarr, kb-rag)
+
+**Problem:** After a physical server crash, ubuntu-manticore dropped to initramfs shell. Boot fsck targeted `/dev/nvme0n1p2` (an NTFS data partition labeled "2TB 970") instead of the actual ext4 root on `/dev/nvme1n1p2`. The generic busybox `fsck -y` wrapper didn't invoke the ext4 backend.
+
+**Root Cause:** Two issues compounded: (1) The crash corrupted the ext4 root filesystem (block/inode count mismatches across ~15 groups). (2) The initramfs resolved the root device UUID to the wrong NVMe drive — `nvme0n1p2` instead of `nvme1n1p2`. NVMe device enumeration order can shift between boots; fstab uses UUIDs correctly but the initramfs got confused during this boot.
+
+**Fix:** Ran `/usr/sbin/fsck.ext4 -y /dev/nvme1n1p2` directly from initramfs (identified correct partition via `blkid`). After `exit`, boot completed normally and all 9 Docker containers came up automatically via restart policies.
+
+**Crash cause investigation:**
+- Kernel panic: `BUG: unable to handle page fault for address: fffffb2320041d50` — supervisor write to not-present page
+- PCIe AER correctable errors (Data Link Layer Timeout) on port `0000:00:01.2` (AMD X470/B450 root port) logged on Mar 19
+- Nvidia proprietary driver loaded, kernel tainted — common source of page faults
+- AMD Zen1 DIV0 bug flagged at boot (Ryzen 5 2600)
+- SMART data: both Samsung 970s healthy (PASSED), but nvme1 (250GB root drive, 22k hours) has **1 Media and Data Integrity Error** — monitor for growth
+- nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
+- Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller
+
+**Lesson:**
+- Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots
+- Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend
+- Docker containers with restart policies recovered without intervention — validates that approach
+- Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation
+- Monitor nvme1 media integrity error count; if it increments, plan replacement