diff --git a/vm-management/troubleshooting.md b/vm-management/troubleshooting.md index b513f32..1b259bd 100644 --- a/vm-management/troubleshooting.md +++ b/vm-management/troubleshooting.md @@ -657,4 +657,30 @@ docker system info && docker system df - **Proxmox Console**: Direct VM access when SSH fails - **Emergency Contact**: Use Discord notifications for critical issues -This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments. \ No newline at end of file +This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments. + +## ubuntu-manticore crash recovery — initramfs fsck on wrong device (2026-03-24) + +**Severity:** High — server unbootable, all services down (pihole DNS, jellyfin, tdarr, kb-rag) + +**Problem:** After a physical server crash, ubuntu-manticore dropped to initramfs shell. Boot fsck targeted `/dev/nvme0n1p2` (an NTFS data partition labeled "2TB 970") instead of the actual ext4 root on `/dev/nvme1n1p2`. The generic busybox `fsck -y` wrapper didn't invoke the ext4 backend. + +**Root Cause:** Two issues compounded: (1) The crash corrupted the ext4 root filesystem (block/inode count mismatches across ~15 groups). (2) The initramfs resolved the root device UUID to the wrong NVMe drive — `nvme0n1p2` instead of `nvme1n1p2`. NVMe device enumeration order can shift between boots; fstab uses UUIDs correctly but the initramfs got confused during this boot. + +**Fix:** Ran `/usr/sbin/fsck.ext4 -y /dev/nvme1n1p2` directly from initramfs (identified correct partition via `blkid`). After `exit`, boot completed normally and all 9 Docker containers came up automatically via restart policies. + +**Crash cause investigation:** +- Kernel panic: `BUG: unable to handle page fault for address: fffffb2320041d50` — supervisor write to not-present page +- PCIe AER correctable errors (Data Link Layer Timeout) on port `0000:00:01.2` (AMD X470/B450 root port) logged on Mar 19 +- Nvidia proprietary driver loaded, kernel tainted — common source of page faults +- AMD Zen1 DIV0 bug flagged at boot (Ryzen 5 2600) +- SMART data: both Samsung 970s healthy (PASSED), but nvme1 (250GB root drive, 22k hours) has **1 Media and Data Integrity Error** — monitor for growth +- nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean +- Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller + +**Lesson:** +- Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots +- Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend +- Docker containers with restart policies recovered without intervention — validates that approach +- Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation +- Monitor nvme1 media integrity error count; if it increments, plan replacement \ No newline at end of file