docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 7s

This commit is contained in:
Cal Corum 2026-03-24 22:00:43 -05:00
parent cedb056bce
commit 08a9dcd6eb

View File

@ -657,4 +657,30 @@ docker system info && docker system df
- **Proxmox Console**: Direct VM access when SSH fails
- **Emergency Contact**: Use Discord notifications for critical issues
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
## ubuntu-manticore crash recovery — initramfs fsck on wrong device (2026-03-24)
**Severity:** High — server unbootable, all services down (pihole DNS, jellyfin, tdarr, kb-rag)
**Problem:** After a physical server crash, ubuntu-manticore dropped to initramfs shell. Boot fsck targeted `/dev/nvme0n1p2` (an NTFS data partition labeled "2TB 970") instead of the actual ext4 root on `/dev/nvme1n1p2`. The generic busybox `fsck -y` wrapper didn't invoke the ext4 backend.
**Root Cause:** Two issues compounded: (1) The crash corrupted the ext4 root filesystem (block/inode count mismatches across ~15 groups). (2) The initramfs resolved the root device UUID to the wrong NVMe drive — `nvme0n1p2` instead of `nvme1n1p2`. NVMe device enumeration order can shift between boots; fstab uses UUIDs correctly but the initramfs got confused during this boot.
**Fix:** Ran `/usr/sbin/fsck.ext4 -y /dev/nvme1n1p2` directly from initramfs (identified correct partition via `blkid`). After `exit`, boot completed normally and all 9 Docker containers came up automatically via restart policies.
**Crash cause investigation:**
- Kernel panic: `BUG: unable to handle page fault for address: fffffb2320041d50` — supervisor write to not-present page
- PCIe AER correctable errors (Data Link Layer Timeout) on port `0000:00:01.2` (AMD X470/B450 root port) logged on Mar 19
- Nvidia proprietary driver loaded, kernel tainted — common source of page faults
- AMD Zen1 DIV0 bug flagged at boot (Ryzen 5 2600)
- SMART data: both Samsung 970s healthy (PASSED), but nvme1 (250GB root drive, 22k hours) has **1 Media and Data Integrity Error** — monitor for growth
- nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
- Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller
**Lesson:**
- Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots
- Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend
- Docker containers with restart policies recovered without intervention — validates that approach
- Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation
- Monitor nvme1 media integrity error count; if it increments, plan replacement