docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 7s
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 7s
This commit is contained in:
parent
cedb056bce
commit
08a9dcd6eb
@ -657,4 +657,30 @@ docker system info && docker system df
|
||||
- **Proxmox Console**: Direct VM access when SSH fails
|
||||
- **Emergency Contact**: Use Discord notifications for critical issues
|
||||
|
||||
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
|
||||
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
|
||||
|
||||
## ubuntu-manticore crash recovery — initramfs fsck on wrong device (2026-03-24)
|
||||
|
||||
**Severity:** High — server unbootable, all services down (pihole DNS, jellyfin, tdarr, kb-rag)
|
||||
|
||||
**Problem:** After a physical server crash, ubuntu-manticore dropped to initramfs shell. Boot fsck targeted `/dev/nvme0n1p2` (an NTFS data partition labeled "2TB 970") instead of the actual ext4 root on `/dev/nvme1n1p2`. The generic busybox `fsck -y` wrapper didn't invoke the ext4 backend.
|
||||
|
||||
**Root Cause:** Two issues compounded: (1) The crash corrupted the ext4 root filesystem (block/inode count mismatches across ~15 groups). (2) The initramfs resolved the root device UUID to the wrong NVMe drive — `nvme0n1p2` instead of `nvme1n1p2`. NVMe device enumeration order can shift between boots; fstab uses UUIDs correctly but the initramfs got confused during this boot.
|
||||
|
||||
**Fix:** Ran `/usr/sbin/fsck.ext4 -y /dev/nvme1n1p2` directly from initramfs (identified correct partition via `blkid`). After `exit`, boot completed normally and all 9 Docker containers came up automatically via restart policies.
|
||||
|
||||
**Crash cause investigation:**
|
||||
- Kernel panic: `BUG: unable to handle page fault for address: fffffb2320041d50` — supervisor write to not-present page
|
||||
- PCIe AER correctable errors (Data Link Layer Timeout) on port `0000:00:01.2` (AMD X470/B450 root port) logged on Mar 19
|
||||
- Nvidia proprietary driver loaded, kernel tainted — common source of page faults
|
||||
- AMD Zen1 DIV0 bug flagged at boot (Ryzen 5 2600)
|
||||
- SMART data: both Samsung 970s healthy (PASSED), but nvme1 (250GB root drive, 22k hours) has **1 Media and Data Integrity Error** — monitor for growth
|
||||
- nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
|
||||
- Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller
|
||||
|
||||
**Lesson:**
|
||||
- Always use `blkid` in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots
|
||||
- Use `/usr/sbin/fsck.ext4 -y` directly rather than the busybox `fsck` wrapper, which may not invoke the correct backend
|
||||
- Docker containers with restart policies recovered without intervention — validates that approach
|
||||
- Install `smartmontools` on bare-metal servers proactively — wasn't available during initial investigation
|
||||
- Monitor nvme1 media integrity error count; if it increments, plan replacement
|
||||
Loading…
Reference in New Issue
Block a user