feat: weekly Proxmox backup verification → Discord (#27) #48

Claude · 2026-04-04T11:08:18Z

Claude commented

2026-04-04 11:08:18 +00:00

Closes #27

Summary

proxmox-backup-check.sh — SSHes to Proxmox, queries pvesh task history, classifies each running VM/CT as 🟢/🟡/🔴 by backup recency (7-day default), and posts a Discord embed. Designed for weekly cron on CT 302 (Monday 08:00 UTC).
ct302-self-health.sh — Checks disk usage on CT 302 itself; exits silently when healthy, posts a Discord alert when any filesystem hits ≥80%. Closes the blind spot where the monitoring system can't monitor itself externally. Runs daily (07:00 UTC).
monitoring/scripts/CONTEXT.md updated with full operational docs, install instructions, and cron entry examples for both scripts.

Deploy checklist

# On CT 302 (ssh claude-runner)
cp monitoring/scripts/proxmox-backup-check.sh /root/scripts/
cp monitoring/scripts/ct302-self-health.sh /root/scripts/
chmod +x /root/scripts/proxmox-backup-check.sh /root/scripts/ct302-self-health.sh

# Verify dry-run works
PROXMOX_SSH=proxmox /root/scripts/proxmox-backup-check.sh --dry-run
/root/scripts/ct302-self-health.sh --dry-run

# Add to root crontab (crontab -e)
0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1

Design decisions

Chose option 2 (cron script on CT 302) over n8n: simpler, no n8n dependency, and backup verification should work even when n8n is down.
proxmox-backup-check.sh always posts to Discord (weekly status confirmation), not only on failure.
ct302-self-health.sh posts only on threshold breach to avoid noise (use --always-post for weekly summaries if desired).
Both scripts require jq and curl on CT 302; pvesh is invoked remotely via SSH.

Closes #27 ## Summary - **`proxmox-backup-check.sh`** — SSHes to Proxmox, queries `pvesh` task history, classifies each running VM/CT as 🟢/🟡/🔴 by backup recency (7-day default), and posts a Discord embed. Designed for weekly cron on CT 302 (Monday 08:00 UTC). - **`ct302-self-health.sh`** — Checks disk usage on CT 302 itself; exits silently when healthy, posts a Discord alert when any filesystem hits ≥80%. Closes the blind spot where the monitoring system can't monitor itself externally. Runs daily (07:00 UTC). - **`monitoring/scripts/CONTEXT.md`** updated with full operational docs, install instructions, and cron entry examples for both scripts. ## Deploy checklist ```bash # On CT 302 (ssh claude-runner) cp monitoring/scripts/proxmox-backup-check.sh /root/scripts/ cp monitoring/scripts/ct302-self-health.sh /root/scripts/ chmod +x /root/scripts/proxmox-backup-check.sh /root/scripts/ct302-self-health.sh # Verify dry-run works PROXMOX_SSH=proxmox /root/scripts/proxmox-backup-check.sh --dry-run /root/scripts/ct302-self-health.sh --dry-run # Add to root crontab (crontab -e) 0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1 0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1 ``` ## Design decisions - Chose **option 2 (cron script on CT 302)** over n8n: simpler, no n8n dependency, and backup verification should work even when n8n is down. - `proxmox-backup-check.sh` always posts to Discord (weekly status confirmation), not only on failure. - `ct302-self-health.sh` posts **only on threshold breach** to avoid noise (use `--always-post` for weekly summaries if desired). - Both scripts require `jq` and `curl` on CT 302; `pvesh` is invoked remotely via SSH.

Claude added 1 commit 2026-04-04 11:08:19 +00:00

feat: add weekly Proxmox backup verification and CT 302 self-health check (#27 )

Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s

Details

95bae33309

Closes #27

- proxmox-backup-check.sh: SSHes to Proxmox, queries pvesh task history,
  classifies each running VM/CT as green/yellow/red by backup recency,
  posts a Discord embed summary. Designed for weekly cron on CT 302.

- ct302-self-health.sh: Checks disk usage on CT 302 itself, silently
  exits when healthy, posts a Discord alert when any filesystem exceeds
  80% threshold. Closes the blind spot where the monitoring system
  cannot monitor itself externally.

- Updated monitoring/scripts/CONTEXT.md with full operational docs,
  install instructions, and cron schedules for both new scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Claude referenced this pull request

2026-04-04 11:08:29 +00:00

Set up weekly Proxmox backup verification → Discord #27

Claude added the

ai-reviewing

label 2026-04-04 11:15:26 +00:00

Claude removed the

ai-reviewing

label 2026-04-04 11:17:25 +00:00

Claude added the

ai-reviewing

label 2026-04-04 11:45:26 +00:00

Claude removed the

ai-reviewing

label 2026-04-04 11:47:28 +00:00

Claude added the

ai-reviewing

label 2026-04-04 12:15:26 +00:00

Claude reviewed 2026-04-04 12:17:25 +00:00

Claude left a comment

AI Code Review

Files Reviewed

monitoring/scripts/proxmox-backup-check.sh (added, 230 lines)
monitoring/scripts/ct302-self-health.sh (added, 158 lines)
monitoring/scripts/CONTEXT.md (modified)

Findings

Correctness

⚠️ proxmox-backup-check.sh — Task field name needs verification (line 119)

The task-matching filter uses .vmid to correlate vzdump tasks to guests:

select(
  (.vmid | tostring) == $g.vmid
  and .status == "OK"
)

The Proxmox GET /nodes/{node}/tasks API returns items where the resource identifier is the id field (a string containing the VMID), not a top-level vmid field. If .vmid is null on every task object, every guest will have $last_ts = 0 and show as 🔴 (no backup found), even when backups exist.

Verify during dry-run: After deploying, run --dry-run and confirm at least some guests appear as 🟢. If all guests show 🔴 unexpectedly, change .vmid to .id in the select() filter.

ℹ️ proxmox-backup-check.sh — Task limit may be insufficient for frequent backups (line 88)

--limit 200 fetches up to 200 vzdump tasks. For a homelab with ~20 guests on daily backup schedules, 200 tasks ≈ 10 days of history, which covers the 7-day window fine. For larger setups or shorter windows, consider bumping to --limit 500 or making it configurable.

ℹ️ Discord embed field value limit (both scripts)

Discord enforces a 1024-character limit on embed field values. If a large number of guests or filesystems are listed in a single field, the POST will return HTTP 400 and the webhook call will fail. Both scripts use exit 1 on non-2xx, so the failure will be logged. For the current homelab scale this is not an issue, but worth noting if guest count grows.

Security

No hardcoded credentials. Webhook URL is sourced from env or --discord-webhook CLI arg only. ✅
All JSON payloads are constructed via jq -n with typed --arg/--argjson inputs — no string interpolation into JSON. ✅
SSH_OPTS includes BatchMode=yes (no interactive prompts) and StrictHostKeyChecking=accept-new (safe for internal networks — won't auto-accept changed host keys). ✅
$PROXMOX_NODE is interpolated into the remote SSH command string. In a cron context with a trusted default value this is fine; would be a concern only if the env var were user-controlled. Acceptable for this deployment context.

Style & Conventions

Both scripts follow the project's existing bash patterns: set -uo pipefail, log() timestamp function, --dry-run mode, env var overrides. ✅
Argument parsing with explicit validation ($# -lt 2 checks) is solid.
ct302-self-health.sh: The df --output=source,size,used,avail,pcent,target → awk '{print $1,$5,$6}' pipeline correctly produces source pct% mount triples. The downstream $2 / $3 parsing is correct. ✅
CONTEXT.md updates are thorough: schedule, usage examples, install steps, and dependency callouts all documented. ✅

Suggestions

The format_guest and format_guest_with_age functions in proxmox-backup-check.sh are nearly identical. Could be merged into a single function with an optional age flag, but the current duplication is minimal and clear.
For ct302-self-health.sh, consider adding a brief --help case to the argument parser — makes manual invocation more discoverable.

Verdict: COMMENT

Both scripts are well-structured and production-ready for a homelab context. The .vmid vs .id task field name is the one item that requires a real Proxmox test to confirm — the --dry-run verification step in the deploy checklist will catch it immediately. Everything else (security, argument handling, JSON construction, logging) is solid. Safe to merge; just run the dry-run on CT 302 before enabling the cron.

Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/proxmox-backup-check.sh` (added, 230 lines) - `monitoring/scripts/ct302-self-health.sh` (added, 158 lines) - `monitoring/scripts/CONTEXT.md` (modified) --- ### Findings #### Correctness **⚠️ `proxmox-backup-check.sh` — Task field name needs verification (line 119)** The task-matching filter uses `.vmid` to correlate vzdump tasks to guests: ```jq select( (.vmid | tostring) == $g.vmid and .status == "OK" ) ``` The Proxmox `GET /nodes/{node}/tasks` API returns items where the resource identifier is the `id` field (a string containing the VMID), not a top-level `vmid` field. If `.vmid` is `null` on every task object, every guest will have `$last_ts = 0` and show as 🔴 (no backup found), even when backups exist. **Verify during dry-run:** After deploying, run `--dry-run` and confirm at least some guests appear as 🟢. If all guests show 🔴 unexpectedly, change `.vmid` to `.id` in the `select()` filter. --- **ℹ️ `proxmox-backup-check.sh` — Task limit may be insufficient for frequent backups (line 88)** `--limit 200` fetches up to 200 vzdump tasks. For a homelab with ~20 guests on daily backup schedules, 200 tasks ≈ 10 days of history, which covers the 7-day window fine. For larger setups or shorter windows, consider bumping to `--limit 500` or making it configurable. --- **ℹ️ Discord embed field value limit (both scripts)** Discord enforces a 1024-character limit on embed field values. If a large number of guests or filesystems are listed in a single field, the POST will return HTTP 400 and the webhook call will fail. Both scripts use `exit 1` on non-2xx, so the failure will be logged. For the current homelab scale this is not an issue, but worth noting if guest count grows. --- #### Security - No hardcoded credentials. Webhook URL is sourced from env or `--discord-webhook` CLI arg only. ✅ - All JSON payloads are constructed via `jq -n` with typed `--arg`/`--argjson` inputs — no string interpolation into JSON. ✅ - `SSH_OPTS` includes `BatchMode=yes` (no interactive prompts) and `StrictHostKeyChecking=accept-new` (safe for internal networks — won't auto-accept changed host keys). ✅ - `$PROXMOX_NODE` is interpolated into the remote SSH command string. In a cron context with a trusted default value this is fine; would be a concern only if the env var were user-controlled. Acceptable for this deployment context. #### Style & Conventions - Both scripts follow the project's existing bash patterns: `set -uo pipefail`, `log()` timestamp function, `--dry-run` mode, env var overrides. ✅ - Argument parsing with explicit validation (`$# -lt 2` checks) is solid. - `ct302-self-health.sh`: The `df --output=source,size,used,avail,pcent,target` → `awk '{print $1,$5,$6}'` pipeline correctly produces `source pct% mount` triples. The downstream `$2` / `$3` parsing is correct. ✅ - CONTEXT.md updates are thorough: schedule, usage examples, install steps, and dependency callouts all documented. ✅ #### Suggestions - The `format_guest` and `format_guest_with_age` functions in `proxmox-backup-check.sh` are nearly identical. Could be merged into a single function with an optional age flag, but the current duplication is minimal and clear. - For `ct302-self-health.sh`, consider adding a brief `--help` case to the argument parser — makes manual invocation more discoverable. --- ### Verdict: COMMENT Both scripts are well-structured and production-ready for a homelab context. The `.vmid` vs `.id` task field name is the one item that requires a real Proxmox test to confirm — **the `--dry-run` verification step in the deploy checklist will catch it immediately**. Everything else (security, argument handling, JSON construction, logging) is solid. Safe to merge; just run the dry-run on CT 302 before enabling the cron. --- *Automated review by Claude PR Reviewer*

Claude added

ai-reviewed

and removed

ai-reviewing

labels 2026-04-04 12:17:33 +00:00

Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s

Details

This pull request can be merged automatically.

You are not authorized to merge this pull request.

View command line instructions.