feat: weekly Proxmox backup verification → Discord (#27) #48

Open
Claude wants to merge 1 commits from issue/27-set-up-weekly-proxmox-backup-verification-discord into main
Collaborator

Closes #27

Summary

  • proxmox-backup-check.sh — SSHes to Proxmox, queries pvesh task history, classifies each running VM/CT as 🟢/🟡/🔴 by backup recency (7-day default), and posts a Discord embed. Designed for weekly cron on CT 302 (Monday 08:00 UTC).

  • ct302-self-health.sh — Checks disk usage on CT 302 itself; exits silently when healthy, posts a Discord alert when any filesystem hits ≥80%. Closes the blind spot where the monitoring system can't monitor itself externally. Runs daily (07:00 UTC).

  • monitoring/scripts/CONTEXT.md updated with full operational docs, install instructions, and cron entry examples for both scripts.

Deploy checklist

# On CT 302 (ssh claude-runner)
cp monitoring/scripts/proxmox-backup-check.sh /root/scripts/
cp monitoring/scripts/ct302-self-health.sh /root/scripts/
chmod +x /root/scripts/proxmox-backup-check.sh /root/scripts/ct302-self-health.sh

# Verify dry-run works
PROXMOX_SSH=proxmox /root/scripts/proxmox-backup-check.sh --dry-run
/root/scripts/ct302-self-health.sh --dry-run

# Add to root crontab (crontab -e)
0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1

Design decisions

  • Chose option 2 (cron script on CT 302) over n8n: simpler, no n8n dependency, and backup verification should work even when n8n is down.
  • proxmox-backup-check.sh always posts to Discord (weekly status confirmation), not only on failure.
  • ct302-self-health.sh posts only on threshold breach to avoid noise (use --always-post for weekly summaries if desired).
  • Both scripts require jq and curl on CT 302; pvesh is invoked remotely via SSH.
Closes #27 ## Summary - **`proxmox-backup-check.sh`** — SSHes to Proxmox, queries `pvesh` task history, classifies each running VM/CT as 🟢/🟡/🔴 by backup recency (7-day default), and posts a Discord embed. Designed for weekly cron on CT 302 (Monday 08:00 UTC). - **`ct302-self-health.sh`** — Checks disk usage on CT 302 itself; exits silently when healthy, posts a Discord alert when any filesystem hits ≥80%. Closes the blind spot where the monitoring system can't monitor itself externally. Runs daily (07:00 UTC). - **`monitoring/scripts/CONTEXT.md`** updated with full operational docs, install instructions, and cron entry examples for both scripts. ## Deploy checklist ```bash # On CT 302 (ssh claude-runner) cp monitoring/scripts/proxmox-backup-check.sh /root/scripts/ cp monitoring/scripts/ct302-self-health.sh /root/scripts/ chmod +x /root/scripts/proxmox-backup-check.sh /root/scripts/ct302-self-health.sh # Verify dry-run works PROXMOX_SSH=proxmox /root/scripts/proxmox-backup-check.sh --dry-run /root/scripts/ct302-self-health.sh --dry-run # Add to root crontab (crontab -e) 0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1 0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1 ``` ## Design decisions - Chose **option 2 (cron script on CT 302)** over n8n: simpler, no n8n dependency, and backup verification should work even when n8n is down. - `proxmox-backup-check.sh` always posts to Discord (weekly status confirmation), not only on failure. - `ct302-self-health.sh` posts **only on threshold breach** to avoid noise (use `--always-post` for weekly summaries if desired). - Both scripts require `jq` and `curl` on CT 302; `pvesh` is invoked remotely via SSH.
Claude added 1 commit 2026-04-04 11:08:19 +00:00
feat: add weekly Proxmox backup verification and CT 302 self-health check (#27)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
95bae33309
Closes #27

- proxmox-backup-check.sh: SSHes to Proxmox, queries pvesh task history,
  classifies each running VM/CT as green/yellow/red by backup recency,
  posts a Discord embed summary. Designed for weekly cron on CT 302.

- ct302-self-health.sh: Checks disk usage on CT 302 itself, silently
  exits when healthy, posts a Discord alert when any filesystem exceeds
  80% threshold. Closes the blind spot where the monitoring system
  cannot monitor itself externally.

- Updated monitoring/scripts/CONTEXT.md with full operational docs,
  install instructions, and cron schedules for both new scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude added the
ai-reviewing
label 2026-04-04 11:15:26 +00:00
Claude removed the
ai-reviewing
label 2026-04-04 11:17:25 +00:00
Claude added the
ai-reviewing
label 2026-04-04 11:45:26 +00:00
Claude removed the
ai-reviewing
label 2026-04-04 11:47:28 +00:00
Claude added the
ai-reviewing
label 2026-04-04 12:15:26 +00:00
Claude reviewed 2026-04-04 12:17:25 +00:00
Claude left a comment
Author
Collaborator

AI Code Review

Files Reviewed

  • monitoring/scripts/proxmox-backup-check.sh (added, 230 lines)
  • monitoring/scripts/ct302-self-health.sh (added, 158 lines)
  • monitoring/scripts/CONTEXT.md (modified)

Findings

Correctness

⚠️ proxmox-backup-check.sh — Task field name needs verification (line 119)

The task-matching filter uses .vmid to correlate vzdump tasks to guests:

select(
  (.vmid | tostring) == $g.vmid
  and .status == "OK"
)

The Proxmox GET /nodes/{node}/tasks API returns items where the resource identifier is the id field (a string containing the VMID), not a top-level vmid field. If .vmid is null on every task object, every guest will have $last_ts = 0 and show as 🔴 (no backup found), even when backups exist.

Verify during dry-run: After deploying, run --dry-run and confirm at least some guests appear as 🟢. If all guests show 🔴 unexpectedly, change .vmid to .id in the select() filter.


ℹ️ proxmox-backup-check.sh — Task limit may be insufficient for frequent backups (line 88)

--limit 200 fetches up to 200 vzdump tasks. For a homelab with ~20 guests on daily backup schedules, 200 tasks ≈ 10 days of history, which covers the 7-day window fine. For larger setups or shorter windows, consider bumping to --limit 500 or making it configurable.


ℹ️ Discord embed field value limit (both scripts)

Discord enforces a 1024-character limit on embed field values. If a large number of guests or filesystems are listed in a single field, the POST will return HTTP 400 and the webhook call will fail. Both scripts use exit 1 on non-2xx, so the failure will be logged. For the current homelab scale this is not an issue, but worth noting if guest count grows.


Security

  • No hardcoded credentials. Webhook URL is sourced from env or --discord-webhook CLI arg only.
  • All JSON payloads are constructed via jq -n with typed --arg/--argjson inputs — no string interpolation into JSON.
  • SSH_OPTS includes BatchMode=yes (no interactive prompts) and StrictHostKeyChecking=accept-new (safe for internal networks — won't auto-accept changed host keys).
  • $PROXMOX_NODE is interpolated into the remote SSH command string. In a cron context with a trusted default value this is fine; would be a concern only if the env var were user-controlled. Acceptable for this deployment context.

Style & Conventions

  • Both scripts follow the project's existing bash patterns: set -uo pipefail, log() timestamp function, --dry-run mode, env var overrides.
  • Argument parsing with explicit validation ($# -lt 2 checks) is solid.
  • ct302-self-health.sh: The df --output=source,size,used,avail,pcent,targetawk '{print $1,$5,$6}' pipeline correctly produces source pct% mount triples. The downstream $2 / $3 parsing is correct.
  • CONTEXT.md updates are thorough: schedule, usage examples, install steps, and dependency callouts all documented.

Suggestions

  • The format_guest and format_guest_with_age functions in proxmox-backup-check.sh are nearly identical. Could be merged into a single function with an optional age flag, but the current duplication is minimal and clear.
  • For ct302-self-health.sh, consider adding a brief --help case to the argument parser — makes manual invocation more discoverable.

Verdict: COMMENT

Both scripts are well-structured and production-ready for a homelab context. The .vmid vs .id task field name is the one item that requires a real Proxmox test to confirm — the --dry-run verification step in the deploy checklist will catch it immediately. Everything else (security, argument handling, JSON construction, logging) is solid. Safe to merge; just run the dry-run on CT 302 before enabling the cron.


Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/proxmox-backup-check.sh` (added, 230 lines) - `monitoring/scripts/ct302-self-health.sh` (added, 158 lines) - `monitoring/scripts/CONTEXT.md` (modified) --- ### Findings #### Correctness **⚠️ `proxmox-backup-check.sh` — Task field name needs verification (line 119)** The task-matching filter uses `.vmid` to correlate vzdump tasks to guests: ```jq select( (.vmid | tostring) == $g.vmid and .status == "OK" ) ``` The Proxmox `GET /nodes/{node}/tasks` API returns items where the resource identifier is the `id` field (a string containing the VMID), not a top-level `vmid` field. If `.vmid` is `null` on every task object, every guest will have `$last_ts = 0` and show as 🔴 (no backup found), even when backups exist. **Verify during dry-run:** After deploying, run `--dry-run` and confirm at least some guests appear as 🟢. If all guests show 🔴 unexpectedly, change `.vmid` to `.id` in the `select()` filter. --- **ℹ️ `proxmox-backup-check.sh` — Task limit may be insufficient for frequent backups (line 88)** `--limit 200` fetches up to 200 vzdump tasks. For a homelab with ~20 guests on daily backup schedules, 200 tasks ≈ 10 days of history, which covers the 7-day window fine. For larger setups or shorter windows, consider bumping to `--limit 500` or making it configurable. --- **ℹ️ Discord embed field value limit (both scripts)** Discord enforces a 1024-character limit on embed field values. If a large number of guests or filesystems are listed in a single field, the POST will return HTTP 400 and the webhook call will fail. Both scripts use `exit 1` on non-2xx, so the failure will be logged. For the current homelab scale this is not an issue, but worth noting if guest count grows. --- #### Security - No hardcoded credentials. Webhook URL is sourced from env or `--discord-webhook` CLI arg only. ✅ - All JSON payloads are constructed via `jq -n` with typed `--arg`/`--argjson` inputs — no string interpolation into JSON. ✅ - `SSH_OPTS` includes `BatchMode=yes` (no interactive prompts) and `StrictHostKeyChecking=accept-new` (safe for internal networks — won't auto-accept changed host keys). ✅ - `$PROXMOX_NODE` is interpolated into the remote SSH command string. In a cron context with a trusted default value this is fine; would be a concern only if the env var were user-controlled. Acceptable for this deployment context. #### Style & Conventions - Both scripts follow the project's existing bash patterns: `set -uo pipefail`, `log()` timestamp function, `--dry-run` mode, env var overrides. ✅ - Argument parsing with explicit validation (`$# -lt 2` checks) is solid. - `ct302-self-health.sh`: The `df --output=source,size,used,avail,pcent,target` → `awk '{print $1,$5,$6}'` pipeline correctly produces `source pct% mount` triples. The downstream `$2` / `$3` parsing is correct. ✅ - CONTEXT.md updates are thorough: schedule, usage examples, install steps, and dependency callouts all documented. ✅ #### Suggestions - The `format_guest` and `format_guest_with_age` functions in `proxmox-backup-check.sh` are nearly identical. Could be merged into a single function with an optional age flag, but the current duplication is minimal and clear. - For `ct302-self-health.sh`, consider adding a brief `--help` case to the argument parser — makes manual invocation more discoverable. --- ### Verdict: COMMENT Both scripts are well-structured and production-ready for a homelab context. The `.vmid` vs `.id` task field name is the one item that requires a real Proxmox test to confirm — **the `--dry-run` verification step in the deploy checklist will catch it immediately**. Everything else (security, argument handling, JSON construction, logging) is solid. Safe to merge; just run the dry-run on CT 302 before enabling the cron. --- *Automated review by Claude PR Reviewer*
Claude added
ai-reviewed
and removed
ai-reviewing
labels 2026-04-04 12:17:33 +00:00
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
This pull request can be merged automatically.
You are not authorized to merge this pull request.

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin issue/27-set-up-weekly-proxmox-backup-verification-discord:issue/27-set-up-weekly-proxmox-backup-verification-discord
git checkout issue/27-set-up-weekly-proxmox-backup-verification-discord
Sign in to join this conversation.
No description provided.