CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
7edb4a3a9c
commit
10c9e0d854
7
.claude/settings.json
Normal file
7
.claude/settings.json
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
{
|
||||||
|
"notifications_disabled": true,
|
||||||
|
"allowed_working_directories": [
|
||||||
|
"/mnt/NV2/Development/claude-home",
|
||||||
|
"/mnt/media"
|
||||||
|
]
|
||||||
|
}
|
||||||
1
.gitignore
vendored
1
.gitignore
vendored
@ -1,2 +1,3 @@
|
|||||||
.claude/tmp/
|
.claude/tmp/
|
||||||
tmp/
|
tmp/
|
||||||
|
__pycache__
|
||||||
315
CLAUDE.md
315
CLAUDE.md
@ -8,130 +8,98 @@
|
|||||||
- If creating a temporary file will help achieve your goal, please create the file in the .claude/tmp/ directory and clean up when you're done.
|
- If creating a temporary file will help achieve your goal, please create the file in the .claude/tmp/ directory and clean up when you're done.
|
||||||
- Prefer editing an existing file to creating a new one.
|
- Prefer editing an existing file to creating a new one.
|
||||||
- Following a complex task or series of tasks, prompt the user to save any key learnings from the session.
|
- Following a complex task or series of tasks, prompt the user to save any key learnings from the session.
|
||||||
|
- **Documentation Maintenance Reminder**: At the end of coding sessions, proactively ask: "Should I update our documentation to reflect the changes we made today?" Focus on CONTEXT.md files, troubleshooting guides, and any new patterns discovered.
|
||||||
|
- **Context Window Management**: When approaching 25% context window remaining, prioritize documentation updates before auto-summarization occurs. Ask: "We're approaching context limits - should I update our documentation now to capture today's work before we lose context?"
|
||||||
|
|
||||||
## Automatic Context Loading Rules
|
## Automatic Context Loading Rules
|
||||||
|
|
||||||
### File Extension Triggers
|
### Technology-First Loading Rules
|
||||||
When working with files, automatically load relevant documentation:
|
When working with specific technologies, automatically load their dedicated context:
|
||||||
|
|
||||||
**Python (.py, .pyx, .pyi)**
|
**Tdarr Keywords**
|
||||||
- Load: `patterns/python/`
|
- "tdarr", "transcode", "ffmpeg", "gpu transcoding", "nvenc", "scheduler", "api"
|
||||||
- Load: `reference/python/`
|
- Load: `tdarr/CONTEXT.md` (technology overview and patterns)
|
||||||
- If Django/Flask detected: Load `examples/python/web-frameworks.md`
|
- Load: `tdarr/troubleshooting.md` (error handling and debugging)
|
||||||
- If requests/httpx detected: Load `examples/python/api-clients.md`
|
- If working in `/tdarr/scripts/`: Load `tdarr/scripts/CONTEXT.md` (script-specific documentation)
|
||||||
|
- Note: Gaming-aware scheduling system with configurable time windows available
|
||||||
|
- Note: Comprehensive API monitoring available via `tdarr_monitor.py` with dataclass-based status tracking
|
||||||
|
|
||||||
**JavaScript/Node.js (.js, .mjs, .ts)**
|
**Docker Keywords**
|
||||||
- Load: `patterns/nodejs/`
|
- "docker", "container", "image", "compose", "kubernetes", "k8s", "dockerfile", "podman"
|
||||||
- Load: `reference/nodejs/`
|
- Load: `docker/CONTEXT.md` (technology overview and patterns)
|
||||||
- If package.json exists: Load `examples/nodejs/package-management.md`
|
- Load: `docker/troubleshooting.md` (error handling and debugging)
|
||||||
|
- If working in `/docker/scripts/`: Load `docker/scripts/CONTEXT.md` (script-specific documentation)
|
||||||
|
|
||||||
**Vue.js (.vue, vite.config.*, nuxt.config.*)**
|
**VM Management Keywords**
|
||||||
- Load: `patterns/vuejs/`
|
- "virtual machine", "vm", "proxmox", "kvm", "hypervisor", "guest", "virtualization"
|
||||||
- Load: `reference/vuejs/`
|
- Load: `vm-management/CONTEXT.md` (technology overview and patterns)
|
||||||
- Load: `examples/vuejs/component-patterns.md`
|
- Load: `vm-management/troubleshooting.md` (error handling and debugging)
|
||||||
|
- If working in `/vm-management/scripts/`: Load `vm-management/scripts/CONTEXT.md` (script-specific documentation)
|
||||||
|
|
||||||
**Shell Scripts (.sh, .bash, .zsh)**
|
**Networking Keywords**
|
||||||
- Load: `patterns/bash/`
|
- "network", "nginx", "proxy", "load balancer", "dns", "port", "firewall", "ssh", "ssl", "tls"
|
||||||
- Load: `reference/bash/`
|
- Load: `networking/CONTEXT.md` (technology overview and patterns)
|
||||||
- If systemd mentioned: Load `examples/bash/service-management.md`
|
- Load: `networking/troubleshooting.md` (error handling and debugging)
|
||||||
|
- If working in `/networking/scripts/`: Load `networking/scripts/CONTEXT.md` (script-specific documentation)
|
||||||
|
|
||||||
**Docker (Dockerfile, docker-compose.yml, .dockerignore)**
|
**Monitoring Keywords**
|
||||||
- Load: `patterns/docker/`
|
- "monitoring", "alert", "notification", "discord", "health check", "status", "uptime", "windows reboot", "system monitor"
|
||||||
- Load: `reference/docker/`
|
- Load: `monitoring/CONTEXT.md` (technology overview and patterns)
|
||||||
- Load: `examples/docker/multi-stage-builds.md`
|
- Load: `monitoring/troubleshooting.md` (error handling and debugging)
|
||||||
|
- If working in `/monitoring/scripts/`: Load `monitoring/scripts/CONTEXT.md` (script-specific documentation)
|
||||||
|
- Note: Windows desktop monitoring with Discord notifications available
|
||||||
|
- Note: Comprehensive Tdarr API monitoring with dataclass-based status tracking
|
||||||
|
|
||||||
### Directory Context Triggers
|
### Directory Context Triggers
|
||||||
When working in specific directories:
|
When working in specific directories:
|
||||||
|
|
||||||
**Docker-related directories (/docker/, /containers/, /compose/)**
|
**Technology directories (/tdarr/, /docker/, /vm-management/, /networking/, /monitoring/)**
|
||||||
- Load: `patterns/docker/`
|
- Load: `{technology}/CONTEXT.md` (technology overview)
|
||||||
- Load: `examples/docker/`
|
- Load: `{technology}/troubleshooting.md` (debugging info)
|
||||||
- Load: `reference/docker/troubleshooting.md`
|
|
||||||
|
|
||||||
**Database directories (/db/, /database/, /mysql/, /postgres/, /mongo/)**
|
**Script subdirectories (/tdarr/scripts/, /docker/scripts/, etc.)**
|
||||||
- Load: `patterns/databases/`
|
- Load: `{technology}/CONTEXT.md` (parent technology context)
|
||||||
- Load: `examples/databases/`
|
- Load: `{technology}/scripts/CONTEXT.md` (script-specific context)
|
||||||
- Load: `reference/databases/`
|
- Load: `{technology}/troubleshooting.md` (debugging info)
|
||||||
|
|
||||||
**Network directories (/network/, /networking/, /nginx/, /traefik/)**
|
|
||||||
- Load: `patterns/networking/`
|
|
||||||
- Load: `examples/networking/`
|
|
||||||
- Load: `reference/networking/troubleshooting.md`
|
|
||||||
|
|
||||||
**VM directories (/vm/, /virtual/, /proxmox/, /kvm/)**
|
|
||||||
- Load: `patterns/vm-management/`
|
|
||||||
- Load: `examples/vm-management/`
|
|
||||||
- Load: `reference/vm-management/`
|
|
||||||
|
|
||||||
**Scripts directory (/scripts/, /scripts/*/)**
|
|
||||||
- Load: `patterns/` (relevant to script type)
|
|
||||||
- Load: `reference/` (relevant troubleshooting guides)
|
|
||||||
- Load: `scripts/*/README.md` (subsystem-specific documentation)
|
|
||||||
- Context: Active operational scripts - treat as production code
|
- Context: Active operational scripts - treat as production code
|
||||||
- Note: Windows desktop monitoring system available in `scripts/monitoring/windows-desktop/`
|
|
||||||
|
|
||||||
### Keyword Triggers
|
**Legacy directories (for backward compatibility)**
|
||||||
When user mentions specific terms, automatically load relevant docs:
|
- `/scripts/tdarr/` → Load Tdarr context files
|
||||||
|
- `/scripts/monitoring/` → Load Monitoring context files
|
||||||
|
- `/patterns/`, `/examples/`, `/reference/` → Load as before until migration complete
|
||||||
|
|
||||||
**Troubleshooting Keywords**
|
### File Extension Triggers
|
||||||
- "debug", "error", "fail", "broken", "not working", "issue"
|
For programming languages, load general development context:
|
||||||
- Load: `reference/{relevant-tech}/troubleshooting.md`
|
|
||||||
- Load: `examples/{relevant-tech}/debugging.md`
|
|
||||||
|
|
||||||
**Configuration Keywords**
|
**Python (.py, .pyx, .pyi)**
|
||||||
- "config", "configure", "setup", "install", "deploy"
|
- Load: `development/python-CONTEXT.md` (Python patterns and best practices)
|
||||||
- Load: `patterns/{relevant-tech}/`
|
- If Django/Flask detected: Load `development/web-frameworks-CONTEXT.md`
|
||||||
- Load: `examples/{relevant-tech}/configuration.md`
|
- If requests/httpx detected: Load `development/api-clients-CONTEXT.md`
|
||||||
|
|
||||||
**Performance Keywords**
|
**JavaScript/Node.js (.js, .mjs, .ts)**
|
||||||
- "slow", "performance", "optimize", "memory", "cpu"
|
- Load: `development/nodejs-CONTEXT.md` (Node.js patterns and best practices)
|
||||||
- Load: `reference/{relevant-tech}/performance.md`
|
- If package.json exists: Load `development/package-management-CONTEXT.md`
|
||||||
- Load: `examples/{relevant-tech}/optimization.md`
|
|
||||||
|
|
||||||
**Security Keywords**
|
**Shell Scripts (.sh, .bash, .zsh)**
|
||||||
- "secure", "ssl", "tls", "certificate", "auth", "firewall"
|
- Load: `development/bash-CONTEXT.md` (Bash scripting patterns)
|
||||||
- Load: `patterns/networking/security.md`
|
- If systemd mentioned: Load `development/service-management-CONTEXT.md`
|
||||||
- Load: `reference/networking/security.md`
|
|
||||||
|
|
||||||
**Database Keywords**
|
### Troubleshooting Keywords
|
||||||
- "database", "db", "sql", "mysql", "postgres", "mongo", "redis"
|
For troubleshooting scenarios, always load both context and troubleshooting files:
|
||||||
- Load: `patterns/databases/`
|
|
||||||
- Load: `examples/databases/`
|
|
||||||
|
|
||||||
**Container Keywords**
|
**General Troubleshooting Keywords**
|
||||||
- "docker", "container", "image", "compose", "kubernetes", "k8s"
|
- "shutdown", "stop", "emergency", "reset", "recovery", "crash", "broken", "not working", "error", "issue", "problem", "debug", "troubleshoot", "fix"
|
||||||
- Load: `patterns/docker/`
|
- If Tdarr context detected: Load `tdarr/CONTEXT.md` AND `tdarr/troubleshooting.md`
|
||||||
- Load: `examples/docker/`
|
- If Docker context detected: Load `docker/CONTEXT.md` AND `docker/troubleshooting.md`
|
||||||
|
- If VM context detected: Load `vm-management/CONTEXT.md` AND `vm-management/troubleshooting.md`
|
||||||
|
- If Network context detected: Load `networking/CONTEXT.md` AND `networking/troubleshooting.md`
|
||||||
|
- If Monitoring context detected: Load `monitoring/CONTEXT.md` AND `monitoring/troubleshooting.md`
|
||||||
|
|
||||||
**Network Keywords**
|
**Specific Tdarr Troubleshooting Keywords**
|
||||||
- "network", "nginx", "proxy", "load balancer", "dns", "port", "firewall"
|
- "forEach error", "staging timeout", "gaming detection", "plugin error", "container stop", "node disconnect", "cache cleanup", "shutdown tdarr", "stop tdarr", "emergency tdarr", "reset tdarr"
|
||||||
- Load: `patterns/networking/`
|
- Load: `tdarr/CONTEXT.md` (technology overview)
|
||||||
- Load: `examples/networking/`
|
- Load: `tdarr/troubleshooting.md` (specific solutions including Emergency Recovery section)
|
||||||
|
- If working in `/tdarr/scripts/`: Load `tdarr/scripts/CONTEXT.md`
|
||||||
**SSH Keywords**
|
|
||||||
- "ssh", "key", "authentication", "authorized_keys", "ssh-copy-id"
|
|
||||||
- Load: `patterns/networking/ssh-key-management.md`
|
|
||||||
- Load: `examples/networking/ssh-homelab-setup.md`
|
|
||||||
- Load: `reference/networking/ssh-troubleshooting.md`
|
|
||||||
|
|
||||||
**VM Keywords**
|
|
||||||
- "virtual machine", "vm", "proxmox", "kvm", "hypervisor", "guest"
|
|
||||||
- Load: `patterns/vm-management/`
|
|
||||||
- Load: `examples/vm-management/`
|
|
||||||
|
|
||||||
**Tdarr Keywords**
|
|
||||||
- "tdarr", "transcode", "ffmpeg", "gpu transcoding", "nvenc", "forEach error", "gaming detection", "scheduler", "monitoring", "api"
|
|
||||||
- Load: `reference/docker/tdarr-troubleshooting.md`
|
|
||||||
- Load: `patterns/docker/distributed-transcoding.md`
|
|
||||||
- Load: `scripts/tdarr/README.md` (for automation and scheduling)
|
|
||||||
- Load: `scripts/monitoring/README.md` (for monitoring and health checks)
|
|
||||||
- Note: Gaming-aware scheduling system with configurable time windows available
|
|
||||||
- Note: Comprehensive API monitoring available via `tdarr_monitor.py` with dataclass-based status tracking
|
|
||||||
|
|
||||||
**Windows Monitoring Keywords**
|
|
||||||
- "windows reboot", "discord notification", "system monitor", "windows desktop", "power outage", "windows update"
|
|
||||||
- Load: `scripts/monitoring/windows-desktop/README.md`
|
|
||||||
- Note: Complete Windows desktop monitoring with Discord notifications for reboots and system events
|
|
||||||
|
|
||||||
### Priority Rules
|
### Priority Rules
|
||||||
1. **File extension triggers** take highest priority
|
1. **File extension triggers** take highest priority
|
||||||
@ -141,33 +109,132 @@ When user mentions specific terms, automatically load relevant docs:
|
|||||||
5. Always prefer specific over general (e.g., `vuejs/` over `nodejs/`)
|
5. Always prefer specific over general (e.g., `vuejs/` over `nodejs/`)
|
||||||
|
|
||||||
### Context Loading Behavior
|
### Context Loading Behavior
|
||||||
- Load pattern files first for overview
|
- **Technology context first**: Load CONTEXT.md for overview and patterns
|
||||||
- Load relevant examples for implementation details
|
- **Troubleshooting context**: ALWAYS load troubleshooting.md for error scenarios and emergency procedures
|
||||||
- Load reference files for troubleshooting and edge cases
|
- **Script-specific context**: Load scripts/CONTEXT.md when working in script directories
|
||||||
- Maximum of 3 documentation files per trigger to maintain efficiency
|
- **Examples last**: Load examples for implementation details
|
||||||
- If context becomes too large, prioritize most recent/specific files
|
- **Critical rule**: For any troubleshooting scenario, load BOTH context and troubleshooting files to ensure complete information
|
||||||
|
- Maximum of 3-4 documentation files per trigger to maintain efficiency while ensuring comprehensive coverage
|
||||||
|
|
||||||
## Documentation Structure
|
## Documentation Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
/patterns/ # Technology overviews and best practices
|
/tdarr/ # Tdarr transcoding automation
|
||||||
/examples/ # Complete working implementations
|
├── CONTEXT.md # Technology overview, patterns, best practices
|
||||||
/reference/ # Troubleshooting, cheat sheets, fallback info
|
├── troubleshooting.md # Error handling and debugging
|
||||||
/scripts/ # Active scripts and utilities for home lab operations
|
├── examples/ # Working configurations and templates
|
||||||
├── tdarr/ # Tdarr automation with gaming-aware scheduling
|
└── scripts/ # Active automation scripts
|
||||||
├── monitoring/ # System monitoring and alerting
|
├── CONTEXT.md # Script-specific documentation
|
||||||
│ ├── tdarr_monitor.py # Comprehensive Tdarr API monitoring with dataclasses
|
├── monitoring.py # Comprehensive API monitoring with dataclasses
|
||||||
│ └── windows-desktop/ # Windows reboot monitoring with Discord notifications
|
└── scheduler.py # Gaming-aware scheduling system
|
||||||
└── <future>/ # Other organized automation subsystems
|
|
||||||
```
|
|
||||||
|
|
||||||
Each pattern file should reference relevant examples and reference materials.
|
/docker/ # Container orchestration and management
|
||||||
|
├── CONTEXT.md # Technology overview, patterns, best practices
|
||||||
|
├── troubleshooting.md # Error handling and debugging
|
||||||
|
├── examples/ # Working configurations and templates
|
||||||
|
└── scripts/ # Active automation scripts
|
||||||
|
└── CONTEXT.md # Script-specific documentation
|
||||||
|
|
||||||
|
/vm-management/ # Virtual machine operations
|
||||||
|
├── CONTEXT.md # Technology overview, patterns, best practices
|
||||||
|
├── troubleshooting.md # Error handling and debugging
|
||||||
|
├── examples/ # Working configurations and templates
|
||||||
|
└── scripts/ # Active automation scripts
|
||||||
|
└── CONTEXT.md # Script-specific documentation
|
||||||
|
|
||||||
|
/networking/ # Network configuration and SSH management
|
||||||
|
├── CONTEXT.md # Technology overview, patterns, best practices
|
||||||
|
├── troubleshooting.md # Error handling and debugging
|
||||||
|
├── examples/ # Working configurations and templates
|
||||||
|
└── scripts/ # Active automation scripts
|
||||||
|
└── CONTEXT.md # Script-specific documentation
|
||||||
|
|
||||||
|
/monitoring/ # System monitoring and alerting
|
||||||
|
├── CONTEXT.md # Technology overview, patterns, best practices
|
||||||
|
├── troubleshooting.md # Error handling and debugging
|
||||||
|
├── examples/ # Working configurations and templates
|
||||||
|
└── scripts/ # Active automation scripts
|
||||||
|
├── CONTEXT.md # Script-specific documentation
|
||||||
|
└── windows-desktop/ # Windows reboot monitoring with Discord notifications
|
||||||
|
|
||||||
|
/development/ # Programming language patterns and tools
|
||||||
|
├── python-CONTEXT.md # Python development patterns
|
||||||
|
├── nodejs-CONTEXT.md # Node.js development patterns
|
||||||
|
└── bash-CONTEXT.md # Shell scripting patterns
|
||||||
|
|
||||||
|
/legacy/ # Backward compatibility during migration
|
||||||
|
├── patterns/ # Old patterns structure (temporary)
|
||||||
|
├── examples/ # Old examples structure (temporary)
|
||||||
|
└── reference/ # Old reference structure (temporary)
|
||||||
|
```
|
||||||
|
|
||||||
### Directory Usage Guidelines
|
### Directory Usage Guidelines
|
||||||
|
|
||||||
- `/scripts/` - Contains actively used scripts for home lab management and operations
|
- Each technology directory is self-contained with its own context, troubleshooting, examples, and scripts
|
||||||
- Organized by subsystem (e.g., `tdarr/`, `networking/`, `vm-management/`)
|
- `CONTEXT.md` files provide technology overview, patterns, and best practices for Claude
|
||||||
- Each subsystem includes its own README.md with complete documentation
|
- `troubleshooting.md` files contain error handling and debugging information
|
||||||
- `/examples/` - Contains example configurations and template scripts for reference
|
- `/scripts/` subdirectories contain active operational code with their own `CONTEXT.md`
|
||||||
- `/patterns/` - Best practices and architectural guidance
|
- `/examples/` subdirectories contain template configurations and reference implementations
|
||||||
- `/reference/` - Troubleshooting guides and technical references
|
- `/development/` contains general programming language patterns that apply across technologies
|
||||||
|
- `/legacy/` provides backward compatibility during the migration from the old structure
|
||||||
|
|
||||||
|
## Documentation Maintenance Protocol
|
||||||
|
|
||||||
|
### Automated Maintenance Triggers
|
||||||
|
Claude Code should automatically prompt for documentation updates when:
|
||||||
|
|
||||||
|
1. **New Technology Integration**: When working with a technology that doesn't have a dedicated directory
|
||||||
|
- Prompt: "I notice we're working with [technology] but don't have a dedicated `/[technology]/` directory. Should I create the technology-first structure with CONTEXT.md and troubleshooting.md files?"
|
||||||
|
|
||||||
|
2. **New Error Patterns Discovered**: When encountering and solving new issues
|
||||||
|
- Prompt: "We just resolved a [technology] issue that isn't documented. Should I add this solution to `[technology]/troubleshooting.md`?"
|
||||||
|
|
||||||
|
3. **New Scripts or Operational Procedures**: When creating new automation or workflows
|
||||||
|
- Prompt: "I created new scripts/procedures for [technology]. Should I update `[technology]/scripts/CONTEXT.md` and add any new operational patterns?"
|
||||||
|
|
||||||
|
4. **Session End with Significant Changes**: When completing complex tasks
|
||||||
|
- Prompt: "We made significant changes to [technology] systems. Should I update our documentation to reflect the new patterns, configurations, or troubleshooting procedures we discovered?"
|
||||||
|
|
||||||
|
### Documentation Update Checklist
|
||||||
|
When "update our documentation" is requested, systematically check:
|
||||||
|
|
||||||
|
**Technology-Specific Updates**:
|
||||||
|
- [ ] Update `[technology]/CONTEXT.md` with new patterns or architectural changes
|
||||||
|
- [ ] Add new troubleshooting scenarios to `[technology]/troubleshooting.md`
|
||||||
|
- [ ] Update `[technology]/scripts/CONTEXT.md` for new operational procedures
|
||||||
|
- [ ] Add working examples to `[technology]/examples/` if new configurations were created
|
||||||
|
|
||||||
|
**Cross-Technology Updates**:
|
||||||
|
- [ ] Update main CLAUDE.md loading rules if new keywords or triggers are needed
|
||||||
|
- [ ] Add new technology directories to the Documentation Structure section
|
||||||
|
- [ ] Update Directory Usage Guidelines if new organizational patterns emerge
|
||||||
|
|
||||||
|
**Legacy Cleanup**:
|
||||||
|
- [ ] Check if any old patterns/examples/reference files can be migrated to technology directories
|
||||||
|
- [ ] Update or remove outdated information that conflicts with new approaches
|
||||||
|
|
||||||
|
### Self-Maintenance Features
|
||||||
|
|
||||||
|
**Loading Rule Validation**: Periodically verify that:
|
||||||
|
- All technology directories have corresponding keyword triggers
|
||||||
|
- Troubleshooting keywords include all common error scenarios
|
||||||
|
- File paths in loading rules match actual directory structure
|
||||||
|
|
||||||
|
**Documentation Completeness Check**: Each technology directory should have:
|
||||||
|
- `CONTEXT.md` (overview, patterns, best practices)
|
||||||
|
- `troubleshooting.md` (error scenarios, emergency procedures)
|
||||||
|
- `examples/` (working configurations)
|
||||||
|
- `scripts/CONTEXT.md` (if operational scripts exist)
|
||||||
|
|
||||||
|
**Keyword Coverage Analysis**: Ensure loading rules cover:
|
||||||
|
- Technology names and common aliases
|
||||||
|
- Error types and troubleshooting scenarios
|
||||||
|
- Operational keywords (start, stop, configure, monitor)
|
||||||
|
- Emergency keywords (shutdown, reset, recovery)
|
||||||
|
|
||||||
|
### Warning Triggers
|
||||||
|
Claude Code should warn when:
|
||||||
|
- Working extensively with a technology that lacks dedicated documentation structure
|
||||||
|
- Solving problems that aren't covered in existing troubleshooting guides
|
||||||
|
- Creating scripts or procedures without corresponding CONTEXT.md documentation
|
||||||
|
- Encountering loading rules that reference non-existent files
|
||||||
|
|||||||
316
databases/troubleshooting.md
Normal file
316
databases/troubleshooting.md
Normal file
@ -0,0 +1,316 @@
|
|||||||
|
# Database Troubleshooting Guide
|
||||||
|
|
||||||
|
## Connection Issues
|
||||||
|
|
||||||
|
### Cannot Connect to Database
|
||||||
|
**Symptoms**: Connection refused, timeout errors, authentication failures
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test basic connectivity
|
||||||
|
telnet db-server 3306 # MySQL
|
||||||
|
telnet db-server 5432 # PostgreSQL
|
||||||
|
nc -zv db-server 6379 # Redis
|
||||||
|
|
||||||
|
# Check database service status
|
||||||
|
systemctl status mysql
|
||||||
|
systemctl status postgresql
|
||||||
|
systemctl status redis-server
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Restart database services
|
||||||
|
sudo systemctl restart mysql
|
||||||
|
sudo systemctl restart postgresql
|
||||||
|
|
||||||
|
# Check configuration files
|
||||||
|
sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf
|
||||||
|
sudo nano /etc/postgresql/*/main/postgresql.conf
|
||||||
|
|
||||||
|
# Verify port bindings
|
||||||
|
ss -tlnp | grep :3306 # MySQL
|
||||||
|
ss -tlnp | grep :5432 # PostgreSQL
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Issues
|
||||||
|
|
||||||
|
### Slow Query Performance
|
||||||
|
**Symptoms**: Long-running queries, high CPU usage, timeouts
|
||||||
|
**Diagnosis**:
|
||||||
|
```sql
|
||||||
|
-- MySQL
|
||||||
|
SHOW PROCESSLIST;
|
||||||
|
SHOW ENGINE INNODB STATUS;
|
||||||
|
EXPLAIN SELECT * FROM table WHERE condition;
|
||||||
|
|
||||||
|
-- PostgreSQL
|
||||||
|
SELECT * FROM pg_stat_activity;
|
||||||
|
EXPLAIN ANALYZE SELECT * FROM table WHERE condition;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```sql
|
||||||
|
-- Add missing indexes
|
||||||
|
CREATE INDEX idx_column ON table(column);
|
||||||
|
|
||||||
|
-- Analyze table statistics
|
||||||
|
ANALYZE TABLE table_name; -- MySQL
|
||||||
|
ANALYZE table_name; -- PostgreSQL
|
||||||
|
|
||||||
|
-- Optimize queries
|
||||||
|
-- Use LIMIT for large result sets
|
||||||
|
-- Add WHERE clauses to filter results
|
||||||
|
-- Use appropriate JOIN types
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory and Resource Issues
|
||||||
|
**Symptoms**: Out of memory errors, swap usage, slow performance
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check memory usage
|
||||||
|
free -h
|
||||||
|
ps aux | grep mysql
|
||||||
|
ps aux | grep postgres
|
||||||
|
|
||||||
|
# Database-specific memory usage
|
||||||
|
mysqladmin -u root -p status
|
||||||
|
sudo -u postgres psql -c "SELECT * FROM pg_stat_database;"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Adjust database memory settings
|
||||||
|
# MySQL - /etc/mysql/mysql.conf.d/mysqld.cnf
|
||||||
|
innodb_buffer_pool_size = 2G
|
||||||
|
key_buffer_size = 256M
|
||||||
|
|
||||||
|
# PostgreSQL - /etc/postgresql/*/main/postgresql.conf
|
||||||
|
shared_buffers = 256MB
|
||||||
|
effective_cache_size = 2GB
|
||||||
|
work_mem = 4MB
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Integrity Issues
|
||||||
|
|
||||||
|
### Corruption Detection and Recovery
|
||||||
|
**Symptoms**: Table corruption errors, data inconsistencies
|
||||||
|
**Diagnosis**:
|
||||||
|
```sql
|
||||||
|
-- MySQL
|
||||||
|
CHECK TABLE table_name;
|
||||||
|
mysqlcheck -u root -p --all-databases
|
||||||
|
|
||||||
|
-- PostgreSQL
|
||||||
|
-- Check for corruption in logs
|
||||||
|
tail -f /var/log/postgresql/postgresql-*.log
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```sql
|
||||||
|
-- MySQL table repair
|
||||||
|
REPAIR TABLE table_name;
|
||||||
|
mysqlcheck -u root -p --auto-repair database_name
|
||||||
|
|
||||||
|
-- PostgreSQL consistency check
|
||||||
|
-- Run VACUUM and REINDEX
|
||||||
|
VACUUM FULL table_name;
|
||||||
|
REINDEX TABLE table_name;
|
||||||
|
```
|
||||||
|
|
||||||
|
## Backup and Recovery Issues
|
||||||
|
|
||||||
|
### Backup Failures
|
||||||
|
**Symptoms**: Backup scripts failing, incomplete backups
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check backup script logs
|
||||||
|
tail -f /var/log/backup.log
|
||||||
|
|
||||||
|
# Test backup commands manually
|
||||||
|
mysqldump -u root -p database_name > test_backup.sql
|
||||||
|
pg_dump -U postgres database_name > test_backup.sql
|
||||||
|
|
||||||
|
# Check disk space
|
||||||
|
df -h /backup/location/
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Fix backup script permissions
|
||||||
|
chmod +x /path/to/backup-script.sh
|
||||||
|
chown backup-user:backup-group /backup/location/
|
||||||
|
|
||||||
|
# Automated backup script example
|
||||||
|
#!/bin/bash
|
||||||
|
BACKUP_DIR="/backups/mysql"
|
||||||
|
DATE=$(date +%Y%m%d_%H%M%S)
|
||||||
|
|
||||||
|
mysqldump -u root -p$MYSQL_PASSWORD --all-databases > \
|
||||||
|
"$BACKUP_DIR/full_backup_$DATE.sql"
|
||||||
|
|
||||||
|
# Compress and rotate backups
|
||||||
|
gzip "$BACKUP_DIR/full_backup_$DATE.sql"
|
||||||
|
find "$BACKUP_DIR" -name "*.gz" -mtime +7 -delete
|
||||||
|
```
|
||||||
|
|
||||||
|
## Authentication and Security Issues
|
||||||
|
|
||||||
|
### Access Denied Errors
|
||||||
|
**Symptoms**: Authentication failures, permission errors
|
||||||
|
**Diagnosis**:
|
||||||
|
```sql
|
||||||
|
-- MySQL
|
||||||
|
SELECT user, host FROM mysql.user;
|
||||||
|
SHOW GRANTS FOR 'username'@'host';
|
||||||
|
|
||||||
|
-- PostgreSQL
|
||||||
|
\du -- List users
|
||||||
|
\l -- List databases
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```sql
|
||||||
|
-- MySQL user management
|
||||||
|
CREATE USER 'newuser'@'localhost' IDENTIFIED BY 'password';
|
||||||
|
GRANT ALL PRIVILEGES ON database.* TO 'newuser'@'localhost';
|
||||||
|
FLUSH PRIVILEGES;
|
||||||
|
|
||||||
|
-- PostgreSQL user management
|
||||||
|
CREATE USER newuser WITH PASSWORD 'password';
|
||||||
|
GRANT ALL PRIVILEGES ON DATABASE database_name TO newuser;
|
||||||
|
```
|
||||||
|
|
||||||
|
## Replication Issues
|
||||||
|
|
||||||
|
### Master-Slave Replication Problems
|
||||||
|
**Symptoms**: Replication lag, sync errors, slave disconnection
|
||||||
|
**Diagnosis**:
|
||||||
|
```sql
|
||||||
|
-- MySQL Master
|
||||||
|
SHOW MASTER STATUS;
|
||||||
|
|
||||||
|
-- MySQL Slave
|
||||||
|
SHOW SLAVE STATUS\G
|
||||||
|
|
||||||
|
-- Check replication lag
|
||||||
|
SELECT SECONDS_BEHIND_MASTER FROM SHOW SLAVE STATUS\G
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```sql
|
||||||
|
-- Reset replication
|
||||||
|
STOP SLAVE;
|
||||||
|
RESET SLAVE;
|
||||||
|
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=4;
|
||||||
|
START SLAVE;
|
||||||
|
|
||||||
|
-- Fix replication errors
|
||||||
|
SET GLOBAL sql_slave_skip_counter = 1;
|
||||||
|
START SLAVE;
|
||||||
|
```
|
||||||
|
|
||||||
|
## Storage and Disk Issues
|
||||||
|
|
||||||
|
### Disk Space Problems
|
||||||
|
**Symptoms**: Out of disk space errors, database growth
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check database sizes
|
||||||
|
du -sh /var/lib/mysql/*
|
||||||
|
du -sh /var/lib/postgresql/*/main/*
|
||||||
|
|
||||||
|
# Find large tables
|
||||||
|
SELECT table_schema, table_name,
|
||||||
|
ROUND((data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'
|
||||||
|
FROM information_schema.tables
|
||||||
|
ORDER BY (data_length + index_length) DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```sql
|
||||||
|
-- Clean up large tables
|
||||||
|
DELETE FROM log_table WHERE created_date < DATE_SUB(NOW(), INTERVAL 30 DAY);
|
||||||
|
OPTIMIZE TABLE log_table;
|
||||||
|
|
||||||
|
-- Enable log rotation
|
||||||
|
-- For MySQL binary logs
|
||||||
|
SET GLOBAL expire_logs_days = 7;
|
||||||
|
PURGE BINARY LOGS BEFORE DATE(NOW() - INTERVAL 7 DAY);
|
||||||
|
```
|
||||||
|
|
||||||
|
## Emergency Recovery
|
||||||
|
|
||||||
|
### Database Won't Start
|
||||||
|
**Recovery Steps**:
|
||||||
|
```bash
|
||||||
|
# Check error logs
|
||||||
|
tail -f /var/log/mysql/error.log
|
||||||
|
tail -f /var/log/postgresql/postgresql-*.log
|
||||||
|
|
||||||
|
# Try safe mode start
|
||||||
|
sudo mysqld_safe --skip-grant-tables &
|
||||||
|
|
||||||
|
# Recovery from backup
|
||||||
|
mysql -u root -p < backup_file.sql
|
||||||
|
psql -U postgres database_name < backup_file.sql
|
||||||
|
```
|
||||||
|
|
||||||
|
### Complete Data Loss Recovery
|
||||||
|
**Recovery Procedure**:
|
||||||
|
```bash
|
||||||
|
# Stop database service
|
||||||
|
sudo systemctl stop mysql
|
||||||
|
|
||||||
|
# Restore from backup
|
||||||
|
cd /var/lib/mysql
|
||||||
|
sudo rm -rf *
|
||||||
|
sudo tar -xzf /backups/mysql_full_backup.tar.gz
|
||||||
|
|
||||||
|
# Fix permissions
|
||||||
|
sudo chown -R mysql:mysql /var/lib/mysql
|
||||||
|
sudo chmod 755 /var/lib/mysql
|
||||||
|
|
||||||
|
# Start database
|
||||||
|
sudo systemctl start mysql
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring and Prevention
|
||||||
|
|
||||||
|
### Database Health Monitoring
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# db-health-check.sh
|
||||||
|
|
||||||
|
# Check if database is responding
|
||||||
|
if ! mysqladmin -u root -p$MYSQL_PASSWORD ping >/dev/null 2>&1; then
|
||||||
|
echo "ALERT: MySQL not responding" | send_alert
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check disk space
|
||||||
|
DISK_USAGE=$(df /var/lib/mysql | awk 'NR==2 {print $5}' | sed 's/%//')
|
||||||
|
if [ $DISK_USAGE -gt 80 ]; then
|
||||||
|
echo "ALERT: Database disk usage at ${DISK_USAGE}%" | send_alert
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for long-running queries
|
||||||
|
LONG_QUERIES=$(mysql -u root -p$MYSQL_PASSWORD -e "SHOW PROCESSLIST" | grep -c "Query.*[0-9][0-9][0-9]")
|
||||||
|
if [ $LONG_QUERIES -gt 5 ]; then
|
||||||
|
echo "ALERT: $LONG_QUERIES long-running queries detected" | send_alert
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
### Automated Maintenance
|
||||||
|
```bash
|
||||||
|
# Daily maintenance script
|
||||||
|
#!/bin/bash
|
||||||
|
# Optimize tables
|
||||||
|
mysqlcheck -u root -p$MYSQL_PASSWORD --auto-repair --optimize --all-databases
|
||||||
|
|
||||||
|
# Update table statistics
|
||||||
|
mysql -u root -p$MYSQL_PASSWORD -e "FLUSH TABLES; ANALYZE TABLE table_name;"
|
||||||
|
|
||||||
|
# Backup rotation
|
||||||
|
find /backups -name "*.sql.gz" -mtime +30 -delete
|
||||||
|
```
|
||||||
|
|
||||||
|
This troubleshooting guide provides systematic approaches to resolving common database issues in home lab environments.
|
||||||
331
docker/CONTEXT.md
Normal file
331
docker/CONTEXT.md
Normal file
@ -0,0 +1,331 @@
|
|||||||
|
# Docker Container Technology - Technology Context
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Docker containerization for home lab environments with focus on performance optimization, GPU acceleration, and distributed workloads. This context covers container architecture patterns, security practices, and production deployment strategies.
|
||||||
|
|
||||||
|
## Architecture Patterns
|
||||||
|
|
||||||
|
### Container Design Principles
|
||||||
|
1. **Single Responsibility**: One service per container
|
||||||
|
2. **Immutable Infrastructure**: Treat containers as replaceable units
|
||||||
|
3. **Resource Isolation**: Use container limits and cgroups
|
||||||
|
4. **Security First**: Run as non-root, minimal attack surface
|
||||||
|
5. **Configuration Management**: Environment variables and external configs
|
||||||
|
|
||||||
|
### Multi-Stage Build Pattern
|
||||||
|
**Purpose**: Minimize production image size and attack surface
|
||||||
|
```dockerfile
|
||||||
|
# Build stage
|
||||||
|
FROM node:18 AS builder
|
||||||
|
WORKDIR /app
|
||||||
|
COPY package*.json ./
|
||||||
|
RUN npm ci --only=production
|
||||||
|
|
||||||
|
# Production stage
|
||||||
|
FROM node:18-alpine AS production
|
||||||
|
WORKDIR /app
|
||||||
|
COPY --from=builder /app/node_modules ./node_modules
|
||||||
|
COPY . .
|
||||||
|
USER 1000
|
||||||
|
EXPOSE 3000
|
||||||
|
CMD ["node", "server.js"]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Distributed Application Architecture
|
||||||
|
**Pattern**: Server-Node separation with specialized workloads
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────┐ ┌──────────────────────────────────┐
|
||||||
|
│ Control Plane │ │ Worker Nodes │
|
||||||
|
│ │ │ ┌─────────┐ ┌─────────┐ │
|
||||||
|
│ - Web Interface│◄──►│ │ Node 1 │ │ Node 2 │ ... │
|
||||||
|
│ - Job Queue │ │ │ GPU+CPU │ │ GPU+CPU │ │
|
||||||
|
│ - Coordination │ │ │Local SSD│ │Local SSD│ │
|
||||||
|
│ │ │ └─────────┘ └─────────┘ │
|
||||||
|
└─────────────────┘ └──────────────────────────────────┘
|
||||||
|
│ │
|
||||||
|
└──────── Shared Storage ──────┘
|
||||||
|
(NAS/SAN for persistence)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Container Runtime Platforms
|
||||||
|
|
||||||
|
### Docker vs Podman Comparison
|
||||||
|
**Docker**: Traditional daemon-based approach
|
||||||
|
- Requires Docker daemon running as root
|
||||||
|
- Centralized container management
|
||||||
|
- Established ecosystem and tooling
|
||||||
|
|
||||||
|
**Podman** (Recommended for GPU workloads):
|
||||||
|
- Daemonless architecture
|
||||||
|
- Better GPU integration with NVIDIA
|
||||||
|
- Rootless containers for enhanced security
|
||||||
|
- Direct systemd integration
|
||||||
|
|
||||||
|
### GPU Acceleration Support
|
||||||
|
**NVIDIA Container Toolkit Integration**:
|
||||||
|
```bash
|
||||||
|
# Podman GPU configuration (recommended)
|
||||||
|
podman run -d --name gpu-workload \
|
||||||
|
--device nvidia.com/gpu=all \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
||||||
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
||||||
|
myapp:latest
|
||||||
|
|
||||||
|
# Docker GPU configuration
|
||||||
|
docker run -d --name gpu-workload \
|
||||||
|
--gpus all \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
||||||
|
myapp:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Optimization Patterns
|
||||||
|
|
||||||
|
### Hybrid Storage Strategy
|
||||||
|
**Pattern**: Balance performance and persistence for different data types
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
volumes:
|
||||||
|
# Local storage (SSD/NVMe) - High Performance
|
||||||
|
- ./app/data:/app/data # Database - frequent I/O
|
||||||
|
- ./app/configs:/app/configs # Config - startup performance
|
||||||
|
- ./app/logs:/app/logs # Logs - continuous writing
|
||||||
|
- ./cache:/cache # Work directories - temp processing
|
||||||
|
|
||||||
|
# Network storage (NAS) - Persistence & Backup
|
||||||
|
- /mnt/nas/backups:/app/backups # Backups - infrequent access
|
||||||
|
- /mnt/nas/media:/media:ro # Source data - read-only
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- **Local Operations**: 100x faster database performance vs network
|
||||||
|
- **Network Reliability**: Critical data protected on redundant storage
|
||||||
|
- **Cost Optimization**: Expensive fast storage only where needed
|
||||||
|
|
||||||
|
### Cache Optimization Hierarchy
|
||||||
|
```bash
|
||||||
|
# Performance tiers for different workload types
|
||||||
|
/dev/shm/cache/ # RAM disk - fastest, volatile, limited size
|
||||||
|
/mnt/nvme/cache/ # NVMe SSD - 3-7GB/s, persistent, recommended
|
||||||
|
/mnt/ssd/cache/ # SATA SSD - 500MB/s, good balance
|
||||||
|
/mnt/nas/cache/ # Network - 100MB/s, legacy compatibility
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resource Management
|
||||||
|
**Container Limits** (prevent resource exhaustion):
|
||||||
|
```yaml
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 8G
|
||||||
|
cpus: '6'
|
||||||
|
reservations:
|
||||||
|
memory: 4G
|
||||||
|
cpus: '2'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Networking Optimization**:
|
||||||
|
```yaml
|
||||||
|
# Host networking for performance-critical applications
|
||||||
|
network_mode: host
|
||||||
|
|
||||||
|
# Bridge networking with port mapping (default)
|
||||||
|
network_mode: bridge
|
||||||
|
ports:
|
||||||
|
- "8080:8080"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Security Patterns
|
||||||
|
|
||||||
|
### Container Hardening
|
||||||
|
```dockerfile
|
||||||
|
# Use minimal base images
|
||||||
|
FROM alpine:3.18
|
||||||
|
|
||||||
|
# Run as non-root user
|
||||||
|
RUN addgroup -g 1000 appuser && \
|
||||||
|
adduser -u 1000 -G appuser -s /bin/sh -D appuser
|
||||||
|
USER 1000
|
||||||
|
|
||||||
|
# Set secure permissions
|
||||||
|
COPY --chown=appuser:appuser . /app
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment Security
|
||||||
|
```bash
|
||||||
|
# Secrets management (avoid environment variables for secrets)
|
||||||
|
podman secret create db_password password.txt
|
||||||
|
podman run --secret db_password myapp:latest
|
||||||
|
|
||||||
|
# Network isolation
|
||||||
|
podman network create --driver bridge isolated-net
|
||||||
|
podman run --network isolated-net myapp:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
### Image Security
|
||||||
|
1. **Vulnerability Scanning**: Regular image scans with tools like Trivy
|
||||||
|
2. **Version Pinning**: Use specific tags, avoid `latest`
|
||||||
|
3. **Minimal Images**: Distroless or Alpine base images
|
||||||
|
4. **Layer Optimization**: Minimize layers, combine RUN commands
|
||||||
|
|
||||||
|
## Development Workflows
|
||||||
|
|
||||||
|
### Local Development Pattern
|
||||||
|
```yaml
|
||||||
|
# docker-compose.dev.yml
|
||||||
|
version: "3.8"
|
||||||
|
services:
|
||||||
|
app:
|
||||||
|
build: .
|
||||||
|
volumes:
|
||||||
|
- .:/app # Code hot-reload
|
||||||
|
- /app/node_modules # Preserve dependencies
|
||||||
|
environment:
|
||||||
|
- NODE_ENV=development
|
||||||
|
ports:
|
||||||
|
- "3000:3000"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Production Deployment Pattern
|
||||||
|
```bash
|
||||||
|
# Production container with health checks
|
||||||
|
podman run -d --name production-app \
|
||||||
|
--restart unless-stopped \
|
||||||
|
--health-cmd="curl -f http://localhost:3000/health || exit 1" \
|
||||||
|
--health-interval=30s \
|
||||||
|
--health-timeout=10s \
|
||||||
|
--health-retries=3 \
|
||||||
|
-p 3000:3000 \
|
||||||
|
myapp:v1.2.3
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring and Observability
|
||||||
|
|
||||||
|
### Health Check Implementation
|
||||||
|
```dockerfile
|
||||||
|
# Application health check
|
||||||
|
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
||||||
|
CMD curl -f http://localhost:3000/health || exit 1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Log Management
|
||||||
|
```bash
|
||||||
|
# Structured logging with log rotation
|
||||||
|
podman run -d --name app \
|
||||||
|
--log-driver journald \
|
||||||
|
--log-opt max-size=10m \
|
||||||
|
--log-opt max-file=3 \
|
||||||
|
myapp:latest
|
||||||
|
|
||||||
|
# Centralized logging
|
||||||
|
podman logs -f app | logger -t myapp
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resource Monitoring
|
||||||
|
```bash
|
||||||
|
# Real-time container metrics
|
||||||
|
podman stats --no-stream app
|
||||||
|
|
||||||
|
# Historical resource usage
|
||||||
|
podman exec app cat /sys/fs/cgroup/memory/memory.usage_in_bytes
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Implementation Patterns
|
||||||
|
|
||||||
|
### Database Containers
|
||||||
|
```yaml
|
||||||
|
# Persistent database with backup strategy
|
||||||
|
services:
|
||||||
|
postgres:
|
||||||
|
image: postgres:15-alpine
|
||||||
|
environment:
|
||||||
|
POSTGRES_DB: myapp
|
||||||
|
POSTGRES_USER: appuser
|
||||||
|
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
|
||||||
|
volumes:
|
||||||
|
- postgres_data:/var/lib/postgresql/data # Persistent data
|
||||||
|
- ./backups:/backups # Backup mount
|
||||||
|
secrets:
|
||||||
|
- db_password
|
||||||
|
```
|
||||||
|
|
||||||
|
### Web Application Containers
|
||||||
|
```yaml
|
||||||
|
# Multi-tier web application
|
||||||
|
services:
|
||||||
|
frontend:
|
||||||
|
image: nginx:alpine
|
||||||
|
volumes:
|
||||||
|
- ./nginx.conf:/etc/nginx/nginx.conf:ro
|
||||||
|
ports:
|
||||||
|
- "80:80"
|
||||||
|
- "443:443"
|
||||||
|
depends_on:
|
||||||
|
- backend
|
||||||
|
|
||||||
|
backend:
|
||||||
|
build: ./api
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=postgresql://appuser@postgres/myapp
|
||||||
|
depends_on:
|
||||||
|
- postgres
|
||||||
|
```
|
||||||
|
|
||||||
|
### GPU-Accelerated Workloads
|
||||||
|
```bash
|
||||||
|
# GPU transcoding/processing container
|
||||||
|
podman run -d --name gpu-processor \
|
||||||
|
--device nvidia.com/gpu=all \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=compute,video \
|
||||||
|
-v "/fast-storage:/cache" \
|
||||||
|
-v "/media:/input:ro" \
|
||||||
|
-v "/output:/output" \
|
||||||
|
gpu-app:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### Production Deployment
|
||||||
|
1. **Use specific image tags**: Never use `latest` in production
|
||||||
|
2. **Implement health checks**: Application and infrastructure monitoring
|
||||||
|
3. **Resource limits**: Prevent resource exhaustion
|
||||||
|
4. **Backup strategy**: Regular backups of persistent data
|
||||||
|
5. **Security scanning**: Regular vulnerability assessments
|
||||||
|
|
||||||
|
### Development Guidelines
|
||||||
|
1. **Multi-stage builds**: Separate build and runtime environments
|
||||||
|
2. **Environment parity**: Keep dev/staging/prod similar
|
||||||
|
3. **Configuration externalization**: Use environment variables and secrets
|
||||||
|
4. **Dependency management**: Pin versions, use lock files
|
||||||
|
5. **Testing strategy**: Unit, integration, and container tests
|
||||||
|
|
||||||
|
### Operational Excellence
|
||||||
|
1. **Log aggregation**: Centralized logging strategy
|
||||||
|
2. **Metrics collection**: Application and infrastructure metrics
|
||||||
|
3. **Alerting**: Proactive monitoring and alerting
|
||||||
|
4. **Documentation**: Container documentation and runbooks
|
||||||
|
5. **Disaster recovery**: Backup and recovery procedures
|
||||||
|
|
||||||
|
## Migration Patterns
|
||||||
|
|
||||||
|
### Legacy Application Containerization
|
||||||
|
1. **Assessment**: Identify dependencies and requirements
|
||||||
|
2. **Dockerfile creation**: Start with appropriate base image
|
||||||
|
3. **Configuration externalization**: Move configs to environment variables
|
||||||
|
4. **Data persistence**: Identify and volume mount data directories
|
||||||
|
5. **Testing**: Validate functionality in containerized environment
|
||||||
|
|
||||||
|
### Platform Migration (Docker to Podman)
|
||||||
|
```bash
|
||||||
|
# Export Docker container configuration
|
||||||
|
docker inspect mycontainer > container-config.json
|
||||||
|
|
||||||
|
# Convert to Podman run command
|
||||||
|
podman run -d --name mycontainer \
|
||||||
|
--memory 4g \
|
||||||
|
--cpus 2 \
|
||||||
|
-v /host/path:/container/path \
|
||||||
|
myimage:tag
|
||||||
|
```
|
||||||
|
|
||||||
|
This technology context provides comprehensive guidance for implementing Docker containerization strategies in home lab and production environments.
|
||||||
262
docker/examples/docker-iptables-troubleshooting-session.md
Normal file
262
docker/examples/docker-iptables-troubleshooting-session.md
Normal file
@ -0,0 +1,262 @@
|
|||||||
|
# Docker iptables/nftables Backend Troubleshooting Session
|
||||||
|
|
||||||
|
## Session Context
|
||||||
|
- **Date**: August 8, 2025
|
||||||
|
- **System**: Nobara PC (Fedora-based gaming distro)
|
||||||
|
- **User**: cal
|
||||||
|
- **Working Directory**: `/mnt/NV2/Development/claude-home`
|
||||||
|
- **Goal**: Get Docker working to run Tdarr Node container
|
||||||
|
|
||||||
|
## System Information
|
||||||
|
```bash
|
||||||
|
# OS Details
|
||||||
|
uname -a
|
||||||
|
# Linux nobara-pc 6.15.5-200.nobara.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Jul 6 11:56:20 UTC 2025 x86_64 GNU/Linux
|
||||||
|
|
||||||
|
# Hardware
|
||||||
|
# AMD Ryzen 7 7800X3D 8-Core Processor
|
||||||
|
# 62GB RAM
|
||||||
|
# NVIDIA GeForce RTX 4080 SUPER
|
||||||
|
|
||||||
|
# Distribution
|
||||||
|
# Nobara (Fedora 42-based)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Problem Summary
|
||||||
|
Docker daemon fails to start with persistent error:
|
||||||
|
```
|
||||||
|
failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: COMMAND_FAILED: INVALID_IPV: 'ipv4' is not a valid backend or is unavailable
|
||||||
|
```
|
||||||
|
|
||||||
|
## Root Cause Analysis
|
||||||
|
|
||||||
|
### Initial Discovery
|
||||||
|
1. **Missing iptables**: Docker couldn't find `iptables` command in PATH
|
||||||
|
2. **Backend conflict**: System using nftables but Docker expects iptables-legacy
|
||||||
|
3. **Package inconsistency**: `iptables-nft` package installed but binary missing initially
|
||||||
|
|
||||||
|
### Key Findings
|
||||||
|
- `dnf list installed | grep -i iptables` initially returned nothing
|
||||||
|
- `firewalld` and `nftables` services were both inactive
|
||||||
|
- `iptables-nft` package was installed but `/usr/bin/iptables` didn't exist
|
||||||
|
- After reinstall, iptables worked but used nftables backend
|
||||||
|
- NAT table incompatible: `iptables v1.8.11 (nf_tables): table 'nat' is incompatible, use 'nft' tool.`
|
||||||
|
|
||||||
|
## Troubleshooting Steps Performed
|
||||||
|
|
||||||
|
### Step 1: Package Investigation
|
||||||
|
```bash
|
||||||
|
# Check installed iptables packages
|
||||||
|
dnf list installed | grep -i iptables
|
||||||
|
# Result: No matching packages (surprising!)
|
||||||
|
|
||||||
|
# Check service status
|
||||||
|
systemctl status nftables # inactive (dead)
|
||||||
|
firewall-cmd --get-backend-type # firewalld not running
|
||||||
|
|
||||||
|
# Check if iptables binary exists
|
||||||
|
which iptables # not found
|
||||||
|
/usr/bin/iptables --version # No such file or directory
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Package Reinstallation
|
||||||
|
```bash
|
||||||
|
# Reinstall iptables-nft package
|
||||||
|
sudo dnf reinstall -y iptables-nft
|
||||||
|
|
||||||
|
# Verify installation
|
||||||
|
rpm -ql iptables-nft | grep bin
|
||||||
|
# Shows /usr/bin/iptables should exist
|
||||||
|
|
||||||
|
# Test after reinstall
|
||||||
|
iptables --version
|
||||||
|
# Result: iptables v1.8.11 (nf_tables) - SUCCESS!
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Backend Compatibility Testing
|
||||||
|
```bash
|
||||||
|
# Test NAT table access
|
||||||
|
sudo iptables -t nat -L
|
||||||
|
# Error: iptables v1.8.11 (nf_tables): table `nat' is incompatible, use 'nft' tool.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Legacy Backend Installation
|
||||||
|
```bash
|
||||||
|
# Install iptables-legacy
|
||||||
|
sudo dnf install -y iptables-legacy iptables-legacy-libs
|
||||||
|
|
||||||
|
# Set up alternatives system
|
||||||
|
sudo alternatives --install /usr/bin/iptables iptables /usr/bin/iptables-legacy 10
|
||||||
|
sudo alternatives --install /usr/bin/ip6tables ip6tables /usr/bin/ip6tables-legacy 10
|
||||||
|
|
||||||
|
# Test NAT table with legacy backend
|
||||||
|
sudo iptables -t nat -L
|
||||||
|
# SUCCESS: Shows empty NAT chains
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Docker Restart Attempts
|
||||||
|
```bash
|
||||||
|
# Remove NVIDIA daemon.json config (potential conflict)
|
||||||
|
sudo rm -f /etc/docker/daemon.json
|
||||||
|
|
||||||
|
# Load NAT kernel module explicitly
|
||||||
|
sudo modprobe iptable_nat
|
||||||
|
|
||||||
|
# Try starting firewalld (in case Docker needs it)
|
||||||
|
sudo systemctl enable --now firewalld
|
||||||
|
|
||||||
|
# Multiple restart attempts
|
||||||
|
sudo systemctl start docker
|
||||||
|
# ALL FAILED with same NAT chain error
|
||||||
|
```
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
- ✅ iptables-legacy installed and configured
|
||||||
|
- ✅ NAT table accessible via `iptables -t nat -L`
|
||||||
|
- ✅ All required kernel modules should be available
|
||||||
|
- ❌ Docker still fails with NAT chain creation error
|
||||||
|
- ❌ Same error persists despite backend switch
|
||||||
|
|
||||||
|
## Analysis of Persistent Issue
|
||||||
|
|
||||||
|
### Potential Causes
|
||||||
|
1. **Kernel State Contamination**: nftables rules/chains may still be active in kernel memory
|
||||||
|
2. **Module Loading Order**: iptables vs nftables modules loaded in conflicting order
|
||||||
|
3. **Docker Caching**: Docker may be caching the old backend detection
|
||||||
|
4. **Firewall Integration**: Docker + firewalld interaction on Fedora/Nobara
|
||||||
|
5. **System-Level Backend Selection**: Some system-wide iptables backend lock
|
||||||
|
|
||||||
|
### Evidence Supporting Kernel State Theory
|
||||||
|
- Error message is identical across all restart attempts
|
||||||
|
- iptables command works fine manually
|
||||||
|
- NAT table shows properly but Docker can't create chains
|
||||||
|
- Issue persists despite configuration changes
|
||||||
|
|
||||||
|
## Next Session Action Plan
|
||||||
|
|
||||||
|
### Immediate Steps After System Reboot
|
||||||
|
1. **Verify Backend Status**:
|
||||||
|
```bash
|
||||||
|
iptables --version # Should show legacy
|
||||||
|
sudo iptables -t nat -L # Should show clean NAT table
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check Kernel Modules**:
|
||||||
|
```bash
|
||||||
|
lsmod | grep -E "(iptable|nf_|ip_tables)"
|
||||||
|
modprobe -l | grep -E "(iptable|nf_table)"
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Test Docker Start**:
|
||||||
|
```bash
|
||||||
|
sudo systemctl start docker
|
||||||
|
docker --version
|
||||||
|
```
|
||||||
|
|
||||||
|
### If Issue Persists After Reboot
|
||||||
|
|
||||||
|
#### Alternative Approach 1: Docker Configuration Override
|
||||||
|
```bash
|
||||||
|
# Create daemon.json to disable iptables management
|
||||||
|
sudo mkdir -p /etc/docker
|
||||||
|
cat <<EOF | sudo tee /etc/docker/daemon.json
|
||||||
|
{
|
||||||
|
"iptables": false,
|
||||||
|
"bridge": "none"
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
sudo systemctl start docker
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Alternative Approach 2: Podman as Docker Alternative
|
||||||
|
```bash
|
||||||
|
# Install podman as Docker drop-in replacement
|
||||||
|
sudo dnf install -y podman podman-docker
|
||||||
|
|
||||||
|
# Test with Tdarr container
|
||||||
|
podman run --rm ghcr.io/haveagitgat/tdarr_node:latest --help
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Alternative Approach 3: Docker Desktop
|
||||||
|
```bash
|
||||||
|
# Consider Docker Desktop for Linux (handles networking differently)
|
||||||
|
# May bypass system iptables issues entirely
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Alternative Approach 4: Deep System Cleanup
|
||||||
|
```bash
|
||||||
|
# Nuclear option: Remove all networking packages and reinstall
|
||||||
|
sudo dnf remove -y iptables* nftables firewalld
|
||||||
|
sudo dnf install -y iptables-legacy iptables-nft firewalld
|
||||||
|
sudo dnf reinstall -y docker-ce
|
||||||
|
```
|
||||||
|
|
||||||
|
### Diagnostic Commands for Next Session
|
||||||
|
```bash
|
||||||
|
# Full network state capture
|
||||||
|
ip addr show
|
||||||
|
ip route show
|
||||||
|
sudo iptables-save > /tmp/iptables-state.txt
|
||||||
|
sudo nft list ruleset > /tmp/nft-state.txt
|
||||||
|
|
||||||
|
# Docker troubleshooting
|
||||||
|
sudo dockerd --debug --log-level=debug > /tmp/docker-debug.log 2>&1 &
|
||||||
|
# Kill after 30 seconds and examine log
|
||||||
|
|
||||||
|
# System journal deep dive
|
||||||
|
journalctl -u docker.service --since="1 hour ago" -o verbose > /tmp/docker-journal.log
|
||||||
|
```
|
||||||
|
|
||||||
|
## Known Working Configuration Target
|
||||||
|
|
||||||
|
### Expected Working State
|
||||||
|
- **iptables**: Legacy backend active
|
||||||
|
- **Docker**: Running with NAT chain creation successful
|
||||||
|
- **Network**: Docker bridge network functional
|
||||||
|
- **Containers**: Can start and access network
|
||||||
|
|
||||||
|
### Tdarr Node Test Command
|
||||||
|
```bash
|
||||||
|
cd ~/docker/tdarr-node
|
||||||
|
# Update IP in compose file first:
|
||||||
|
# serverIP=<TDARR_SERVER_IP>
|
||||||
|
docker-compose -f tdarr-node-basic.yml up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
## Related Documentation Created
|
||||||
|
- `/patterns/docker/gpu-acceleration.md` - GPU troubleshooting patterns
|
||||||
|
- `/reference/docker/nvidia-troubleshooting.md` - NVIDIA container toolkit
|
||||||
|
- `/examples/docker/tdarr-node-local/` - Working configurations
|
||||||
|
|
||||||
|
## System Context Notes
|
||||||
|
- This is a gaming-focused Nobara distribution
|
||||||
|
- May have different default networking than standard Fedora
|
||||||
|
- NVIDIA drivers already working (nvidia-smi functional)
|
||||||
|
- System has been used for other Docker containers successfully in past
|
||||||
|
- Recent NVIDIA container toolkit installation may have triggered the issue
|
||||||
|
|
||||||
|
## Success Criteria for Next Session
|
||||||
|
1. ✅ Docker service starts without errors
|
||||||
|
2. ✅ `docker ps` command works
|
||||||
|
3. ✅ Simple container can run: `docker run --rm hello-world`
|
||||||
|
4. ✅ Tdarr node container can start (even if can't connect to server yet)
|
||||||
|
5. ✅ Network connectivity from containers works
|
||||||
|
|
||||||
|
## Escalation Options
|
||||||
|
If standard troubleshooting fails:
|
||||||
|
1. **Nobara Community**: Check Nobara Discord/forums for similar issues
|
||||||
|
2. **Docker Desktop**: Use different Docker implementation
|
||||||
|
3. **Podman Migration**: Switch to podman as Docker replacement
|
||||||
|
4. **System Reinstall**: Fresh OS install (nuclear option)
|
||||||
|
5. **Container Alternatives**: LXC/systemd containers instead of Docker
|
||||||
|
|
||||||
|
## Files to Check Next Session
|
||||||
|
- `/etc/docker/daemon.json` - Docker configuration
|
||||||
|
- `/var/log/docker.log` - Docker service logs
|
||||||
|
- `~/.docker/config.json` - User Docker config
|
||||||
|
- `/proc/sys/net/ipv4/ip_forward` - IP forwarding enabled
|
||||||
|
- `/etc/systemd/system/docker.service.d/` - Service overrides
|
||||||
|
|
||||||
|
---
|
||||||
|
*End of troubleshooting session log*
|
||||||
466
docker/troubleshooting.md
Normal file
466
docker/troubleshooting.md
Normal file
@ -0,0 +1,466 @@
|
|||||||
|
# Docker Container Troubleshooting Guide
|
||||||
|
|
||||||
|
## Container Startup Issues
|
||||||
|
|
||||||
|
### Container Won't Start
|
||||||
|
**Check container logs first**:
|
||||||
|
```bash
|
||||||
|
# Docker
|
||||||
|
docker logs <container_name>
|
||||||
|
docker logs --tail 50 -f <container_name>
|
||||||
|
|
||||||
|
# Podman
|
||||||
|
podman logs <container_name>
|
||||||
|
podman logs --tail 50 -f <container_name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Startup Failures
|
||||||
|
|
||||||
|
#### Port Conflicts
|
||||||
|
**Symptoms**: `bind: address already in use` error
|
||||||
|
**Solution**:
|
||||||
|
```bash
|
||||||
|
# Find conflicting process
|
||||||
|
sudo netstat -tulpn | grep <port>
|
||||||
|
docker ps | grep <port>
|
||||||
|
|
||||||
|
# Change port mapping
|
||||||
|
docker run -p 8081:8080 myapp # Use different host port
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Permission Errors
|
||||||
|
**Symptoms**: `permission denied` when accessing files/volumes
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Check file ownership
|
||||||
|
ls -la /host/volume/path
|
||||||
|
|
||||||
|
# Fix ownership (match container user)
|
||||||
|
sudo chown -R 1000:1000 /host/volume/path
|
||||||
|
|
||||||
|
# Use correct UID/GID in container
|
||||||
|
docker run -e PUID=1000 -e PGID=1000 myapp
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Missing Environment Variables
|
||||||
|
**Symptoms**: Application fails with configuration errors
|
||||||
|
**Diagnostic**:
|
||||||
|
```bash
|
||||||
|
# Check container environment
|
||||||
|
docker exec -it <container> env
|
||||||
|
docker exec -it <container> printenv
|
||||||
|
|
||||||
|
# Verify required variables are set
|
||||||
|
docker inspect <container> | grep -A 20 "Env"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Resource Constraints
|
||||||
|
**Symptoms**: Container killed or OOM errors
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Check resource usage
|
||||||
|
docker stats <container>
|
||||||
|
|
||||||
|
# Increase memory limit
|
||||||
|
docker run -m 4g myapp
|
||||||
|
|
||||||
|
# Check system resources
|
||||||
|
free -h
|
||||||
|
df -h
|
||||||
|
```
|
||||||
|
|
||||||
|
### Debug Running Containers
|
||||||
|
```bash
|
||||||
|
# Access container shell
|
||||||
|
docker exec -it <container> /bin/bash
|
||||||
|
docker exec -it <container> /bin/sh # if bash not available
|
||||||
|
|
||||||
|
# Check container processes
|
||||||
|
docker exec <container> ps aux
|
||||||
|
|
||||||
|
# Check container filesystem
|
||||||
|
docker exec <container> ls -la /app
|
||||||
|
```
|
||||||
|
|
||||||
|
## Build Issues
|
||||||
|
|
||||||
|
### Build Failures
|
||||||
|
**Clear build cache when encountering issues**:
|
||||||
|
```bash
|
||||||
|
# Docker
|
||||||
|
docker system prune -a
|
||||||
|
docker builder prune
|
||||||
|
|
||||||
|
# Podman
|
||||||
|
podman system prune -a
|
||||||
|
podman image prune -a
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verbose Build Output
|
||||||
|
```bash
|
||||||
|
# Docker
|
||||||
|
docker build --progress=plain --no-cache .
|
||||||
|
|
||||||
|
# Podman
|
||||||
|
podman build --layers=false .
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Build Problems
|
||||||
|
|
||||||
|
#### COPY/ADD Errors
|
||||||
|
**Issue**: Files not found during build
|
||||||
|
**Solutions**:
|
||||||
|
```dockerfile
|
||||||
|
# Check .dockerignore file
|
||||||
|
# Verify file paths relative to build context
|
||||||
|
COPY ./src /app/src # ✅ Correct
|
||||||
|
COPY /absolute/path /app # ❌ Wrong - no absolute paths
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Package Installation Failures
|
||||||
|
**Issue**: apt/yum/dnf package installation fails
|
||||||
|
**Solutions**:
|
||||||
|
```dockerfile
|
||||||
|
# Update package lists first
|
||||||
|
RUN apt-get update && apt-get install -y package-name
|
||||||
|
|
||||||
|
# Combine RUN commands to reduce layers
|
||||||
|
RUN apt-get update && \
|
||||||
|
apt-get install -y package1 package2 && \
|
||||||
|
apt-get clean && \
|
||||||
|
rm -rf /var/lib/apt/lists/*
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Network Issues During Build
|
||||||
|
**Issue**: Cannot reach package repositories
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Check DNS resolution
|
||||||
|
docker build --network host .
|
||||||
|
|
||||||
|
# Use custom DNS
|
||||||
|
docker build --dns 8.8.8.8 .
|
||||||
|
```
|
||||||
|
|
||||||
|
## GPU Container Issues
|
||||||
|
|
||||||
|
### NVIDIA GPU Support Problems
|
||||||
|
|
||||||
|
#### Docker Desktop vs Podman on Fedora/Nobara
|
||||||
|
**Issue**: Docker Desktop has GPU compatibility issues on Fedora-based systems
|
||||||
|
**Symptoms**:
|
||||||
|
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
|
||||||
|
- `unknown or invalid runtime name: nvidia`
|
||||||
|
- Device nodes exist but CUDA fails to initialize
|
||||||
|
|
||||||
|
**Solution**: Use Podman instead of Docker on Fedora systems
|
||||||
|
```bash
|
||||||
|
# Verify host GPU works
|
||||||
|
nvidia-smi
|
||||||
|
|
||||||
|
# Test with Podman (recommended)
|
||||||
|
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
|
||||||
|
|
||||||
|
# Test with Docker (may fail on Fedora)
|
||||||
|
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
#### GPU Container Configuration
|
||||||
|
**Working Podman GPU template**:
|
||||||
|
```bash
|
||||||
|
podman run -d --name gpu-container \
|
||||||
|
--device nvidia.com/gpu=all \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
||||||
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
||||||
|
myapp:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
**Working Docker GPU template**:
|
||||||
|
```bash
|
||||||
|
docker run -d --name gpu-container \
|
||||||
|
--gpus all \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
||||||
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
||||||
|
myapp:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
#### GPU Troubleshooting Steps
|
||||||
|
1. **Verify Host GPU Access**:
|
||||||
|
```bash
|
||||||
|
nvidia-smi # Should show GPU info
|
||||||
|
lsmod | grep nvidia # Should show nvidia modules
|
||||||
|
ls -la /dev/nvidia* # Should show device files
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check NVIDIA Container Toolkit**:
|
||||||
|
```bash
|
||||||
|
rpm -qa | grep nvidia-container-toolkit # Fedora/RHEL
|
||||||
|
dpkg -l | grep nvidia-container-toolkit # Ubuntu/Debian
|
||||||
|
nvidia-ctk --version
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Test GPU in Container**:
|
||||||
|
```bash
|
||||||
|
# Should show GPU information
|
||||||
|
podman exec gpu-container nvidia-smi
|
||||||
|
|
||||||
|
# Test CUDA functionality
|
||||||
|
podman exec gpu-container nvidia-ml-py
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Platform-Specific GPU Notes
|
||||||
|
**Fedora/Nobara/RHEL**:
|
||||||
|
- ✅ Podman: Works out-of-the-box with GPU support
|
||||||
|
- ❌ Docker Desktop: Known GPU integration issues
|
||||||
|
- Solution: Use Podman for GPU workloads
|
||||||
|
|
||||||
|
**Ubuntu/Debian**:
|
||||||
|
- ✅ Docker: Generally works well with proper NVIDIA toolkit setup
|
||||||
|
- ✅ Podman: Also works well
|
||||||
|
- Solution: Either runtime typically works
|
||||||
|
|
||||||
|
## Performance Issues
|
||||||
|
|
||||||
|
### Resource Monitoring
|
||||||
|
**Real-time resource usage**:
|
||||||
|
```bash
|
||||||
|
# Overall container stats
|
||||||
|
docker stats
|
||||||
|
podman stats
|
||||||
|
|
||||||
|
# Inside container analysis
|
||||||
|
docker exec <container> top
|
||||||
|
docker exec <container> free -h
|
||||||
|
docker exec <container> df -h
|
||||||
|
|
||||||
|
# Network usage
|
||||||
|
docker exec <container> netstat -i
|
||||||
|
```
|
||||||
|
|
||||||
|
### Image Size Optimization
|
||||||
|
**Analyze image layers**:
|
||||||
|
```bash
|
||||||
|
# Check image sizes
|
||||||
|
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"
|
||||||
|
|
||||||
|
# Analyze layer history
|
||||||
|
docker history <image>
|
||||||
|
|
||||||
|
# Find large files in container
|
||||||
|
docker exec <container> du -sh /* | sort -hr
|
||||||
|
```
|
||||||
|
|
||||||
|
**Optimization strategies**:
|
||||||
|
```dockerfile
|
||||||
|
# Use multi-stage builds
|
||||||
|
FROM node:18 AS builder
|
||||||
|
# ... build steps ...
|
||||||
|
|
||||||
|
FROM node:18-alpine AS production
|
||||||
|
COPY --from=builder /app/dist /app
|
||||||
|
# Smaller final image
|
||||||
|
|
||||||
|
# Combine RUN commands
|
||||||
|
RUN apt-get update && \
|
||||||
|
apt-get install -y package && \
|
||||||
|
apt-get clean && \
|
||||||
|
rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Use .dockerignore
|
||||||
|
# .dockerignore
|
||||||
|
node_modules
|
||||||
|
.git
|
||||||
|
*.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### Storage Performance Issues
|
||||||
|
**Slow volume performance**:
|
||||||
|
```bash
|
||||||
|
# Test volume I/O performance
|
||||||
|
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000
|
||||||
|
|
||||||
|
# Check volume mount options
|
||||||
|
docker inspect <container> | grep -A 10 "Mounts"
|
||||||
|
|
||||||
|
# Consider using tmpfs for temporary data
|
||||||
|
docker run --tmpfs /tmp myapp
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network Debugging
|
||||||
|
|
||||||
|
### Network Connectivity Issues
|
||||||
|
**Inspect network configuration**:
|
||||||
|
```bash
|
||||||
|
# List networks
|
||||||
|
docker network ls
|
||||||
|
podman network ls
|
||||||
|
|
||||||
|
# Inspect specific network
|
||||||
|
docker network inspect <network_name>
|
||||||
|
|
||||||
|
# Check container networking
|
||||||
|
docker exec <container> ip addr show
|
||||||
|
docker exec <container> ip route show
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service Discovery Problems
|
||||||
|
**Test connectivity between containers**:
|
||||||
|
```bash
|
||||||
|
# Test by container name (same network)
|
||||||
|
docker exec container1 ping container2
|
||||||
|
|
||||||
|
# Test by IP address
|
||||||
|
docker exec container1 ping 172.17.0.3
|
||||||
|
|
||||||
|
# Check DNS resolution
|
||||||
|
docker exec container1 nslookup container2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Port Binding Issues
|
||||||
|
**Verify port mappings**:
|
||||||
|
```bash
|
||||||
|
# Check exposed ports
|
||||||
|
docker port <container>
|
||||||
|
|
||||||
|
# Test external connectivity
|
||||||
|
curl localhost:8080
|
||||||
|
|
||||||
|
# Check if port is bound to all interfaces
|
||||||
|
netstat -tulpn | grep :8080
|
||||||
|
```
|
||||||
|
|
||||||
|
## Emergency Recovery
|
||||||
|
|
||||||
|
### Complete Container Reset
|
||||||
|
**Remove all containers and start fresh**:
|
||||||
|
```bash
|
||||||
|
# Stop all containers
|
||||||
|
docker stop $(docker ps -q)
|
||||||
|
podman stop --all
|
||||||
|
|
||||||
|
# Remove all containers
|
||||||
|
docker container prune -f
|
||||||
|
podman container prune -f
|
||||||
|
|
||||||
|
# Remove all images
|
||||||
|
docker image prune -a -f
|
||||||
|
podman image prune -a -f
|
||||||
|
|
||||||
|
# Remove all volumes (CAUTION: data loss)
|
||||||
|
docker volume prune -f
|
||||||
|
podman volume prune -f
|
||||||
|
|
||||||
|
# Complete system cleanup
|
||||||
|
docker system prune -a --volumes -f
|
||||||
|
podman system prune -a --volumes -f
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container Recovery
|
||||||
|
**Recover from corrupted container**:
|
||||||
|
```bash
|
||||||
|
# Create backup of container data
|
||||||
|
docker cp <container>:/important/data ./backup/
|
||||||
|
|
||||||
|
# Export container filesystem
|
||||||
|
docker export <container> > container-backup.tar
|
||||||
|
|
||||||
|
# Import and restart
|
||||||
|
docker import container-backup.tar new-image:latest
|
||||||
|
docker run -d --name new-container new-image:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Recovery
|
||||||
|
**Recover data from volumes**:
|
||||||
|
```bash
|
||||||
|
# List volumes
|
||||||
|
docker volume ls
|
||||||
|
|
||||||
|
# Inspect volume location
|
||||||
|
docker volume inspect <volume_name>
|
||||||
|
|
||||||
|
# Access volume data directly
|
||||||
|
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data
|
||||||
|
|
||||||
|
# Mount volume to temporary container
|
||||||
|
docker run --rm -v <volume_name>:/data alpine ls -la /data
|
||||||
|
```
|
||||||
|
|
||||||
|
## Health Check Issues
|
||||||
|
|
||||||
|
### Container Health Checks
|
||||||
|
**Implement health checks**:
|
||||||
|
```dockerfile
|
||||||
|
# Dockerfile health check
|
||||||
|
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
||||||
|
CMD curl -f http://localhost:3000/health || exit 1
|
||||||
|
```
|
||||||
|
|
||||||
|
**Debug health check failures**:
|
||||||
|
```bash
|
||||||
|
# Check health status
|
||||||
|
docker inspect <container> | grep -A 10 Health
|
||||||
|
|
||||||
|
# Manual health check test
|
||||||
|
docker exec <container> curl -f http://localhost:3000/health
|
||||||
|
|
||||||
|
# Check health check logs
|
||||||
|
docker events --filter container=<container>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Log Analysis
|
||||||
|
|
||||||
|
### Log Management
|
||||||
|
**View and manage container logs**:
|
||||||
|
```bash
|
||||||
|
# View recent logs
|
||||||
|
docker logs --tail 100 <container>
|
||||||
|
|
||||||
|
# Follow logs in real-time
|
||||||
|
docker logs -f <container>
|
||||||
|
|
||||||
|
# Logs with timestamps
|
||||||
|
docker logs -t <container>
|
||||||
|
|
||||||
|
# Search logs for errors
|
||||||
|
docker logs <container> 2>&1 | grep ERROR
|
||||||
|
```
|
||||||
|
|
||||||
|
### Log Rotation Issues
|
||||||
|
**Configure log rotation to prevent disk filling**:
|
||||||
|
```bash
|
||||||
|
# Run with log size limits
|
||||||
|
docker run --log-opt max-size=10m --log-opt max-file=3 myapp
|
||||||
|
|
||||||
|
# Check log file sizes
|
||||||
|
sudo du -sh /var/lib/docker/containers/*/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Platform-Specific Issues
|
||||||
|
|
||||||
|
### Fedora/Nobara/RHEL Systems
|
||||||
|
- **GPU Support**: Use Podman instead of Docker Desktop
|
||||||
|
- **SELinux**: May require container contexts (`-Z` flag)
|
||||||
|
- **Firewall**: Configure firewalld for container networking
|
||||||
|
|
||||||
|
### Ubuntu/Debian Systems
|
||||||
|
- **AppArmor**: May restrict container operations
|
||||||
|
- **Snap Docker**: May have permission issues vs native package
|
||||||
|
|
||||||
|
### General Linux Issues
|
||||||
|
- **cgroups v2**: Some older containers need cgroups v1
|
||||||
|
- **User namespaces**: May cause UID/GID mapping issues
|
||||||
|
- **systemd**: Integration differences between Docker/Podman
|
||||||
|
|
||||||
|
## Prevention Best Practices
|
||||||
|
|
||||||
|
1. **Resource Limits**: Always set memory and CPU limits
|
||||||
|
2. **Health Checks**: Implement application health monitoring
|
||||||
|
3. **Log Rotation**: Configure to prevent disk space issues
|
||||||
|
4. **Security Scanning**: Regular vulnerability scans
|
||||||
|
5. **Backup Strategy**: Regular data and configuration backups
|
||||||
|
6. **Testing**: Test containers in staging before production
|
||||||
|
7. **Documentation**: Document container configurations and dependencies
|
||||||
|
|
||||||
|
This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments.
|
||||||
172
legacy/old-scripts-README.md
Normal file
172
legacy/old-scripts-README.md
Normal file
@ -0,0 +1,172 @@
|
|||||||
|
# Scripts Directory
|
||||||
|
|
||||||
|
This directory contains operational scripts and utilities for home lab management and automation.
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
scripts/
|
||||||
|
├── README.md # This documentation
|
||||||
|
├── tdarr_monitor.py # Enhanced Tdarr monitoring with Discord alerts
|
||||||
|
├── tdarr/ # Tdarr automation and scheduling
|
||||||
|
├── monitoring/ # System monitoring and alerting
|
||||||
|
└── <future>/ # Other organized automation subsystems
|
||||||
|
```
|
||||||
|
|
||||||
|
## Scripts Overview
|
||||||
|
|
||||||
|
### `tdarr_monitor.py` - Enhanced Tdarr Monitoring
|
||||||
|
|
||||||
|
**Description**: Comprehensive Tdarr monitoring script with stuck job detection and Discord notifications.
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- 📊 Complete Tdarr system monitoring (server, nodes, queue, libraries)
|
||||||
|
- 🧠 Short-term memory for stuck job detection
|
||||||
|
- 🚨 Discord notifications with rich embeds
|
||||||
|
- 💾 Persistent state management
|
||||||
|
- ⚙️ Configurable thresholds and alerts
|
||||||
|
|
||||||
|
**Quick Start**:
|
||||||
|
```bash
|
||||||
|
# Basic monitoring
|
||||||
|
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 --check all
|
||||||
|
|
||||||
|
# Enable stuck job detection with 15-minute threshold
|
||||||
|
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 \
|
||||||
|
--check nodes --detect-stuck --stuck-threshold 15
|
||||||
|
|
||||||
|
# Full monitoring with Discord alerts (uses default webhook)
|
||||||
|
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 \
|
||||||
|
--check all --detect-stuck --discord-alerts
|
||||||
|
|
||||||
|
# Test Discord integration (uses default webhook)
|
||||||
|
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 --discord-test
|
||||||
|
```
|
||||||
|
|
||||||
|
**CLI Options**:
|
||||||
|
```
|
||||||
|
--server Tdarr server URL (required)
|
||||||
|
--check Type of check: all, status, queue, nodes, libraries, stats, health
|
||||||
|
--timeout Request timeout in seconds (default: 30)
|
||||||
|
--output Output format: json, pretty (default: pretty)
|
||||||
|
--verbose Enable verbose logging
|
||||||
|
--detect-stuck Enable stuck job detection
|
||||||
|
--stuck-threshold Minutes before job considered stuck (default: 30)
|
||||||
|
--memory-file Path to memory state file (default: .claude/tmp/tdarr_memory.pkl)
|
||||||
|
--clear-memory Clear memory state and exit
|
||||||
|
--discord-webhook Discord webhook URL for notifications (default: configured)
|
||||||
|
--discord-alerts Enable Discord alerts for stuck jobs
|
||||||
|
--discord-test Send test Discord message and exit
|
||||||
|
```
|
||||||
|
|
||||||
|
**Memory Management**:
|
||||||
|
- **Persistent State**: Worker snapshots saved to `.claude/tmp/tdarr_memory.pkl`
|
||||||
|
- **Automatic Cleanup**: Removes tracking for disappeared workers
|
||||||
|
- **Error Recovery**: Graceful handling of corrupted memory files
|
||||||
|
|
||||||
|
**Discord Features**:
|
||||||
|
- **Two Message Types**: Simple content messages and rich embeds
|
||||||
|
- **Stuck Job Alerts**: Detailed embed notifications with file info, progress, duration
|
||||||
|
- **System Status**: Health summaries with node details and color-coded status
|
||||||
|
- **Customizable**: Colors, fields, titles, descriptions fully configurable
|
||||||
|
- **Error Handling**: Graceful failures without breaking monitoring
|
||||||
|
|
||||||
|
**Integration Examples**:
|
||||||
|
|
||||||
|
*Cron Job for Regular Monitoring*:
|
||||||
|
```bash
|
||||||
|
# Check every 15 minutes, alert on stuck jobs over 30 minutes
|
||||||
|
*/15 * * * * cd /path/to/claude-home && python3 scripts/tdarr_monitor.py \
|
||||||
|
--server http://10.10.0.43:8265 --check nodes --detect-stuck --discord-alerts
|
||||||
|
```
|
||||||
|
|
||||||
|
*Systemd Service*:
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Tdarr Monitor
|
||||||
|
After=network.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/bin/python3 /path/to/claude-home/scripts/tdarr_monitor.py \
|
||||||
|
--server http://10.10.0.43:8265 --check all --detect-stuck --discord-alerts
|
||||||
|
WorkingDirectory=/path/to/claude-home
|
||||||
|
User=your-user
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=*:0/15
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
|
```
|
||||||
|
|
||||||
|
**API Data Classes**:
|
||||||
|
The script uses strongly-typed dataclasses for all API responses:
|
||||||
|
- `ServerStatus` - Server health and version info
|
||||||
|
- `NodeStatus` - Node details with stuck job tracking
|
||||||
|
- `QueueStatus` - Transcoding queue statistics
|
||||||
|
- `LibraryStatus` - Library scan progress
|
||||||
|
- `StatisticsStatus` - Overall system statistics
|
||||||
|
- `HealthStatus` - Comprehensive health check results
|
||||||
|
|
||||||
|
**Error Handling**:
|
||||||
|
- Network timeouts and connection errors
|
||||||
|
- API endpoint failures
|
||||||
|
- JSON parsing errors
|
||||||
|
- Discord webhook failures
|
||||||
|
- Memory state corruption
|
||||||
|
- Missing dependencies
|
||||||
|
|
||||||
|
**Dependencies**:
|
||||||
|
- `requests` - HTTP client for API calls
|
||||||
|
- `pickle` - State serialization
|
||||||
|
- Standard library only (no external requirements beyond requests)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Development Guidelines
|
||||||
|
|
||||||
|
### Adding New Scripts
|
||||||
|
|
||||||
|
1. **Location**: Place scripts in appropriate subdirectories by function
|
||||||
|
2. **Documentation**: Include comprehensive docstrings and usage examples
|
||||||
|
3. **Error Handling**: Implement robust error handling and logging
|
||||||
|
4. **Configuration**: Use CLI arguments and/or config files for flexibility
|
||||||
|
5. **Testing**: Include test functionality where applicable
|
||||||
|
|
||||||
|
### Naming Conventions
|
||||||
|
|
||||||
|
- Use descriptive names: `tdarr_monitor.py` not `monitor.py`
|
||||||
|
- Use underscores for Python scripts: `system_health.py`
|
||||||
|
- Use hyphens for shell scripts: `backup-system.sh`
|
||||||
|
|
||||||
|
### Directory Organization
|
||||||
|
|
||||||
|
Create subdirectories for related functionality:
|
||||||
|
```
|
||||||
|
scripts/
|
||||||
|
├── monitoring/ # System monitoring scripts
|
||||||
|
├── backup/ # Backup and restore utilities
|
||||||
|
├── network/ # Network management tools
|
||||||
|
├── containers/ # Docker/Podman management
|
||||||
|
└── maintenance/ # System maintenance tasks
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
### Planned Features
|
||||||
|
- **Email Notifications**: SMTP integration for email alerts
|
||||||
|
- **Prometheus Metrics**: Export metrics for Grafana dashboards
|
||||||
|
- **Webhook Actions**: Trigger external actions on stuck jobs
|
||||||
|
- **Multi-Server Support**: Monitor multiple Tdarr instances
|
||||||
|
- **Configuration Files**: YAML/JSON config file support
|
||||||
|
|
||||||
|
### Contributing
|
||||||
|
1. Follow existing code style and patterns
|
||||||
|
2. Add comprehensive documentation
|
||||||
|
3. Include error handling and logging
|
||||||
|
4. Test thoroughly before committing
|
||||||
|
5. Update this README with new scripts
|
||||||
142
monitoring/CONTEXT.md
Normal file
142
monitoring/CONTEXT.md
Normal file
@ -0,0 +1,142 @@
|
|||||||
|
# System Monitoring and Alerting - Technology Context
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
|
||||||
|
|
||||||
|
## Architecture Patterns
|
||||||
|
|
||||||
|
### Distributed Monitoring Strategy
|
||||||
|
**Pattern**: Service-specific monitoring with centralized alerting
|
||||||
|
- **Tdarr Monitoring**: API-based transcoding health checks
|
||||||
|
- **Windows Desktop Monitoring**: Reboot detection and system events
|
||||||
|
- **Network Monitoring**: Connectivity and service availability
|
||||||
|
- **Container Monitoring**: Docker/Podman health and resource usage
|
||||||
|
|
||||||
|
### Alert Management
|
||||||
|
**Pattern**: Structured notifications with actionable information
|
||||||
|
```bash
|
||||||
|
# Discord webhook integration
|
||||||
|
curl -X POST "$DISCORD_WEBHOOK" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Core Monitoring Components
|
||||||
|
|
||||||
|
### Tdarr System Monitoring
|
||||||
|
**Purpose**: Monitor transcoding pipeline health and performance
|
||||||
|
**Location**: `scripts/tdarr_monitor.py`
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- API-based status monitoring with dataclass structures
|
||||||
|
- Staging section timeout detection and cleanup
|
||||||
|
- Discord notifications with professional formatting
|
||||||
|
- Log rotation and retention management
|
||||||
|
|
||||||
|
### Windows Desktop Monitoring
|
||||||
|
**Purpose**: Track Windows system reboots and power events
|
||||||
|
**Location**: `scripts/windows-desktop/`
|
||||||
|
|
||||||
|
**Components**:
|
||||||
|
- PowerShell monitoring script
|
||||||
|
- Scheduled task automation
|
||||||
|
- Discord notification integration
|
||||||
|
- System event correlation
|
||||||
|
|
||||||
|
### Network and Service Monitoring
|
||||||
|
**Purpose**: Monitor critical infrastructure availability
|
||||||
|
**Implementation**:
|
||||||
|
```bash
|
||||||
|
# Service health check pattern
|
||||||
|
SERVICES="https://homelab.local http://nas.homelab.local"
|
||||||
|
for service in $SERVICES; do
|
||||||
|
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
|
||||||
|
echo "✅ $service: Available"
|
||||||
|
else
|
||||||
|
echo "❌ $service: Failed" | send_alert
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
## Automation Patterns
|
||||||
|
|
||||||
|
### Cron-Based Scheduling
|
||||||
|
**Pattern**: Regular health checks with intelligent alerting
|
||||||
|
```bash
|
||||||
|
# Monitoring schedule examples
|
||||||
|
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
|
||||||
|
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
|
||||||
|
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
|
||||||
|
```
|
||||||
|
|
||||||
|
### Event-Driven Monitoring
|
||||||
|
**Pattern**: Reactive monitoring for critical events
|
||||||
|
- **System Startup**: Windows boot detection
|
||||||
|
- **Service Failures**: Container restart alerts
|
||||||
|
- **Resource Exhaustion**: Disk space warnings
|
||||||
|
- **Security Events**: Failed login attempts
|
||||||
|
|
||||||
|
## Data Collection and Analysis
|
||||||
|
|
||||||
|
### Log Management
|
||||||
|
**Pattern**: Centralized logging with rotation
|
||||||
|
```bash
|
||||||
|
# Log rotation configuration
|
||||||
|
LOG_FILE="/var/log/homelab-monitor.log"
|
||||||
|
MAX_SIZE="10M"
|
||||||
|
RETENTION_DAYS=30
|
||||||
|
|
||||||
|
# Rotate logs when size exceeded
|
||||||
|
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
|
||||||
|
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
|
||||||
|
touch "$LOG_FILE"
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
### Metrics Collection
|
||||||
|
**Pattern**: Time-series data for trend analysis
|
||||||
|
- **System Metrics**: CPU, memory, disk usage
|
||||||
|
- **Service Metrics**: Response times, error rates
|
||||||
|
- **Application Metrics**: Transcoding progress, queue sizes
|
||||||
|
- **Network Metrics**: Bandwidth usage, latency
|
||||||
|
|
||||||
|
## Alert Integration
|
||||||
|
|
||||||
|
### Discord Notification System
|
||||||
|
**Pattern**: Rich, actionable notifications
|
||||||
|
```markdown
|
||||||
|
# Professional alert format
|
||||||
|
**🔧 System Maintenance**
|
||||||
|
Service: Tdarr Transcoding
|
||||||
|
Issue: 3 files timed out in staging
|
||||||
|
Resolution: Automatic cleanup completed
|
||||||
|
Status: System operational
|
||||||
|
|
||||||
|
Manual review recommended <@user_id>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Alert Escalation
|
||||||
|
**Pattern**: Tiered alerting based on severity
|
||||||
|
1. **Info**: Routine maintenance completed
|
||||||
|
2. **Warning**: Service degradation detected
|
||||||
|
3. **Critical**: Service failure requiring immediate attention
|
||||||
|
4. **Emergency**: System-wide failure requiring manual intervention
|
||||||
|
|
||||||
|
## Best Practices Implementation
|
||||||
|
|
||||||
|
### Monitoring Strategy
|
||||||
|
1. **Proactive**: Monitor trends to predict issues
|
||||||
|
2. **Reactive**: Alert on current failures
|
||||||
|
3. **Preventive**: Automated cleanup and maintenance
|
||||||
|
4. **Comprehensive**: Cover all critical services
|
||||||
|
5. **Actionable**: Provide clear resolution paths
|
||||||
|
|
||||||
|
### Performance Optimization
|
||||||
|
1. **Efficient Polling**: Balance monitoring frequency with resource usage
|
||||||
|
2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering
|
||||||
|
3. **Resource Management**: Monitor the monitoring system itself
|
||||||
|
4. **Scalable Architecture**: Design for growth and additional services
|
||||||
|
|
||||||
|
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.
|
||||||
326
monitoring/examples/cron-job-management.md
Normal file
326
monitoring/examples/cron-job-management.md
Normal file
@ -0,0 +1,326 @@
|
|||||||
|
# Cron Job Management Patterns
|
||||||
|
|
||||||
|
This document outlines the cron job patterns and management strategies used in the home lab environment.
|
||||||
|
|
||||||
|
## Current Cron Schedule
|
||||||
|
|
||||||
|
### Overview
|
||||||
|
```bash
|
||||||
|
# Monthly maintenance
|
||||||
|
0 2 1 * * /home/cal/bin/ssh_key_maintenance.sh
|
||||||
|
|
||||||
|
# Tdarr monitoring and management
|
||||||
|
*/10 * * * * python3 /mnt/NV2/Development/claude-home/scripts/tdarr_monitor.py --server http://10.10.0.43:8265 --check nodes --detect-stuck --discord-alerts >/dev/null 2>&1
|
||||||
|
0 */6 * * * find "/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/" -name "tdarr-workDir2-*" -type d -mmin +360 -exec rm -rf {} \; 2>/dev/null || true
|
||||||
|
0 3 * * * find "/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media" -name "*.temp" -o -name "*.tdarr" -mtime +1 -delete 2>/dev/null || true
|
||||||
|
|
||||||
|
# Disabled/legacy jobs
|
||||||
|
#*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Job Categories
|
||||||
|
|
||||||
|
### 1. System Maintenance
|
||||||
|
**SSH Key Maintenance**
|
||||||
|
- **Schedule**: `0 2 1 * *` (Monthly, 1st at 2 AM)
|
||||||
|
- **Purpose**: Maintain SSH key security and rotation
|
||||||
|
- **Location**: `/home/cal/bin/ssh_key_maintenance.sh`
|
||||||
|
- **Priority**: High (security-critical)
|
||||||
|
|
||||||
|
### 2. Monitoring & Alerting
|
||||||
|
**Tdarr System Monitoring**
|
||||||
|
- **Schedule**: `*/10 * * * *` (Every 10 minutes)
|
||||||
|
- **Purpose**: Monitor Tdarr nodes, detect stuck jobs, send Discord alerts
|
||||||
|
- **Features**:
|
||||||
|
- Stuck job detection (30-minute threshold)
|
||||||
|
- Discord notifications with rich embeds
|
||||||
|
- Persistent memory state tracking
|
||||||
|
- **Script**: `/mnt/NV2/Development/claude-home/scripts/tdarr_monitor.py`
|
||||||
|
- **Output**: Silent (`>/dev/null 2>&1`)
|
||||||
|
|
||||||
|
### 3. Cleanup & Housekeeping
|
||||||
|
**Tdarr Work Directory Cleanup**
|
||||||
|
- **Schedule**: `0 */6 * * *` (Every 6 hours)
|
||||||
|
- **Purpose**: Remove stale Tdarr work directories
|
||||||
|
- **Target**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/`
|
||||||
|
- **Pattern**: `tdarr-workDir2-*` directories
|
||||||
|
- **Age threshold**: 6 hours (`-mmin +360`)
|
||||||
|
|
||||||
|
**Failed Tdarr Job Cleanup**
|
||||||
|
- **Schedule**: `0 3 * * *` (Daily at 3 AM)
|
||||||
|
- **Purpose**: Remove failed transcode artifacts
|
||||||
|
- **Target**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media/`
|
||||||
|
- **Patterns**: `*.temp` and `*.tdarr` files
|
||||||
|
- **Age threshold**: 24 hours (`-mtime +1`)
|
||||||
|
|
||||||
|
## Design Patterns
|
||||||
|
|
||||||
|
### 1. Absolute Paths
|
||||||
|
**Always use absolute paths in cron jobs**
|
||||||
|
```bash
|
||||||
|
# Good
|
||||||
|
*/10 * * * * python3 /full/path/to/script.py
|
||||||
|
|
||||||
|
# Bad - relative paths don't work in cron
|
||||||
|
*/10 * * * * python3 scripts/script.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Error Handling
|
||||||
|
**Standard error suppression pattern**
|
||||||
|
```bash
|
||||||
|
command 2>/dev/null || true
|
||||||
|
```
|
||||||
|
- Suppresses stderr to prevent cron emails
|
||||||
|
- `|| true` ensures job always exits successfully
|
||||||
|
|
||||||
|
### 3. Time-based Cleanup
|
||||||
|
**Safe age thresholds for different content types**
|
||||||
|
- **Work directories**: 6 hours (short-lived, safe for active jobs)
|
||||||
|
- **Temp files**: 24 hours (allows for long transcodes)
|
||||||
|
- **Log files**: 7-30 days (depending on importance)
|
||||||
|
|
||||||
|
### 4. Resource-aware Scheduling
|
||||||
|
**Avoid resource conflicts**
|
||||||
|
```bash
|
||||||
|
# System maintenance at low-usage times
|
||||||
|
0 2 1 * * maintenance_script.sh
|
||||||
|
|
||||||
|
# Cleanup during off-peak hours
|
||||||
|
0 3 * * * cleanup_script.sh
|
||||||
|
|
||||||
|
# Monitoring with high frequency during active hours
|
||||||
|
*/10 * * * * monitor_script.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Management Workflow
|
||||||
|
|
||||||
|
### Adding New Cron Jobs
|
||||||
|
|
||||||
|
1. **Backup current crontab**
|
||||||
|
```bash
|
||||||
|
crontab -l > /tmp/crontab_backup_$(date +%Y%m%d)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Edit safely**
|
||||||
|
```bash
|
||||||
|
crontab -l > /tmp/new_crontab
|
||||||
|
echo "# New job description" >> /tmp/new_crontab
|
||||||
|
echo "schedule command" >> /tmp/new_crontab
|
||||||
|
crontab /tmp/new_crontab
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Verify installation**
|
||||||
|
```bash
|
||||||
|
crontab -l
|
||||||
|
```
|
||||||
|
|
||||||
|
### Proper HERE Document (EOF) Usage
|
||||||
|
|
||||||
|
**When building cron files with HERE documents, use proper EOF formatting:**
|
||||||
|
|
||||||
|
#### ✅ **Correct Format**
|
||||||
|
```bash
|
||||||
|
cat > /tmp/new_crontab << 'EOF'
|
||||||
|
0 2 1 * * /home/cal/bin/ssh_key_maintenance.sh
|
||||||
|
# Tdarr monitoring every 10 minutes
|
||||||
|
*/10 * * * * python3 /path/to/script.py --args
|
||||||
|
EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
#### ❌ **Common Mistakes**
|
||||||
|
```bash
|
||||||
|
# BAD - Causes "EOF not found" errors
|
||||||
|
cat >> /tmp/crontab << 'EOF'
|
||||||
|
new_cron_job
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Results in malformed file with literal "EOF < /dev/null" lines
|
||||||
|
```
|
||||||
|
|
||||||
|
#### **Key Rules for EOF in Cron Files**
|
||||||
|
|
||||||
|
1. **Use `cat >` not `cat >>`** for building complete files
|
||||||
|
```bash
|
||||||
|
# Good - overwrites file cleanly
|
||||||
|
cat > /tmp/crontab << 'EOF'
|
||||||
|
|
||||||
|
# Bad - appends and can create malformed files
|
||||||
|
cat >> /tmp/crontab << 'EOF'
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Quote the EOF delimiter** to prevent variable expansion
|
||||||
|
```bash
|
||||||
|
# Good - literal content
|
||||||
|
cat > file << 'EOF'
|
||||||
|
|
||||||
|
# Can cause issues with special characters
|
||||||
|
cat > file << EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Clean up malformed files** before installing
|
||||||
|
```bash
|
||||||
|
# Remove EOF artifacts and empty lines
|
||||||
|
head -n -1 /tmp/crontab > /tmp/clean_crontab
|
||||||
|
|
||||||
|
# Or use grep to remove EOF lines
|
||||||
|
grep -v "^EOF" /tmp/crontab > /tmp/clean_crontab
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Alternative approach - direct echo method**
|
||||||
|
```bash
|
||||||
|
crontab -l > /tmp/current_crontab
|
||||||
|
echo "# New job comment" >> /tmp/current_crontab
|
||||||
|
echo "*/10 * * * * /path/to/command" >> /tmp/current_crontab
|
||||||
|
crontab /tmp/current_crontab
|
||||||
|
```
|
||||||
|
|
||||||
|
#### **Debugging EOF Issues**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check for EOF artifacts in crontab file
|
||||||
|
cat -n /tmp/crontab | grep EOF
|
||||||
|
|
||||||
|
# Validate crontab syntax before installing
|
||||||
|
crontab -T /tmp/crontab # Some systems support this
|
||||||
|
|
||||||
|
# Manual cleanup if needed
|
||||||
|
sed '/^EOF/d' /tmp/crontab > /tmp/clean_crontab
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing Cron Jobs
|
||||||
|
|
||||||
|
**Test command syntax first**
|
||||||
|
```bash
|
||||||
|
# Test the actual command before scheduling
|
||||||
|
python3 /full/path/to/script.py --test
|
||||||
|
|
||||||
|
# Check file permissions
|
||||||
|
ls -la /path/to/script
|
||||||
|
|
||||||
|
# Verify paths exist
|
||||||
|
ls -la /target/directory/
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test with minimal frequency**
|
||||||
|
```bash
|
||||||
|
# Start with 5-minute intervals for testing
|
||||||
|
*/5 * * * * /path/to/new/script.sh
|
||||||
|
|
||||||
|
# Monitor logs
|
||||||
|
tail -f /var/log/syslog | grep CRON
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitoring Cron Jobs
|
||||||
|
|
||||||
|
**Check cron logs**
|
||||||
|
```bash
|
||||||
|
# System cron logs
|
||||||
|
sudo journalctl -u cron -f
|
||||||
|
|
||||||
|
# User cron logs
|
||||||
|
grep CRON /var/log/syslog | grep $(whoami)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify job execution**
|
||||||
|
```bash
|
||||||
|
# Check if cleanup actually ran
|
||||||
|
ls -la /target/cleanup/directory/
|
||||||
|
|
||||||
|
# Monitor script logs
|
||||||
|
tail -f /path/to/script/logs/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
### 1. Path Security
|
||||||
|
- Use absolute paths to prevent PATH manipulation
|
||||||
|
- Ensure scripts are owned by correct user
|
||||||
|
- Set appropriate permissions (750 for scripts)
|
||||||
|
|
||||||
|
### 2. Command Injection Prevention
|
||||||
|
```bash
|
||||||
|
# Good - quoted paths
|
||||||
|
find "/path/with spaces/" -name "pattern"
|
||||||
|
|
||||||
|
# Bad - unquoted paths vulnerable to injection
|
||||||
|
find /path/with spaces/ -name pattern
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Resource Limits
|
||||||
|
- Prevent runaway processes with `timeout`
|
||||||
|
- Use `ionice` for I/O intensive cleanup jobs
|
||||||
|
- Consider `nice` for CPU-intensive tasks
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
**Job not running**
|
||||||
|
1. Check cron service: `sudo systemctl status cron`
|
||||||
|
2. Verify crontab syntax: `crontab -l`
|
||||||
|
3. Check file permissions and paths
|
||||||
|
4. Review cron logs for errors
|
||||||
|
|
||||||
|
**Environment differences**
|
||||||
|
- Cron runs with minimal environment
|
||||||
|
- Set PATH explicitly if needed
|
||||||
|
- Use absolute paths for all commands
|
||||||
|
|
||||||
|
**Silent failures**
|
||||||
|
- Remove `2>/dev/null` temporarily for debugging
|
||||||
|
- Add logging to scripts
|
||||||
|
- Check script exit codes
|
||||||
|
|
||||||
|
### Debugging Commands
|
||||||
|
```bash
|
||||||
|
# Test cron environment
|
||||||
|
* * * * * env > /tmp/cron_env.txt
|
||||||
|
|
||||||
|
# Test script in cron-like environment
|
||||||
|
env -i /bin/bash -c 'your_command_here'
|
||||||
|
|
||||||
|
# Monitor real-time execution
|
||||||
|
sudo tail -f /var/log/syslog | grep CRON
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### 1. Documentation
|
||||||
|
- Comment all cron jobs with purpose and schedule
|
||||||
|
- Document in this patterns file
|
||||||
|
- Include contact info for complex jobs
|
||||||
|
|
||||||
|
### 2. Maintenance
|
||||||
|
- Regular review of active jobs (quarterly)
|
||||||
|
- Remove obsolete jobs promptly
|
||||||
|
- Update absolute paths when moving scripts
|
||||||
|
|
||||||
|
### 3. Monitoring
|
||||||
|
- Implement health checks for critical jobs
|
||||||
|
- Use Discord/email notifications for failures
|
||||||
|
- Monitor disk space usage from cleanup jobs
|
||||||
|
|
||||||
|
### 4. Backup Strategy
|
||||||
|
- Backup crontab before changes
|
||||||
|
- Version control cron configurations
|
||||||
|
- Document restoration procedures
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
### Planned Additions
|
||||||
|
- **Log rotation**: Automated cleanup of application logs
|
||||||
|
- **Health checks**: System resource monitoring
|
||||||
|
- **Backup verification**: Automated backup integrity checks
|
||||||
|
- **Certificate renewal**: SSL/TLS certificate automation
|
||||||
|
|
||||||
|
### Migration Considerations
|
||||||
|
- **Systemd timers**: Consider migration for complex scheduling
|
||||||
|
- **Configuration management**: Ansible or similar for multi-host
|
||||||
|
- **Centralized logging**: Aggregated cron job monitoring
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
- [Tdarr Monitoring Script](../scripts/README.md#tdarr_monitorpy---enhanced-tdarr-monitoring)
|
||||||
|
- [System Maintenance](../reference/system-maintenance.md)
|
||||||
|
- [Discord Integration](../examples/discord-notifications.md)
|
||||||
1234
monitoring/scripts/tdarr_monitor.py
Executable file
1234
monitoring/scripts/tdarr_monitor.py
Executable file
File diff suppressed because it is too large
Load Diff
414
monitoring/troubleshooting.md
Normal file
414
monitoring/troubleshooting.md
Normal file
@ -0,0 +1,414 @@
|
|||||||
|
# Monitoring System Troubleshooting Guide
|
||||||
|
|
||||||
|
## Discord Notification Issues
|
||||||
|
|
||||||
|
### Webhook Not Working
|
||||||
|
**Symptoms**: No Discord messages received, connection errors
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test webhook manually
|
||||||
|
curl -X POST "$DISCORD_WEBHOOK_URL" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"content": "Test message"}'
|
||||||
|
|
||||||
|
# Check webhook URL format
|
||||||
|
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Verify webhook URL is correct
|
||||||
|
# Format: https://discord.com/api/webhooks/ID/TOKEN
|
||||||
|
|
||||||
|
# Test with minimal payload
|
||||||
|
curl -X POST "$DISCORD_WEBHOOK_URL" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"content": "✅ Webhook working"}'
|
||||||
|
|
||||||
|
# Check for JSON formatting issues
|
||||||
|
echo '{"content": "test"}' | jq . # Validate JSON
|
||||||
|
```
|
||||||
|
|
||||||
|
### Message Formatting Problems
|
||||||
|
**Symptoms**: Malformed messages, broken markdown, missing user pings
|
||||||
|
**Common Issues**:
|
||||||
|
```bash
|
||||||
|
# ❌ Broken JSON escaping
|
||||||
|
{"content": "Error: "quotes" break JSON"}
|
||||||
|
|
||||||
|
# ✅ Proper JSON escaping
|
||||||
|
{"content": "Error: \"quotes\" properly escaped"}
|
||||||
|
|
||||||
|
# ❌ User ping inside code block (doesn't work)
|
||||||
|
{"content": "```\nIssue occurred <@user_id>\n```"}
|
||||||
|
|
||||||
|
# ✅ User ping outside code block
|
||||||
|
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tdarr Monitoring Issues
|
||||||
|
|
||||||
|
### Script Not Running
|
||||||
|
**Symptoms**: No monitoring alerts, script execution failures
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check cron job status
|
||||||
|
crontab -l | grep tdarr-timeout-monitor
|
||||||
|
systemctl status cron
|
||||||
|
|
||||||
|
# Run script manually for debugging
|
||||||
|
bash -x /path/to/tdarr-timeout-monitor.sh
|
||||||
|
|
||||||
|
# Check script permissions
|
||||||
|
ls -la /path/to/tdarr-timeout-monitor.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Fix script permissions
|
||||||
|
chmod +x /path/to/tdarr-timeout-monitor.sh
|
||||||
|
|
||||||
|
# Reinstall cron job
|
||||||
|
crontab -e
|
||||||
|
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh
|
||||||
|
|
||||||
|
# Check script environment
|
||||||
|
# Ensure PATH and variables are set correctly in script
|
||||||
|
```
|
||||||
|
|
||||||
|
### API Connection Failures
|
||||||
|
**Symptoms**: Cannot connect to Tdarr server, timeout errors
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test Tdarr API manually
|
||||||
|
curl -f "http://tdarr-server:8266/api/v2/status"
|
||||||
|
|
||||||
|
# Check network connectivity
|
||||||
|
ping tdarr-server
|
||||||
|
nc -zv tdarr-server 8266
|
||||||
|
|
||||||
|
# Verify SSH access to server
|
||||||
|
ssh tdarr "docker ps | grep tdarr"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Update server connection in script
|
||||||
|
# Verify server IP and port are correct
|
||||||
|
|
||||||
|
# Test API endpoints
|
||||||
|
curl "http://10.10.0.43:8265/api/v2/status" # Web port
|
||||||
|
curl "http://10.10.0.43:8266/api/v2/status" # Server port
|
||||||
|
|
||||||
|
# Check Tdarr server logs
|
||||||
|
ssh tdarr "docker logs tdarr | tail -20"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Windows Desktop Monitoring Issues
|
||||||
|
|
||||||
|
### PowerShell Script Not Running
|
||||||
|
**Symptoms**: No reboot notifications from Windows systems
|
||||||
|
**Diagnosis**:
|
||||||
|
```powershell
|
||||||
|
# Check scheduled task status
|
||||||
|
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo
|
||||||
|
|
||||||
|
# Test script execution manually
|
||||||
|
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"
|
||||||
|
|
||||||
|
# Check PowerShell execution policy
|
||||||
|
Get-ExecutionPolicy
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```powershell
|
||||||
|
# Set execution policy
|
||||||
|
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
|
||||||
|
|
||||||
|
# Recreate scheduled tasks
|
||||||
|
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"
|
||||||
|
|
||||||
|
# Check task trigger configuration
|
||||||
|
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger
|
||||||
|
```
|
||||||
|
|
||||||
|
### Network Access from Windows
|
||||||
|
**Symptoms**: PowerShell cannot reach Discord webhook
|
||||||
|
**Diagnosis**:
|
||||||
|
```powershell
|
||||||
|
# Test network connectivity
|
||||||
|
Test-NetConnection discord.com -Port 443
|
||||||
|
|
||||||
|
# Test webhook manually
|
||||||
|
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"
|
||||||
|
|
||||||
|
# Check Windows firewall
|
||||||
|
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```powershell
|
||||||
|
# Allow PowerShell through firewall
|
||||||
|
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow
|
||||||
|
|
||||||
|
# Test with simplified request
|
||||||
|
$body = @{content="Test from Windows"} | ConvertTo-Json
|
||||||
|
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Log Management Issues
|
||||||
|
|
||||||
|
### Log Files Growing Too Large
|
||||||
|
**Symptoms**: Disk space filling up, slow log access
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check log file sizes
|
||||||
|
du -sh /var/log/homelab-*
|
||||||
|
du -sh /tmp/*monitor*.log
|
||||||
|
|
||||||
|
# Check available disk space
|
||||||
|
df -h /var/log
|
||||||
|
df -h /tmp
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Implement log rotation
|
||||||
|
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
|
||||||
|
/var/log/homelab-*.log {
|
||||||
|
daily
|
||||||
|
missingok
|
||||||
|
rotate 7
|
||||||
|
compress
|
||||||
|
notifempty
|
||||||
|
create 644 root root
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Manual log cleanup
|
||||||
|
find /tmp -name "*monitor*.log" -size +10M -delete
|
||||||
|
truncate -s 0 /tmp/large-log-file.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### Log Rotation Not Working
|
||||||
|
**Symptoms**: Old logs not being cleaned up
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check logrotate status
|
||||||
|
systemctl status logrotate
|
||||||
|
cat /var/lib/logrotate/status
|
||||||
|
|
||||||
|
# Test logrotate configuration
|
||||||
|
logrotate -d /etc/logrotate.d/homelab-monitoring
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Force log rotation
|
||||||
|
logrotate -f /etc/logrotate.d/homelab-monitoring
|
||||||
|
|
||||||
|
# Fix logrotate configuration
|
||||||
|
sudo nano /etc/logrotate.d/homelab-monitoring
|
||||||
|
# Verify syntax and permissions
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cron Job Issues
|
||||||
|
|
||||||
|
### Scheduled Tasks Not Running
|
||||||
|
**Symptoms**: Scripts not executing at scheduled times
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check cron service
|
||||||
|
systemctl status cron
|
||||||
|
systemctl status crond # RHEL/CentOS
|
||||||
|
|
||||||
|
# View cron logs
|
||||||
|
grep CRON /var/log/syslog
|
||||||
|
journalctl -u cron
|
||||||
|
|
||||||
|
# List all cron jobs
|
||||||
|
crontab -l
|
||||||
|
sudo crontab -l # System crontab
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Restart cron service
|
||||||
|
sudo systemctl restart cron
|
||||||
|
|
||||||
|
# Fix cron job syntax
|
||||||
|
# Ensure absolute paths are used
|
||||||
|
# Example: */20 * * * * /full/path/to/script.sh
|
||||||
|
|
||||||
|
# Check script permissions and execution
|
||||||
|
ls -la /path/to/script.sh
|
||||||
|
/path/to/script.sh # Test manual execution
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment Variables in Cron
|
||||||
|
**Symptoms**: Scripts work manually but fail in cron
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Create test cron job to check environment
|
||||||
|
* * * * * env > /tmp/cron-env.txt
|
||||||
|
|
||||||
|
# Compare with shell environment
|
||||||
|
env > /tmp/shell-env.txt
|
||||||
|
diff /tmp/shell-env.txt /tmp/cron-env.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Set PATH in crontab
|
||||||
|
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
|
||||||
|
|
||||||
|
# Or set PATH in script
|
||||||
|
#!/bin/bash
|
||||||
|
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||||||
|
|
||||||
|
# Source environment if needed
|
||||||
|
source /etc/environment
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network Monitoring Issues
|
||||||
|
|
||||||
|
### False Positives
|
||||||
|
**Symptoms**: Alerts for services that are actually working
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test monitoring checks manually
|
||||||
|
curl -sSf --max-time 10 "https://service.homelab.local"
|
||||||
|
ping -c1 -W5 10.10.0.100
|
||||||
|
|
||||||
|
# Check for intermittent network issues
|
||||||
|
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Adjust timeout values
|
||||||
|
curl --max-time 30 "$service" # Increase timeout
|
||||||
|
|
||||||
|
# Add retry logic
|
||||||
|
for retry in {1..3}; do
|
||||||
|
if curl -sSf "$service" >/dev/null 2>&1; then
|
||||||
|
break
|
||||||
|
elif [ $retry -eq 3 ]; then
|
||||||
|
send_alert "Service $service failed after 3 retries"
|
||||||
|
fi
|
||||||
|
sleep 5
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Missing Alerts
|
||||||
|
**Symptoms**: Real failures not triggering notifications
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Verify monitoring script logic
|
||||||
|
bash -x monitoring-script.sh
|
||||||
|
|
||||||
|
# Check if services are actually down
|
||||||
|
systemctl status service-name
|
||||||
|
curl -v service-url
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Lower detection thresholds
|
||||||
|
# Increase monitoring frequency
|
||||||
|
# Add redundant monitoring methods
|
||||||
|
|
||||||
|
# Test alert mechanism
|
||||||
|
echo "Test alert" | send_alert_function
|
||||||
|
```
|
||||||
|
|
||||||
|
## System Resource Issues
|
||||||
|
|
||||||
|
### Monitoring Overhead
|
||||||
|
**Symptoms**: High CPU/memory usage from monitoring scripts
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Monitor the monitoring scripts
|
||||||
|
top -p $(pgrep -f monitor)
|
||||||
|
ps aux | grep monitor
|
||||||
|
|
||||||
|
# Check monitoring frequency
|
||||||
|
crontab -l | grep monitor
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Reduce monitoring frequency
|
||||||
|
# Change from */1 to */5 minutes
|
||||||
|
|
||||||
|
# Optimize scripts
|
||||||
|
# Remove unnecessary commands
|
||||||
|
# Use efficient tools (prefer curl over wget, etc.)
|
||||||
|
|
||||||
|
# Add resource limits
|
||||||
|
timeout 30 monitoring-script.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Emergency Recovery
|
||||||
|
|
||||||
|
### Complete Monitoring Failure
|
||||||
|
**Recovery Steps**:
|
||||||
|
```bash
|
||||||
|
# Restart all monitoring services
|
||||||
|
sudo systemctl restart cron
|
||||||
|
sudo systemctl restart rsyslog
|
||||||
|
|
||||||
|
# Reinstall monitoring scripts
|
||||||
|
cd /path/to/scripts
|
||||||
|
./install-monitoring.sh
|
||||||
|
|
||||||
|
# Test all components
|
||||||
|
./test-monitoring.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Discord Integration Lost
|
||||||
|
**Quick Recovery**:
|
||||||
|
```bash
|
||||||
|
# Test webhook
|
||||||
|
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'
|
||||||
|
|
||||||
|
# Switch to backup webhook if needed
|
||||||
|
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prevention and Best Practices
|
||||||
|
|
||||||
|
### Monitoring Health Checks
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# monitor-the-monitors.sh
|
||||||
|
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"
|
||||||
|
|
||||||
|
for script in $MONITORING_SCRIPTS; do
|
||||||
|
if [ ! -x "$script" ]; then
|
||||||
|
echo "ALERT: $script not executable" | send_alert
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if script has run recently
|
||||||
|
if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
|
||||||
|
echo "ALERT: $script hasn't run in over an hour" | send_alert
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backup Alerting Channels
|
||||||
|
```bash
|
||||||
|
# Multiple notification methods
|
||||||
|
send_alert() {
|
||||||
|
local message="$1"
|
||||||
|
|
||||||
|
# Primary: Discord
|
||||||
|
curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
|
||||||
|
# Backup: Email
|
||||||
|
echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
|
||||||
|
# Last resort: Local log
|
||||||
|
echo "$(date): $message" >> /var/log/critical-alerts.log
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.
|
||||||
309
networking/CONTEXT.md
Normal file
309
networking/CONTEXT.md
Normal file
@ -0,0 +1,309 @@
|
|||||||
|
# Networking Infrastructure - Technology Context
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Home lab networking infrastructure with focus on reverse proxy configuration, SSL/TLS management, SSH key management, and network security. This context covers service discovery, load balancing, and performance optimization patterns.
|
||||||
|
|
||||||
|
## Architecture Patterns
|
||||||
|
|
||||||
|
### Reverse Proxy and Load Balancing
|
||||||
|
**Pattern**: Centralized traffic management with SSL termination
|
||||||
|
```nginx
|
||||||
|
# Nginx reverse proxy pattern
|
||||||
|
upstream backend {
|
||||||
|
server 10.10.0.100:3000;
|
||||||
|
server 10.10.0.101:3000;
|
||||||
|
keepalive 32;
|
||||||
|
}
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 443 ssl http2;
|
||||||
|
server_name myapp.homelab.local;
|
||||||
|
|
||||||
|
ssl_certificate /etc/ssl/certs/homelab.crt;
|
||||||
|
ssl_certificate_key /etc/ssl/private/homelab.key;
|
||||||
|
|
||||||
|
location / {
|
||||||
|
proxy_pass http://backend;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_set_header X-Forwarded-Proto $scheme;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Network Segmentation Strategy
|
||||||
|
**Pattern**: VLAN-based isolation with controlled inter-VLAN routing
|
||||||
|
```
|
||||||
|
Management VLAN: 10.10.0.x/24 # VM management, SSH access
|
||||||
|
Services VLAN: 10.10.1.x/24 # Application services
|
||||||
|
Storage VLAN: 10.10.2.x/24 # NAS, backup traffic
|
||||||
|
DMZ VLAN: 10.10.10.x/24 # External-facing services
|
||||||
|
```
|
||||||
|
|
||||||
|
## SSH Key Management
|
||||||
|
|
||||||
|
### Centralized Key Distribution
|
||||||
|
**Pattern**: Automated SSH key deployment with emergency backup
|
||||||
|
```bash
|
||||||
|
# Primary access key
|
||||||
|
~/.ssh/homelab_rsa # Daily operations key
|
||||||
|
|
||||||
|
# Emergency access key
|
||||||
|
~/.ssh/emergency_homelab_rsa # Backup recovery key
|
||||||
|
|
||||||
|
# Automated deployment
|
||||||
|
for host in $(cat hosts.txt); do
|
||||||
|
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@$host
|
||||||
|
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@$host
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Lifecycle Management
|
||||||
|
**Pattern**: Regular rotation with zero-downtime deployment
|
||||||
|
1. **Generation**: Create new key pairs annually
|
||||||
|
2. **Distribution**: Deploy to all managed systems
|
||||||
|
3. **Verification**: Test connectivity with new keys
|
||||||
|
4. **Rotation**: Remove old keys after verification
|
||||||
|
5. **Backup**: Store keys in secure, recoverable location
|
||||||
|
|
||||||
|
## Service Discovery and DNS
|
||||||
|
|
||||||
|
### Local DNS Resolution
|
||||||
|
**Pattern**: Internal DNS for service discovery
|
||||||
|
```bind
|
||||||
|
# Home lab DNS zones
|
||||||
|
homelab.local. IN A 10.10.0.16 # DNS server
|
||||||
|
proxmox.homelab.local. IN A 10.10.0.10 # Hypervisor
|
||||||
|
nas.homelab.local. IN A 10.10.0.20 # Storage
|
||||||
|
tdarr.homelab.local. IN A 10.10.0.43 # Media server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container Service Discovery
|
||||||
|
**Pattern**: Docker network-based service resolution
|
||||||
|
```yaml
|
||||||
|
# Docker Compose service discovery
|
||||||
|
version: "3.8"
|
||||||
|
services:
|
||||||
|
web:
|
||||||
|
networks:
|
||||||
|
- frontend
|
||||||
|
- backend
|
||||||
|
api:
|
||||||
|
networks:
|
||||||
|
- backend
|
||||||
|
- database
|
||||||
|
db:
|
||||||
|
networks:
|
||||||
|
- database
|
||||||
|
|
||||||
|
networks:
|
||||||
|
frontend:
|
||||||
|
driver: bridge
|
||||||
|
backend:
|
||||||
|
driver: bridge
|
||||||
|
database:
|
||||||
|
driver: bridge
|
||||||
|
internal: true # No external access
|
||||||
|
```
|
||||||
|
|
||||||
|
## Security Patterns
|
||||||
|
|
||||||
|
### SSH Security Hardening
|
||||||
|
**Configuration**: Secure SSH server setup
|
||||||
|
```sshd_config
|
||||||
|
# /etc/ssh/sshd_config.d/99-homelab-security.conf
|
||||||
|
PasswordAuthentication no
|
||||||
|
PubkeyAuthentication yes
|
||||||
|
PermitRootLogin no
|
||||||
|
AllowUsers cal
|
||||||
|
Protocol 2
|
||||||
|
ClientAliveInterval 300
|
||||||
|
ClientAliveCountMax 2
|
||||||
|
MaxAuthTries 3
|
||||||
|
X11Forwarding no
|
||||||
|
```
|
||||||
|
|
||||||
|
### Network Access Control
|
||||||
|
**Pattern**: Firewall-based service protection
|
||||||
|
```bash
|
||||||
|
# ufw firewall rules
|
||||||
|
ufw default deny incoming
|
||||||
|
ufw default allow outgoing
|
||||||
|
ufw allow ssh
|
||||||
|
ufw allow from 10.10.0.0/24 to any port 22
|
||||||
|
ufw allow from 10.10.0.0/24 to any port 80
|
||||||
|
ufw allow from 10.10.0.0/24 to any port 443
|
||||||
|
```
|
||||||
|
|
||||||
|
### SSL/TLS Certificate Management
|
||||||
|
**Pattern**: Automated certificate lifecycle
|
||||||
|
```bash
|
||||||
|
# Let's Encrypt automation
|
||||||
|
certbot certonly --nginx \
|
||||||
|
--email admin@homelab.local \
|
||||||
|
--agree-tos \
|
||||||
|
--domains homelab.local,*.homelab.local
|
||||||
|
|
||||||
|
# Certificate renewal automation
|
||||||
|
0 2 * * * certbot renew --quiet && systemctl reload nginx
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Optimization
|
||||||
|
|
||||||
|
### Connection Management
|
||||||
|
**Pattern**: Optimized connection handling
|
||||||
|
```nginx
|
||||||
|
# Nginx performance tuning
|
||||||
|
worker_processes auto;
|
||||||
|
worker_connections 1024;
|
||||||
|
|
||||||
|
keepalive_timeout 65;
|
||||||
|
keepalive_requests 1000;
|
||||||
|
|
||||||
|
gzip on;
|
||||||
|
gzip_vary on;
|
||||||
|
gzip_types text/plain text/css application/json application/javascript;
|
||||||
|
|
||||||
|
# Connection pooling
|
||||||
|
upstream backend {
|
||||||
|
server 10.10.0.100:3000 max_fails=3 fail_timeout=30s;
|
||||||
|
keepalive 32;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Caching Strategies
|
||||||
|
**Pattern**: Multi-level caching architecture
|
||||||
|
```nginx
|
||||||
|
# Static content caching
|
||||||
|
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
|
||||||
|
expires 1y;
|
||||||
|
add_header Cache-Control "public, immutable";
|
||||||
|
}
|
||||||
|
|
||||||
|
# Proxy caching
|
||||||
|
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=app_cache:10m;
|
||||||
|
proxy_cache app_cache;
|
||||||
|
proxy_cache_valid 200 302 10m;
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network Storage Integration
|
||||||
|
|
||||||
|
### CIFS/SMB Mount Resilience
|
||||||
|
**Pattern**: Robust network filesystem mounting
|
||||||
|
```fstab
|
||||||
|
//nas.homelab.local/media /mnt/media cifs \
|
||||||
|
credentials=/etc/cifs/credentials,\
|
||||||
|
uid=1000,gid=1000,\
|
||||||
|
file_mode=0644,dir_mode=0755,\
|
||||||
|
iocharset=utf8,\
|
||||||
|
cache=strict,\
|
||||||
|
actimeo=30,\
|
||||||
|
_netdev,\
|
||||||
|
reconnect,\
|
||||||
|
soft,\
|
||||||
|
rsize=1048576,\
|
||||||
|
wsize=1048576 0 0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring and Observability
|
||||||
|
|
||||||
|
### Network Health Monitoring
|
||||||
|
**Pattern**: Automated connectivity verification
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# network-health-check.sh
|
||||||
|
HOSTS="10.10.0.10 10.10.0.20 10.10.0.43"
|
||||||
|
DNS_SERVERS="10.10.0.16 8.8.8.8"
|
||||||
|
|
||||||
|
for host in $HOSTS; do
|
||||||
|
if ping -c1 -W5 $host >/dev/null 2>&1; then
|
||||||
|
echo "✅ $host: Reachable"
|
||||||
|
else
|
||||||
|
echo "❌ $host: Unreachable"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
for dns in $DNS_SERVERS; do
|
||||||
|
if nslookup google.com $dns >/dev/null 2>&1; then
|
||||||
|
echo "✅ DNS $dns: Working"
|
||||||
|
else
|
||||||
|
echo "❌ DNS $dns: Failed"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service Availability Monitoring
|
||||||
|
**Pattern**: HTTP/HTTPS endpoint monitoring
|
||||||
|
```bash
|
||||||
|
# Service health check
|
||||||
|
SERVICES="https://homelab.local http://proxmox.homelab.local:8006"
|
||||||
|
|
||||||
|
for service in $SERVICES; do
|
||||||
|
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
|
||||||
|
echo "✅ $service: Available"
|
||||||
|
else
|
||||||
|
echo "❌ $service: Unavailable"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Integration Patterns
|
||||||
|
|
||||||
|
### Reverse Proxy with Docker
|
||||||
|
**Pattern**: Container service exposure
|
||||||
|
```nginx
|
||||||
|
# Dynamic service discovery with Docker
|
||||||
|
location /api/ {
|
||||||
|
proxy_pass http://api-container:3000/;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
}
|
||||||
|
|
||||||
|
location /web/ {
|
||||||
|
proxy_pass http://web-container:8080/;
|
||||||
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
|
proxy_set_header Connection "upgrade"; # WebSocket support
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### VPN Integration
|
||||||
|
**Pattern**: Secure remote access
|
||||||
|
```openvpn
|
||||||
|
# OpenVPN server configuration
|
||||||
|
port 1194
|
||||||
|
proto udp
|
||||||
|
dev tun
|
||||||
|
ca ca.crt
|
||||||
|
cert server.crt
|
||||||
|
key server.key
|
||||||
|
dh dh.pem
|
||||||
|
server 10.8.0.0 255.255.255.0
|
||||||
|
push "route 10.10.0.0 255.255.0.0" # Home lab networks
|
||||||
|
keepalive 10 120
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### Security Implementation
|
||||||
|
1. **SSH Keys Only**: Disable password authentication everywhere
|
||||||
|
2. **Network Segmentation**: Use VLANs for isolation
|
||||||
|
3. **Certificate Management**: Automate SSL/TLS certificate lifecycle
|
||||||
|
4. **Access Control**: Implement least-privilege networking
|
||||||
|
5. **Monitoring**: Continuous network and service monitoring
|
||||||
|
|
||||||
|
### Performance Optimization
|
||||||
|
1. **Connection Pooling**: Reuse connections for efficiency
|
||||||
|
2. **Caching**: Implement multi-level caching strategies
|
||||||
|
3. **Compression**: Enable gzip for reduced bandwidth
|
||||||
|
4. **Keep-Alives**: Optimize connection persistence
|
||||||
|
5. **CDN Strategy**: Cache static content effectively
|
||||||
|
|
||||||
|
### Operational Excellence
|
||||||
|
1. **Documentation**: Maintain network topology documentation
|
||||||
|
2. **Automation**: Script routine network operations
|
||||||
|
3. **Backup**: Regular configuration backups
|
||||||
|
4. **Testing**: Regular connectivity and performance testing
|
||||||
|
5. **Change Management**: Controlled network configuration changes
|
||||||
|
|
||||||
|
This technology context provides comprehensive guidance for implementing robust networking infrastructure in home lab environments.
|
||||||
99
networking/examples/security_improvements.md
Normal file
99
networking/examples/security_improvements.md
Normal file
@ -0,0 +1,99 @@
|
|||||||
|
# Home Lab Security Improvements
|
||||||
|
|
||||||
|
## Current Security Issues
|
||||||
|
|
||||||
|
### Critical Issues Found:
|
||||||
|
- **Password Authentication**: All servers using password-based SSH authentication
|
||||||
|
- **Credential Reuse**: Same password used across 7 home network servers
|
||||||
|
- **Insecure Storage**: Passwords stored in FileZilla (base64 encoded, not encrypted)
|
||||||
|
- **Root Access**: Cloud servers using root user accounts
|
||||||
|
|
||||||
|
### Risk Assessment:
|
||||||
|
- **High**: Password-based authentication vulnerable to brute force attacks
|
||||||
|
- **High**: Shared passwords create single point of failure
|
||||||
|
- **Medium**: FileZilla credentials accessible to anyone with file system access
|
||||||
|
- **Medium**: Root access increases attack surface
|
||||||
|
|
||||||
|
## Implemented Solutions
|
||||||
|
|
||||||
|
### 1. SSH Key-Based Authentication
|
||||||
|
- **Generated separate key pairs** for home lab vs cloud servers
|
||||||
|
- **4096-bit RSA keys** for strong encryption
|
||||||
|
- **Descriptive key comments** for identification
|
||||||
|
|
||||||
|
### 2. SSH Configuration Management
|
||||||
|
- **Centralized config** in `~/.ssh/config`
|
||||||
|
- **Host aliases** for easy server access
|
||||||
|
- **Port forwarding** pre-configured for common services
|
||||||
|
- **Security defaults** (ServerAliveInterval, StrictHostKeyChecking)
|
||||||
|
|
||||||
|
### 3. Network Segmentation
|
||||||
|
- **Home network** (10.10.0.0/24) uses dedicated key
|
||||||
|
- **Cloud servers** use separate key pair
|
||||||
|
- **Service-specific aliases** for different server roles
|
||||||
|
|
||||||
|
## Additional Security Recommendations
|
||||||
|
|
||||||
|
### Immediate Actions:
|
||||||
|
1. **Deploy SSH keys** using the provided script
|
||||||
|
2. **Test key-based authentication** on all servers
|
||||||
|
3. **Disable password authentication** once keys work
|
||||||
|
4. **Remove FileZilla passwords** after migration
|
||||||
|
|
||||||
|
### Server Hardening:
|
||||||
|
```bash
|
||||||
|
# On each server, edit /etc/ssh/sshd_config:
|
||||||
|
PasswordAuthentication no
|
||||||
|
PubkeyAuthentication yes
|
||||||
|
PermitRootLogin no # (create non-root user on cloud servers first)
|
||||||
|
Port 2222 # Change default SSH port
|
||||||
|
AllowUsers cal # Restrict SSH access
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitoring:
|
||||||
|
- **SSH login monitoring** with fail2ban
|
||||||
|
- **Key rotation schedule** (annually)
|
||||||
|
- **Access logging** review
|
||||||
|
|
||||||
|
### Future Enhancements:
|
||||||
|
- **Certificate-based authentication** (SSH CA)
|
||||||
|
- **Multi-factor authentication** (TOTP)
|
||||||
|
- **VPN access** for home network
|
||||||
|
- **Bastion host** for cloud servers
|
||||||
|
|
||||||
|
## Migration Plan
|
||||||
|
|
||||||
|
### Phase 1: Key Deployment ✅
|
||||||
|
- [x] Generate SSH key pairs
|
||||||
|
- [x] Create SSH configuration
|
||||||
|
- [x] Document server inventory
|
||||||
|
|
||||||
|
### Phase 2: Authentication Migration
|
||||||
|
- [ ] Deploy public keys to all servers
|
||||||
|
- [ ] Test SSH connections with keys
|
||||||
|
- [ ] Verify all services accessible
|
||||||
|
|
||||||
|
### Phase 3: Security Lockdown
|
||||||
|
- [ ] Disable password authentication
|
||||||
|
- [ ] Change default SSH ports
|
||||||
|
- [ ] Configure fail2ban
|
||||||
|
- [ ] Remove FileZilla credentials
|
||||||
|
|
||||||
|
### Phase 4: Monitoring & Maintenance
|
||||||
|
- [ ] Set up access logging
|
||||||
|
- [ ] Schedule key rotation
|
||||||
|
- [ ] Document incident response
|
||||||
|
|
||||||
|
## Connection Examples
|
||||||
|
|
||||||
|
After setup, you'll connect using simple aliases:
|
||||||
|
```bash
|
||||||
|
# Instead of: ssh cal@10.10.0.42
|
||||||
|
ssh database-apis
|
||||||
|
|
||||||
|
# Instead of: ssh root@172.237.147.99
|
||||||
|
ssh akamai
|
||||||
|
|
||||||
|
# With automatic port forwarding:
|
||||||
|
ssh pihole # Forwards port 8080 → localhost:80
|
||||||
|
```
|
||||||
70
networking/examples/server_inventory.yaml
Normal file
70
networking/examples/server_inventory.yaml
Normal file
@ -0,0 +1,70 @@
|
|||||||
|
---
|
||||||
|
# Home Lab Server Inventory
|
||||||
|
# Generated from FileZilla configuration
|
||||||
|
|
||||||
|
home_network:
|
||||||
|
subnet: "10.10.0.0/24"
|
||||||
|
servers:
|
||||||
|
database_apis:
|
||||||
|
hostname: "10.10.0.42"
|
||||||
|
port: 22
|
||||||
|
user: "cal"
|
||||||
|
services: ["database", "api"]
|
||||||
|
description: "Database and API services"
|
||||||
|
|
||||||
|
discord_bots:
|
||||||
|
hostname: "10.10.0.33"
|
||||||
|
port: 22
|
||||||
|
user: "cal"
|
||||||
|
services: ["discord", "bots"]
|
||||||
|
description: "Discord bot hosting"
|
||||||
|
|
||||||
|
home_docker:
|
||||||
|
hostname: "10.10.0.124"
|
||||||
|
port: 22
|
||||||
|
user: "cal"
|
||||||
|
services: ["docker", "containers"]
|
||||||
|
description: "Main Docker container host"
|
||||||
|
|
||||||
|
pihole:
|
||||||
|
hostname: "10.10.0.16"
|
||||||
|
port: 22
|
||||||
|
user: "cal"
|
||||||
|
services: ["dns", "adblock"]
|
||||||
|
description: "Pi-hole DNS and ad blocking"
|
||||||
|
|
||||||
|
sba_pd_bots:
|
||||||
|
hostname: "10.10.0.88"
|
||||||
|
port: 22
|
||||||
|
user: "cal"
|
||||||
|
services: ["bots", "automation"]
|
||||||
|
description: "SBa and PD bot services"
|
||||||
|
|
||||||
|
tdarr:
|
||||||
|
hostname: "10.10.0.43"
|
||||||
|
port: 22
|
||||||
|
user: "cal"
|
||||||
|
services: ["media", "transcoding"]
|
||||||
|
description: "Tdarr media transcoding"
|
||||||
|
|
||||||
|
vpn_docker:
|
||||||
|
hostname: "10.10.0.121"
|
||||||
|
port: 22
|
||||||
|
user: "cal"
|
||||||
|
services: ["vpn", "docker"]
|
||||||
|
description: "VPN and Docker services"
|
||||||
|
|
||||||
|
remote_servers:
|
||||||
|
akamai_nano:
|
||||||
|
hostname: "172.237.147.99"
|
||||||
|
port: 22
|
||||||
|
user: "root"
|
||||||
|
provider: "akamai"
|
||||||
|
description: "Akamai cloud nano instance"
|
||||||
|
|
||||||
|
vultr_host:
|
||||||
|
hostname: "45.76.25.231"
|
||||||
|
port: 22
|
||||||
|
user: "root"
|
||||||
|
provider: "vultr"
|
||||||
|
description: "Vultr cloud host"
|
||||||
114
networking/scripts/ssh_key_maintenance.sh
Executable file
114
networking/scripts/ssh_key_maintenance.sh
Executable file
@ -0,0 +1,114 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# SSH Key Maintenance and Backup Script
|
||||||
|
# Run this periodically to maintain key security
|
||||||
|
|
||||||
|
echo "🔧 SSH Key Maintenance and Backup"
|
||||||
|
|
||||||
|
# Check if NAS is mounted
|
||||||
|
if [ ! -d "/mnt/NV2" ]; then
|
||||||
|
echo "❌ ERROR: NAS not mounted at /mnt/NV2"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Create timestamp
|
||||||
|
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||||||
|
BACKUP_ROOT="/mnt/NV2/ssh-keys"
|
||||||
|
BACKUP_DIR="$BACKUP_ROOT/maintenance-$TIMESTAMP"
|
||||||
|
|
||||||
|
# Ensure backup directory structure
|
||||||
|
mkdir -p "$BACKUP_DIR"
|
||||||
|
chmod 700 "$BACKUP_DIR"
|
||||||
|
|
||||||
|
echo "📁 Creating maintenance backup in: $BACKUP_DIR"
|
||||||
|
|
||||||
|
# Backup current keys and config
|
||||||
|
cp ~/.ssh/*_rsa* "$BACKUP_DIR/" 2>/dev/null || true
|
||||||
|
cp ~/.ssh/config "$BACKUP_DIR/" 2>/dev/null || true
|
||||||
|
cp ~/.ssh/known_hosts "$BACKUP_DIR/" 2>/dev/null || true
|
||||||
|
|
||||||
|
# Check key ages and recommend rotation
|
||||||
|
echo ""
|
||||||
|
echo "🔍 Key Age Analysis:"
|
||||||
|
for key in ~/.ssh/*_rsa; do
|
||||||
|
if [ -f "$key" ]; then
|
||||||
|
age_days=$(( ($(date +%s) - $(stat -c %Y "$key")) / 86400 ))
|
||||||
|
basename_key=$(basename "$key")
|
||||||
|
|
||||||
|
if [ $age_days -gt 365 ]; then
|
||||||
|
echo "⚠️ $basename_key: $age_days days old - ROTATION RECOMMENDED"
|
||||||
|
elif [ $age_days -gt 180 ]; then
|
||||||
|
echo "⚡ $basename_key: $age_days days old - consider rotation"
|
||||||
|
else
|
||||||
|
echo "✅ $basename_key: $age_days days old - OK"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Test key accessibility
|
||||||
|
echo ""
|
||||||
|
echo "🔐 Testing Key Access:"
|
||||||
|
for key in ~/.ssh/*_rsa; do
|
||||||
|
if [ -f "$key" ]; then
|
||||||
|
basename_key=$(basename "$key")
|
||||||
|
if ssh-keygen -l -f "$key" >/dev/null 2>&1; then
|
||||||
|
echo "✅ $basename_key: Valid and readable"
|
||||||
|
else
|
||||||
|
echo "❌ $basename_key: CORRUPTED or unreadable"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Clean up old backups (keep last 10)
|
||||||
|
echo ""
|
||||||
|
echo "🧹 Cleaning old backups (keeping last 10):"
|
||||||
|
cd "$BACKUP_ROOT"
|
||||||
|
ls -dt backup-* maintenance-* 2>/dev/null | tail -n +11 | while read old_backup; do
|
||||||
|
if [ -d "$old_backup" ]; then
|
||||||
|
echo "🗑️ Removing old backup: $old_backup"
|
||||||
|
rm -rf "$old_backup"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Generate maintenance report
|
||||||
|
cat > "$BACKUP_DIR/MAINTENANCE_REPORT.md" << EOF
|
||||||
|
# SSH Key Maintenance Report
|
||||||
|
Generated: $(date)
|
||||||
|
Host: $(hostname)
|
||||||
|
User: $(whoami)
|
||||||
|
|
||||||
|
## Backup Location
|
||||||
|
$BACKUP_DIR
|
||||||
|
|
||||||
|
## Key Inventory
|
||||||
|
$(ls -la ~/.ssh/*_rsa* 2>/dev/null || echo "No SSH keys found")
|
||||||
|
|
||||||
|
## SSH Config Status
|
||||||
|
$(if [ -f ~/.ssh/config ]; then echo "SSH config exists: ~/.ssh/config"; else echo "No SSH config found"; fi)
|
||||||
|
|
||||||
|
## Server Connection Tests
|
||||||
|
Run these commands to verify connectivity:
|
||||||
|
|
||||||
|
### Primary Keys:
|
||||||
|
ssh -o ConnectTimeout=5 database-apis 'echo "DB APIs: OK"'
|
||||||
|
ssh -o ConnectTimeout=5 pihole 'echo "PiHole: OK"'
|
||||||
|
ssh -o ConnectTimeout=5 akamai 'echo "Akamai: OK"'
|
||||||
|
|
||||||
|
### Emergency Keys (if deployed):
|
||||||
|
ssh -i ~/.ssh/emergency_homelab_rsa -o ConnectTimeout=5 cal@10.10.0.16 'echo "Emergency Home: OK"'
|
||||||
|
ssh -i ~/.ssh/emergency_cloud_rsa -o ConnectTimeout=5 root@172.237.147.99 'echo "Emergency Cloud: OK"'
|
||||||
|
|
||||||
|
## Next Maintenance Due
|
||||||
|
$(date -d '+3 months')
|
||||||
|
|
||||||
|
## Key Rotation Schedule
|
||||||
|
- Home lab keys: Annual (generated $(date -r ~/.ssh/homelab_rsa 2>/dev/null || echo "Not found"))
|
||||||
|
- Cloud keys: Annual (generated $(date -r ~/.ssh/cloud_servers_rsa 2>/dev/null || echo "Not found"))
|
||||||
|
- Emergency keys: Bi-annual
|
||||||
|
|
||||||
|
EOF
|
||||||
|
|
||||||
|
echo "✅ Maintenance backup completed"
|
||||||
|
echo "📄 Report saved: $BACKUP_DIR/MAINTENANCE_REPORT.md"
|
||||||
|
echo ""
|
||||||
|
echo "💡 Schedule this script to run monthly via cron:"
|
||||||
|
echo " 0 2 1 * * /path/to/ssh_key_maintenance.sh"
|
||||||
496
networking/troubleshooting.md
Normal file
496
networking/troubleshooting.md
Normal file
@ -0,0 +1,496 @@
|
|||||||
|
# Networking Infrastructure Troubleshooting Guide
|
||||||
|
|
||||||
|
## SSH Connection Issues
|
||||||
|
|
||||||
|
### SSH Authentication Failures
|
||||||
|
**Symptoms**: Permission denied, connection refused, timeout
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Verbose SSH debugging
|
||||||
|
ssh -vvv user@host
|
||||||
|
|
||||||
|
# Test different authentication methods
|
||||||
|
ssh -o PasswordAuthentication=no user@host
|
||||||
|
ssh -o PubkeyAuthentication=yes user@host
|
||||||
|
|
||||||
|
# Check local key files
|
||||||
|
ls -la ~/.ssh/
|
||||||
|
ssh-keygen -lf ~/.ssh/homelab_rsa.pub
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Re-deploy SSH keys
|
||||||
|
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@host
|
||||||
|
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@host
|
||||||
|
|
||||||
|
# Fix key permissions
|
||||||
|
chmod 600 ~/.ssh/homelab_rsa
|
||||||
|
chmod 644 ~/.ssh/homelab_rsa.pub
|
||||||
|
chmod 700 ~/.ssh
|
||||||
|
|
||||||
|
# Verify remote authorized_keys
|
||||||
|
ssh user@host 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
|
||||||
|
```
|
||||||
|
|
||||||
|
### SSH Service Issues
|
||||||
|
**Symptoms**: Connection refused, service not running
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check SSH service status
|
||||||
|
systemctl status sshd
|
||||||
|
ss -tlnp | grep :22
|
||||||
|
|
||||||
|
# Test port connectivity
|
||||||
|
nc -zv host 22
|
||||||
|
nmap -p 22 host
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Restart SSH service
|
||||||
|
sudo systemctl restart sshd
|
||||||
|
sudo systemctl enable sshd
|
||||||
|
|
||||||
|
# Check firewall
|
||||||
|
sudo ufw status
|
||||||
|
sudo ufw allow ssh
|
||||||
|
|
||||||
|
# Verify SSH configuration
|
||||||
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network Connectivity Problems
|
||||||
|
|
||||||
|
### Basic Network Troubleshooting
|
||||||
|
**Symptoms**: Cannot reach hosts, timeouts, routing issues
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Basic connectivity tests
|
||||||
|
ping host
|
||||||
|
traceroute host
|
||||||
|
mtr host
|
||||||
|
|
||||||
|
# Check local network configuration
|
||||||
|
ip addr show
|
||||||
|
ip route show
|
||||||
|
cat /etc/resolv.conf
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Restart networking
|
||||||
|
sudo systemctl restart networking
|
||||||
|
sudo netplan apply # Ubuntu
|
||||||
|
|
||||||
|
# Reset network interface
|
||||||
|
sudo ip link set eth0 down
|
||||||
|
sudo ip link set eth0 up
|
||||||
|
|
||||||
|
# Check default gateway
|
||||||
|
sudo ip route add default via 10.10.0.1
|
||||||
|
```
|
||||||
|
|
||||||
|
### DNS Resolution Issues
|
||||||
|
**Symptoms**: Cannot resolve hostnames, slow resolution
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test DNS resolution
|
||||||
|
nslookup google.com
|
||||||
|
dig google.com
|
||||||
|
host google.com
|
||||||
|
|
||||||
|
# Check DNS servers
|
||||||
|
systemd-resolve --status
|
||||||
|
cat /etc/resolv.conf
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Temporary DNS fix
|
||||||
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
||||||
|
|
||||||
|
# Restart DNS services
|
||||||
|
sudo systemctl restart systemd-resolved
|
||||||
|
|
||||||
|
# Flush DNS cache
|
||||||
|
sudo systemd-resolve --flush-caches
|
||||||
|
```
|
||||||
|
|
||||||
|
## Reverse Proxy and Load Balancer Issues
|
||||||
|
|
||||||
|
### Nginx Configuration Problems
|
||||||
|
**Symptoms**: 502 Bad Gateway, 503 Service Unavailable, SSL errors
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check Nginx status and logs
|
||||||
|
systemctl status nginx
|
||||||
|
sudo tail -f /var/log/nginx/error.log
|
||||||
|
sudo tail -f /var/log/nginx/access.log
|
||||||
|
|
||||||
|
# Test Nginx configuration
|
||||||
|
sudo nginx -t
|
||||||
|
sudo nginx -T # Show full configuration
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Reload Nginx configuration
|
||||||
|
sudo nginx -s reload
|
||||||
|
|
||||||
|
# Check upstream servers
|
||||||
|
curl -I http://backend-server:port
|
||||||
|
telnet backend-server port
|
||||||
|
|
||||||
|
# Fix common configuration issues
|
||||||
|
sudo nano /etc/nginx/sites-available/default
|
||||||
|
# Check proxy_pass URLs, upstream definitions
|
||||||
|
```
|
||||||
|
|
||||||
|
### SSL/TLS Certificate Issues
|
||||||
|
**Symptoms**: Certificate warnings, expired certificates, connection errors
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check certificate validity
|
||||||
|
openssl s_client -connect host:443 -servername host
|
||||||
|
openssl x509 -in /etc/ssl/certs/cert.pem -text -noout
|
||||||
|
|
||||||
|
# Check certificate expiry
|
||||||
|
openssl x509 -in /etc/ssl/certs/cert.pem -noout -dates
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Renew Let's Encrypt certificates
|
||||||
|
sudo certbot renew --dry-run
|
||||||
|
sudo certbot renew --force-renewal
|
||||||
|
|
||||||
|
# Generate self-signed certificate
|
||||||
|
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
|
||||||
|
-keyout /etc/ssl/private/selfsigned.key \
|
||||||
|
-out /etc/ssl/certs/selfsigned.crt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network Storage Issues
|
||||||
|
|
||||||
|
### CIFS/SMB Mount Problems
|
||||||
|
**Symptoms**: Mount failures, connection timeouts, permission errors
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test SMB connectivity
|
||||||
|
smbclient -L //nas-server -U username
|
||||||
|
testparm # Test Samba configuration
|
||||||
|
|
||||||
|
# Check mount status
|
||||||
|
mount | grep cifs
|
||||||
|
df -h | grep cifs
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Remount with verbose logging
|
||||||
|
sudo mount -t cifs //server/share /mnt/point -o username=user,password=pass,vers=3.0
|
||||||
|
|
||||||
|
# Fix mount options in /etc/fstab
|
||||||
|
//server/share /mnt/point cifs credentials=/etc/cifs/credentials,uid=1000,gid=1000,iocharset=utf8,file_mode=0644,dir_mode=0755,cache=strict,_netdev 0 0
|
||||||
|
|
||||||
|
# Test credentials
|
||||||
|
sudo cat /etc/cifs/credentials
|
||||||
|
# Should contain: username=, password=, domain=
|
||||||
|
```
|
||||||
|
|
||||||
|
### NFS Mount Issues
|
||||||
|
**Symptoms**: Stale file handles, mount hangs, permission denied
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check NFS services
|
||||||
|
systemctl status nfs-client.target
|
||||||
|
showmount -e nfs-server
|
||||||
|
|
||||||
|
# Test NFS connectivity
|
||||||
|
rpcinfo -p nfs-server
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Restart NFS services
|
||||||
|
sudo systemctl restart nfs-client.target
|
||||||
|
|
||||||
|
# Remount NFS shares
|
||||||
|
sudo umount /mnt/nfs-share
|
||||||
|
sudo mount -t nfs server:/path /mnt/nfs-share
|
||||||
|
|
||||||
|
# Fix stale file handles
|
||||||
|
sudo umount -f /mnt/nfs-share
|
||||||
|
sudo mount /mnt/nfs-share
|
||||||
|
```
|
||||||
|
|
||||||
|
## Firewall and Security Issues
|
||||||
|
|
||||||
|
### Port Access Problems
|
||||||
|
**Symptoms**: Connection refused, filtered ports, blocked services
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check firewall status
|
||||||
|
sudo ufw status verbose
|
||||||
|
sudo iptables -L -n -v
|
||||||
|
|
||||||
|
# Test port accessibility
|
||||||
|
nc -zv host port
|
||||||
|
nmap -p port host
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Open required ports
|
||||||
|
sudo ufw allow ssh
|
||||||
|
sudo ufw allow 80/tcp
|
||||||
|
sudo ufw allow 443/tcp
|
||||||
|
sudo ufw allow from 10.10.0.0/24
|
||||||
|
|
||||||
|
# Reset firewall if needed
|
||||||
|
sudo ufw --force reset
|
||||||
|
sudo ufw enable
|
||||||
|
```
|
||||||
|
|
||||||
|
### Network Security Issues
|
||||||
|
**Symptoms**: Unauthorized access, suspicious traffic, security alerts
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check active connections
|
||||||
|
ss -tuln
|
||||||
|
netstat -tuln
|
||||||
|
|
||||||
|
# Review logs for security events
|
||||||
|
sudo tail -f /var/log/auth.log
|
||||||
|
sudo tail -f /var/log/syslog | grep -i security
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Block suspicious IPs
|
||||||
|
sudo ufw deny from suspicious-ip
|
||||||
|
|
||||||
|
# Update SSH security
|
||||||
|
sudo nano /etc/ssh/sshd_config
|
||||||
|
# Set: PasswordAuthentication no, PermitRootLogin no
|
||||||
|
sudo systemctl restart sshd
|
||||||
|
```
|
||||||
|
|
||||||
|
## Service Discovery and DNS Issues
|
||||||
|
|
||||||
|
### Local DNS Problems
|
||||||
|
**Symptoms**: Services unreachable by hostname, DNS timeouts
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test local DNS resolution
|
||||||
|
nslookup service.homelab.local
|
||||||
|
dig @10.10.0.16 service.homelab.local
|
||||||
|
|
||||||
|
# Check DNS server status
|
||||||
|
systemctl status bind9 # or named
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Add to /etc/hosts as temporary fix
|
||||||
|
echo "10.10.0.100 service.homelab.local" | sudo tee -a /etc/hosts
|
||||||
|
|
||||||
|
# Restart DNS services
|
||||||
|
sudo systemctl restart bind9
|
||||||
|
sudo systemctl restart systemd-resolved
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container Networking Issues
|
||||||
|
**Symptoms**: Containers cannot communicate, service discovery fails
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check Docker networks
|
||||||
|
docker network ls
|
||||||
|
docker network inspect bridge
|
||||||
|
|
||||||
|
# Test container connectivity
|
||||||
|
docker exec container1 ping container2
|
||||||
|
docker exec container1 nslookup container2
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Create custom network
|
||||||
|
docker network create --driver bridge app-network
|
||||||
|
docker run --network app-network container
|
||||||
|
|
||||||
|
# Fix DNS in containers
|
||||||
|
docker run --dns 8.8.8.8 container
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Issues
|
||||||
|
|
||||||
|
### Network Latency Problems
|
||||||
|
**Symptoms**: Slow response times, timeouts, poor performance
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Measure network latency
|
||||||
|
ping -c 100 host
|
||||||
|
mtr --report host
|
||||||
|
|
||||||
|
# Check network interface stats
|
||||||
|
ip -s link show
|
||||||
|
cat /proc/net/dev
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Optimize network settings
|
||||||
|
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
|
||||||
|
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
|
||||||
|
sudo sysctl -p
|
||||||
|
|
||||||
|
# Check for network congestion
|
||||||
|
iftop
|
||||||
|
nethogs
|
||||||
|
```
|
||||||
|
|
||||||
|
### Bandwidth Issues
|
||||||
|
**Symptoms**: Slow transfers, network congestion, dropped packets
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test bandwidth
|
||||||
|
iperf3 -s # Server
|
||||||
|
iperf3 -c server-ip # Client
|
||||||
|
|
||||||
|
# Check interface utilization
|
||||||
|
vnstat -i eth0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Implement QoS if needed
|
||||||
|
sudo tc qdisc add dev eth0 root fq_codel
|
||||||
|
|
||||||
|
# Optimize buffer sizes
|
||||||
|
sudo ethtool -G eth0 rx 4096 tx 4096
|
||||||
|
```
|
||||||
|
|
||||||
|
## Emergency Recovery Procedures
|
||||||
|
|
||||||
|
### Network Emergency Recovery
|
||||||
|
**Complete network failure recovery**:
|
||||||
|
```bash
|
||||||
|
# Reset all network configuration
|
||||||
|
sudo systemctl stop networking
|
||||||
|
sudo ip addr flush eth0
|
||||||
|
sudo ip route flush table main
|
||||||
|
sudo systemctl start networking
|
||||||
|
|
||||||
|
# Manual network configuration
|
||||||
|
sudo ip addr add 10.10.0.100/24 dev eth0
|
||||||
|
sudo ip route add default via 10.10.0.1
|
||||||
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
||||||
|
```
|
||||||
|
|
||||||
|
### SSH Emergency Access
|
||||||
|
**When locked out of systems**:
|
||||||
|
```bash
|
||||||
|
# Use emergency SSH key
|
||||||
|
ssh -i ~/.ssh/emergency_homelab_rsa user@host
|
||||||
|
|
||||||
|
# Via console access (if available)
|
||||||
|
# Use hypervisor console or physical access
|
||||||
|
|
||||||
|
# Reset SSH to allow password auth temporarily
|
||||||
|
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
|
||||||
|
sudo systemctl restart sshd
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service Recovery
|
||||||
|
**Critical service restoration**:
|
||||||
|
```bash
|
||||||
|
# Restart all network services
|
||||||
|
sudo systemctl restart networking
|
||||||
|
sudo systemctl restart nginx
|
||||||
|
sudo systemctl restart sshd
|
||||||
|
|
||||||
|
# Emergency firewall disable
|
||||||
|
sudo ufw disable # CAUTION: Only for troubleshooting
|
||||||
|
|
||||||
|
# Service-specific recovery
|
||||||
|
sudo systemctl restart docker
|
||||||
|
sudo systemctl restart systemd-resolved
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring and Prevention
|
||||||
|
|
||||||
|
### Network Health Monitoring
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# network-monitor.sh
|
||||||
|
CRITICAL_HOSTS="10.10.0.1 10.10.0.16 nas.homelab.local"
|
||||||
|
CRITICAL_SERVICES="https://homelab.local http://proxmox.homelab.local:8006"
|
||||||
|
|
||||||
|
for host in $CRITICAL_HOSTS; do
|
||||||
|
if ! ping -c1 -W5 $host >/dev/null 2>&1; then
|
||||||
|
echo "ALERT: $host unreachable" | logger -t network-monitor
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
for service in $CRITICAL_SERVICES; do
|
||||||
|
if ! curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
|
||||||
|
echo "ALERT: $service unavailable" | logger -t network-monitor
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Automated Recovery Scripts
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# network-recovery.sh
|
||||||
|
if ! ping -c1 8.8.8.8 >/dev/null 2>&1; then
|
||||||
|
echo "Network down, attempting recovery..."
|
||||||
|
sudo systemctl restart networking
|
||||||
|
sleep 10
|
||||||
|
if ping -c1 8.8.8.8 >/dev/null 2>&1; then
|
||||||
|
echo "Network recovered"
|
||||||
|
else
|
||||||
|
echo "Manual intervention required"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quick Reference Commands
|
||||||
|
|
||||||
|
### Network Diagnostics
|
||||||
|
```bash
|
||||||
|
# Connectivity tests
|
||||||
|
ping host
|
||||||
|
traceroute host
|
||||||
|
mtr host
|
||||||
|
nc -zv host port
|
||||||
|
|
||||||
|
# Service checks
|
||||||
|
systemctl status networking
|
||||||
|
systemctl status nginx
|
||||||
|
systemctl status sshd
|
||||||
|
|
||||||
|
# Network configuration
|
||||||
|
ip addr show
|
||||||
|
ip route show
|
||||||
|
ss -tuln
|
||||||
|
```
|
||||||
|
|
||||||
|
### Emergency Commands
|
||||||
|
```bash
|
||||||
|
# Network restart
|
||||||
|
sudo systemctl restart networking
|
||||||
|
|
||||||
|
# SSH emergency access
|
||||||
|
ssh -i ~/.ssh/emergency_homelab_rsa user@host
|
||||||
|
|
||||||
|
# Firewall quick disable (emergency only)
|
||||||
|
sudo ufw disable
|
||||||
|
|
||||||
|
# DNS quick fix
|
||||||
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
||||||
|
```
|
||||||
|
|
||||||
|
This troubleshooting guide provides comprehensive solutions for common networking issues in home lab environments.
|
||||||
@ -1,26 +0,0 @@
|
|||||||
# Docker Patterns
|
|
||||||
|
|
||||||
## Container Best Practices
|
|
||||||
- Use multi-stage builds for production images
|
|
||||||
- Minimize layer count and image size
|
|
||||||
- Run containers as non-root users
|
|
||||||
- Use specific version tags, avoid `latest`
|
|
||||||
- Implement health checks
|
|
||||||
|
|
||||||
## Common Patterns
|
|
||||||
- **Multi-service applications**: Use docker-compose for local development
|
|
||||||
- **Production deployments**: Single-container per service with orchestration
|
|
||||||
- **Development environments**: Volume mounts for code changes
|
|
||||||
- **CI/CD integration**: Build, test, and push in pipeline stages
|
|
||||||
|
|
||||||
## Security Considerations
|
|
||||||
- Scan images for vulnerabilities
|
|
||||||
- Use distroless or minimal base images
|
|
||||||
- Implement resource limits
|
|
||||||
- Network isolation between services
|
|
||||||
|
|
||||||
## Related Documentation
|
|
||||||
- Examples: `/examples/docker/multi-stage-builds.md`
|
|
||||||
- Examples: `/examples/docker/compose-patterns.md`
|
|
||||||
- Reference: `/reference/docker/troubleshooting.md`
|
|
||||||
- Reference: `/reference/docker/security-checklist.md`
|
|
||||||
@ -1,32 +0,0 @@
|
|||||||
# Networking Patterns
|
|
||||||
|
|
||||||
## Infrastructure Setup
|
|
||||||
- **Reverse proxy** configuration (Nginx/Traefik)
|
|
||||||
- **Load balancing** strategies and health checks
|
|
||||||
- **SSL/TLS termination** and certificate management
|
|
||||||
- **Network segmentation** and VLANs
|
|
||||||
|
|
||||||
## Service Discovery
|
|
||||||
- **DNS-based** service resolution
|
|
||||||
- **Container networking** with Docker networks
|
|
||||||
- **Service mesh** patterns for microservices
|
|
||||||
- **API gateway** implementation
|
|
||||||
|
|
||||||
## Security Patterns
|
|
||||||
- **Firewall rules** and port management
|
|
||||||
- **VPN setup** for remote access
|
|
||||||
- **Zero-trust networking** principles
|
|
||||||
- **Network monitoring** and intrusion detection
|
|
||||||
|
|
||||||
## Performance Optimization
|
|
||||||
- **CDN integration** for static assets
|
|
||||||
- **Connection pooling** and keep-alives
|
|
||||||
- **Bandwidth management** and QoS
|
|
||||||
- **Caching strategies** at network level
|
|
||||||
|
|
||||||
## Related Documentation
|
|
||||||
- Examples: `/examples/networking/nginx-config.md`
|
|
||||||
- Examples: `/examples/networking/vpn-setup.md`
|
|
||||||
- Examples: `/examples/networking/load-balancing.md`
|
|
||||||
- Reference: `/reference/networking/troubleshooting.md`
|
|
||||||
- Reference: `/reference/networking/security.md`
|
|
||||||
@ -1,66 +0,0 @@
|
|||||||
# Virtual Machine Management Patterns
|
|
||||||
|
|
||||||
## Automated Provisioning
|
|
||||||
- **Cloud-init deployment** - Fully automated VM provisioning from first boot
|
|
||||||
- **Post-install scripts** - Standardized configuration for existing VMs
|
|
||||||
- **SSH key management** - Automated key deployment with emergency backup
|
|
||||||
- **Security hardening** - Password auth disabled, firewall configured
|
|
||||||
|
|
||||||
## VM Provisioning Strategies
|
|
||||||
|
|
||||||
### Template-Based Deployment
|
|
||||||
- **Ubuntu Server templates** optimized for home lab environments
|
|
||||||
- **Resource allocation** sizing and planning
|
|
||||||
- **Network configuration** and VLAN assignment (10.10.0.x networks)
|
|
||||||
- **Storage provisioning** and disk management
|
|
||||||
|
|
||||||
### Infrastructure as Code
|
|
||||||
- **Cloud-init templates** for repeatable VM creation
|
|
||||||
- **Bash provisioning scripts** for existing infrastructure
|
|
||||||
- **SSH key integration** with existing homelab key management
|
|
||||||
- **Docker environment** setup with user permissions
|
|
||||||
|
|
||||||
## Lifecycle Management
|
|
||||||
- **Automated provisioning** with infrastructure as code
|
|
||||||
- **Configuration management** with standardized scripts
|
|
||||||
- **Snapshot management** and rollback strategies
|
|
||||||
- **Scaling policies** for resource optimization
|
|
||||||
|
|
||||||
## Monitoring & Maintenance
|
|
||||||
- **Resource monitoring** (CPU, memory, disk, network)
|
|
||||||
- **Health checks** and alerting systems
|
|
||||||
- **Patch management** and update strategies
|
|
||||||
- **Performance tuning** and optimization
|
|
||||||
|
|
||||||
## Backup & Recovery
|
|
||||||
- **VM-level backups** vs **application-level backups**
|
|
||||||
- **Disaster recovery** planning and testing
|
|
||||||
- **High availability** configurations
|
|
||||||
- **Migration strategies** between hosts
|
|
||||||
|
|
||||||
## Implementation Workflows
|
|
||||||
|
|
||||||
### New VM Creation (Recommended)
|
|
||||||
1. **Create VM in Proxmox** with cloud-init support
|
|
||||||
2. **Apply cloud-init template** (`scripts/vm-management/cloud-init-user-data.yaml`)
|
|
||||||
3. **Start VM** - fully automated provisioning
|
|
||||||
4. **Verify setup** via SSH key authentication
|
|
||||||
|
|
||||||
### Existing VM Configuration
|
|
||||||
1. **Run post-install script** (`scripts/vm-management/vm-post-install.sh <ip> <user>`)
|
|
||||||
2. **Automated provisioning** handles updates, SSH keys, Docker
|
|
||||||
3. **Security hardening** applied automatically
|
|
||||||
4. **Test connectivity** and verify Docker installation
|
|
||||||
|
|
||||||
## Security Architecture
|
|
||||||
- **SSH key-based authentication** only (passwords disabled)
|
|
||||||
- **Emergency key backup** for failover access
|
|
||||||
- **User privilege separation** (sudo required, docker group)
|
|
||||||
- **Automatic security updates** configured
|
|
||||||
- **Network isolation** ready (10.10.0.x internal network)
|
|
||||||
|
|
||||||
## Related Documentation
|
|
||||||
- **Implementation**: `scripts/vm-management/README.md` - Complete setup guides
|
|
||||||
- **SSH Keys**: `patterns/networking/ssh-key-management.md` - Key lifecycle management
|
|
||||||
- **Examples**: `examples/networking/ssh-homelab-setup.md` - SSH integration patterns
|
|
||||||
- **Reference**: `reference/vm-management/troubleshooting.md` - Common issues and solutions
|
|
||||||
@ -1,498 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Tdarr API Monitoring Script
|
|
||||||
|
|
||||||
Monitors Tdarr server via its web API endpoints:
|
|
||||||
- Server status and health
|
|
||||||
- Queue status and statistics
|
|
||||||
- Node status and performance
|
|
||||||
- Library scan progress
|
|
||||||
- Worker activity
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check all
|
|
||||||
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check queue
|
|
||||||
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check nodes
|
|
||||||
"""
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import json
|
|
||||||
import logging
|
|
||||||
import sys
|
|
||||||
from dataclasses import dataclass, asdict
|
|
||||||
from datetime import datetime
|
|
||||||
from typing import Dict, List, Optional, Any
|
|
||||||
import requests
|
|
||||||
from urllib.parse import urljoin
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class ServerStatus:
|
|
||||||
timestamp: str
|
|
||||||
server_url: str
|
|
||||||
status: str
|
|
||||||
error: Optional[str] = None
|
|
||||||
version: Optional[str] = None
|
|
||||||
server_id: Optional[str] = None
|
|
||||||
uptime: Optional[str] = None
|
|
||||||
system_info: Optional[Dict[str, Any]] = None
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class QueueStats:
|
|
||||||
total_files: int
|
|
||||||
queued: int
|
|
||||||
processing: int
|
|
||||||
completed: int
|
|
||||||
queue_items: List[Dict[str, Any]]
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class QueueStatus:
|
|
||||||
timestamp: str
|
|
||||||
queue_stats: Optional[QueueStats] = None
|
|
||||||
error: Optional[str] = None
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class NodeInfo:
|
|
||||||
id: Optional[str]
|
|
||||||
nodeName: Optional[str]
|
|
||||||
status: str
|
|
||||||
lastSeen: Optional[int]
|
|
||||||
version: Optional[str]
|
|
||||||
platform: Optional[str]
|
|
||||||
workers: Dict[str, int]
|
|
||||||
processing: List[Dict[str, Any]]
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class NodeSummary:
|
|
||||||
total_nodes: int
|
|
||||||
online_nodes: int
|
|
||||||
offline_nodes: int
|
|
||||||
online_details: List[NodeInfo]
|
|
||||||
offline_details: List[NodeInfo]
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class NodeStatus:
|
|
||||||
timestamp: str
|
|
||||||
nodes: List[Dict[str, Any]]
|
|
||||||
node_summary: Optional[NodeSummary] = None
|
|
||||||
error: Optional[str] = None
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class LibraryInfo:
|
|
||||||
name: Optional[str]
|
|
||||||
path: Optional[str]
|
|
||||||
file_count: int
|
|
||||||
scan_progress: int
|
|
||||||
last_scan: Optional[str]
|
|
||||||
is_scanning: bool
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class ScanStatus:
|
|
||||||
total_libraries: int
|
|
||||||
total_files: int
|
|
||||||
scanning_libraries: int
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class LibraryStatus:
|
|
||||||
timestamp: str
|
|
||||||
libraries: List[LibraryInfo]
|
|
||||||
scan_status: Optional[ScanStatus] = None
|
|
||||||
error: Optional[str] = None
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class Statistics:
|
|
||||||
total_transcodes: int
|
|
||||||
space_saved: int
|
|
||||||
total_files_processed: int
|
|
||||||
failed_transcodes: int
|
|
||||||
processing_speed: int
|
|
||||||
eta: Optional[str]
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class StatisticsStatus:
|
|
||||||
timestamp: str
|
|
||||||
statistics: Optional[Statistics] = None
|
|
||||||
error: Optional[str] = None
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class HealthCheck:
|
|
||||||
status: str
|
|
||||||
healthy: bool
|
|
||||||
online_count: Optional[int] = None
|
|
||||||
total_count: Optional[int] = None
|
|
||||||
accessible: Optional[bool] = None
|
|
||||||
total_items: Optional[int] = None
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class HealthStatus:
|
|
||||||
timestamp: str
|
|
||||||
overall_status: str
|
|
||||||
checks: Dict[str, HealthCheck]
|
|
||||||
|
|
||||||
|
|
||||||
class TdarrMonitor:
|
|
||||||
def __init__(self, server_url: str, timeout: int = 30):
|
|
||||||
"""Initialize Tdarr monitor with server URL."""
|
|
||||||
self.server_url = server_url.rstrip('/')
|
|
||||||
self.timeout = timeout
|
|
||||||
self.session = requests.Session()
|
|
||||||
|
|
||||||
# Configure logging
|
|
||||||
logging.basicConfig(
|
|
||||||
level=logging.INFO,
|
|
||||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
|
||||||
)
|
|
||||||
self.logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
def _make_request(self, endpoint: str) -> Optional[Dict[str, Any]]:
|
|
||||||
"""Make HTTP request to Tdarr API endpoint."""
|
|
||||||
url = urljoin(self.server_url, endpoint)
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = self.session.get(url, timeout=self.timeout)
|
|
||||||
response.raise_for_status()
|
|
||||||
return response.json()
|
|
||||||
|
|
||||||
except requests.exceptions.RequestException as e:
|
|
||||||
self.logger.error(f"Request failed for {url}: {e}")
|
|
||||||
return None
|
|
||||||
except json.JSONDecodeError as e:
|
|
||||||
self.logger.error(f"JSON decode failed for {url}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def get_server_status(self) -> ServerStatus:
|
|
||||||
"""Get overall server status and configuration."""
|
|
||||||
timestamp = datetime.now().isoformat()
|
|
||||||
|
|
||||||
# Try to get server info from API
|
|
||||||
data = self._make_request('/api/v2/get-server-info')
|
|
||||||
if data:
|
|
||||||
return ServerStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
server_url=self.server_url,
|
|
||||||
status='online',
|
|
||||||
version=data.get('version'),
|
|
||||||
server_id=data.get('serverId'),
|
|
||||||
uptime=data.get('uptime'),
|
|
||||||
system_info=data.get('systemInfo', {})
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
return ServerStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
server_url=self.server_url,
|
|
||||||
status='offline',
|
|
||||||
error='Unable to connect to Tdarr server'
|
|
||||||
)
|
|
||||||
|
|
||||||
def get_queue_status(self) -> QueueStatus:
|
|
||||||
"""Get transcoding queue status and statistics."""
|
|
||||||
timestamp = datetime.now().isoformat()
|
|
||||||
|
|
||||||
# Get queue information
|
|
||||||
data = self._make_request('/api/v2/get-queue')
|
|
||||||
if data:
|
|
||||||
queue_data = data.get('queue', [])
|
|
||||||
|
|
||||||
# Calculate queue statistics
|
|
||||||
total_files = len(queue_data)
|
|
||||||
queued_files = len([f for f in queue_data if f.get('status') == 'Queued'])
|
|
||||||
processing_files = len([f for f in queue_data if f.get('status') == 'Processing'])
|
|
||||||
completed_files = len([f for f in queue_data if f.get('status') == 'Completed'])
|
|
||||||
|
|
||||||
queue_stats = QueueStats(
|
|
||||||
total_files=total_files,
|
|
||||||
queued=queued_files,
|
|
||||||
processing=processing_files,
|
|
||||||
completed=completed_files,
|
|
||||||
queue_items=queue_data[:10] # First 10 items for details
|
|
||||||
)
|
|
||||||
|
|
||||||
return QueueStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
queue_stats=queue_stats
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
return QueueStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
error='Unable to fetch queue data'
|
|
||||||
)
|
|
||||||
|
|
||||||
def get_node_status(self) -> NodeStatus:
|
|
||||||
"""Get status of all connected nodes."""
|
|
||||||
timestamp = datetime.now().isoformat()
|
|
||||||
|
|
||||||
# Get nodes information
|
|
||||||
data = self._make_request('/api/v2/get-nodes')
|
|
||||||
if data:
|
|
||||||
nodes = data.get('nodes', [])
|
|
||||||
|
|
||||||
# Process node information
|
|
||||||
online_nodes = []
|
|
||||||
offline_nodes = []
|
|
||||||
|
|
||||||
for node in nodes:
|
|
||||||
node_info = NodeInfo(
|
|
||||||
id=node.get('_id'),
|
|
||||||
nodeName=node.get('nodeName'),
|
|
||||||
status='online' if node.get('lastSeen', 0) > 0 else 'offline',
|
|
||||||
lastSeen=node.get('lastSeen'),
|
|
||||||
version=node.get('version'),
|
|
||||||
platform=node.get('platform'),
|
|
||||||
workers={
|
|
||||||
'cpu': node.get('workers', {}).get('CPU', 0),
|
|
||||||
'gpu': node.get('workers', {}).get('GPU', 0)
|
|
||||||
},
|
|
||||||
processing=node.get('currentJobs', [])
|
|
||||||
)
|
|
||||||
|
|
||||||
if node_info.status == 'online':
|
|
||||||
online_nodes.append(node_info)
|
|
||||||
else:
|
|
||||||
offline_nodes.append(node_info)
|
|
||||||
|
|
||||||
node_summary = NodeSummary(
|
|
||||||
total_nodes=len(nodes),
|
|
||||||
online_nodes=len(online_nodes),
|
|
||||||
offline_nodes=len(offline_nodes),
|
|
||||||
online_details=online_nodes,
|
|
||||||
offline_details=offline_nodes
|
|
||||||
)
|
|
||||||
|
|
||||||
return NodeStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
nodes=nodes,
|
|
||||||
node_summary=node_summary
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
return NodeStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
nodes=[],
|
|
||||||
error='Unable to fetch node data'
|
|
||||||
)
|
|
||||||
|
|
||||||
def get_library_status(self) -> LibraryStatus:
|
|
||||||
"""Get library scan status and file statistics."""
|
|
||||||
timestamp = datetime.now().isoformat()
|
|
||||||
|
|
||||||
# Get library information
|
|
||||||
data = self._make_request('/api/v2/get-libraries')
|
|
||||||
if data:
|
|
||||||
libraries = data.get('libraries', [])
|
|
||||||
|
|
||||||
library_stats = []
|
|
||||||
total_files = 0
|
|
||||||
|
|
||||||
for lib in libraries:
|
|
||||||
lib_info = LibraryInfo(
|
|
||||||
name=lib.get('name'),
|
|
||||||
path=lib.get('path'),
|
|
||||||
file_count=lib.get('totalFiles', 0),
|
|
||||||
scan_progress=lib.get('scanProgress', 0),
|
|
||||||
last_scan=lib.get('lastScan'),
|
|
||||||
is_scanning=lib.get('isScanning', False)
|
|
||||||
)
|
|
||||||
library_stats.append(lib_info)
|
|
||||||
total_files += lib_info.file_count
|
|
||||||
|
|
||||||
scan_status = ScanStatus(
|
|
||||||
total_libraries=len(libraries),
|
|
||||||
total_files=total_files,
|
|
||||||
scanning_libraries=len([l for l in library_stats if l.is_scanning])
|
|
||||||
)
|
|
||||||
|
|
||||||
return LibraryStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
libraries=library_stats,
|
|
||||||
scan_status=scan_status
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
return LibraryStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
libraries=[],
|
|
||||||
error='Unable to fetch library data'
|
|
||||||
)
|
|
||||||
|
|
||||||
def get_statistics(self) -> StatisticsStatus:
|
|
||||||
"""Get overall Tdarr statistics and health metrics."""
|
|
||||||
timestamp = datetime.now().isoformat()
|
|
||||||
|
|
||||||
# Get statistics
|
|
||||||
data = self._make_request('/api/v2/get-stats')
|
|
||||||
if data:
|
|
||||||
stats = data.get('stats', {})
|
|
||||||
statistics = Statistics(
|
|
||||||
total_transcodes=stats.get('totalTranscodes', 0),
|
|
||||||
space_saved=stats.get('spaceSaved', 0),
|
|
||||||
total_files_processed=stats.get('totalFilesProcessed', 0),
|
|
||||||
failed_transcodes=stats.get('failedTranscodes', 0),
|
|
||||||
processing_speed=stats.get('processingSpeed', 0),
|
|
||||||
eta=stats.get('eta')
|
|
||||||
)
|
|
||||||
|
|
||||||
return StatisticsStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
statistics=statistics
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
return StatisticsStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
error='Unable to fetch statistics'
|
|
||||||
)
|
|
||||||
|
|
||||||
def health_check(self) -> HealthStatus:
|
|
||||||
"""Perform comprehensive health check."""
|
|
||||||
timestamp = datetime.now().isoformat()
|
|
||||||
|
|
||||||
# Server connectivity
|
|
||||||
server_status = self.get_server_status()
|
|
||||||
server_check = HealthCheck(
|
|
||||||
status=server_status.status,
|
|
||||||
healthy=server_status.status == 'online'
|
|
||||||
)
|
|
||||||
|
|
||||||
# Node connectivity
|
|
||||||
node_status = self.get_node_status()
|
|
||||||
nodes_healthy = (
|
|
||||||
node_status.node_summary.online_nodes > 0 if node_status.node_summary else False
|
|
||||||
) and not node_status.error
|
|
||||||
|
|
||||||
nodes_check = HealthCheck(
|
|
||||||
status='online' if nodes_healthy else 'offline',
|
|
||||||
healthy=nodes_healthy,
|
|
||||||
online_count=node_status.node_summary.online_nodes if node_status.node_summary else 0,
|
|
||||||
total_count=node_status.node_summary.total_nodes if node_status.node_summary else 0
|
|
||||||
)
|
|
||||||
|
|
||||||
# Queue status
|
|
||||||
queue_status = self.get_queue_status()
|
|
||||||
queue_healthy = not queue_status.error
|
|
||||||
queue_check = HealthCheck(
|
|
||||||
status='accessible' if queue_healthy else 'error',
|
|
||||||
healthy=queue_healthy,
|
|
||||||
accessible=queue_healthy,
|
|
||||||
total_items=queue_status.queue_stats.total_files if queue_status.queue_stats else 0
|
|
||||||
)
|
|
||||||
|
|
||||||
checks = {
|
|
||||||
'server': server_check,
|
|
||||||
'nodes': nodes_check,
|
|
||||||
'queue': queue_check
|
|
||||||
}
|
|
||||||
|
|
||||||
# Determine overall health
|
|
||||||
all_checks_healthy = all(check.healthy for check in checks.values())
|
|
||||||
overall_status = 'healthy' if all_checks_healthy else 'unhealthy'
|
|
||||||
|
|
||||||
return HealthStatus(
|
|
||||||
timestamp=timestamp,
|
|
||||||
overall_status=overall_status,
|
|
||||||
checks=checks
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
parser = argparse.ArgumentParser(description='Monitor Tdarr server via API')
|
|
||||||
parser.add_argument('--server', required=True, help='Tdarr server URL (e.g., http://10.10.0.43:8265)')
|
|
||||||
parser.add_argument('--check', choices=['all', 'status', 'queue', 'nodes', 'libraries', 'stats', 'health'],
|
|
||||||
default='health', help='Type of check to perform')
|
|
||||||
parser.add_argument('--timeout', type=int, default=30, help='Request timeout in seconds')
|
|
||||||
parser.add_argument('--output', choices=['json', 'pretty'], default='pretty', help='Output format')
|
|
||||||
parser.add_argument('--verbose', action='store_true', help='Enable verbose logging')
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if args.verbose:
|
|
||||||
logging.getLogger().setLevel(logging.DEBUG)
|
|
||||||
|
|
||||||
# Initialize monitor
|
|
||||||
monitor = TdarrMonitor(args.server, args.timeout)
|
|
||||||
|
|
||||||
# Perform requested check
|
|
||||||
result = None
|
|
||||||
if args.check == 'all':
|
|
||||||
result = {
|
|
||||||
'server_status': monitor.get_server_status(),
|
|
||||||
'queue_status': monitor.get_queue_status(),
|
|
||||||
'node_status': monitor.get_node_status(),
|
|
||||||
'library_status': monitor.get_library_status(),
|
|
||||||
'statistics': monitor.get_statistics()
|
|
||||||
}
|
|
||||||
elif args.check == 'status':
|
|
||||||
result = monitor.get_server_status()
|
|
||||||
elif args.check == 'queue':
|
|
||||||
result = monitor.get_queue_status()
|
|
||||||
elif args.check == 'nodes':
|
|
||||||
result = monitor.get_node_status()
|
|
||||||
elif args.check == 'libraries':
|
|
||||||
result = monitor.get_library_status()
|
|
||||||
elif args.check == 'stats':
|
|
||||||
result = monitor.get_statistics()
|
|
||||||
elif args.check == 'health':
|
|
||||||
result = monitor.health_check()
|
|
||||||
|
|
||||||
# Output results
|
|
||||||
if args.output == 'json':
|
|
||||||
# Convert dataclasses to dictionaries for JSON serialization
|
|
||||||
if args.check == 'all':
|
|
||||||
json_result = {}
|
|
||||||
for key, value in result.items():
|
|
||||||
json_result[key] = asdict(value)
|
|
||||||
print(json.dumps(json_result, indent=2))
|
|
||||||
else:
|
|
||||||
print(json.dumps(asdict(result), indent=2))
|
|
||||||
else:
|
|
||||||
# Pretty print format
|
|
||||||
print(f"=== Tdarr Monitor Results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} ===")
|
|
||||||
|
|
||||||
if args.check == 'health' or (hasattr(result, 'overall_status') and result.overall_status):
|
|
||||||
health = result if hasattr(result, 'overall_status') else None
|
|
||||||
if health:
|
|
||||||
status = health.overall_status
|
|
||||||
print(f"Overall Status: {status.upper()}")
|
|
||||||
|
|
||||||
if health.checks:
|
|
||||||
print("\nHealth Checks:")
|
|
||||||
for check_name, check_data in health.checks.items():
|
|
||||||
status_icon = "✓" if check_data.healthy else "✗"
|
|
||||||
print(f" {status_icon} {check_name.title()}: {asdict(check_data)}")
|
|
||||||
|
|
||||||
if args.check == 'all':
|
|
||||||
for section, data in result.items():
|
|
||||||
print(f"\n=== {section.replace('_', ' ').title()} ===")
|
|
||||||
print(json.dumps(asdict(data), indent=2))
|
|
||||||
elif args.check != 'health':
|
|
||||||
print(json.dumps(asdict(result), indent=2))
|
|
||||||
|
|
||||||
# Exit with appropriate code
|
|
||||||
if result:
|
|
||||||
# Check for unhealthy status in health check
|
|
||||||
if isinstance(result, HealthStatus) and result.overall_status == 'unhealthy':
|
|
||||||
sys.exit(1)
|
|
||||||
# Check for errors in individual status objects (all status classes except HealthStatus have error attribute)
|
|
||||||
elif (isinstance(result, (ServerStatus, QueueStatus, NodeStatus, LibraryStatus, StatisticsStatus))
|
|
||||||
and result.error):
|
|
||||||
sys.exit(1)
|
|
||||||
# Check for errors in 'all' results
|
|
||||||
elif isinstance(result, dict):
|
|
||||||
for status_obj in result.values():
|
|
||||||
if (isinstance(status_obj, (ServerStatus, QueueStatus, NodeStatus, LibraryStatus, StatisticsStatus))
|
|
||||||
and status_obj.error):
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
sys.exit(0)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
main()
|
|
||||||
@ -1,6 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
# Tdarr Manager - Quick access to Tdarr scheduler controls
|
|
||||||
# This is a convenience script that forwards to the main manager
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
|
|
||||||
exec "${SCRIPT_DIR}/tdarr/tdarr-schedule-manager.sh" "$@"
|
|
||||||
152
tdarr/CONTEXT.md
Normal file
152
tdarr/CONTEXT.md
Normal file
@ -0,0 +1,152 @@
|
|||||||
|
# Tdarr Transcoding System - Technology Context
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Tdarr is a distributed transcoding system that converts media files to optimized formats. This implementation uses an intelligent gaming-aware scheduler with unmapped node architecture for optimal performance and system stability.
|
||||||
|
|
||||||
|
## Architecture Patterns
|
||||||
|
|
||||||
|
### Distributed Unmapped Node Architecture (Recommended)
|
||||||
|
**Pattern**: Server-Node separation with local high-speed cache
|
||||||
|
- **Server**: Tdarr Server manages queue, web interface, and coordination
|
||||||
|
- **Node**: Unmapped nodes with local NVMe cache for processing
|
||||||
|
- **Benefits**: 3-5x performance improvement, network I/O reduction, linear scaling
|
||||||
|
|
||||||
|
**When to Use**:
|
||||||
|
- Multiple transcoding nodes across network
|
||||||
|
- High-performance requirements (10GB+ files)
|
||||||
|
- Network bandwidth limitations
|
||||||
|
- Gaming systems requiring GPU priority management
|
||||||
|
|
||||||
|
### Configuration Principles
|
||||||
|
1. **Cache Optimization**: Use local NVMe storage for work directories
|
||||||
|
2. **Gaming Detection**: Automatic pause during GPU-intensive activities
|
||||||
|
3. **Resource Isolation**: Container limits prevent kernel-level crashes
|
||||||
|
4. **Monitoring Integration**: Automated cleanup and Discord notifications
|
||||||
|
|
||||||
|
## Core Components
|
||||||
|
|
||||||
|
### Gaming-Aware Scheduler
|
||||||
|
**Purpose**: Automatically manages Tdarr node to avoid conflicts with gaming
|
||||||
|
**Location**: `scripts/tdarr-schedule-manager.sh`
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- Detects gaming processes (Steam, Lutris, Wine, etc.)
|
||||||
|
- GPU usage monitoring (>15% threshold)
|
||||||
|
- Configurable time windows
|
||||||
|
- Automated temporary directory cleanup
|
||||||
|
|
||||||
|
**Schedule Format**: `"HOUR_START-HOUR_END:DAYS"`
|
||||||
|
- `"22-07:daily"` - Overnight transcoding
|
||||||
|
- `"09-17:1-5"` - Business hours weekdays only
|
||||||
|
- `"14-16:6,7"` - Weekend afternoon window
|
||||||
|
|
||||||
|
### Monitoring System
|
||||||
|
**Purpose**: Prevents staging section timeouts and system instability
|
||||||
|
**Location**: `scripts/monitoring/tdarr-timeout-monitor.sh`
|
||||||
|
|
||||||
|
**Capabilities**:
|
||||||
|
- Staging timeout detection (300-second hardcoded limit)
|
||||||
|
- Automatic work directory cleanup
|
||||||
|
- Discord notifications with user pings
|
||||||
|
- Log rotation and retention management
|
||||||
|
|
||||||
|
### Container Architecture
|
||||||
|
**Server Configuration**:
|
||||||
|
```yaml
|
||||||
|
# Hybrid storage with resource limits
|
||||||
|
services:
|
||||||
|
tdarr:
|
||||||
|
image: ghcr.io/haveagitgat/tdarr:latest
|
||||||
|
ports: ["8265:8266"]
|
||||||
|
volumes:
|
||||||
|
- "./tdarr-data:/app/configs"
|
||||||
|
- "/mnt/media:/media"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Node Configuration**:
|
||||||
|
```bash
|
||||||
|
# Unmapped node with local cache
|
||||||
|
podman run -d \
|
||||||
|
--name tdarr-node-gpu \
|
||||||
|
-e nodeType=unmapped \
|
||||||
|
-v "/mnt/NV2/tdarr-cache:/cache" \
|
||||||
|
--device nvidia.com/gpu=all \
|
||||||
|
ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implementation Patterns
|
||||||
|
|
||||||
|
### Performance Optimization
|
||||||
|
1. **Local Cache Strategy**: Download → Process → Upload (vs. streaming)
|
||||||
|
2. **Resource Limits**: Prevent memory exhaustion and kernel crashes
|
||||||
|
3. **Network Resilience**: CIFS mount options for stability
|
||||||
|
4. **Automated Cleanup**: Prevent accumulation of stuck directories
|
||||||
|
|
||||||
|
### Error Prevention
|
||||||
|
1. **Plugin Safety**: Null-safe forEach operations `(streams || []).forEach()`
|
||||||
|
2. **Clean Installation**: Avoid custom plugin mounts causing version conflicts
|
||||||
|
3. **Container Isolation**: Resource limits prevent system-level crashes
|
||||||
|
4. **Network Stability**: Unmapped architecture reduces CIFS dependency
|
||||||
|
|
||||||
|
### Gaming Integration
|
||||||
|
1. **Process Detection**: Monitor for gaming applications and utilities
|
||||||
|
2. **GPU Threshold**: Stop transcoding when GPU usage >15%
|
||||||
|
3. **Time Windows**: Respect user-defined allowed transcoding hours
|
||||||
|
4. **Manual Override**: Direct start/stop commands bypass scheduler
|
||||||
|
|
||||||
|
## Common Workflows
|
||||||
|
|
||||||
|
### Initial Setup
|
||||||
|
1. Start server with "Allow unmapped Nodes" enabled
|
||||||
|
2. Configure node as unmapped with local cache
|
||||||
|
3. Install gaming-aware scheduler via cron
|
||||||
|
4. Set up monitoring system for automated cleanup
|
||||||
|
|
||||||
|
### Troubleshooting Patterns
|
||||||
|
1. **forEach Errors**: Clean plugin installation, avoid custom mounts
|
||||||
|
2. **Staging Timeouts**: Monitor system handles automatic cleanup
|
||||||
|
3. **System Crashes**: Convert to unmapped node architecture
|
||||||
|
4. **Network Issues**: Implement CIFS resilience options
|
||||||
|
|
||||||
|
### Performance Tuning
|
||||||
|
1. **Cache Size**: 100-500GB NVMe for concurrent jobs
|
||||||
|
2. **Bandwidth**: Unmapped nodes reduce streaming requirements
|
||||||
|
3. **Scaling**: Linear scaling with additional unmapped nodes
|
||||||
|
4. **GPU Priority**: Gaming detection ensures responsive system
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### Production Deployment
|
||||||
|
- Use unmapped node architecture for stability
|
||||||
|
- Implement comprehensive monitoring
|
||||||
|
- Configure gaming-aware scheduling for desktop systems
|
||||||
|
- Set appropriate container resource limits
|
||||||
|
|
||||||
|
### Development Guidelines
|
||||||
|
- Test with internal Tdarr test files first
|
||||||
|
- Implement null-safety checks in custom plugins
|
||||||
|
- Use structured logging for troubleshooting
|
||||||
|
- Separate concerns: scheduling, monitoring, processing
|
||||||
|
|
||||||
|
### Security Considerations
|
||||||
|
- Container isolation prevents system-level failures
|
||||||
|
- Resource limits protect against memory exhaustion
|
||||||
|
- Network mount resilience prevents kernel crashes
|
||||||
|
- Automated cleanup prevents disk space issues
|
||||||
|
|
||||||
|
## Migration Patterns
|
||||||
|
|
||||||
|
### From Mapped to Unmapped Nodes
|
||||||
|
1. Enable "Allow unmapped Nodes" in server options
|
||||||
|
2. Update node configuration (add nodeType=unmapped)
|
||||||
|
3. Change cache volume to local storage
|
||||||
|
4. Remove media volume mapping
|
||||||
|
5. Test workflow and monitor performance
|
||||||
|
|
||||||
|
### Plugin System Cleanup
|
||||||
|
1. Remove all custom plugin mounts
|
||||||
|
2. Force server restart to regenerate plugin ZIP
|
||||||
|
3. Restart nodes to download fresh plugins
|
||||||
|
4. Verify forEach fixes in downloaded plugins
|
||||||
|
|
||||||
|
This technology context provides the foundation for implementing, troubleshooting, and optimizing Tdarr transcoding systems in home lab environments.
|
||||||
143
tdarr/examples/tdarr-cifs-troubleshooting-2025-08-11.md
Normal file
143
tdarr/examples/tdarr-cifs-troubleshooting-2025-08-11.md
Normal file
@ -0,0 +1,143 @@
|
|||||||
|
# Tdarr CIFS Troubleshooting Session - 2025-08-11
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
Tdarr unmapped node experiencing persistent download timeouts at 9:08 PM with large files (31GB+ remux), causing "Cancelling" messages and stuck downloads. Downloads would hang for 33+ minutes before timing out, despite container remaining running.
|
||||||
|
|
||||||
|
## Initial Hypothesis: Mapped vs Unmapped Node Issue
|
||||||
|
**Status**: ❌ **DISPROVEN**
|
||||||
|
- Suspected unmapped node timeout configuration differences
|
||||||
|
- Windows PC running mapped Tdarr node works fine (slow but stable)
|
||||||
|
- Both mapped and unmapped Linux nodes exhibited identical timeout issues
|
||||||
|
- **Conclusion**: Architecture type was not the root cause
|
||||||
|
|
||||||
|
## Key Insight: Windows vs Linux Performance Difference
|
||||||
|
**Observation**: Windows Tdarr node (mapped mode) works without timeouts, Linux nodes (both mapped/unmapped) fail
|
||||||
|
**Implication**: Platform-specific issue, likely network stack or CIFS implementation
|
||||||
|
|
||||||
|
## Root Cause Discovery Process
|
||||||
|
|
||||||
|
### Phase 1: Linux Client CIFS Analysis
|
||||||
|
**Method**: Direct CIFS mount testing on Tdarr node machine (nobara-pc)
|
||||||
|
|
||||||
|
**Initial CIFS Mount Configuration** (problematic):
|
||||||
|
```bash
|
||||||
|
//10.10.0.35/media on /mnt/media type cifs (rw,relatime,vers=3.1.1,cache=strict,upcall_target=app,username=root,uid=1000,forceuid,gid=1000,forcegid,addr=10.10.0.35,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,noperm,reparse=nfs,nativesocket,symlink=native,rsize=4194304,wsize=4194304,bsize=1048576,retrans=1,echo_interval=60,actimeo=30,closetimeo=1,_netdev,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical Issues Identified**:
|
||||||
|
- `soft` - Mount fails on timeout instead of retrying indefinitely
|
||||||
|
- `retrans=1` - Only 1 retry attempt (NFS option, invalid for CIFS)
|
||||||
|
- `closetimeo=1` - Very short close timeout (1 second)
|
||||||
|
- `cache=strict` - No local caching, poor performance for large files
|
||||||
|
- `x-systemd.mount-timeout=30` - 30-second mount timeout
|
||||||
|
|
||||||
|
**Optimization Applied**:
|
||||||
|
```bash
|
||||||
|
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,hard,rsize=16777216,wsize=16777216,cache=loose,actimeo=60,echo_interval=30,_netdev,x-systemd.automount,x-systemd.device-timeout=60,x-systemd.mount-timeout=120,noperm 0 0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Performance Testing Results**:
|
||||||
|
- **Local SSD**: `dd` 800MB in 0.217s (4.0 GB/s) - baseline
|
||||||
|
- **CIFS 1MB blocks**: 42.7 MB/s - fast, no issues
|
||||||
|
- **CIFS 4MB blocks**: 205 MB/s - fast, no issues
|
||||||
|
- **CIFS 8MB blocks**: 83.1 MB/s - **3-minute terminal freeze**
|
||||||
|
|
||||||
|
**Critical Discovery**: Block size dependency causing I/O blocking with large transfers
|
||||||
|
|
||||||
|
### Phase 2: Tdarr Server-Side Analysis
|
||||||
|
**Method**: Test Tdarr API download path directly
|
||||||
|
|
||||||
|
**API Test Command**:
|
||||||
|
```bash
|
||||||
|
curl -X POST "http://10.10.0.43:8265/api/v2/file/download" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"filePath": "/media/Movies/Jumanji (1995)/Jumanji (1995) Remux-1080p Proper.mkv"}' \
|
||||||
|
-o /tmp/tdarr-api-test.mkv
|
||||||
|
```
|
||||||
|
|
||||||
|
**Results**:
|
||||||
|
- **Performance**: 55.7-58.6 MB/s sustained
|
||||||
|
- **Progress**: Downloaded 15.3GB of 23GB (66%)
|
||||||
|
- **Failure**: **Download hung at 66% completion**
|
||||||
|
- **Timing**: Hung after ~5 minutes (consistent with previous timeout patterns)
|
||||||
|
|
||||||
|
### Phase 3: Tdarr Server CIFS Configuration Analysis
|
||||||
|
**Method**: Examine server-side storage mount
|
||||||
|
|
||||||
|
**Server CIFS Mount** (problematic):
|
||||||
|
```bash
|
||||||
|
//10.10.0.35/media /mnt/truenas-share cifs credentials=/root/.truenascreds,vers=3.1.1,rsize=4194304,wsize=4194304,cache=strict,actimeo=30,echo_interval=60,noperm 0 0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Server Issues Identified**:
|
||||||
|
- **Missing `hard`** - Defaults to `soft` mount behavior
|
||||||
|
- `cache=strict` - No local caching (same issue as client)
|
||||||
|
- **No retry/timeout extensions** - Uses unreliable kernel defaults
|
||||||
|
- **No systemd timeout protection**
|
||||||
|
|
||||||
|
## Root Cause Confirmed
|
||||||
|
**Primary Issue**: Tdarr server's CIFS mount to TrueNAS using suboptimal configuration
|
||||||
|
**Impact**: Large file streaming via Tdarr API hangs when server's CIFS mount hits I/O blocking
|
||||||
|
**Evidence**: API download hung at exact same pattern as node timeouts (66% through large file)
|
||||||
|
|
||||||
|
## Solution Strategy
|
||||||
|
**Fix Tdarr Server CIFS Mount Configuration**:
|
||||||
|
```bash
|
||||||
|
//10.10.0.35/media /mnt/truenas-share cifs credentials=/root/.truenascreds,vers=3.1.1,hard,rsize=4194304,wsize=4194304,cache=loose,actimeo=60,echo_interval=30,_netdev,x-systemd.device-timeout=60,x-systemd.mount-timeout=120,noperm 0 0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Optimizations**:
|
||||||
|
- `hard` - Retry indefinitely instead of timing out
|
||||||
|
- `cache=loose` - Enable local caching for large file performance
|
||||||
|
- `actimeo=60` - Longer attribute caching
|
||||||
|
- `echo_interval=30` - More frequent keep-alives
|
||||||
|
- Extended systemd timeouts for reliability
|
||||||
|
|
||||||
|
## Implementation Steps
|
||||||
|
1. **Update server `/etc/fstab`** with optimized CIFS configuration
|
||||||
|
2. **Remount server storage**:
|
||||||
|
```bash
|
||||||
|
ssh tdarr "sudo umount /mnt/truenas-share"
|
||||||
|
ssh tdarr "sudo systemctl daemon-reload"
|
||||||
|
ssh tdarr "sudo mount /mnt/truenas-share"
|
||||||
|
```
|
||||||
|
3. **Test large file API download** to verify fix
|
||||||
|
4. **Resume Tdarr transcoding** with confidence in large file handling
|
||||||
|
|
||||||
|
## Technical Insights
|
||||||
|
|
||||||
|
### CIFS vs SMB Protocol Differences
|
||||||
|
- **Windows nodes**: Use native SMB implementation (stable)
|
||||||
|
- **Linux nodes**: Use kernel CIFS module (prone to I/O blocking with poor configuration)
|
||||||
|
- **Block size sensitivity**: Large block transfers require careful timeout/retry configuration
|
||||||
|
|
||||||
|
### Tdarr Architecture Impact
|
||||||
|
- **Unmapped nodes**: Download entire files via API before processing (high bandwidth, vulnerable to server CIFS issues)
|
||||||
|
- **Mapped nodes**: Stream files during processing (lower bandwidth, still vulnerable to server CIFS issues)
|
||||||
|
- **Root cause affects both architectures** since server-side storage access is the bottleneck
|
||||||
|
|
||||||
|
### Performance Expectations Post-Fix
|
||||||
|
- **Consistent 50-100 MB/s** for large file downloads
|
||||||
|
- **No timeout failures** with properly configured hard mounts
|
||||||
|
- **Reliable processing** of 31GB+ remux files
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
- **Client**: `/etc/fstab` on nobara-pc (CIFS optimization applied)
|
||||||
|
- **Server**: `/etc/fstab` on tdarr server (pending optimization)
|
||||||
|
|
||||||
|
## Monitoring and Validation
|
||||||
|
- **Success criteria**: Tdarr API download of 23GB+ file completes without hanging
|
||||||
|
- **Performance target**: Sustained 50+ MB/s throughout entire transfer
|
||||||
|
- **Reliability target**: No timeouts during large file processing
|
||||||
|
|
||||||
|
## Session Outcome
|
||||||
|
**Status**: ✅ **ROOT CAUSE IDENTIFIED AND SOLUTION READY**
|
||||||
|
- Eliminated client-side variables through systematic testing
|
||||||
|
- Confirmed server-side CIFS configuration as bottleneck
|
||||||
|
- Validated fix strategy through client-side optimization success
|
||||||
|
- Ready to implement server-side solution
|
||||||
|
|
||||||
|
---
|
||||||
|
*Session Date: 2025-08-11*
|
||||||
|
*Duration: ~3 hours*
|
||||||
|
*Methods: Direct testing, API analysis, mount configuration review*
|
||||||
183
tdarr/examples/tdarr-node-configurations.md
Normal file
183
tdarr/examples/tdarr-node-configurations.md
Normal file
@ -0,0 +1,183 @@
|
|||||||
|
# Tdarr Node Container Configurations
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Complete examples for running Tdarr transcoding nodes in containers, covering both CPU-only and GPU-accelerated setups.
|
||||||
|
|
||||||
|
## CPU-Only Configuration (Docker Compose)
|
||||||
|
|
||||||
|
For systems without GPU or when GPU isn't needed:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
version: "3.4"
|
||||||
|
services:
|
||||||
|
tdarr-node:
|
||||||
|
container_name: tdarr-node-cpu
|
||||||
|
image: ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- TZ=America/Chicago
|
||||||
|
- UMASK_SET=002
|
||||||
|
- nodeName=local-workstation-cpu
|
||||||
|
- serverIP=YOUR_TDARR_SERVER_IP # Replace with your tdarr server IP
|
||||||
|
- serverPort=8266
|
||||||
|
- inContainer=true
|
||||||
|
- ffmpegVersion=6
|
||||||
|
volumes:
|
||||||
|
# Mount your media from the same NAS share as the server
|
||||||
|
- /path/to/your/media:/media # Replace with your local media mount
|
||||||
|
# Temp directory for transcoding cache
|
||||||
|
- ./temp:/temp
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use case**:
|
||||||
|
- CPU-only transcoding
|
||||||
|
- Testing Tdarr functionality
|
||||||
|
- Systems without dedicated GPU
|
||||||
|
- When GPU drivers aren't available
|
||||||
|
|
||||||
|
## GPU-Accelerated Configuration (Podman)
|
||||||
|
|
||||||
|
**Recommended for Fedora/RHEL/CentOS/Nobara systems:**
|
||||||
|
|
||||||
|
### Mapped Node (Direct Media Access)
|
||||||
|
```bash
|
||||||
|
podman run -d --name tdarr-node-gpu-mapped \
|
||||||
|
--gpus all \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e TZ=America/Chicago \
|
||||||
|
-e UMASK_SET=002 \
|
||||||
|
-e nodeName=local-workstation-gpu-mapped \
|
||||||
|
-e serverIP=10.10.0.43 \
|
||||||
|
-e serverPort=8266 \
|
||||||
|
-e inContainer=true \
|
||||||
|
-e ffmpegVersion=6 \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
||||||
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
||||||
|
-v /mnt/NV2/tdarr-cache:/cache \
|
||||||
|
-v /mnt/media/TV:/media/TV \
|
||||||
|
-v /mnt/media/Movies:/media/Movies \
|
||||||
|
-v /mnt/media/tdarr/tdarr-cache-clean:/temp \
|
||||||
|
ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
### Unmapped Node (Downloads Files)
|
||||||
|
```bash
|
||||||
|
podman run -d --name tdarr-node-gpu-unmapped \
|
||||||
|
--gpus all \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e TZ=America/Chicago \
|
||||||
|
-e UMASK_SET=002 \
|
||||||
|
-e nodeName=local-workstation-gpu-unmapped \
|
||||||
|
-e serverIP=10.10.0.43 \
|
||||||
|
-e serverPort=8266 \
|
||||||
|
-e inContainer=true \
|
||||||
|
-e ffmpegVersion=6 \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
||||||
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
||||||
|
-v /mnt/NV2/tdarr-cache:/cache \
|
||||||
|
-v /mnt/media:/media \
|
||||||
|
-v /mnt/media/tdarr/tdarr-cache-clean:/temp \
|
||||||
|
ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use cases**:
|
||||||
|
- **Mapped**: Direct media access, faster processing, no file downloads
|
||||||
|
- **Unmapped**: Works when network shares aren't available locally
|
||||||
|
- Hardware video encoding/decoding (NVENC/NVDEC)
|
||||||
|
- High-performance transcoding with NVMe cache
|
||||||
|
- Multiple concurrent streams
|
||||||
|
- Fedora-based systems where Podman works better than Docker
|
||||||
|
|
||||||
|
## GPU-Accelerated Configuration (Docker)
|
||||||
|
|
||||||
|
**For Ubuntu/Debian systems where Docker GPU support works:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
version: "3.4"
|
||||||
|
services:
|
||||||
|
tdarr-node:
|
||||||
|
container_name: tdarr-node-gpu
|
||||||
|
image: ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- TZ=America/Chicago
|
||||||
|
- UMASK_SET=002
|
||||||
|
- nodeName=local-workstation-gpu
|
||||||
|
- serverIP=YOUR_TDARR_SERVER_IP
|
||||||
|
- serverPort=8266
|
||||||
|
- inContainer=true
|
||||||
|
- ffmpegVersion=6
|
||||||
|
- NVIDIA_DRIVER_CAPABILITIES=all
|
||||||
|
- NVIDIA_VISIBLE_DEVICES=all
|
||||||
|
volumes:
|
||||||
|
- /path/to/your/media:/media
|
||||||
|
- ./temp:/temp
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: all
|
||||||
|
capabilities: [gpu]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration Parameters
|
||||||
|
|
||||||
|
### Required Environment Variables
|
||||||
|
- `TZ`: Timezone (e.g., `America/Chicago`)
|
||||||
|
- `nodeName`: Unique identifier for this node
|
||||||
|
- `serverIP`: IP address of Tdarr server
|
||||||
|
- `serverPort`: Tdarr server port (typically 8266)
|
||||||
|
- `inContainer`: Set to `true` for containerized deployments
|
||||||
|
- `ffmpegVersion`: FFmpeg version to use (6 recommended)
|
||||||
|
|
||||||
|
### GPU-Specific Variables
|
||||||
|
- `NVIDIA_DRIVER_CAPABILITIES`: Set to `all` for full GPU access
|
||||||
|
- `NVIDIA_VISIBLE_DEVICES`: `all` for all GPUs, or specific GPU IDs
|
||||||
|
|
||||||
|
### Volume Mounts
|
||||||
|
- `/media`: Mount point for media files (must match server configuration)
|
||||||
|
- `/temp`: Temporary directory for transcoding cache
|
||||||
|
|
||||||
|
## Platform-Specific Recommendations
|
||||||
|
|
||||||
|
### Fedora/RHEL/CentOS/Nobara
|
||||||
|
- **GPU**: Use Podman (Docker Desktop has GPU issues)
|
||||||
|
- **CPU**: Docker or Podman both work fine
|
||||||
|
|
||||||
|
### Ubuntu/Debian
|
||||||
|
- **GPU**: Use Docker with nvidia-container-toolkit
|
||||||
|
- **CPU**: Docker recommended
|
||||||
|
|
||||||
|
### Testing GPU Functionality
|
||||||
|
|
||||||
|
Verify GPU access inside container:
|
||||||
|
```bash
|
||||||
|
# For Podman
|
||||||
|
podman exec tdarr-node-gpu nvidia-smi
|
||||||
|
|
||||||
|
# For Docker
|
||||||
|
docker exec tdarr-node-gpu nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
Test NVENC encoding:
|
||||||
|
```bash
|
||||||
|
# For Podman
|
||||||
|
podman exec tdarr-node-gpu /usr/local/bin/tdarr-ffmpeg -f lavfi -i testsrc2=duration=5:size=1920x1080:rate=30 -c:v h264_nvenc -t 5 /tmp/test.mp4
|
||||||
|
|
||||||
|
# For Docker
|
||||||
|
docker exec tdarr-node-gpu /usr/local/bin/tdarr-ffmpeg -f lavfi -i testsrc2=duration=5:size=1920x1080:rate=30 -c:v h264_nvenc -t 5 /tmp/test.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
- **GPU not detected**: See `reference/docker/nvidia-gpu-troubleshooting.md`
|
||||||
|
- **Permission issues**: Ensure proper UMASK_SET and volume permissions
|
||||||
|
- **Connection issues**: Verify serverIP and firewall settings
|
||||||
|
- **Performance issues**: Monitor CPU/GPU utilization during transcoding
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- `patterns/docker/gpu-acceleration.md` - GPU acceleration patterns
|
||||||
|
- `reference/docker/nvidia-gpu-troubleshooting.md` - Detailed GPU troubleshooting
|
||||||
|
- `start-tdarr-gpu-podman.sh` - Ready-to-use Podman startup script
|
||||||
28
tdarr/examples/tdarr-node-local/docker-compose-cpu.yml
Normal file
28
tdarr/examples/tdarr-node-local/docker-compose-cpu.yml
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
version: "3.4"
|
||||||
|
services:
|
||||||
|
tdarr-node:
|
||||||
|
container_name: tdarr-node-local-cpu
|
||||||
|
image: ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- TZ=America/Chicago
|
||||||
|
- UMASK_SET=002
|
||||||
|
- nodeName=local-workstation-cpu
|
||||||
|
- serverIP=192.168.1.100 # Replace with your Tdarr server IP
|
||||||
|
- serverPort=8266
|
||||||
|
- inContainer=true
|
||||||
|
- ffmpegVersion=6
|
||||||
|
volumes:
|
||||||
|
# Media access (same as server)
|
||||||
|
- /mnt/media:/media # Replace with your media path
|
||||||
|
# Local transcoding cache
|
||||||
|
- ./temp:/temp
|
||||||
|
# Resource limits for CPU transcoding
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
cpus: '14' # Leave some cores for system (16-core = use 14)
|
||||||
|
memory: 32G # Generous for 4K transcoding
|
||||||
|
reservations:
|
||||||
|
cpus: '8' # Minimum guaranteed cores
|
||||||
|
memory: 16G
|
||||||
45
tdarr/examples/tdarr-node-local/docker-compose-gpu.yml
Normal file
45
tdarr/examples/tdarr-node-local/docker-compose-gpu.yml
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
version: "3.4"
|
||||||
|
services:
|
||||||
|
tdarr-node:
|
||||||
|
container_name: tdarr-node-local-gpu
|
||||||
|
image: ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- TZ=America/Chicago
|
||||||
|
- UMASK_SET=002
|
||||||
|
- nodeName=local-workstation-gpu
|
||||||
|
- serverIP=192.168.1.100 # Replace with your Tdarr server IP
|
||||||
|
- serverPort=8266
|
||||||
|
- inContainer=true
|
||||||
|
- ffmpegVersion=6
|
||||||
|
# NVIDIA environment variables
|
||||||
|
- NVIDIA_DRIVER_CAPABILITIES=all
|
||||||
|
- NVIDIA_VISIBLE_DEVICES=all
|
||||||
|
volumes:
|
||||||
|
# Media access (same as server)
|
||||||
|
- /mnt/media:/media # Replace with your media path
|
||||||
|
# Local transcoding cache
|
||||||
|
- ./temp:/temp
|
||||||
|
devices:
|
||||||
|
- /dev/dri:/dev/dri # Intel/AMD GPU fallback
|
||||||
|
|
||||||
|
# GPU configuration - choose ONE method:
|
||||||
|
|
||||||
|
# Method 1: Deploy syntax (recommended)
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 16G # GPU transcoding uses less RAM
|
||||||
|
reservations:
|
||||||
|
memory: 8G
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: all
|
||||||
|
capabilities: [gpu]
|
||||||
|
|
||||||
|
# Method 2: Runtime (alternative)
|
||||||
|
# runtime: nvidia
|
||||||
|
|
||||||
|
# Method 3: CDI (future)
|
||||||
|
# devices:
|
||||||
|
# - nvidia.com/gpu=all
|
||||||
83
tdarr/examples/tdarr-node-local/start-tdarr-mapped-node.sh
Executable file
83
tdarr/examples/tdarr-node-local/start-tdarr-mapped-node.sh
Executable file
@ -0,0 +1,83 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# Tdarr Mapped Node with GPU Support - Example Script
|
||||||
|
# This script starts a MAPPED Tdarr node container with NVIDIA GPU acceleration using Podman
|
||||||
|
#
|
||||||
|
# MAPPED NODES: Direct access to media files via volume mounts
|
||||||
|
# Use this approach when you want the node to directly access your media library
|
||||||
|
# for local processing without server coordination for file transfers
|
||||||
|
#
|
||||||
|
# Configure these variables for your setup:
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
CONTAINER_NAME="tdarr-node-gpu-mapped"
|
||||||
|
SERVER_IP="YOUR_SERVER_IP" # e.g., "10.10.0.43" or "192.168.1.100"
|
||||||
|
SERVER_PORT="8266" # Default Tdarr server port
|
||||||
|
NODE_NAME="YOUR_NODE_NAME" # e.g., "workstation-gpu" or "local-gpu-node"
|
||||||
|
MEDIA_PATH="/path/to/your/media" # e.g., "/mnt/media" or "/home/user/Videos"
|
||||||
|
CACHE_PATH="/path/to/cache" # e.g., "/mnt/ssd/tdarr-cache"
|
||||||
|
|
||||||
|
echo "🚀 Starting MAPPED Tdarr Node with GPU support using Podman..."
|
||||||
|
echo " Media Path: ${MEDIA_PATH}"
|
||||||
|
echo " Cache Path: ${CACHE_PATH}"
|
||||||
|
|
||||||
|
# Stop and remove existing container if it exists
|
||||||
|
if podman ps -a --format "{{.Names}}" | grep -q "^${CONTAINER_NAME}$"; then
|
||||||
|
echo "🛑 Stopping existing container: ${CONTAINER_NAME}"
|
||||||
|
podman stop "${CONTAINER_NAME}" 2>/dev/null || true
|
||||||
|
podman rm "${CONTAINER_NAME}" 2>/dev/null || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Start Tdarr node with GPU support
|
||||||
|
echo "🎬 Starting Tdarr Node container..."
|
||||||
|
podman run -d --name "${CONTAINER_NAME}" \
|
||||||
|
--gpus all \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e TZ=America/Chicago \
|
||||||
|
-e UMASK_SET=002 \
|
||||||
|
-e nodeName="${NODE_NAME}" \
|
||||||
|
-e serverIP="${SERVER_IP}" \
|
||||||
|
-e serverPort="${SERVER_PORT}" \
|
||||||
|
-e inContainer=true \
|
||||||
|
-e ffmpegVersion=6 \
|
||||||
|
-e logLevel=DEBUG \
|
||||||
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
||||||
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
||||||
|
-v "${MEDIA_PATH}:/media" \
|
||||||
|
-v "${CACHE_PATH}:/temp" \
|
||||||
|
ghcr.io/haveagitgat/tdarr_node:latest
|
||||||
|
|
||||||
|
echo "⏳ Waiting for container to initialize..."
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
# Check container status
|
||||||
|
if podman ps --format "{{.Names}}" | grep -q "^${CONTAINER_NAME}$"; then
|
||||||
|
echo "✅ Mapped Tdarr Node is running successfully!"
|
||||||
|
echo ""
|
||||||
|
echo "📊 Container Status:"
|
||||||
|
podman ps --filter "name=${CONTAINER_NAME}" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||||
|
echo ""
|
||||||
|
echo "🔍 Testing GPU Access:"
|
||||||
|
if podman exec "${CONTAINER_NAME}" nvidia-smi --query-gpu=name --format=csv,noheader,nounits 2>/dev/null; then
|
||||||
|
echo "🎉 GPU is accessible in container!"
|
||||||
|
else
|
||||||
|
echo "⚠️ GPU test failed, but container is running"
|
||||||
|
fi
|
||||||
|
echo ""
|
||||||
|
echo "🌐 Connection Details:"
|
||||||
|
echo " Server: ${SERVER_IP}:${SERVER_PORT}"
|
||||||
|
echo " Node Name: ${NODE_NAME}"
|
||||||
|
echo ""
|
||||||
|
echo "🧪 Test NVENC encoding:"
|
||||||
|
echo " podman exec ${CONTAINER_NAME} /usr/local/bin/tdarr-ffmpeg -f lavfi -i testsrc2=duration=5:size=1920x1080:rate=30 -c:v h264_nvenc -preset fast -t 5 /tmp/test.mp4"
|
||||||
|
echo ""
|
||||||
|
echo "📋 Container Management:"
|
||||||
|
echo " View logs: podman logs ${CONTAINER_NAME}"
|
||||||
|
echo " Stop: podman stop ${CONTAINER_NAME}"
|
||||||
|
echo " Remove: podman rm ${CONTAINER_NAME}"
|
||||||
|
else
|
||||||
|
echo "❌ Failed to start container"
|
||||||
|
echo "📋 Checking logs..."
|
||||||
|
podman logs "${CONTAINER_NAME}" --tail 10
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
69
tdarr/examples/tdarr-server-setup/README.md
Normal file
69
tdarr/examples/tdarr-server-setup/README.md
Normal file
@ -0,0 +1,69 @@
|
|||||||
|
# Tdarr Server Setup Example
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
```
|
||||||
|
~/container-data/tdarr/
|
||||||
|
├── docker-compose.yml
|
||||||
|
├── stonefish-tdarr-plugins/ # Custom plugins
|
||||||
|
├── tdarr/
|
||||||
|
│ ├── server/ # Local storage
|
||||||
|
│ ├── configs/
|
||||||
|
│ └── logs/
|
||||||
|
└── temp/ # Local temp if needed
|
||||||
|
```
|
||||||
|
|
||||||
|
## Storage Strategy
|
||||||
|
|
||||||
|
### Local Storage (Fast Access)
|
||||||
|
- **Database**: SQLite requires local filesystem for WAL mode
|
||||||
|
- **Configs**: Frequently accessed during startup
|
||||||
|
- **Logs**: Regular writes during operation
|
||||||
|
|
||||||
|
### Network Storage (Capacity)
|
||||||
|
- **Backups**: Infrequent access, large files
|
||||||
|
- **Media**: Read-only during transcoding
|
||||||
|
- **Cache**: Temporary transcoding files
|
||||||
|
|
||||||
|
## Upgrade Process
|
||||||
|
|
||||||
|
### Major Version Upgrades
|
||||||
|
1. **Backup current state**
|
||||||
|
```bash
|
||||||
|
docker-compose down
|
||||||
|
cp docker-compose.yml docker-compose.yml.backup
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **For clean start** (recommended for major versions):
|
||||||
|
```bash
|
||||||
|
# Remove old database
|
||||||
|
sudo rm -rf ./tdarr/server
|
||||||
|
mkdir -p ./tdarr/server
|
||||||
|
|
||||||
|
# Pull latest image
|
||||||
|
docker-compose pull
|
||||||
|
|
||||||
|
# Start fresh
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Monitor initialization**
|
||||||
|
```bash
|
||||||
|
docker-compose logs -f
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Issues
|
||||||
|
|
||||||
|
### Disk Space
|
||||||
|
- Monitor local database growth
|
||||||
|
- Regular cleanup of old backups
|
||||||
|
- Use network storage for large static data
|
||||||
|
|
||||||
|
### Permissions
|
||||||
|
- Container runs as PUID/PGID (usually 0/0)
|
||||||
|
- Ensure proper ownership of mounted directories
|
||||||
|
- Use `sudo rm -rf` for root-owned container files
|
||||||
|
|
||||||
|
### Network Filesystem Issues
|
||||||
|
- SQLite incompatible with NFS/SMB for database
|
||||||
|
- Keep database local, only backups on network
|
||||||
|
- Monitor transcoding cache disk usage
|
||||||
37
tdarr/examples/tdarr-server-setup/docker-compose.yml
Normal file
37
tdarr/examples/tdarr-server-setup/docker-compose.yml
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
version: "3.4"
|
||||||
|
services:
|
||||||
|
tdarr:
|
||||||
|
container_name: tdarr
|
||||||
|
image: ghcr.io/haveagitgat/tdarr:latest
|
||||||
|
restart: unless-stopped
|
||||||
|
network_mode: bridge
|
||||||
|
ports:
|
||||||
|
- 8265:8265 # webUI port
|
||||||
|
- 8266:8266 # server port
|
||||||
|
environment:
|
||||||
|
- TZ=America/Chicago
|
||||||
|
- PUID=0
|
||||||
|
- PGID=0
|
||||||
|
- UMASK_SET=002
|
||||||
|
- serverIP=0.0.0.0
|
||||||
|
- serverPort=8266
|
||||||
|
- webUIPort=8265
|
||||||
|
- internalNode=false # Disable for distributed setup
|
||||||
|
- inContainer=true
|
||||||
|
- ffmpegVersion=6
|
||||||
|
- nodeName=docker-server
|
||||||
|
volumes:
|
||||||
|
# Plugin mounts (stonefish example)
|
||||||
|
- ./stonefish-tdarr-plugins/FlowPlugins/:/app/server/Tdarr/Plugins/FlowPlugins/
|
||||||
|
- ./stonefish-tdarr-plugins/FlowPluginsTs/:/app/server/Tdarr/Plugins/FlowPluginsTs/
|
||||||
|
- ./stonefish-tdarr-plugins/Community/:/app/server/Tdarr/Plugins/Community/
|
||||||
|
|
||||||
|
# Hybrid storage strategy
|
||||||
|
- ./tdarr/server:/app/server # Local: Database, configs, logs
|
||||||
|
- ./tdarr/configs:/app/configs
|
||||||
|
- ./tdarr/logs:/app/logs
|
||||||
|
- /mnt/truenas-share/tdarr/tdarr-server/Backups:/app/server/Tdarr/Backups # Network: Backups
|
||||||
|
|
||||||
|
# Media and cache
|
||||||
|
- /mnt/truenas-share:/media
|
||||||
|
- /mnt/truenas-share/tdarr/tdarr-cache:/temp
|
||||||
212
tdarr/scripts/CONTEXT.md
Normal file
212
tdarr/scripts/CONTEXT.md
Normal file
@ -0,0 +1,212 @@
|
|||||||
|
# Tdarr Scripts - Operational Context
|
||||||
|
|
||||||
|
## Script Overview
|
||||||
|
This directory contains active operational scripts for Tdarr transcoding automation, gaming-aware scheduling, and system management.
|
||||||
|
|
||||||
|
## Core Scripts
|
||||||
|
|
||||||
|
### Gaming-Aware Scheduler
|
||||||
|
**Primary Script**: `tdarr-schedule-manager.sh`
|
||||||
|
**Purpose**: Comprehensive management interface for gaming-aware Tdarr scheduling
|
||||||
|
|
||||||
|
**Key Functions**:
|
||||||
|
- **Preset Management**: Quick schedule templates (night-only, work-safe, weekend-heavy, gaming-only)
|
||||||
|
- **Installation**: Automated cron job setup and configuration
|
||||||
|
- **Status Monitoring**: Real-time status and logging
|
||||||
|
- **Configuration**: Interactive schedule editing and validation
|
||||||
|
|
||||||
|
**Usage Patterns**:
|
||||||
|
```bash
|
||||||
|
# Quick setup
|
||||||
|
./tdarr-schedule-manager.sh preset work-safe
|
||||||
|
./tdarr-schedule-manager.sh install
|
||||||
|
|
||||||
|
# Monitoring
|
||||||
|
./tdarr-schedule-manager.sh status
|
||||||
|
./tdarr-schedule-manager.sh logs
|
||||||
|
|
||||||
|
# Testing
|
||||||
|
./tdarr-schedule-manager.sh test
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container Management
|
||||||
|
**Start Script**: `start-tdarr-gpu-podman-clean.sh`
|
||||||
|
**Purpose**: Launch unmapped Tdarr node with optimized configuration
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- **Unmapped Node Configuration**: Local cache for optimal performance
|
||||||
|
- **GPU Support**: Full NVIDIA device passthrough
|
||||||
|
- **Resource Optimization**: Direct NVMe cache mapping
|
||||||
|
- **Clean Architecture**: No media volume dependencies
|
||||||
|
|
||||||
|
**Stop Script**: `stop-tdarr-gpu-podman.sh`
|
||||||
|
**Purpose**: Graceful container shutdown with cleanup
|
||||||
|
|
||||||
|
### Scheduling Engine
|
||||||
|
**Core Engine**: `tdarr-cron-check-configurable.sh`
|
||||||
|
**Purpose**: Minute-by-minute decision engine for Tdarr state management
|
||||||
|
|
||||||
|
**Decision Logic**:
|
||||||
|
1. **Gaming Detection**: Check for active gaming processes
|
||||||
|
2. **GPU Monitoring**: Verify GPU usage below threshold (15%)
|
||||||
|
3. **Time Window Validation**: Ensure current time within allowed schedule
|
||||||
|
4. **State Management**: Start/stop Tdarr based on conditions
|
||||||
|
|
||||||
|
**Gaming Process Detection**:
|
||||||
|
- Steam, Lutris, Heroic Games Launcher
|
||||||
|
- Wine, Bottles (Windows compatibility layers)
|
||||||
|
- GameMode, MangoHUD (gaming utilities)
|
||||||
|
- GPU usage monitoring via nvidia-smi
|
||||||
|
|
||||||
|
### Configuration Management
|
||||||
|
**Config File**: `tdarr-schedule.conf`
|
||||||
|
**Purpose**: Centralized configuration for scheduler behavior
|
||||||
|
|
||||||
|
**Configuration Structure**:
|
||||||
|
```bash
|
||||||
|
# Time blocks: "HOUR_START-HOUR_END:DAYS"
|
||||||
|
SCHEDULE_BLOCKS="22-07:daily 09-17:1-5"
|
||||||
|
|
||||||
|
# Gaming detection settings
|
||||||
|
GPU_THRESHOLD=15
|
||||||
|
GAMING_PROCESSES="steam lutris heroic wine bottles gamemode mangohud"
|
||||||
|
|
||||||
|
# Operational settings
|
||||||
|
LOG_FILE="/tmp/tdarr-scheduler.log"
|
||||||
|
CONTAINER_NAME="tdarr-node-gpu"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Operational Patterns
|
||||||
|
|
||||||
|
### Automated Maintenance
|
||||||
|
**Cron Integration**: Two automated systems running simultaneously
|
||||||
|
1. **Scheduler** (every minute): `tdarr-cron-check-configurable.sh`
|
||||||
|
2. **Cleanup** (every 6 hours): Temporary directory maintenance
|
||||||
|
|
||||||
|
**Cleanup Automation**:
|
||||||
|
```bash
|
||||||
|
# Removes abandoned transcoding directories
|
||||||
|
0 */6 * * * find /tmp -name "tdarr-workDir2-*" -type d -mmin +360 -exec rm -rf {} \; 2>/dev/null || true
|
||||||
|
```
|
||||||
|
|
||||||
|
### Logging Strategy
|
||||||
|
**Log Location**: `/tmp/tdarr-scheduler.log`
|
||||||
|
**Log Format**: Timestamped entries with decision reasoning
|
||||||
|
**Log Rotation**: Manual cleanup, focused on recent activity
|
||||||
|
|
||||||
|
**Log Examples**:
|
||||||
|
```
|
||||||
|
[2025-08-13 14:30:01] Gaming detected (steam), stopping Tdarr
|
||||||
|
[2025-08-13 14:35:01] Gaming ended, but outside allowed hours (14:35 not in 22-07:daily)
|
||||||
|
[2025-08-13 22:00:01] Starting Tdarr (no gaming, within schedule)
|
||||||
|
```
|
||||||
|
|
||||||
|
### System Integration
|
||||||
|
**Gaming Detection**: Real-time process monitoring
|
||||||
|
**GPU Monitoring**: nvidia-smi integration for usage thresholds
|
||||||
|
**Container Management**: Podman-based lifecycle management
|
||||||
|
**Cron Integration**: Standard system scheduler for automation
|
||||||
|
|
||||||
|
## Configuration Presets
|
||||||
|
|
||||||
|
### Preset Profiles
|
||||||
|
**night-only**: `"22-07:daily"` - Overnight transcoding only
|
||||||
|
**work-safe**: `"22-07:daily 09-17:1-5"` - Nights + work hours
|
||||||
|
**weekend-heavy**: `"22-07:daily 09-17:1-5 08-20:6-7"` - Maximum time
|
||||||
|
**gaming-only**: No time limits, gaming detection only
|
||||||
|
|
||||||
|
### Schedule Format Specification
|
||||||
|
**Format**: `"HOUR_START-HOUR_END:DAYS"`
|
||||||
|
**Examples**:
|
||||||
|
- `"22-07:daily"` - 10PM to 7AM every day (overnight)
|
||||||
|
- `"09-17:1-5"` - 9AM to 5PM Monday-Friday
|
||||||
|
- `"14-16:6,7"` - 2PM to 4PM Saturday and Sunday
|
||||||
|
- `"08-20:6-7"` - 8AM to 8PM weekends only
|
||||||
|
|
||||||
|
## Container Architecture
|
||||||
|
|
||||||
|
### Unmapped Node Configuration
|
||||||
|
**Architecture Choice**: Local cache with API-based transfers
|
||||||
|
**Benefits**: 3-5x performance improvement, reduced network dependency
|
||||||
|
|
||||||
|
**Container Environment**:
|
||||||
|
```bash
|
||||||
|
-e nodeType=unmapped
|
||||||
|
-e unmappedNodeCache=/cache
|
||||||
|
-e enableGpu=true
|
||||||
|
-e TZ=America/New_York
|
||||||
|
```
|
||||||
|
|
||||||
|
**Volume Configuration**:
|
||||||
|
```bash
|
||||||
|
# Local high-speed cache (NVMe)
|
||||||
|
-v "/mnt/NV2/tdarr-cache:/cache"
|
||||||
|
|
||||||
|
# Configuration persistence
|
||||||
|
-v "/mnt/NV2/tdarr-cache-clean:/temp"
|
||||||
|
|
||||||
|
# No media volumes (unmapped mode uses API)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resource Management
|
||||||
|
**GPU Access**: Full NVIDIA device passthrough
|
||||||
|
**Memory**: Controlled by container limits
|
||||||
|
**CPU**: Shared with host system
|
||||||
|
**Storage**: Local NVMe for optimal I/O performance
|
||||||
|
|
||||||
|
## Troubleshooting Context
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
1. **Gaming Not Detected**: Check process names in configuration
|
||||||
|
2. **Time Window Issues**: Verify schedule block format
|
||||||
|
3. **Container Start Failures**: Check GPU device access
|
||||||
|
4. **Log File Growth**: Manual cleanup of scheduler logs
|
||||||
|
|
||||||
|
### Diagnostic Commands
|
||||||
|
```bash
|
||||||
|
# Test current conditions
|
||||||
|
./tdarr-schedule-manager.sh test
|
||||||
|
|
||||||
|
# View real-time logs
|
||||||
|
./tdarr-schedule-manager.sh logs
|
||||||
|
|
||||||
|
# Check container status
|
||||||
|
podman ps | grep tdarr
|
||||||
|
|
||||||
|
# Verify GPU access
|
||||||
|
podman exec tdarr-node-gpu nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
### Recovery Procedures
|
||||||
|
```bash
|
||||||
|
# Reset to defaults
|
||||||
|
./tdarr-schedule-manager.sh preset work-safe
|
||||||
|
|
||||||
|
# Reinstall scheduler
|
||||||
|
./tdarr-schedule-manager.sh install
|
||||||
|
|
||||||
|
# Manual container restart
|
||||||
|
./stop-tdarr-gpu-podman.sh
|
||||||
|
./start-tdarr-gpu-podman-clean.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration Points
|
||||||
|
|
||||||
|
### External Dependencies
|
||||||
|
- **Podman**: Container runtime for node management
|
||||||
|
- **nvidia-smi**: GPU monitoring and device access
|
||||||
|
- **cron**: System scheduler for automation
|
||||||
|
- **SSH**: Remote server access (monitoring scripts)
|
||||||
|
|
||||||
|
### File System Dependencies
|
||||||
|
- **Cache Directory**: `/mnt/NV2/tdarr-cache` (local NVMe)
|
||||||
|
- **Temp Directory**: `/mnt/NV2/tdarr-cache-clean` (processing space)
|
||||||
|
- **Log Files**: `/tmp/tdarr-scheduler.log` (operational logs)
|
||||||
|
- **Configuration**: Local `tdarr-schedule.conf` file
|
||||||
|
|
||||||
|
### Network Dependencies
|
||||||
|
- **Tdarr Server**: API communication for unmapped node operation
|
||||||
|
- **Discord Webhooks**: Optional notification integration (via monitoring)
|
||||||
|
- **NAS Access**: For final file storage (post-processing only)
|
||||||
|
|
||||||
|
This operational context provides comprehensive guidance for managing active Tdarr automation scripts in production environments.
|
||||||
272
tdarr/troubleshooting.md
Normal file
272
tdarr/troubleshooting.md
Normal file
@ -0,0 +1,272 @@
|
|||||||
|
# Tdarr Troubleshooting Guide
|
||||||
|
|
||||||
|
## forEach Error Resolution
|
||||||
|
|
||||||
|
### Problem: TypeError: Cannot read properties of undefined (reading 'forEach')
|
||||||
|
**Symptoms**: Scanning phase fails at "Tagging video res" step, preventing all transcodes
|
||||||
|
**Root Cause**: Custom plugin mounts override community plugins with incompatible versions
|
||||||
|
|
||||||
|
### Solution: Clean Plugin Installation
|
||||||
|
1. **Remove custom plugin mounts** from docker-compose.yml
|
||||||
|
2. **Force plugin regeneration**:
|
||||||
|
```bash
|
||||||
|
ssh tdarr "docker restart tdarr"
|
||||||
|
podman restart tdarr-node-gpu
|
||||||
|
```
|
||||||
|
3. **Verify clean plugins**: Check for null-safety fixes `(streams || []).forEach()`
|
||||||
|
|
||||||
|
### Plugin Safety Patterns
|
||||||
|
```javascript
|
||||||
|
// ❌ Unsafe - causes forEach errors
|
||||||
|
args.variables.ffmpegCommand.streams.forEach()
|
||||||
|
|
||||||
|
// ✅ Safe - null-safe forEach
|
||||||
|
(args.variables.ffmpegCommand.streams || []).forEach()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Staging Section Timeout Issues
|
||||||
|
|
||||||
|
### Problem: Files removed from staging after 300 seconds
|
||||||
|
**Symptoms**:
|
||||||
|
- `.tmp` files stuck in work directories
|
||||||
|
- ENOTEMPTY errors during cleanup
|
||||||
|
- Subsequent jobs blocked
|
||||||
|
|
||||||
|
### Solution: Automated Monitoring System
|
||||||
|
**Monitor Script**: `/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh`
|
||||||
|
|
||||||
|
**Automatic Actions**:
|
||||||
|
- Detects staging timeouts every 20 minutes
|
||||||
|
- Removes stuck work directories
|
||||||
|
- Sends Discord notifications
|
||||||
|
- Logs all cleanup activities
|
||||||
|
|
||||||
|
### Manual Cleanup Commands
|
||||||
|
```bash
|
||||||
|
# Check staging section
|
||||||
|
ssh tdarr "docker logs tdarr | tail -50"
|
||||||
|
|
||||||
|
# Find stuck work directories
|
||||||
|
find /mnt/NV2/tdarr-cache -name "tdarr-workDir*" -type d
|
||||||
|
|
||||||
|
# Force cleanup stuck directory
|
||||||
|
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir-[ID]
|
||||||
|
```
|
||||||
|
|
||||||
|
## System Stability Issues
|
||||||
|
|
||||||
|
### Problem: Kernel crashes during intensive transcoding
|
||||||
|
**Root Cause**: CIFS network issues during large file streaming (mapped nodes)
|
||||||
|
|
||||||
|
### Solution: Convert to Unmapped Node Architecture
|
||||||
|
1. **Enable unmapped nodes** in server Options
|
||||||
|
2. **Update node configuration**:
|
||||||
|
```bash
|
||||||
|
# Add to container environment
|
||||||
|
-e nodeType=unmapped
|
||||||
|
-e unmappedNodeCache=/cache
|
||||||
|
|
||||||
|
# Use local cache volume
|
||||||
|
-v "/mnt/NV2/tdarr-cache:/cache"
|
||||||
|
|
||||||
|
# Remove media volume (no longer needed)
|
||||||
|
```
|
||||||
|
3. **Benefits**: Eliminates CIFS streaming, prevents kernel crashes
|
||||||
|
|
||||||
|
### Container Resource Limits
|
||||||
|
```yaml
|
||||||
|
# Prevent memory exhaustion
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 8G
|
||||||
|
cpus: '6'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Gaming Detection Issues
|
||||||
|
|
||||||
|
### Problem: Tdarr doesn't stop during gaming
|
||||||
|
**Check gaming detection**:
|
||||||
|
```bash
|
||||||
|
# Test current gaming detection
|
||||||
|
./tdarr-schedule-manager.sh test
|
||||||
|
|
||||||
|
# View scheduler logs
|
||||||
|
tail -f /tmp/tdarr-scheduler.log
|
||||||
|
|
||||||
|
# Verify GPU usage detection
|
||||||
|
nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
### Gaming Process Detection
|
||||||
|
**Monitored Processes**:
|
||||||
|
- Steam, Lutris, Heroic Games Launcher
|
||||||
|
- Wine, Bottles (Windows compatibility)
|
||||||
|
- GameMode, MangoHUD (utilities)
|
||||||
|
- **GPU usage >15%** (configurable threshold)
|
||||||
|
|
||||||
|
### Configuration Adjustments
|
||||||
|
```bash
|
||||||
|
# Edit gaming detection threshold
|
||||||
|
./tdarr-schedule-manager.sh edit
|
||||||
|
|
||||||
|
# Apply preset configurations
|
||||||
|
./tdarr-schedule-manager.sh preset gaming-only # No time limits
|
||||||
|
./tdarr-schedule-manager.sh preset night-only # 10PM-7AM only
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network and Access Issues
|
||||||
|
|
||||||
|
### Server Connection Problems
|
||||||
|
**Server Access Commands**:
|
||||||
|
```bash
|
||||||
|
# SSH to Tdarr server
|
||||||
|
ssh tdarr
|
||||||
|
|
||||||
|
# Check server status
|
||||||
|
ssh tdarr "docker ps | grep tdarr"
|
||||||
|
|
||||||
|
# View server logs
|
||||||
|
ssh tdarr "docker logs tdarr"
|
||||||
|
|
||||||
|
# Access server container
|
||||||
|
ssh tdarr "docker exec -it tdarr /bin/bash"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Node Registration Issues
|
||||||
|
```bash
|
||||||
|
# Check node logs
|
||||||
|
podman logs tdarr-node-gpu
|
||||||
|
|
||||||
|
# Verify node registration
|
||||||
|
# Look for "Node registered" in server logs
|
||||||
|
ssh tdarr "docker logs tdarr | grep -i node"
|
||||||
|
|
||||||
|
# Test node connectivity
|
||||||
|
curl http://10.10.0.43:8265/api/v2/status
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Issues
|
||||||
|
|
||||||
|
### Slow Transcoding Performance
|
||||||
|
**Diagnosis**:
|
||||||
|
1. **Check cache location**: Should be local NVMe, not network
|
||||||
|
2. **Verify unmapped mode**: `nodeType=unmapped` in container
|
||||||
|
3. **Monitor I/O**: `iotop` during transcoding
|
||||||
|
|
||||||
|
**Expected Performance**:
|
||||||
|
- **Mapped nodes**: Constant SMB streaming (~100MB/s)
|
||||||
|
- **Unmapped nodes**: Download once → Process locally → Upload once
|
||||||
|
|
||||||
|
### GPU Utilization Problems
|
||||||
|
```bash
|
||||||
|
# Monitor GPU usage during transcoding
|
||||||
|
watch nvidia-smi
|
||||||
|
|
||||||
|
# Check GPU device access in container
|
||||||
|
podman exec tdarr-node-gpu nvidia-smi
|
||||||
|
|
||||||
|
# Verify NVENC encoder availability
|
||||||
|
podman exec tdarr-node-gpu ffmpeg -encoders | grep nvenc
|
||||||
|
```
|
||||||
|
|
||||||
|
## Plugin System Issues
|
||||||
|
|
||||||
|
### Plugin Loading Failures
|
||||||
|
**Troubleshooting Steps**:
|
||||||
|
1. **Check plugin directory**: Ensure no custom mounts override community plugins
|
||||||
|
2. **Verify dependencies**: FlowHelper files (`metadataUtils.js`, `letterboxUtils.js`)
|
||||||
|
3. **Test plugin syntax**:
|
||||||
|
```bash
|
||||||
|
# Test plugin in Node.js
|
||||||
|
node -e "require('./path/to/plugin.js')"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Custom Plugin Integration
|
||||||
|
**Safe Integration Pattern**:
|
||||||
|
1. **Selective mounting**: Mount only specific required plugins
|
||||||
|
2. **Dependency verification**: Include all FlowHelper dependencies
|
||||||
|
3. **Version compatibility**: Ensure plugins match Tdarr version
|
||||||
|
4. **Null-safety checks**: Add `|| []` to forEach operations
|
||||||
|
|
||||||
|
## Monitoring and Logging
|
||||||
|
|
||||||
|
### Log Locations
|
||||||
|
```bash
|
||||||
|
# Scheduler logs
|
||||||
|
tail -f /tmp/tdarr-scheduler.log
|
||||||
|
|
||||||
|
# Monitor logs
|
||||||
|
tail -f /tmp/tdarr-monitor/monitor.log
|
||||||
|
|
||||||
|
# Server logs
|
||||||
|
ssh tdarr "docker logs tdarr"
|
||||||
|
|
||||||
|
# Node logs
|
||||||
|
podman logs tdarr-node-gpu
|
||||||
|
```
|
||||||
|
|
||||||
|
### Discord Notification Issues
|
||||||
|
**Check webhook configuration**:
|
||||||
|
```bash
|
||||||
|
# Test Discord webhook
|
||||||
|
curl -X POST [WEBHOOK_URL] \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"content": "Test message"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Common Issues**:
|
||||||
|
- JSON escaping in message content
|
||||||
|
- Markdown formatting in Discord
|
||||||
|
- User ping placement (outside code blocks)
|
||||||
|
|
||||||
|
## Emergency Recovery
|
||||||
|
|
||||||
|
### Complete System Reset
|
||||||
|
```bash
|
||||||
|
# Stop all containers
|
||||||
|
podman stop tdarr-node-gpu
|
||||||
|
ssh tdarr "docker stop tdarr"
|
||||||
|
|
||||||
|
# Clean cache directories
|
||||||
|
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir*
|
||||||
|
|
||||||
|
# Remove scheduler
|
||||||
|
crontab -e # Delete tdarr lines
|
||||||
|
|
||||||
|
# Restart with clean configuration
|
||||||
|
./start-tdarr-gpu-podman-clean.sh
|
||||||
|
./tdarr-schedule-manager.sh preset work-safe
|
||||||
|
./tdarr-schedule-manager.sh install
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Recovery
|
||||||
|
**Important**: Tdarr processes files in-place, original files remain untouched
|
||||||
|
- **Queue data**: Stored in server configuration (`/app/configs`)
|
||||||
|
- **Progress data**: Lost on container restart (unmapped nodes)
|
||||||
|
- **Cache files**: Safe to delete, will re-download
|
||||||
|
|
||||||
|
## Common Error Patterns
|
||||||
|
|
||||||
|
### "Copy failed" in Staging Section
|
||||||
|
**Cause**: Network timeout during file transfer to unmapped node
|
||||||
|
**Solution**: Monitoring system automatically retries
|
||||||
|
|
||||||
|
### "ENOTEMPTY" Directory Cleanup Errors
|
||||||
|
**Cause**: Partial downloads leave files in work directories
|
||||||
|
**Solution**: Force remove directories, monitoring handles automatically
|
||||||
|
|
||||||
|
### Node Disconnection During Processing
|
||||||
|
**Cause**: Gaming detection or manual stop during active job
|
||||||
|
**Result**: File returns to queue automatically, safe to restart
|
||||||
|
|
||||||
|
## Prevention Best Practices
|
||||||
|
|
||||||
|
1. **Use unmapped node architecture** for stability
|
||||||
|
2. **Implement monitoring system** for automatic cleanup
|
||||||
|
3. **Configure gaming-aware scheduling** for desktop systems
|
||||||
|
4. **Set container resource limits** to prevent crashes
|
||||||
|
5. **Use clean plugin installation** to avoid forEach errors
|
||||||
|
6. **Monitor system resources** during intensive operations
|
||||||
|
|
||||||
|
This troubleshooting guide covers the most common issues and their resolutions for production Tdarr deployments.
|
||||||
296
vm-management/CONTEXT.md
Normal file
296
vm-management/CONTEXT.md
Normal file
@ -0,0 +1,296 @@
|
|||||||
|
# Virtual Machine Management - Technology Context
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Virtual machine management for home lab environments with focus on automated provisioning, infrastructure as code, and security-first configuration. This context covers VM lifecycle management, Proxmox integration, and standardized deployment patterns.
|
||||||
|
|
||||||
|
## Architecture Patterns
|
||||||
|
|
||||||
|
### Infrastructure as Code (IaC) Approach
|
||||||
|
**Pattern**: Declarative VM configuration with repeatable deployments
|
||||||
|
```yaml
|
||||||
|
# Cloud-init template pattern
|
||||||
|
#cloud-config
|
||||||
|
users:
|
||||||
|
- name: cal
|
||||||
|
groups: [sudo, docker]
|
||||||
|
ssh_authorized_keys:
|
||||||
|
- ssh-rsa AAAAB3... primary-key
|
||||||
|
- ssh-rsa AAAAB3... emergency-key
|
||||||
|
packages:
|
||||||
|
- docker.io
|
||||||
|
- docker-compose
|
||||||
|
runcmd:
|
||||||
|
- systemctl enable docker
|
||||||
|
- usermod -aG docker cal
|
||||||
|
```
|
||||||
|
|
||||||
|
### Template-Based Deployment Strategy
|
||||||
|
**Pattern**: Standardized VM templates with cloud-init automation
|
||||||
|
- **Base Templates**: Ubuntu Server with cloud-init support
|
||||||
|
- **Resource Allocation**: Standardized sizing (2CPU/4GB/20GB baseline)
|
||||||
|
- **Network Configuration**: Predefined VLAN assignments (10.10.0.x internal)
|
||||||
|
- **Security Hardening**: SSH keys only, password auth disabled
|
||||||
|
|
||||||
|
## Provisioning Strategies
|
||||||
|
|
||||||
|
### Cloud-Init Deployment (Recommended for New VMs)
|
||||||
|
**Purpose**: Fully automated VM provisioning from first boot
|
||||||
|
**Implementation**:
|
||||||
|
1. Create VM in Proxmox with cloud-init support
|
||||||
|
2. Apply standardized cloud-init template
|
||||||
|
3. VM configures itself automatically on first boot
|
||||||
|
4. No manual intervention required
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- Zero-touch deployment
|
||||||
|
- Consistent configuration
|
||||||
|
- Security hardening from first boot
|
||||||
|
- Immediate productivity
|
||||||
|
|
||||||
|
### Post-Install Scripting (Existing VMs)
|
||||||
|
**Purpose**: Standardize existing VM configurations
|
||||||
|
**Implementation**:
|
||||||
|
```bash
|
||||||
|
./vm-post-install.sh <vm-ip> [username]
|
||||||
|
# Automated: updates, SSH keys, Docker, hardening
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use Cases**:
|
||||||
|
- Legacy VM standardization
|
||||||
|
- Imported VM configuration
|
||||||
|
- Recovery and remediation
|
||||||
|
- Incremental improvements
|
||||||
|
|
||||||
|
## Security Architecture
|
||||||
|
|
||||||
|
### SSH Key-Based Authentication
|
||||||
|
**Pattern**: Dual key deployment for security and redundancy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Primary access key
|
||||||
|
~/.ssh/homelab_rsa # Daily operations
|
||||||
|
|
||||||
|
# Emergency access key
|
||||||
|
~/.ssh/emergency_homelab_rsa # Backup/recovery access
|
||||||
|
```
|
||||||
|
|
||||||
|
**Security Controls**:
|
||||||
|
- Password authentication completely disabled
|
||||||
|
- Root login prohibited
|
||||||
|
- SSH keys managed centrally
|
||||||
|
- Automatic key deployment
|
||||||
|
|
||||||
|
### User Privilege Management
|
||||||
|
**Pattern**: Least privilege with sudo elevation
|
||||||
|
```bash
|
||||||
|
# User configuration
|
||||||
|
username: cal
|
||||||
|
groups: [sudo, docker] # Minimal required groups
|
||||||
|
shell: /bin/bash
|
||||||
|
sudo: ALL=(ALL) NOPASSWD:ALL # Operational convenience
|
||||||
|
```
|
||||||
|
|
||||||
|
**Access Controls**:
|
||||||
|
- Non-root user accounts only
|
||||||
|
- Sudo required for administrative tasks
|
||||||
|
- Docker group for container management
|
||||||
|
- SSH key authentication mandatory
|
||||||
|
|
||||||
|
### Network Security
|
||||||
|
**Pattern**: Network segmentation and access control
|
||||||
|
- **Internal Network**: 10.10.0.x/24 for VM communication
|
||||||
|
- **Management Access**: SSH (port 22) only
|
||||||
|
- **Service Isolation**: Application-specific port exposure
|
||||||
|
- **Firewall Ready**: iptables/ufw configuration prepared
|
||||||
|
|
||||||
|
## Lifecycle Management Patterns
|
||||||
|
|
||||||
|
### VM Creation Workflow
|
||||||
|
1. **Template Selection**: Choose appropriate base image
|
||||||
|
2. **Resource Allocation**: Size based on workload requirements
|
||||||
|
3. **Network Assignment**: VLAN and IP address planning
|
||||||
|
4. **Cloud-Init Configuration**: Apply standardized template
|
||||||
|
5. **Automated Provisioning**: Zero-touch deployment
|
||||||
|
6. **Verification**: Automated connectivity and configuration tests
|
||||||
|
|
||||||
|
### Configuration Management
|
||||||
|
**Pattern**: Standardized system configuration
|
||||||
|
```bash
|
||||||
|
# Essential packages
|
||||||
|
packages: [
|
||||||
|
"curl", "wget", "git", "vim", "htop", "unzip",
|
||||||
|
"docker.io", "docker-compose-plugin"
|
||||||
|
]
|
||||||
|
|
||||||
|
# System services
|
||||||
|
runcmd:
|
||||||
|
- systemctl enable docker
|
||||||
|
- systemctl enable ssh
|
||||||
|
- systemctl enable unattended-upgrades
|
||||||
|
```
|
||||||
|
|
||||||
|
### Maintenance Automation
|
||||||
|
**Pattern**: Automated updates and maintenance
|
||||||
|
- **Security Updates**: Automatic installation enabled
|
||||||
|
- **Package Management**: Standardized package selection
|
||||||
|
- **Service Management**: Consistent service configuration
|
||||||
|
- **Log Management**: Centralized logging ready
|
||||||
|
|
||||||
|
## Resource Management
|
||||||
|
|
||||||
|
### Sizing Standards
|
||||||
|
**Pattern**: Standardized VM resource allocation
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Basic workload (web services, small databases)
|
||||||
|
vcpus: 2
|
||||||
|
memory: 4096 # 4GB
|
||||||
|
disk: 20 # 20GB
|
||||||
|
|
||||||
|
# Medium workload (application servers, medium databases)
|
||||||
|
vcpus: 4
|
||||||
|
memory: 8192 # 8GB
|
||||||
|
disk: 40 # 40GB
|
||||||
|
|
||||||
|
# Heavy workload (transcoding, large databases)
|
||||||
|
vcpus: 6
|
||||||
|
memory: 16384 # 16GB
|
||||||
|
disk: 100 # 100GB
|
||||||
|
```
|
||||||
|
|
||||||
|
### Storage Strategy
|
||||||
|
**Pattern**: Application-appropriate storage allocation
|
||||||
|
- **System Disk**: OS and applications (20-40GB)
|
||||||
|
- **Data Volumes**: Application data (variable)
|
||||||
|
- **Backup Storage**: Network-attached for persistence
|
||||||
|
- **Cache Storage**: Local fast storage for performance
|
||||||
|
|
||||||
|
### Network Planning
|
||||||
|
**Pattern**: Structured network addressing
|
||||||
|
```yaml
|
||||||
|
# Network segments
|
||||||
|
management: 10.10.0.x/24 # VM management and SSH access
|
||||||
|
services: 10.10.1.x/24 # Application services
|
||||||
|
storage: 10.10.2.x/24 # Storage and backup traffic
|
||||||
|
dmz: 10.10.10.x/24 # External-facing services
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring and Operations
|
||||||
|
|
||||||
|
### Health Monitoring
|
||||||
|
**Pattern**: Automated system health checks
|
||||||
|
```bash
|
||||||
|
# Resource monitoring
|
||||||
|
cpu_usage: <80%
|
||||||
|
memory_usage: <90%
|
||||||
|
disk_usage: <85%
|
||||||
|
network_connectivity: verified
|
||||||
|
|
||||||
|
# Service monitoring
|
||||||
|
ssh_service: active
|
||||||
|
docker_service: active
|
||||||
|
unattended_upgrades: active
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backup Strategies
|
||||||
|
**Pattern**: Multi-tier backup approach
|
||||||
|
- **VM Snapshots**: Point-in-time recovery (Proxmox)
|
||||||
|
- **Application Data**: Specific application backup procedures
|
||||||
|
- **Configuration Backup**: Cloud-init templates and scripts
|
||||||
|
- **SSH Keys**: Centralized key management backup
|
||||||
|
|
||||||
|
### Performance Tuning
|
||||||
|
**Pattern**: Workload-optimized configuration
|
||||||
|
```yaml
|
||||||
|
# CPU optimization
|
||||||
|
cpu_type: host # Performance over compatibility
|
||||||
|
numa: enabled # NUMA awareness for multi-socket
|
||||||
|
|
||||||
|
# Memory optimization
|
||||||
|
ballooning: enabled # Dynamic memory allocation
|
||||||
|
hugepages: disabled # Unless specifically needed
|
||||||
|
|
||||||
|
# Storage optimization
|
||||||
|
cache: writethrough # Balance performance and safety
|
||||||
|
io_thread: enabled # Improve I/O performance
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration Patterns
|
||||||
|
|
||||||
|
### Container Platform Integration
|
||||||
|
**Pattern**: Docker-ready VM deployment
|
||||||
|
```bash
|
||||||
|
# Automated Docker setup
|
||||||
|
- docker.io installation
|
||||||
|
- docker-compose plugin
|
||||||
|
- User added to docker group
|
||||||
|
- Service auto-start enabled
|
||||||
|
- Container runtime verified
|
||||||
|
```
|
||||||
|
|
||||||
|
### SSH Infrastructure Integration
|
||||||
|
**Pattern**: Centralized SSH key management
|
||||||
|
```bash
|
||||||
|
# Key deployment automation
|
||||||
|
primary_key: ~/.ssh/homelab_rsa.pub
|
||||||
|
emergency_key: ~/.ssh/emergency_homelab_rsa.pub
|
||||||
|
backup_system: automated
|
||||||
|
rotation_policy: annual
|
||||||
|
```
|
||||||
|
|
||||||
|
### Network Services Integration
|
||||||
|
**Pattern**: Ready for service deployment
|
||||||
|
- **Reverse Proxy**: Nginx/Traefik ready configuration
|
||||||
|
- **DNS**: Local DNS registration prepared
|
||||||
|
- **Certificates**: Let's Encrypt integration ready
|
||||||
|
- **Monitoring**: Prometheus/Grafana agent ready
|
||||||
|
|
||||||
|
## Common Implementation Workflows
|
||||||
|
|
||||||
|
### New VM Deployment
|
||||||
|
1. **Create VM** in Proxmox with cloud-init support
|
||||||
|
2. **Configure resources** based on workload requirements
|
||||||
|
3. **Apply cloud-init template** with standardized configuration
|
||||||
|
4. **Start VM** and wait for automated provisioning
|
||||||
|
5. **Verify deployment** via SSH key authentication
|
||||||
|
6. **Deploy applications** using container or package management
|
||||||
|
|
||||||
|
### Existing VM Standardization
|
||||||
|
1. **Assess current configuration** and identify gaps
|
||||||
|
2. **Run post-install script** for automated updates
|
||||||
|
3. **Verify SSH key deployment** and password auth disable
|
||||||
|
4. **Test Docker installation** and user permissions
|
||||||
|
5. **Update documentation** with new configuration
|
||||||
|
6. **Schedule regular maintenance** and monitoring
|
||||||
|
|
||||||
|
### VM Migration and Recovery
|
||||||
|
1. **Create VM snapshot** before changes
|
||||||
|
2. **Export VM configuration** and cloud-init template
|
||||||
|
3. **Test recovery procedure** in staging environment
|
||||||
|
4. **Document recovery steps** and verification procedures
|
||||||
|
5. **Implement backup automation** for critical VMs
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### Security Hardening
|
||||||
|
1. **SSH Keys Only**: Disable password authentication completely
|
||||||
|
2. **Emergency Access**: Deploy backup SSH keys for recovery
|
||||||
|
3. **User Separation**: Non-root users with sudo privileges
|
||||||
|
4. **Automatic Updates**: Enable security update automation
|
||||||
|
5. **Network Isolation**: Use VLANs and firewall rules
|
||||||
|
|
||||||
|
### Operational Excellence
|
||||||
|
1. **Infrastructure as Code**: Use cloud-init for reproducible deployments
|
||||||
|
2. **Standardization**: Consistent VM sizing and configuration
|
||||||
|
3. **Automation**: Minimize manual configuration steps
|
||||||
|
4. **Documentation**: Maintain deployment templates and procedures
|
||||||
|
5. **Testing**: Verify deployments before production use
|
||||||
|
|
||||||
|
### Performance Optimization
|
||||||
|
1. **Resource Right-Sizing**: Match resources to workload requirements
|
||||||
|
2. **Storage Strategy**: Use appropriate storage tiers
|
||||||
|
3. **Network Optimization**: Plan network topology for performance
|
||||||
|
4. **Monitoring**: Implement resource usage monitoring
|
||||||
|
5. **Capacity Planning**: Plan for growth and scaling
|
||||||
|
|
||||||
|
This technology context provides comprehensive guidance for implementing virtual machine management in home lab and production environments using modern IaC principles and security best practices.
|
||||||
652
vm-management/troubleshooting.md
Normal file
652
vm-management/troubleshooting.md
Normal file
@ -0,0 +1,652 @@
|
|||||||
|
# Virtual Machine Management Troubleshooting Guide
|
||||||
|
|
||||||
|
## VM Provisioning Issues
|
||||||
|
|
||||||
|
### Cloud-Init Configuration Problems
|
||||||
|
|
||||||
|
#### Cloud-Init Not Executing
|
||||||
|
**Symptoms**:
|
||||||
|
- VM starts but user accounts not created
|
||||||
|
- SSH keys not deployed
|
||||||
|
- Packages not installed
|
||||||
|
- Configuration not applied
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check cloud-init status and logs
|
||||||
|
ssh root@<vm-ip> 'cloud-init status --long'
|
||||||
|
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
|
||||||
|
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
|
||||||
|
|
||||||
|
# Verify cloud-init configuration
|
||||||
|
ssh root@<vm-ip> 'cloud-init query userdata'
|
||||||
|
|
||||||
|
# Check for YAML syntax errors
|
||||||
|
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Re-run cloud-init (CAUTION: may overwrite changes)
|
||||||
|
ssh root@<vm-ip> 'cloud-init clean --logs'
|
||||||
|
ssh root@<vm-ip> 'cloud-init init --local'
|
||||||
|
ssh root@<vm-ip> 'cloud-init init'
|
||||||
|
ssh root@<vm-ip> 'cloud-init modules --mode=config'
|
||||||
|
ssh root@<vm-ip> 'cloud-init modules --mode=final'
|
||||||
|
|
||||||
|
# Manual user creation if cloud-init fails
|
||||||
|
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
|
||||||
|
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
|
||||||
|
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
|
||||||
|
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Invalid Cloud-Init YAML
|
||||||
|
**Symptoms**:
|
||||||
|
- Cloud-init fails with syntax errors
|
||||||
|
- Parser errors in cloud-init logs
|
||||||
|
- Partial configuration application
|
||||||
|
|
||||||
|
**Common YAML Issues**:
|
||||||
|
```yaml
|
||||||
|
# ❌ Incorrect indentation
|
||||||
|
users:
|
||||||
|
- name: cal
|
||||||
|
groups: [sudo, docker] # Wrong indentation
|
||||||
|
|
||||||
|
# ✅ Correct indentation
|
||||||
|
users:
|
||||||
|
- name: cal
|
||||||
|
groups: [sudo, docker] # Proper indentation
|
||||||
|
|
||||||
|
# ❌ Missing quotes for special characters
|
||||||
|
ssh_authorized_keys:
|
||||||
|
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
|
||||||
|
|
||||||
|
# ✅ Quoted strings
|
||||||
|
ssh_authorized_keys:
|
||||||
|
- "ssh-rsa AAAAB3NzaC1... user@host"
|
||||||
|
```
|
||||||
|
|
||||||
|
### VM Boot and Startup Issues
|
||||||
|
|
||||||
|
#### VM Won't Start
|
||||||
|
**Symptoms**:
|
||||||
|
- VM fails to boot from Proxmox
|
||||||
|
- Kernel panic messages
|
||||||
|
- Boot loop or hanging
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check VM configuration
|
||||||
|
pvesh get /nodes/pve/qemu/<vmid>/config
|
||||||
|
|
||||||
|
# Check resource allocation
|
||||||
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
||||||
|
|
||||||
|
# Review VM logs via Proxmox console
|
||||||
|
# Use Proxmox web interface -> VM -> Console
|
||||||
|
|
||||||
|
# Check Proxmox host resources
|
||||||
|
pvesh get /nodes/pve/status
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Increase memory allocation
|
||||||
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
|
||||||
|
|
||||||
|
# Reset CPU configuration
|
||||||
|
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
|
||||||
|
|
||||||
|
# Check and repair disk
|
||||||
|
# Stop VM, then:
|
||||||
|
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
|
||||||
|
# Use fsck on the disk image if needed
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Resource Constraints
|
||||||
|
**Symptoms**:
|
||||||
|
- VM extremely slow performance
|
||||||
|
- Out-of-memory kills
|
||||||
|
- Disk I/O bottlenecks
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Inside VM resource check
|
||||||
|
free -h
|
||||||
|
df -h
|
||||||
|
iostat 1 5
|
||||||
|
vmstat 1 5
|
||||||
|
|
||||||
|
# Proxmox host resource check
|
||||||
|
pvesh get /nodes/pve/status
|
||||||
|
cat /proc/meminfo
|
||||||
|
df -h /var/lib/vz
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Increase VM resources via Proxmox
|
||||||
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
|
||||||
|
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
|
||||||
|
|
||||||
|
# Resize VM disk
|
||||||
|
# Proxmox GUI: Hardware -> Hard Disk -> Resize
|
||||||
|
# Then extend filesystem:
|
||||||
|
sudo growpart /dev/sda 1
|
||||||
|
sudo resize2fs /dev/sda1
|
||||||
|
```
|
||||||
|
|
||||||
|
## SSH Access Issues
|
||||||
|
|
||||||
|
### SSH Connection Failures
|
||||||
|
|
||||||
|
#### Cannot Connect to VM
|
||||||
|
**Symptoms**:
|
||||||
|
- Connection timeout
|
||||||
|
- Connection refused
|
||||||
|
- Host unreachable
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Network connectivity tests
|
||||||
|
ping <vm-ip>
|
||||||
|
traceroute <vm-ip>
|
||||||
|
|
||||||
|
# SSH service tests
|
||||||
|
nc -zv <vm-ip> 22
|
||||||
|
nmap -p 22 <vm-ip>
|
||||||
|
|
||||||
|
# From Proxmox console, check SSH service
|
||||||
|
systemctl status sshd
|
||||||
|
ss -tlnp | grep :22
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Via Proxmox console - restart SSH
|
||||||
|
systemctl start sshd
|
||||||
|
systemctl enable sshd
|
||||||
|
|
||||||
|
# Check and configure firewall
|
||||||
|
ufw status
|
||||||
|
# If blocking SSH:
|
||||||
|
ufw allow ssh
|
||||||
|
ufw allow 22/tcp
|
||||||
|
|
||||||
|
# Network configuration reset
|
||||||
|
ip addr show
|
||||||
|
dhclient # For DHCP
|
||||||
|
systemctl restart networking
|
||||||
|
```
|
||||||
|
|
||||||
|
#### SSH Key Authentication Failures
|
||||||
|
**Symptoms**:
|
||||||
|
- Password prompts despite key installation
|
||||||
|
- "Permission denied (publickey)"
|
||||||
|
- "No more authentication methods"
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Verbose SSH debugging
|
||||||
|
ssh -vvv cal@<vm-ip>
|
||||||
|
|
||||||
|
# Check key files locally
|
||||||
|
ls -la ~/.ssh/homelab_rsa*
|
||||||
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
||||||
|
|
||||||
|
# Via console or password auth, check VM
|
||||||
|
ls -la ~/.ssh/
|
||||||
|
cat ~/.ssh/authorized_keys
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Fix SSH directory permissions
|
||||||
|
chmod 700 ~/.ssh
|
||||||
|
chmod 600 ~/.ssh/authorized_keys
|
||||||
|
chown -R cal:cal ~/.ssh
|
||||||
|
|
||||||
|
# Re-deploy SSH keys
|
||||||
|
cat > ~/.ssh/authorized_keys << 'EOF'
|
||||||
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
|
||||||
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Verify SSH server configuration
|
||||||
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### SSH Security Configuration Issues
|
||||||
|
**Symptoms**:
|
||||||
|
- Password authentication still enabled
|
||||||
|
- Root login allowed
|
||||||
|
- Insecure SSH settings
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check effective SSH configuration
|
||||||
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
|
||||||
|
|
||||||
|
# Review SSH config files
|
||||||
|
cat /etc/ssh/sshd_config
|
||||||
|
ls /etc/ssh/sshd_config.d/
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Apply security hardening
|
||||||
|
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
|
||||||
|
PasswordAuthentication no
|
||||||
|
PubkeyAuthentication yes
|
||||||
|
PermitRootLogin no
|
||||||
|
AllowUsers cal
|
||||||
|
Protocol 2
|
||||||
|
ClientAliveInterval 300
|
||||||
|
ClientAliveCountMax 2
|
||||||
|
MaxAuthTries 3
|
||||||
|
X11Forwarding no
|
||||||
|
EOF
|
||||||
|
|
||||||
|
sudo systemctl restart sshd
|
||||||
|
```
|
||||||
|
|
||||||
|
## Docker Installation and Configuration Issues
|
||||||
|
|
||||||
|
### Docker Installation Failures
|
||||||
|
|
||||||
|
#### Package Installation Fails
|
||||||
|
**Symptoms**:
|
||||||
|
- Docker packages not found
|
||||||
|
- GPG key verification errors
|
||||||
|
- Repository access failures
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Test internet connectivity
|
||||||
|
ping google.com
|
||||||
|
curl -I https://download.docker.com
|
||||||
|
|
||||||
|
# Check repository configuration
|
||||||
|
cat /etc/apt/sources.list.d/docker.list
|
||||||
|
apt-cache policy docker-ce
|
||||||
|
|
||||||
|
# Check for package conflicts
|
||||||
|
dpkg -l | grep docker
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Remove conflicting packages
|
||||||
|
sudo apt remove -y docker docker-engine docker.io containerd runc
|
||||||
|
|
||||||
|
# Reinstall Docker repository
|
||||||
|
sudo mkdir -p /etc/apt/keyrings
|
||||||
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
||||||
|
|
||||||
|
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
|
||||||
|
|
||||||
|
# Install Docker
|
||||||
|
sudo apt update
|
||||||
|
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Docker Service Issues
|
||||||
|
**Symptoms**:
|
||||||
|
- Docker daemon won't start
|
||||||
|
- Socket connection errors
|
||||||
|
- Service failure on boot
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check service status
|
||||||
|
systemctl status docker
|
||||||
|
journalctl -u docker.service -f
|
||||||
|
|
||||||
|
# Check system resources
|
||||||
|
df -h
|
||||||
|
free -h
|
||||||
|
|
||||||
|
# Test daemon manually
|
||||||
|
sudo dockerd --debug
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Restart Docker service
|
||||||
|
sudo systemctl stop docker
|
||||||
|
sudo systemctl start docker
|
||||||
|
sudo systemctl enable docker
|
||||||
|
|
||||||
|
# Clear corrupted Docker data
|
||||||
|
sudo systemctl stop docker
|
||||||
|
sudo rm -rf /var/lib/docker/tmp/*
|
||||||
|
sudo systemctl start docker
|
||||||
|
|
||||||
|
# Reset Docker configuration
|
||||||
|
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
|
||||||
|
sudo systemctl restart docker
|
||||||
|
```
|
||||||
|
|
||||||
|
### Docker Permission and Access Issues
|
||||||
|
|
||||||
|
#### Permission Denied Errors
|
||||||
|
**Symptoms**:
|
||||||
|
- Must use sudo for Docker commands
|
||||||
|
- "Permission denied" when accessing Docker socket
|
||||||
|
- User not in docker group
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check user groups
|
||||||
|
groups
|
||||||
|
groups cal
|
||||||
|
getent group docker
|
||||||
|
|
||||||
|
# Check Docker socket permissions
|
||||||
|
ls -la /var/run/docker.sock
|
||||||
|
|
||||||
|
# Verify Docker service is running
|
||||||
|
systemctl status docker
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Add user to docker group
|
||||||
|
sudo usermod -aG docker cal
|
||||||
|
|
||||||
|
# Create docker group if missing
|
||||||
|
sudo groupadd docker 2>/dev/null || true
|
||||||
|
sudo usermod -aG docker cal
|
||||||
|
|
||||||
|
# Apply group membership (requires logout/login or):
|
||||||
|
newgrp docker
|
||||||
|
|
||||||
|
# Fix socket permissions
|
||||||
|
sudo chown root:docker /var/run/docker.sock
|
||||||
|
sudo chmod 664 /var/run/docker.sock
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network Configuration Problems
|
||||||
|
|
||||||
|
### IP Address and Connectivity Issues
|
||||||
|
|
||||||
|
#### Incorrect IP Configuration
|
||||||
|
**Symptoms**:
|
||||||
|
- VM has wrong IP address
|
||||||
|
- No network connectivity
|
||||||
|
- Cannot reach default gateway
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check network configuration
|
||||||
|
ip addr show
|
||||||
|
ip route show
|
||||||
|
cat /etc/netplan/*.yaml
|
||||||
|
|
||||||
|
# Test connectivity
|
||||||
|
ping $(ip route | grep default | awk '{print $3}') # Gateway
|
||||||
|
ping 8.8.8.8 # External connectivity
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Fix netplan configuration
|
||||||
|
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
|
||||||
|
network:
|
||||||
|
version: 2
|
||||||
|
ethernets:
|
||||||
|
ens18:
|
||||||
|
dhcp4: false
|
||||||
|
addresses: [10.10.0.200/24]
|
||||||
|
gateway4: 10.10.0.1
|
||||||
|
nameservers:
|
||||||
|
addresses: [10.10.0.16, 8.8.8.8]
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Apply network configuration
|
||||||
|
sudo netplan apply
|
||||||
|
```
|
||||||
|
|
||||||
|
#### DNS Resolution Problems
|
||||||
|
**Symptoms**:
|
||||||
|
- Cannot resolve domain names
|
||||||
|
- Package downloads fail
|
||||||
|
- Host lookup failures
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check DNS configuration
|
||||||
|
cat /etc/resolv.conf
|
||||||
|
systemd-resolve --status
|
||||||
|
|
||||||
|
# Test DNS resolution
|
||||||
|
nslookup google.com
|
||||||
|
dig google.com @8.8.8.8
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Fix DNS in netplan (see above example)
|
||||||
|
sudo netplan apply
|
||||||
|
|
||||||
|
# Temporary DNS fix
|
||||||
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
||||||
|
|
||||||
|
# Restart DNS services
|
||||||
|
sudo systemctl restart systemd-resolved
|
||||||
|
sudo systemctl restart networking
|
||||||
|
```
|
||||||
|
|
||||||
|
## System Maintenance Issues
|
||||||
|
|
||||||
|
### Package Management Problems
|
||||||
|
|
||||||
|
#### Update Failures
|
||||||
|
**Symptoms**:
|
||||||
|
- apt update fails
|
||||||
|
- Repository signature errors
|
||||||
|
- Dependency conflicts
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check repository status
|
||||||
|
sudo apt update
|
||||||
|
apt-cache policy
|
||||||
|
|
||||||
|
# Check disk space
|
||||||
|
df -h /
|
||||||
|
df -h /var
|
||||||
|
|
||||||
|
# Check for held packages
|
||||||
|
apt-mark showhold
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Fix broken packages
|
||||||
|
sudo apt --fix-broken install
|
||||||
|
sudo dpkg --configure -a
|
||||||
|
|
||||||
|
# Clean package cache
|
||||||
|
sudo apt clean
|
||||||
|
sudo apt autoclean
|
||||||
|
sudo apt autoremove
|
||||||
|
|
||||||
|
# Reset problematic repositories
|
||||||
|
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
|
||||||
|
sudo apt update
|
||||||
|
```
|
||||||
|
|
||||||
|
### Storage and Disk Space Issues
|
||||||
|
|
||||||
|
#### Disk Space Exhaustion
|
||||||
|
**Symptoms**:
|
||||||
|
- Cannot install packages
|
||||||
|
- Docker operations fail
|
||||||
|
- System becomes unresponsive
|
||||||
|
|
||||||
|
**Diagnosis**:
|
||||||
|
```bash
|
||||||
|
# Check disk usage
|
||||||
|
df -h
|
||||||
|
du -sh /home/* /var/* /opt/* 2>/dev/null
|
||||||
|
|
||||||
|
# Find large files
|
||||||
|
find / -size +100M 2>/dev/null | head -20
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions**:
|
||||||
|
```bash
|
||||||
|
# Clean system files
|
||||||
|
sudo apt clean
|
||||||
|
sudo apt autoremove
|
||||||
|
sudo journalctl --vacuum-time=7d
|
||||||
|
|
||||||
|
# Clean Docker data
|
||||||
|
docker system prune -a -f
|
||||||
|
docker volume prune -f
|
||||||
|
|
||||||
|
# Extend disk (Proxmox GUI: Hardware -> Resize)
|
||||||
|
# Then extend filesystem:
|
||||||
|
sudo growpart /dev/sda 1
|
||||||
|
sudo resize2fs /dev/sda1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Emergency Recovery Procedures
|
||||||
|
|
||||||
|
### SSH Access Recovery
|
||||||
|
|
||||||
|
#### Complete SSH Lockout
|
||||||
|
**Recovery Steps**:
|
||||||
|
1. **Use Proxmox console** for direct VM access
|
||||||
|
2. **Reset SSH configuration**:
|
||||||
|
```bash
|
||||||
|
# Via console
|
||||||
|
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
|
||||||
|
sudo systemctl restart sshd
|
||||||
|
```
|
||||||
|
3. **Re-enable emergency access**:
|
||||||
|
```bash
|
||||||
|
# Temporary password access for recovery
|
||||||
|
sudo passwd cal
|
||||||
|
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
|
||||||
|
sudo systemctl restart sshd
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Emergency SSH Key Deployment
|
||||||
|
**If primary keys fail**:
|
||||||
|
```bash
|
||||||
|
# Use emergency key
|
||||||
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
||||||
|
|
||||||
|
# Or deploy keys via console
|
||||||
|
mkdir -p ~/.ssh
|
||||||
|
chmod 700 ~/.ssh
|
||||||
|
cat > ~/.ssh/authorized_keys << 'EOF'
|
||||||
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
|
||||||
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
|
||||||
|
EOF
|
||||||
|
chmod 600 ~/.ssh/authorized_keys
|
||||||
|
```
|
||||||
|
|
||||||
|
### VM Recovery and Rebuild
|
||||||
|
|
||||||
|
#### Corrupt VM Recovery
|
||||||
|
**Steps**:
|
||||||
|
1. **Create snapshot** before attempting recovery
|
||||||
|
2. **Export VM data**:
|
||||||
|
```bash
|
||||||
|
# Backup important data
|
||||||
|
rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
|
||||||
|
```
|
||||||
|
3. **Restore from template**:
|
||||||
|
```bash
|
||||||
|
# Delete corrupt VM
|
||||||
|
pvesh delete /nodes/pve/qemu/<vmid>
|
||||||
|
|
||||||
|
# Clone from template
|
||||||
|
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Post-Install Script Recovery
|
||||||
|
**If automation fails**:
|
||||||
|
```bash
|
||||||
|
# Run in debug mode
|
||||||
|
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
|
||||||
|
|
||||||
|
# Manual step execution
|
||||||
|
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
|
||||||
|
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
|
||||||
|
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prevention and Monitoring
|
||||||
|
|
||||||
|
### Pre-Deployment Validation
|
||||||
|
```bash
|
||||||
|
# Verify prerequisites
|
||||||
|
ls -la ~/.ssh/homelab_rsa*
|
||||||
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
||||||
|
ping 10.10.0.1
|
||||||
|
|
||||||
|
# Test cloud-init YAML
|
||||||
|
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health Monitoring Script
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# vm-health-check.sh
|
||||||
|
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
|
||||||
|
|
||||||
|
for ip in $VM_IPS; do
|
||||||
|
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
|
||||||
|
echo "✅ $ip: SSH OK"
|
||||||
|
# Check Docker
|
||||||
|
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
|
||||||
|
echo "✅ $ip: Docker OK"
|
||||||
|
else
|
||||||
|
echo "❌ $ip: Docker FAILED"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
echo "❌ $ip: SSH FAILED"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Automated Backup
|
||||||
|
```bash
|
||||||
|
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
|
||||||
|
#!/bin/bash
|
||||||
|
for vm_ip in 10.10.0.{200..210}; do
|
||||||
|
if ping -c1 $vm_ip >/dev/null 2>&1; then
|
||||||
|
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quick Reference Commands
|
||||||
|
|
||||||
|
### Essential VM Management
|
||||||
|
```bash
|
||||||
|
# VM control via Proxmox
|
||||||
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
||||||
|
pvesh create /nodes/pve/qemu/<vmid>/status/start
|
||||||
|
pvesh create /nodes/pve/qemu/<vmid>/status/stop
|
||||||
|
|
||||||
|
# SSH with alternative keys
|
||||||
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
||||||
|
|
||||||
|
# System health checks
|
||||||
|
free -h && df -h && systemctl status docker
|
||||||
|
docker system info && docker system df
|
||||||
|
```
|
||||||
|
|
||||||
|
### Recovery Resources
|
||||||
|
- **SSH Keys Backup**: `/mnt/NV2/ssh-keys/backup-*/`
|
||||||
|
- **Proxmox Console**: Direct VM access when SSH fails
|
||||||
|
- **Emergency Contact**: Use Discord notifications for critical issues
|
||||||
|
|
||||||
|
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
|
||||||
Loading…
Reference in New Issue
Block a user