CLAUDE: Migrate to technology-first documentation architecture

Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Cal Corum 2025-08-12 23:20:15 -05:00
parent 7edb4a3a9c
commit 10c9e0d854
86 changed files with 7123 additions and 753 deletions

7
.claude/settings.json Normal file
View File

@ -0,0 +1,7 @@
{
"notifications_disabled": true,
"allowed_working_directories": [
"/mnt/NV2/Development/claude-home",
"/mnt/media"
]
}

3
.gitignore vendored
View File

@ -1,2 +1,3 @@
.claude/tmp/ .claude/tmp/
tmp/ tmp/
__pycache__

315
CLAUDE.md
View File

@ -8,130 +8,98 @@
- If creating a temporary file will help achieve your goal, please create the file in the .claude/tmp/ directory and clean up when you're done. - If creating a temporary file will help achieve your goal, please create the file in the .claude/tmp/ directory and clean up when you're done.
- Prefer editing an existing file to creating a new one. - Prefer editing an existing file to creating a new one.
- Following a complex task or series of tasks, prompt the user to save any key learnings from the session. - Following a complex task or series of tasks, prompt the user to save any key learnings from the session.
- **Documentation Maintenance Reminder**: At the end of coding sessions, proactively ask: "Should I update our documentation to reflect the changes we made today?" Focus on CONTEXT.md files, troubleshooting guides, and any new patterns discovered.
- **Context Window Management**: When approaching 25% context window remaining, prioritize documentation updates before auto-summarization occurs. Ask: "We're approaching context limits - should I update our documentation now to capture today's work before we lose context?"
## Automatic Context Loading Rules ## Automatic Context Loading Rules
### File Extension Triggers ### Technology-First Loading Rules
When working with files, automatically load relevant documentation: When working with specific technologies, automatically load their dedicated context:
**Python (.py, .pyx, .pyi)** **Tdarr Keywords**
- Load: `patterns/python/` - "tdarr", "transcode", "ffmpeg", "gpu transcoding", "nvenc", "scheduler", "api"
- Load: `reference/python/` - Load: `tdarr/CONTEXT.md` (technology overview and patterns)
- If Django/Flask detected: Load `examples/python/web-frameworks.md` - Load: `tdarr/troubleshooting.md` (error handling and debugging)
- If requests/httpx detected: Load `examples/python/api-clients.md` - If working in `/tdarr/scripts/`: Load `tdarr/scripts/CONTEXT.md` (script-specific documentation)
- Note: Gaming-aware scheduling system with configurable time windows available
- Note: Comprehensive API monitoring available via `tdarr_monitor.py` with dataclass-based status tracking
**JavaScript/Node.js (.js, .mjs, .ts)** **Docker Keywords**
- Load: `patterns/nodejs/` - "docker", "container", "image", "compose", "kubernetes", "k8s", "dockerfile", "podman"
- Load: `reference/nodejs/` - Load: `docker/CONTEXT.md` (technology overview and patterns)
- If package.json exists: Load `examples/nodejs/package-management.md` - Load: `docker/troubleshooting.md` (error handling and debugging)
- If working in `/docker/scripts/`: Load `docker/scripts/CONTEXT.md` (script-specific documentation)
**Vue.js (.vue, vite.config.*, nuxt.config.*)** **VM Management Keywords**
- Load: `patterns/vuejs/` - "virtual machine", "vm", "proxmox", "kvm", "hypervisor", "guest", "virtualization"
- Load: `reference/vuejs/` - Load: `vm-management/CONTEXT.md` (technology overview and patterns)
- Load: `examples/vuejs/component-patterns.md` - Load: `vm-management/troubleshooting.md` (error handling and debugging)
- If working in `/vm-management/scripts/`: Load `vm-management/scripts/CONTEXT.md` (script-specific documentation)
**Shell Scripts (.sh, .bash, .zsh)** **Networking Keywords**
- Load: `patterns/bash/` - "network", "nginx", "proxy", "load balancer", "dns", "port", "firewall", "ssh", "ssl", "tls"
- Load: `reference/bash/` - Load: `networking/CONTEXT.md` (technology overview and patterns)
- If systemd mentioned: Load `examples/bash/service-management.md` - Load: `networking/troubleshooting.md` (error handling and debugging)
- If working in `/networking/scripts/`: Load `networking/scripts/CONTEXT.md` (script-specific documentation)
**Docker (Dockerfile, docker-compose.yml, .dockerignore)** **Monitoring Keywords**
- Load: `patterns/docker/` - "monitoring", "alert", "notification", "discord", "health check", "status", "uptime", "windows reboot", "system monitor"
- Load: `reference/docker/` - Load: `monitoring/CONTEXT.md` (technology overview and patterns)
- Load: `examples/docker/multi-stage-builds.md` - Load: `monitoring/troubleshooting.md` (error handling and debugging)
- If working in `/monitoring/scripts/`: Load `monitoring/scripts/CONTEXT.md` (script-specific documentation)
- Note: Windows desktop monitoring with Discord notifications available
- Note: Comprehensive Tdarr API monitoring with dataclass-based status tracking
### Directory Context Triggers ### Directory Context Triggers
When working in specific directories: When working in specific directories:
**Docker-related directories (/docker/, /containers/, /compose/)** **Technology directories (/tdarr/, /docker/, /vm-management/, /networking/, /monitoring/)**
- Load: `patterns/docker/` - Load: `{technology}/CONTEXT.md` (technology overview)
- Load: `examples/docker/` - Load: `{technology}/troubleshooting.md` (debugging info)
- Load: `reference/docker/troubleshooting.md`
**Database directories (/db/, /database/, /mysql/, /postgres/, /mongo/)** **Script subdirectories (/tdarr/scripts/, /docker/scripts/, etc.)**
- Load: `patterns/databases/` - Load: `{technology}/CONTEXT.md` (parent technology context)
- Load: `examples/databases/` - Load: `{technology}/scripts/CONTEXT.md` (script-specific context)
- Load: `reference/databases/` - Load: `{technology}/troubleshooting.md` (debugging info)
**Network directories (/network/, /networking/, /nginx/, /traefik/)**
- Load: `patterns/networking/`
- Load: `examples/networking/`
- Load: `reference/networking/troubleshooting.md`
**VM directories (/vm/, /virtual/, /proxmox/, /kvm/)**
- Load: `patterns/vm-management/`
- Load: `examples/vm-management/`
- Load: `reference/vm-management/`
**Scripts directory (/scripts/, /scripts/*/)**
- Load: `patterns/` (relevant to script type)
- Load: `reference/` (relevant troubleshooting guides)
- Load: `scripts/*/README.md` (subsystem-specific documentation)
- Context: Active operational scripts - treat as production code - Context: Active operational scripts - treat as production code
- Note: Windows desktop monitoring system available in `scripts/monitoring/windows-desktop/`
### Keyword Triggers **Legacy directories (for backward compatibility)**
When user mentions specific terms, automatically load relevant docs: - `/scripts/tdarr/` → Load Tdarr context files
- `/scripts/monitoring/` → Load Monitoring context files
- `/patterns/`, `/examples/`, `/reference/` → Load as before until migration complete
**Troubleshooting Keywords** ### File Extension Triggers
- "debug", "error", "fail", "broken", "not working", "issue" For programming languages, load general development context:
- Load: `reference/{relevant-tech}/troubleshooting.md`
- Load: `examples/{relevant-tech}/debugging.md`
**Configuration Keywords** **Python (.py, .pyx, .pyi)**
- "config", "configure", "setup", "install", "deploy" - Load: `development/python-CONTEXT.md` (Python patterns and best practices)
- Load: `patterns/{relevant-tech}/` - If Django/Flask detected: Load `development/web-frameworks-CONTEXT.md`
- Load: `examples/{relevant-tech}/configuration.md` - If requests/httpx detected: Load `development/api-clients-CONTEXT.md`
**Performance Keywords** **JavaScript/Node.js (.js, .mjs, .ts)**
- "slow", "performance", "optimize", "memory", "cpu" - Load: `development/nodejs-CONTEXT.md` (Node.js patterns and best practices)
- Load: `reference/{relevant-tech}/performance.md` - If package.json exists: Load `development/package-management-CONTEXT.md`
- Load: `examples/{relevant-tech}/optimization.md`
**Security Keywords** **Shell Scripts (.sh, .bash, .zsh)**
- "secure", "ssl", "tls", "certificate", "auth", "firewall" - Load: `development/bash-CONTEXT.md` (Bash scripting patterns)
- Load: `patterns/networking/security.md` - If systemd mentioned: Load `development/service-management-CONTEXT.md`
- Load: `reference/networking/security.md`
**Database Keywords** ### Troubleshooting Keywords
- "database", "db", "sql", "mysql", "postgres", "mongo", "redis" For troubleshooting scenarios, always load both context and troubleshooting files:
- Load: `patterns/databases/`
- Load: `examples/databases/`
**Container Keywords** **General Troubleshooting Keywords**
- "docker", "container", "image", "compose", "kubernetes", "k8s" - "shutdown", "stop", "emergency", "reset", "recovery", "crash", "broken", "not working", "error", "issue", "problem", "debug", "troubleshoot", "fix"
- Load: `patterns/docker/` - If Tdarr context detected: Load `tdarr/CONTEXT.md` AND `tdarr/troubleshooting.md`
- Load: `examples/docker/` - If Docker context detected: Load `docker/CONTEXT.md` AND `docker/troubleshooting.md`
- If VM context detected: Load `vm-management/CONTEXT.md` AND `vm-management/troubleshooting.md`
- If Network context detected: Load `networking/CONTEXT.md` AND `networking/troubleshooting.md`
- If Monitoring context detected: Load `monitoring/CONTEXT.md` AND `monitoring/troubleshooting.md`
**Network Keywords** **Specific Tdarr Troubleshooting Keywords**
- "network", "nginx", "proxy", "load balancer", "dns", "port", "firewall" - "forEach error", "staging timeout", "gaming detection", "plugin error", "container stop", "node disconnect", "cache cleanup", "shutdown tdarr", "stop tdarr", "emergency tdarr", "reset tdarr"
- Load: `patterns/networking/` - Load: `tdarr/CONTEXT.md` (technology overview)
- Load: `examples/networking/` - Load: `tdarr/troubleshooting.md` (specific solutions including Emergency Recovery section)
- If working in `/tdarr/scripts/`: Load `tdarr/scripts/CONTEXT.md`
**SSH Keywords**
- "ssh", "key", "authentication", "authorized_keys", "ssh-copy-id"
- Load: `patterns/networking/ssh-key-management.md`
- Load: `examples/networking/ssh-homelab-setup.md`
- Load: `reference/networking/ssh-troubleshooting.md`
**VM Keywords**
- "virtual machine", "vm", "proxmox", "kvm", "hypervisor", "guest"
- Load: `patterns/vm-management/`
- Load: `examples/vm-management/`
**Tdarr Keywords**
- "tdarr", "transcode", "ffmpeg", "gpu transcoding", "nvenc", "forEach error", "gaming detection", "scheduler", "monitoring", "api"
- Load: `reference/docker/tdarr-troubleshooting.md`
- Load: `patterns/docker/distributed-transcoding.md`
- Load: `scripts/tdarr/README.md` (for automation and scheduling)
- Load: `scripts/monitoring/README.md` (for monitoring and health checks)
- Note: Gaming-aware scheduling system with configurable time windows available
- Note: Comprehensive API monitoring available via `tdarr_monitor.py` with dataclass-based status tracking
**Windows Monitoring Keywords**
- "windows reboot", "discord notification", "system monitor", "windows desktop", "power outage", "windows update"
- Load: `scripts/monitoring/windows-desktop/README.md`
- Note: Complete Windows desktop monitoring with Discord notifications for reboots and system events
### Priority Rules ### Priority Rules
1. **File extension triggers** take highest priority 1. **File extension triggers** take highest priority
@ -141,33 +109,132 @@ When user mentions specific terms, automatically load relevant docs:
5. Always prefer specific over general (e.g., `vuejs/` over `nodejs/`) 5. Always prefer specific over general (e.g., `vuejs/` over `nodejs/`)
### Context Loading Behavior ### Context Loading Behavior
- Load pattern files first for overview - **Technology context first**: Load CONTEXT.md for overview and patterns
- Load relevant examples for implementation details - **Troubleshooting context**: ALWAYS load troubleshooting.md for error scenarios and emergency procedures
- Load reference files for troubleshooting and edge cases - **Script-specific context**: Load scripts/CONTEXT.md when working in script directories
- Maximum of 3 documentation files per trigger to maintain efficiency - **Examples last**: Load examples for implementation details
- If context becomes too large, prioritize most recent/specific files - **Critical rule**: For any troubleshooting scenario, load BOTH context and troubleshooting files to ensure complete information
- Maximum of 3-4 documentation files per trigger to maintain efficiency while ensuring comprehensive coverage
## Documentation Structure ## Documentation Structure
``` ```
/patterns/ # Technology overviews and best practices /tdarr/ # Tdarr transcoding automation
/examples/ # Complete working implementations ├── CONTEXT.md # Technology overview, patterns, best practices
/reference/ # Troubleshooting, cheat sheets, fallback info ├── troubleshooting.md # Error handling and debugging
/scripts/ # Active scripts and utilities for home lab operations ├── examples/ # Working configurations and templates
├── tdarr/ # Tdarr automation with gaming-aware scheduling └── scripts/ # Active automation scripts
├── monitoring/ # System monitoring and alerting ├── CONTEXT.md # Script-specific documentation
│ ├── tdarr_monitor.py # Comprehensive Tdarr API monitoring with dataclasses ├── monitoring.py # Comprehensive API monitoring with dataclasses
│ └── windows-desktop/ # Windows reboot monitoring with Discord notifications └── scheduler.py # Gaming-aware scheduling system
└── <future>/ # Other organized automation subsystems
```
Each pattern file should reference relevant examples and reference materials. /docker/ # Container orchestration and management
├── CONTEXT.md # Technology overview, patterns, best practices
├── troubleshooting.md # Error handling and debugging
├── examples/ # Working configurations and templates
└── scripts/ # Active automation scripts
└── CONTEXT.md # Script-specific documentation
/vm-management/ # Virtual machine operations
├── CONTEXT.md # Technology overview, patterns, best practices
├── troubleshooting.md # Error handling and debugging
├── examples/ # Working configurations and templates
└── scripts/ # Active automation scripts
└── CONTEXT.md # Script-specific documentation
/networking/ # Network configuration and SSH management
├── CONTEXT.md # Technology overview, patterns, best practices
├── troubleshooting.md # Error handling and debugging
├── examples/ # Working configurations and templates
└── scripts/ # Active automation scripts
└── CONTEXT.md # Script-specific documentation
/monitoring/ # System monitoring and alerting
├── CONTEXT.md # Technology overview, patterns, best practices
├── troubleshooting.md # Error handling and debugging
├── examples/ # Working configurations and templates
└── scripts/ # Active automation scripts
├── CONTEXT.md # Script-specific documentation
└── windows-desktop/ # Windows reboot monitoring with Discord notifications
/development/ # Programming language patterns and tools
├── python-CONTEXT.md # Python development patterns
├── nodejs-CONTEXT.md # Node.js development patterns
└── bash-CONTEXT.md # Shell scripting patterns
/legacy/ # Backward compatibility during migration
├── patterns/ # Old patterns structure (temporary)
├── examples/ # Old examples structure (temporary)
└── reference/ # Old reference structure (temporary)
```
### Directory Usage Guidelines ### Directory Usage Guidelines
- `/scripts/` - Contains actively used scripts for home lab management and operations - Each technology directory is self-contained with its own context, troubleshooting, examples, and scripts
- Organized by subsystem (e.g., `tdarr/`, `networking/`, `vm-management/`) - `CONTEXT.md` files provide technology overview, patterns, and best practices for Claude
- Each subsystem includes its own README.md with complete documentation - `troubleshooting.md` files contain error handling and debugging information
- `/examples/` - Contains example configurations and template scripts for reference - `/scripts/` subdirectories contain active operational code with their own `CONTEXT.md`
- `/patterns/` - Best practices and architectural guidance - `/examples/` subdirectories contain template configurations and reference implementations
- `/reference/` - Troubleshooting guides and technical references - `/development/` contains general programming language patterns that apply across technologies
- `/legacy/` provides backward compatibility during the migration from the old structure
## Documentation Maintenance Protocol
### Automated Maintenance Triggers
Claude Code should automatically prompt for documentation updates when:
1. **New Technology Integration**: When working with a technology that doesn't have a dedicated directory
- Prompt: "I notice we're working with [technology] but don't have a dedicated `/[technology]/` directory. Should I create the technology-first structure with CONTEXT.md and troubleshooting.md files?"
2. **New Error Patterns Discovered**: When encountering and solving new issues
- Prompt: "We just resolved a [technology] issue that isn't documented. Should I add this solution to `[technology]/troubleshooting.md`?"
3. **New Scripts or Operational Procedures**: When creating new automation or workflows
- Prompt: "I created new scripts/procedures for [technology]. Should I update `[technology]/scripts/CONTEXT.md` and add any new operational patterns?"
4. **Session End with Significant Changes**: When completing complex tasks
- Prompt: "We made significant changes to [technology] systems. Should I update our documentation to reflect the new patterns, configurations, or troubleshooting procedures we discovered?"
### Documentation Update Checklist
When "update our documentation" is requested, systematically check:
**Technology-Specific Updates**:
- [ ] Update `[technology]/CONTEXT.md` with new patterns or architectural changes
- [ ] Add new troubleshooting scenarios to `[technology]/troubleshooting.md`
- [ ] Update `[technology]/scripts/CONTEXT.md` for new operational procedures
- [ ] Add working examples to `[technology]/examples/` if new configurations were created
**Cross-Technology Updates**:
- [ ] Update main CLAUDE.md loading rules if new keywords or triggers are needed
- [ ] Add new technology directories to the Documentation Structure section
- [ ] Update Directory Usage Guidelines if new organizational patterns emerge
**Legacy Cleanup**:
- [ ] Check if any old patterns/examples/reference files can be migrated to technology directories
- [ ] Update or remove outdated information that conflicts with new approaches
### Self-Maintenance Features
**Loading Rule Validation**: Periodically verify that:
- All technology directories have corresponding keyword triggers
- Troubleshooting keywords include all common error scenarios
- File paths in loading rules match actual directory structure
**Documentation Completeness Check**: Each technology directory should have:
- `CONTEXT.md` (overview, patterns, best practices)
- `troubleshooting.md` (error scenarios, emergency procedures)
- `examples/` (working configurations)
- `scripts/CONTEXT.md` (if operational scripts exist)
**Keyword Coverage Analysis**: Ensure loading rules cover:
- Technology names and common aliases
- Error types and troubleshooting scenarios
- Operational keywords (start, stop, configure, monitor)
- Emergency keywords (shutdown, reset, recovery)
### Warning Triggers
Claude Code should warn when:
- Working extensively with a technology that lacks dedicated documentation structure
- Solving problems that aren't covered in existing troubleshooting guides
- Creating scripts or procedures without corresponding CONTEXT.md documentation
- Encountering loading rules that reference non-existent files

View File

@ -0,0 +1,316 @@
# Database Troubleshooting Guide
## Connection Issues
### Cannot Connect to Database
**Symptoms**: Connection refused, timeout errors, authentication failures
**Diagnosis**:
```bash
# Test basic connectivity
telnet db-server 3306 # MySQL
telnet db-server 5432 # PostgreSQL
nc -zv db-server 6379 # Redis
# Check database service status
systemctl status mysql
systemctl status postgresql
systemctl status redis-server
```
**Solutions**:
```bash
# Restart database services
sudo systemctl restart mysql
sudo systemctl restart postgresql
# Check configuration files
sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf
sudo nano /etc/postgresql/*/main/postgresql.conf
# Verify port bindings
ss -tlnp | grep :3306 # MySQL
ss -tlnp | grep :5432 # PostgreSQL
```
## Performance Issues
### Slow Query Performance
**Symptoms**: Long-running queries, high CPU usage, timeouts
**Diagnosis**:
```sql
-- MySQL
SHOW PROCESSLIST;
SHOW ENGINE INNODB STATUS;
EXPLAIN SELECT * FROM table WHERE condition;
-- PostgreSQL
SELECT * FROM pg_stat_activity;
EXPLAIN ANALYZE SELECT * FROM table WHERE condition;
```
**Solutions**:
```sql
-- Add missing indexes
CREATE INDEX idx_column ON table(column);
-- Analyze table statistics
ANALYZE TABLE table_name; -- MySQL
ANALYZE table_name; -- PostgreSQL
-- Optimize queries
-- Use LIMIT for large result sets
-- Add WHERE clauses to filter results
-- Use appropriate JOIN types
```
### Memory and Resource Issues
**Symptoms**: Out of memory errors, swap usage, slow performance
**Diagnosis**:
```bash
# Check memory usage
free -h
ps aux | grep mysql
ps aux | grep postgres
# Database-specific memory usage
mysqladmin -u root -p status
sudo -u postgres psql -c "SELECT * FROM pg_stat_database;"
```
**Solutions**:
```bash
# Adjust database memory settings
# MySQL - /etc/mysql/mysql.conf.d/mysqld.cnf
innodb_buffer_pool_size = 2G
key_buffer_size = 256M
# PostgreSQL - /etc/postgresql/*/main/postgresql.conf
shared_buffers = 256MB
effective_cache_size = 2GB
work_mem = 4MB
```
## Data Integrity Issues
### Corruption Detection and Recovery
**Symptoms**: Table corruption errors, data inconsistencies
**Diagnosis**:
```sql
-- MySQL
CHECK TABLE table_name;
mysqlcheck -u root -p --all-databases
-- PostgreSQL
-- Check for corruption in logs
tail -f /var/log/postgresql/postgresql-*.log
```
**Solutions**:
```sql
-- MySQL table repair
REPAIR TABLE table_name;
mysqlcheck -u root -p --auto-repair database_name
-- PostgreSQL consistency check
-- Run VACUUM and REINDEX
VACUUM FULL table_name;
REINDEX TABLE table_name;
```
## Backup and Recovery Issues
### Backup Failures
**Symptoms**: Backup scripts failing, incomplete backups
**Diagnosis**:
```bash
# Check backup script logs
tail -f /var/log/backup.log
# Test backup commands manually
mysqldump -u root -p database_name > test_backup.sql
pg_dump -U postgres database_name > test_backup.sql
# Check disk space
df -h /backup/location/
```
**Solutions**:
```bash
# Fix backup script permissions
chmod +x /path/to/backup-script.sh
chown backup-user:backup-group /backup/location/
# Automated backup script example
#!/bin/bash
BACKUP_DIR="/backups/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
mysqldump -u root -p$MYSQL_PASSWORD --all-databases > \
"$BACKUP_DIR/full_backup_$DATE.sql"
# Compress and rotate backups
gzip "$BACKUP_DIR/full_backup_$DATE.sql"
find "$BACKUP_DIR" -name "*.gz" -mtime +7 -delete
```
## Authentication and Security Issues
### Access Denied Errors
**Symptoms**: Authentication failures, permission errors
**Diagnosis**:
```sql
-- MySQL
SELECT user, host FROM mysql.user;
SHOW GRANTS FOR 'username'@'host';
-- PostgreSQL
\du -- List users
\l -- List databases
```
**Solutions**:
```sql
-- MySQL user management
CREATE USER 'newuser'@'localhost' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON database.* TO 'newuser'@'localhost';
FLUSH PRIVILEGES;
-- PostgreSQL user management
CREATE USER newuser WITH PASSWORD 'password';
GRANT ALL PRIVILEGES ON DATABASE database_name TO newuser;
```
## Replication Issues
### Master-Slave Replication Problems
**Symptoms**: Replication lag, sync errors, slave disconnection
**Diagnosis**:
```sql
-- MySQL Master
SHOW MASTER STATUS;
-- MySQL Slave
SHOW SLAVE STATUS\G
-- Check replication lag
SELECT SECONDS_BEHIND_MASTER FROM SHOW SLAVE STATUS\G
```
**Solutions**:
```sql
-- Reset replication
STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=4;
START SLAVE;
-- Fix replication errors
SET GLOBAL sql_slave_skip_counter = 1;
START SLAVE;
```
## Storage and Disk Issues
### Disk Space Problems
**Symptoms**: Out of disk space errors, database growth
**Diagnosis**:
```bash
# Check database sizes
du -sh /var/lib/mysql/*
du -sh /var/lib/postgresql/*/main/*
# Find large tables
SELECT table_schema, table_name,
ROUND((data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'
FROM information_schema.tables
ORDER BY (data_length + index_length) DESC;
```
**Solutions**:
```sql
-- Clean up large tables
DELETE FROM log_table WHERE created_date < DATE_SUB(NOW(), INTERVAL 30 DAY);
OPTIMIZE TABLE log_table;
-- Enable log rotation
-- For MySQL binary logs
SET GLOBAL expire_logs_days = 7;
PURGE BINARY LOGS BEFORE DATE(NOW() - INTERVAL 7 DAY);
```
## Emergency Recovery
### Database Won't Start
**Recovery Steps**:
```bash
# Check error logs
tail -f /var/log/mysql/error.log
tail -f /var/log/postgresql/postgresql-*.log
# Try safe mode start
sudo mysqld_safe --skip-grant-tables &
# Recovery from backup
mysql -u root -p < backup_file.sql
psql -U postgres database_name < backup_file.sql
```
### Complete Data Loss Recovery
**Recovery Procedure**:
```bash
# Stop database service
sudo systemctl stop mysql
# Restore from backup
cd /var/lib/mysql
sudo rm -rf *
sudo tar -xzf /backups/mysql_full_backup.tar.gz
# Fix permissions
sudo chown -R mysql:mysql /var/lib/mysql
sudo chmod 755 /var/lib/mysql
# Start database
sudo systemctl start mysql
```
## Monitoring and Prevention
### Database Health Monitoring
```bash
#!/bin/bash
# db-health-check.sh
# Check if database is responding
if ! mysqladmin -u root -p$MYSQL_PASSWORD ping >/dev/null 2>&1; then
echo "ALERT: MySQL not responding" | send_alert
fi
# Check disk space
DISK_USAGE=$(df /var/lib/mysql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
echo "ALERT: Database disk usage at ${DISK_USAGE}%" | send_alert
fi
# Check for long-running queries
LONG_QUERIES=$(mysql -u root -p$MYSQL_PASSWORD -e "SHOW PROCESSLIST" | grep -c "Query.*[0-9][0-9][0-9]")
if [ $LONG_QUERIES -gt 5 ]; then
echo "ALERT: $LONG_QUERIES long-running queries detected" | send_alert
fi
```
### Automated Maintenance
```bash
# Daily maintenance script
#!/bin/bash
# Optimize tables
mysqlcheck -u root -p$MYSQL_PASSWORD --auto-repair --optimize --all-databases
# Update table statistics
mysql -u root -p$MYSQL_PASSWORD -e "FLUSH TABLES; ANALYZE TABLE table_name;"
# Backup rotation
find /backups -name "*.sql.gz" -mtime +30 -delete
```
This troubleshooting guide provides systematic approaches to resolving common database issues in home lab environments.

331
docker/CONTEXT.md Normal file
View File

@ -0,0 +1,331 @@
# Docker Container Technology - Technology Context
## Overview
Docker containerization for home lab environments with focus on performance optimization, GPU acceleration, and distributed workloads. This context covers container architecture patterns, security practices, and production deployment strategies.
## Architecture Patterns
### Container Design Principles
1. **Single Responsibility**: One service per container
2. **Immutable Infrastructure**: Treat containers as replaceable units
3. **Resource Isolation**: Use container limits and cgroups
4. **Security First**: Run as non-root, minimal attack surface
5. **Configuration Management**: Environment variables and external configs
### Multi-Stage Build Pattern
**Purpose**: Minimize production image size and attack surface
```dockerfile
# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
# Production stage
FROM node:18-alpine AS production
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY . .
USER 1000
EXPOSE 3000
CMD ["node", "server.js"]
```
### Distributed Application Architecture
**Pattern**: Server-Node separation with specialized workloads
```
┌─────────────────┐ ┌──────────────────────────────────┐
│ Control Plane │ │ Worker Nodes │
│ │ │ ┌─────────┐ ┌─────────┐ │
│ - Web Interface│◄──►│ │ Node 1 │ │ Node 2 │ ... │
│ - Job Queue │ │ │ GPU+CPU │ │ GPU+CPU │ │
│ - Coordination │ │ │Local SSD│ │Local SSD│ │
│ │ │ └─────────┘ └─────────┘ │
└─────────────────┘ └──────────────────────────────────┘
│ │
└──────── Shared Storage ──────┘
(NAS/SAN for persistence)
```
## Container Runtime Platforms
### Docker vs Podman Comparison
**Docker**: Traditional daemon-based approach
- Requires Docker daemon running as root
- Centralized container management
- Established ecosystem and tooling
**Podman** (Recommended for GPU workloads):
- Daemonless architecture
- Better GPU integration with NVIDIA
- Rootless containers for enhanced security
- Direct systemd integration
### GPU Acceleration Support
**NVIDIA Container Toolkit Integration**:
```bash
# Podman GPU configuration (recommended)
podman run -d --name gpu-workload \
--device nvidia.com/gpu=all \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
myapp:latest
# Docker GPU configuration
docker run -d --name gpu-workload \
--gpus all \
-e NVIDIA_DRIVER_CAPABILITIES=all \
myapp:latest
```
## Performance Optimization Patterns
### Hybrid Storage Strategy
**Pattern**: Balance performance and persistence for different data types
```yaml
volumes:
# Local storage (SSD/NVMe) - High Performance
- ./app/data:/app/data # Database - frequent I/O
- ./app/configs:/app/configs # Config - startup performance
- ./app/logs:/app/logs # Logs - continuous writing
- ./cache:/cache # Work directories - temp processing
# Network storage (NAS) - Persistence & Backup
- /mnt/nas/backups:/app/backups # Backups - infrequent access
- /mnt/nas/media:/media:ro # Source data - read-only
```
**Benefits**:
- **Local Operations**: 100x faster database performance vs network
- **Network Reliability**: Critical data protected on redundant storage
- **Cost Optimization**: Expensive fast storage only where needed
### Cache Optimization Hierarchy
```bash
# Performance tiers for different workload types
/dev/shm/cache/ # RAM disk - fastest, volatile, limited size
/mnt/nvme/cache/ # NVMe SSD - 3-7GB/s, persistent, recommended
/mnt/ssd/cache/ # SATA SSD - 500MB/s, good balance
/mnt/nas/cache/ # Network - 100MB/s, legacy compatibility
```
### Resource Management
**Container Limits** (prevent resource exhaustion):
```yaml
deploy:
resources:
limits:
memory: 8G
cpus: '6'
reservations:
memory: 4G
cpus: '2'
```
**Networking Optimization**:
```yaml
# Host networking for performance-critical applications
network_mode: host
# Bridge networking with port mapping (default)
network_mode: bridge
ports:
- "8080:8080"
```
## Security Patterns
### Container Hardening
```dockerfile
# Use minimal base images
FROM alpine:3.18
# Run as non-root user
RUN addgroup -g 1000 appuser && \
adduser -u 1000 -G appuser -s /bin/sh -D appuser
USER 1000
# Set secure permissions
COPY --chown=appuser:appuser . /app
```
### Environment Security
```bash
# Secrets management (avoid environment variables for secrets)
podman secret create db_password password.txt
podman run --secret db_password myapp:latest
# Network isolation
podman network create --driver bridge isolated-net
podman run --network isolated-net myapp:latest
```
### Image Security
1. **Vulnerability Scanning**: Regular image scans with tools like Trivy
2. **Version Pinning**: Use specific tags, avoid `latest`
3. **Minimal Images**: Distroless or Alpine base images
4. **Layer Optimization**: Minimize layers, combine RUN commands
## Development Workflows
### Local Development Pattern
```yaml
# docker-compose.dev.yml
version: "3.8"
services:
app:
build: .
volumes:
- .:/app # Code hot-reload
- /app/node_modules # Preserve dependencies
environment:
- NODE_ENV=development
ports:
- "3000:3000"
```
### Production Deployment Pattern
```bash
# Production container with health checks
podman run -d --name production-app \
--restart unless-stopped \
--health-cmd="curl -f http://localhost:3000/health || exit 1" \
--health-interval=30s \
--health-timeout=10s \
--health-retries=3 \
-p 3000:3000 \
myapp:v1.2.3
```
## Monitoring and Observability
### Health Check Implementation
```dockerfile
# Application health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
```
### Log Management
```bash
# Structured logging with log rotation
podman run -d --name app \
--log-driver journald \
--log-opt max-size=10m \
--log-opt max-file=3 \
myapp:latest
# Centralized logging
podman logs -f app | logger -t myapp
```
### Resource Monitoring
```bash
# Real-time container metrics
podman stats --no-stream app
# Historical resource usage
podman exec app cat /sys/fs/cgroup/memory/memory.usage_in_bytes
```
## Common Implementation Patterns
### Database Containers
```yaml
# Persistent database with backup strategy
services:
postgres:
image: postgres:15-alpine
environment:
POSTGRES_DB: myapp
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
volumes:
- postgres_data:/var/lib/postgresql/data # Persistent data
- ./backups:/backups # Backup mount
secrets:
- db_password
```
### Web Application Containers
```yaml
# Multi-tier web application
services:
frontend:
image: nginx:alpine
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
ports:
- "80:80"
- "443:443"
depends_on:
- backend
backend:
build: ./api
environment:
- DATABASE_URL=postgresql://appuser@postgres/myapp
depends_on:
- postgres
```
### GPU-Accelerated Workloads
```bash
# GPU transcoding/processing container
podman run -d --name gpu-processor \
--device nvidia.com/gpu=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,video \
-v "/fast-storage:/cache" \
-v "/media:/input:ro" \
-v "/output:/output" \
gpu-app:latest
```
## Best Practices
### Production Deployment
1. **Use specific image tags**: Never use `latest` in production
2. **Implement health checks**: Application and infrastructure monitoring
3. **Resource limits**: Prevent resource exhaustion
4. **Backup strategy**: Regular backups of persistent data
5. **Security scanning**: Regular vulnerability assessments
### Development Guidelines
1. **Multi-stage builds**: Separate build and runtime environments
2. **Environment parity**: Keep dev/staging/prod similar
3. **Configuration externalization**: Use environment variables and secrets
4. **Dependency management**: Pin versions, use lock files
5. **Testing strategy**: Unit, integration, and container tests
### Operational Excellence
1. **Log aggregation**: Centralized logging strategy
2. **Metrics collection**: Application and infrastructure metrics
3. **Alerting**: Proactive monitoring and alerting
4. **Documentation**: Container documentation and runbooks
5. **Disaster recovery**: Backup and recovery procedures
## Migration Patterns
### Legacy Application Containerization
1. **Assessment**: Identify dependencies and requirements
2. **Dockerfile creation**: Start with appropriate base image
3. **Configuration externalization**: Move configs to environment variables
4. **Data persistence**: Identify and volume mount data directories
5. **Testing**: Validate functionality in containerized environment
### Platform Migration (Docker to Podman)
```bash
# Export Docker container configuration
docker inspect mycontainer > container-config.json
# Convert to Podman run command
podman run -d --name mycontainer \
--memory 4g \
--cpus 2 \
-v /host/path:/container/path \
myimage:tag
```
This technology context provides comprehensive guidance for implementing Docker containerization strategies in home lab and production environments.

View File

@ -0,0 +1,262 @@
# Docker iptables/nftables Backend Troubleshooting Session
## Session Context
- **Date**: August 8, 2025
- **System**: Nobara PC (Fedora-based gaming distro)
- **User**: cal
- **Working Directory**: `/mnt/NV2/Development/claude-home`
- **Goal**: Get Docker working to run Tdarr Node container
## System Information
```bash
# OS Details
uname -a
# Linux nobara-pc 6.15.5-200.nobara.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Jul 6 11:56:20 UTC 2025 x86_64 GNU/Linux
# Hardware
# AMD Ryzen 7 7800X3D 8-Core Processor
# 62GB RAM
# NVIDIA GeForce RTX 4080 SUPER
# Distribution
# Nobara (Fedora 42-based)
```
## Problem Summary
Docker daemon fails to start with persistent error:
```
failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: COMMAND_FAILED: INVALID_IPV: 'ipv4' is not a valid backend or is unavailable
```
## Root Cause Analysis
### Initial Discovery
1. **Missing iptables**: Docker couldn't find `iptables` command in PATH
2. **Backend conflict**: System using nftables but Docker expects iptables-legacy
3. **Package inconsistency**: `iptables-nft` package installed but binary missing initially
### Key Findings
- `dnf list installed | grep -i iptables` initially returned nothing
- `firewalld` and `nftables` services were both inactive
- `iptables-nft` package was installed but `/usr/bin/iptables` didn't exist
- After reinstall, iptables worked but used nftables backend
- NAT table incompatible: `iptables v1.8.11 (nf_tables): table 'nat' is incompatible, use 'nft' tool.`
## Troubleshooting Steps Performed
### Step 1: Package Investigation
```bash
# Check installed iptables packages
dnf list installed | grep -i iptables
# Result: No matching packages (surprising!)
# Check service status
systemctl status nftables # inactive (dead)
firewall-cmd --get-backend-type # firewalld not running
# Check if iptables binary exists
which iptables # not found
/usr/bin/iptables --version # No such file or directory
```
### Step 2: Package Reinstallation
```bash
# Reinstall iptables-nft package
sudo dnf reinstall -y iptables-nft
# Verify installation
rpm -ql iptables-nft | grep bin
# Shows /usr/bin/iptables should exist
# Test after reinstall
iptables --version
# Result: iptables v1.8.11 (nf_tables) - SUCCESS!
```
### Step 3: Backend Compatibility Testing
```bash
# Test NAT table access
sudo iptables -t nat -L
# Error: iptables v1.8.11 (nf_tables): table `nat' is incompatible, use 'nft' tool.
```
### Step 4: Legacy Backend Installation
```bash
# Install iptables-legacy
sudo dnf install -y iptables-legacy iptables-legacy-libs
# Set up alternatives system
sudo alternatives --install /usr/bin/iptables iptables /usr/bin/iptables-legacy 10
sudo alternatives --install /usr/bin/ip6tables ip6tables /usr/bin/ip6tables-legacy 10
# Test NAT table with legacy backend
sudo iptables -t nat -L
# SUCCESS: Shows empty NAT chains
```
### Step 5: Docker Restart Attempts
```bash
# Remove NVIDIA daemon.json config (potential conflict)
sudo rm -f /etc/docker/daemon.json
# Load NAT kernel module explicitly
sudo modprobe iptable_nat
# Try starting firewalld (in case Docker needs it)
sudo systemctl enable --now firewalld
# Multiple restart attempts
sudo systemctl start docker
# ALL FAILED with same NAT chain error
```
## Current State
- ✅ iptables-legacy installed and configured
- ✅ NAT table accessible via `iptables -t nat -L`
- ✅ All required kernel modules should be available
- ❌ Docker still fails with NAT chain creation error
- ❌ Same error persists despite backend switch
## Analysis of Persistent Issue
### Potential Causes
1. **Kernel State Contamination**: nftables rules/chains may still be active in kernel memory
2. **Module Loading Order**: iptables vs nftables modules loaded in conflicting order
3. **Docker Caching**: Docker may be caching the old backend detection
4. **Firewall Integration**: Docker + firewalld interaction on Fedora/Nobara
5. **System-Level Backend Selection**: Some system-wide iptables backend lock
### Evidence Supporting Kernel State Theory
- Error message is identical across all restart attempts
- iptables command works fine manually
- NAT table shows properly but Docker can't create chains
- Issue persists despite configuration changes
## Next Session Action Plan
### Immediate Steps After System Reboot
1. **Verify Backend Status**:
```bash
iptables --version # Should show legacy
sudo iptables -t nat -L # Should show clean NAT table
```
2. **Check Kernel Modules**:
```bash
lsmod | grep -E "(iptable|nf_|ip_tables)"
modprobe -l | grep -E "(iptable|nf_table)"
```
3. **Test Docker Start**:
```bash
sudo systemctl start docker
docker --version
```
### If Issue Persists After Reboot
#### Alternative Approach 1: Docker Configuration Override
```bash
# Create daemon.json to disable iptables management
sudo mkdir -p /etc/docker
cat <<EOF | sudo tee /etc/docker/daemon.json
{
"iptables": false,
"bridge": "none"
}
EOF
sudo systemctl start docker
```
#### Alternative Approach 2: Podman as Docker Alternative
```bash
# Install podman as Docker drop-in replacement
sudo dnf install -y podman podman-docker
# Test with Tdarr container
podman run --rm ghcr.io/haveagitgat/tdarr_node:latest --help
```
#### Alternative Approach 3: Docker Desktop
```bash
# Consider Docker Desktop for Linux (handles networking differently)
# May bypass system iptables issues entirely
```
#### Alternative Approach 4: Deep System Cleanup
```bash
# Nuclear option: Remove all networking packages and reinstall
sudo dnf remove -y iptables* nftables firewalld
sudo dnf install -y iptables-legacy iptables-nft firewalld
sudo dnf reinstall -y docker-ce
```
### Diagnostic Commands for Next Session
```bash
# Full network state capture
ip addr show
ip route show
sudo iptables-save > /tmp/iptables-state.txt
sudo nft list ruleset > /tmp/nft-state.txt
# Docker troubleshooting
sudo dockerd --debug --log-level=debug > /tmp/docker-debug.log 2>&1 &
# Kill after 30 seconds and examine log
# System journal deep dive
journalctl -u docker.service --since="1 hour ago" -o verbose > /tmp/docker-journal.log
```
## Known Working Configuration Target
### Expected Working State
- **iptables**: Legacy backend active
- **Docker**: Running with NAT chain creation successful
- **Network**: Docker bridge network functional
- **Containers**: Can start and access network
### Tdarr Node Test Command
```bash
cd ~/docker/tdarr-node
# Update IP in compose file first:
# serverIP=<TDARR_SERVER_IP>
docker-compose -f tdarr-node-basic.yml up -d
```
## Related Documentation Created
- `/patterns/docker/gpu-acceleration.md` - GPU troubleshooting patterns
- `/reference/docker/nvidia-troubleshooting.md` - NVIDIA container toolkit
- `/examples/docker/tdarr-node-local/` - Working configurations
## System Context Notes
- This is a gaming-focused Nobara distribution
- May have different default networking than standard Fedora
- NVIDIA drivers already working (nvidia-smi functional)
- System has been used for other Docker containers successfully in past
- Recent NVIDIA container toolkit installation may have triggered the issue
## Success Criteria for Next Session
1. ✅ Docker service starts without errors
2. ✅ `docker ps` command works
3. ✅ Simple container can run: `docker run --rm hello-world`
4. ✅ Tdarr node container can start (even if can't connect to server yet)
5. ✅ Network connectivity from containers works
## Escalation Options
If standard troubleshooting fails:
1. **Nobara Community**: Check Nobara Discord/forums for similar issues
2. **Docker Desktop**: Use different Docker implementation
3. **Podman Migration**: Switch to podman as Docker replacement
4. **System Reinstall**: Fresh OS install (nuclear option)
5. **Container Alternatives**: LXC/systemd containers instead of Docker
## Files to Check Next Session
- `/etc/docker/daemon.json` - Docker configuration
- `/var/log/docker.log` - Docker service logs
- `~/.docker/config.json` - User Docker config
- `/proc/sys/net/ipv4/ip_forward` - IP forwarding enabled
- `/etc/systemd/system/docker.service.d/` - Service overrides
---
*End of troubleshooting session log*

466
docker/troubleshooting.md Normal file
View File

@ -0,0 +1,466 @@
# Docker Container Troubleshooting Guide
## Container Startup Issues
### Container Won't Start
**Check container logs first**:
```bash
# Docker
docker logs <container_name>
docker logs --tail 50 -f <container_name>
# Podman
podman logs <container_name>
podman logs --tail 50 -f <container_name>
```
### Common Startup Failures
#### Port Conflicts
**Symptoms**: `bind: address already in use` error
**Solution**:
```bash
# Find conflicting process
sudo netstat -tulpn | grep <port>
docker ps | grep <port>
# Change port mapping
docker run -p 8081:8080 myapp # Use different host port
```
#### Permission Errors
**Symptoms**: `permission denied` when accessing files/volumes
**Solutions**:
```bash
# Check file ownership
ls -la /host/volume/path
# Fix ownership (match container user)
sudo chown -R 1000:1000 /host/volume/path
# Use correct UID/GID in container
docker run -e PUID=1000 -e PGID=1000 myapp
```
#### Missing Environment Variables
**Symptoms**: Application fails with configuration errors
**Diagnostic**:
```bash
# Check container environment
docker exec -it <container> env
docker exec -it <container> printenv
# Verify required variables are set
docker inspect <container> | grep -A 20 "Env"
```
#### Resource Constraints
**Symptoms**: Container killed or OOM errors
**Solutions**:
```bash
# Check resource usage
docker stats <container>
# Increase memory limit
docker run -m 4g myapp
# Check system resources
free -h
df -h
```
### Debug Running Containers
```bash
# Access container shell
docker exec -it <container> /bin/bash
docker exec -it <container> /bin/sh # if bash not available
# Check container processes
docker exec <container> ps aux
# Check container filesystem
docker exec <container> ls -la /app
```
## Build Issues
### Build Failures
**Clear build cache when encountering issues**:
```bash
# Docker
docker system prune -a
docker builder prune
# Podman
podman system prune -a
podman image prune -a
```
### Verbose Build Output
```bash
# Docker
docker build --progress=plain --no-cache .
# Podman
podman build --layers=false .
```
### Common Build Problems
#### COPY/ADD Errors
**Issue**: Files not found during build
**Solutions**:
```dockerfile
# Check .dockerignore file
# Verify file paths relative to build context
COPY ./src /app/src # ✅ Correct
COPY /absolute/path /app # ❌ Wrong - no absolute paths
```
#### Package Installation Failures
**Issue**: apt/yum/dnf package installation fails
**Solutions**:
```dockerfile
# Update package lists first
RUN apt-get update && apt-get install -y package-name
# Combine RUN commands to reduce layers
RUN apt-get update && \
apt-get install -y package1 package2 && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
```
#### Network Issues During Build
**Issue**: Cannot reach package repositories
**Solutions**:
```bash
# Check DNS resolution
docker build --network host .
# Use custom DNS
docker build --dns 8.8.8.8 .
```
## GPU Container Issues
### NVIDIA GPU Support Problems
#### Docker Desktop vs Podman on Fedora/Nobara
**Issue**: Docker Desktop has GPU compatibility issues on Fedora-based systems
**Symptoms**:
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
- `unknown or invalid runtime name: nvidia`
- Device nodes exist but CUDA fails to initialize
**Solution**: Use Podman instead of Docker on Fedora systems
```bash
# Verify host GPU works
nvidia-smi
# Test with Podman (recommended)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
# Test with Docker (may fail on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
```
#### GPU Container Configuration
**Working Podman GPU template**:
```bash
podman run -d --name gpu-container \
--device nvidia.com/gpu=all \
--restart unless-stopped \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
myapp:latest
```
**Working Docker GPU template**:
```bash
docker run -d --name gpu-container \
--gpus all \
--restart unless-stopped \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
myapp:latest
```
#### GPU Troubleshooting Steps
1. **Verify Host GPU Access**:
```bash
nvidia-smi # Should show GPU info
lsmod | grep nvidia # Should show nvidia modules
ls -la /dev/nvidia* # Should show device files
```
2. **Check NVIDIA Container Toolkit**:
```bash
rpm -qa | grep nvidia-container-toolkit # Fedora/RHEL
dpkg -l | grep nvidia-container-toolkit # Ubuntu/Debian
nvidia-ctk --version
```
3. **Test GPU in Container**:
```bash
# Should show GPU information
podman exec gpu-container nvidia-smi
# Test CUDA functionality
podman exec gpu-container nvidia-ml-py
```
#### Platform-Specific GPU Notes
**Fedora/Nobara/RHEL**:
- ✅ Podman: Works out-of-the-box with GPU support
- ❌ Docker Desktop: Known GPU integration issues
- Solution: Use Podman for GPU workloads
**Ubuntu/Debian**:
- ✅ Docker: Generally works well with proper NVIDIA toolkit setup
- ✅ Podman: Also works well
- Solution: Either runtime typically works
## Performance Issues
### Resource Monitoring
**Real-time resource usage**:
```bash
# Overall container stats
docker stats
podman stats
# Inside container analysis
docker exec <container> top
docker exec <container> free -h
docker exec <container> df -h
# Network usage
docker exec <container> netstat -i
```
### Image Size Optimization
**Analyze image layers**:
```bash
# Check image sizes
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"
# Analyze layer history
docker history <image>
# Find large files in container
docker exec <container> du -sh /* | sort -hr
```
**Optimization strategies**:
```dockerfile
# Use multi-stage builds
FROM node:18 AS builder
# ... build steps ...
FROM node:18-alpine AS production
COPY --from=builder /app/dist /app
# Smaller final image
# Combine RUN commands
RUN apt-get update && \
apt-get install -y package && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Use .dockerignore
# .dockerignore
node_modules
.git
*.log
```
### Storage Performance Issues
**Slow volume performance**:
```bash
# Test volume I/O performance
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000
# Check volume mount options
docker inspect <container> | grep -A 10 "Mounts"
# Consider using tmpfs for temporary data
docker run --tmpfs /tmp myapp
```
## Network Debugging
### Network Connectivity Issues
**Inspect network configuration**:
```bash
# List networks
docker network ls
podman network ls
# Inspect specific network
docker network inspect <network_name>
# Check container networking
docker exec <container> ip addr show
docker exec <container> ip route show
```
### Service Discovery Problems
**Test connectivity between containers**:
```bash
# Test by container name (same network)
docker exec container1 ping container2
# Test by IP address
docker exec container1 ping 172.17.0.3
# Check DNS resolution
docker exec container1 nslookup container2
```
### Port Binding Issues
**Verify port mappings**:
```bash
# Check exposed ports
docker port <container>
# Test external connectivity
curl localhost:8080
# Check if port is bound to all interfaces
netstat -tulpn | grep :8080
```
## Emergency Recovery
### Complete Container Reset
**Remove all containers and start fresh**:
```bash
# Stop all containers
docker stop $(docker ps -q)
podman stop --all
# Remove all containers
docker container prune -f
podman container prune -f
# Remove all images
docker image prune -a -f
podman image prune -a -f
# Remove all volumes (CAUTION: data loss)
docker volume prune -f
podman volume prune -f
# Complete system cleanup
docker system prune -a --volumes -f
podman system prune -a --volumes -f
```
### Container Recovery
**Recover from corrupted container**:
```bash
# Create backup of container data
docker cp <container>:/important/data ./backup/
# Export container filesystem
docker export <container> > container-backup.tar
# Import and restart
docker import container-backup.tar new-image:latest
docker run -d --name new-container new-image:latest
```
### Data Recovery
**Recover data from volumes**:
```bash
# List volumes
docker volume ls
# Inspect volume location
docker volume inspect <volume_name>
# Access volume data directly
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data
# Mount volume to temporary container
docker run --rm -v <volume_name>:/data alpine ls -la /data
```
## Health Check Issues
### Container Health Checks
**Implement health checks**:
```dockerfile
# Dockerfile health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
```
**Debug health check failures**:
```bash
# Check health status
docker inspect <container> | grep -A 10 Health
# Manual health check test
docker exec <container> curl -f http://localhost:3000/health
# Check health check logs
docker events --filter container=<container>
```
## Log Analysis
### Log Management
**View and manage container logs**:
```bash
# View recent logs
docker logs --tail 100 <container>
# Follow logs in real-time
docker logs -f <container>
# Logs with timestamps
docker logs -t <container>
# Search logs for errors
docker logs <container> 2>&1 | grep ERROR
```
### Log Rotation Issues
**Configure log rotation to prevent disk filling**:
```bash
# Run with log size limits
docker run --log-opt max-size=10m --log-opt max-file=3 myapp
# Check log file sizes
sudo du -sh /var/lib/docker/containers/*/
```
## Platform-Specific Issues
### Fedora/Nobara/RHEL Systems
- **GPU Support**: Use Podman instead of Docker Desktop
- **SELinux**: May require container contexts (`-Z` flag)
- **Firewall**: Configure firewalld for container networking
### Ubuntu/Debian Systems
- **AppArmor**: May restrict container operations
- **Snap Docker**: May have permission issues vs native package
### General Linux Issues
- **cgroups v2**: Some older containers need cgroups v1
- **User namespaces**: May cause UID/GID mapping issues
- **systemd**: Integration differences between Docker/Podman
## Prevention Best Practices
1. **Resource Limits**: Always set memory and CPU limits
2. **Health Checks**: Implement application health monitoring
3. **Log Rotation**: Configure to prevent disk space issues
4. **Security Scanning**: Regular vulnerability scans
5. **Backup Strategy**: Regular data and configuration backups
6. **Testing**: Test containers in staging before production
7. **Documentation**: Document container configurations and dependencies
This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments.

View File

@ -0,0 +1,172 @@
# Scripts Directory
This directory contains operational scripts and utilities for home lab management and automation.
## Directory Structure
```
scripts/
├── README.md # This documentation
├── tdarr_monitor.py # Enhanced Tdarr monitoring with Discord alerts
├── tdarr/ # Tdarr automation and scheduling
├── monitoring/ # System monitoring and alerting
└── <future>/ # Other organized automation subsystems
```
## Scripts Overview
### `tdarr_monitor.py` - Enhanced Tdarr Monitoring
**Description**: Comprehensive Tdarr monitoring script with stuck job detection and Discord notifications.
**Features**:
- 📊 Complete Tdarr system monitoring (server, nodes, queue, libraries)
- 🧠 Short-term memory for stuck job detection
- 🚨 Discord notifications with rich embeds
- 💾 Persistent state management
- ⚙️ Configurable thresholds and alerts
**Quick Start**:
```bash
# Basic monitoring
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 --check all
# Enable stuck job detection with 15-minute threshold
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 \
--check nodes --detect-stuck --stuck-threshold 15
# Full monitoring with Discord alerts (uses default webhook)
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 \
--check all --detect-stuck --discord-alerts
# Test Discord integration (uses default webhook)
python3 scripts/tdarr_monitor.py --server http://10.10.0.43:8265 --discord-test
```
**CLI Options**:
```
--server Tdarr server URL (required)
--check Type of check: all, status, queue, nodes, libraries, stats, health
--timeout Request timeout in seconds (default: 30)
--output Output format: json, pretty (default: pretty)
--verbose Enable verbose logging
--detect-stuck Enable stuck job detection
--stuck-threshold Minutes before job considered stuck (default: 30)
--memory-file Path to memory state file (default: .claude/tmp/tdarr_memory.pkl)
--clear-memory Clear memory state and exit
--discord-webhook Discord webhook URL for notifications (default: configured)
--discord-alerts Enable Discord alerts for stuck jobs
--discord-test Send test Discord message and exit
```
**Memory Management**:
- **Persistent State**: Worker snapshots saved to `.claude/tmp/tdarr_memory.pkl`
- **Automatic Cleanup**: Removes tracking for disappeared workers
- **Error Recovery**: Graceful handling of corrupted memory files
**Discord Features**:
- **Two Message Types**: Simple content messages and rich embeds
- **Stuck Job Alerts**: Detailed embed notifications with file info, progress, duration
- **System Status**: Health summaries with node details and color-coded status
- **Customizable**: Colors, fields, titles, descriptions fully configurable
- **Error Handling**: Graceful failures without breaking monitoring
**Integration Examples**:
*Cron Job for Regular Monitoring*:
```bash
# Check every 15 minutes, alert on stuck jobs over 30 minutes
*/15 * * * * cd /path/to/claude-home && python3 scripts/tdarr_monitor.py \
--server http://10.10.0.43:8265 --check nodes --detect-stuck --discord-alerts
```
*Systemd Service*:
```ini
[Unit]
Description=Tdarr Monitor
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/bin/python3 /path/to/claude-home/scripts/tdarr_monitor.py \
--server http://10.10.0.43:8265 --check all --detect-stuck --discord-alerts
WorkingDirectory=/path/to/claude-home
User=your-user
[Timer]
OnCalendar=*:0/15
Persistent=true
[Install]
WantedBy=timers.target
```
**API Data Classes**:
The script uses strongly-typed dataclasses for all API responses:
- `ServerStatus` - Server health and version info
- `NodeStatus` - Node details with stuck job tracking
- `QueueStatus` - Transcoding queue statistics
- `LibraryStatus` - Library scan progress
- `StatisticsStatus` - Overall system statistics
- `HealthStatus` - Comprehensive health check results
**Error Handling**:
- Network timeouts and connection errors
- API endpoint failures
- JSON parsing errors
- Discord webhook failures
- Memory state corruption
- Missing dependencies
**Dependencies**:
- `requests` - HTTP client for API calls
- `pickle` - State serialization
- Standard library only (no external requirements beyond requests)
---
## Development Guidelines
### Adding New Scripts
1. **Location**: Place scripts in appropriate subdirectories by function
2. **Documentation**: Include comprehensive docstrings and usage examples
3. **Error Handling**: Implement robust error handling and logging
4. **Configuration**: Use CLI arguments and/or config files for flexibility
5. **Testing**: Include test functionality where applicable
### Naming Conventions
- Use descriptive names: `tdarr_monitor.py` not `monitor.py`
- Use underscores for Python scripts: `system_health.py`
- Use hyphens for shell scripts: `backup-system.sh`
### Directory Organization
Create subdirectories for related functionality:
```
scripts/
├── monitoring/ # System monitoring scripts
├── backup/ # Backup and restore utilities
├── network/ # Network management tools
├── containers/ # Docker/Podman management
└── maintenance/ # System maintenance tasks
```
---
## Future Enhancements
### Planned Features
- **Email Notifications**: SMTP integration for email alerts
- **Prometheus Metrics**: Export metrics for Grafana dashboards
- **Webhook Actions**: Trigger external actions on stuck jobs
- **Multi-Server Support**: Monitor multiple Tdarr instances
- **Configuration Files**: YAML/JSON config file support
### Contributing
1. Follow existing code style and patterns
2. Add comprehensive documentation
3. Include error handling and logging
4. Test thoroughly before committing
5. Update this README with new scripts

142
monitoring/CONTEXT.md Normal file
View File

@ -0,0 +1,142 @@
# System Monitoring and Alerting - Technology Context
## Overview
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
## Architecture Patterns
### Distributed Monitoring Strategy
**Pattern**: Service-specific monitoring with centralized alerting
- **Tdarr Monitoring**: API-based transcoding health checks
- **Windows Desktop Monitoring**: Reboot detection and system events
- **Network Monitoring**: Connectivity and service availability
- **Container Monitoring**: Docker/Podman health and resource usage
### Alert Management
**Pattern**: Structured notifications with actionable information
```bash
# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
}'
```
## Core Monitoring Components
### Tdarr System Monitoring
**Purpose**: Monitor transcoding pipeline health and performance
**Location**: `scripts/tdarr_monitor.py`
**Key Features**:
- API-based status monitoring with dataclass structures
- Staging section timeout detection and cleanup
- Discord notifications with professional formatting
- Log rotation and retention management
### Windows Desktop Monitoring
**Purpose**: Track Windows system reboots and power events
**Location**: `scripts/windows-desktop/`
**Components**:
- PowerShell monitoring script
- Scheduled task automation
- Discord notification integration
- System event correlation
### Network and Service Monitoring
**Purpose**: Monitor critical infrastructure availability
**Implementation**:
```bash
# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "✅ $service: Available"
else
echo "❌ $service: Failed" | send_alert
fi
done
```
## Automation Patterns
### Cron-Based Scheduling
**Pattern**: Regular health checks with intelligent alerting
```bash
# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
```
### Event-Driven Monitoring
**Pattern**: Reactive monitoring for critical events
- **System Startup**: Windows boot detection
- **Service Failures**: Container restart alerts
- **Resource Exhaustion**: Disk space warnings
- **Security Events**: Failed login attempts
## Data Collection and Analysis
### Log Management
**Pattern**: Centralized logging with rotation
```bash
# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30
# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
touch "$LOG_FILE"
fi
```
### Metrics Collection
**Pattern**: Time-series data for trend analysis
- **System Metrics**: CPU, memory, disk usage
- **Service Metrics**: Response times, error rates
- **Application Metrics**: Transcoding progress, queue sizes
- **Network Metrics**: Bandwidth usage, latency
## Alert Integration
### Discord Notification System
**Pattern**: Rich, actionable notifications
```markdown
# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational
Manual review recommended <@user_id>
```
### Alert Escalation
**Pattern**: Tiered alerting based on severity
1. **Info**: Routine maintenance completed
2. **Warning**: Service degradation detected
3. **Critical**: Service failure requiring immediate attention
4. **Emergency**: System-wide failure requiring manual intervention
## Best Practices Implementation
### Monitoring Strategy
1. **Proactive**: Monitor trends to predict issues
2. **Reactive**: Alert on current failures
3. **Preventive**: Automated cleanup and maintenance
4. **Comprehensive**: Cover all critical services
5. **Actionable**: Provide clear resolution paths
### Performance Optimization
1. **Efficient Polling**: Balance monitoring frequency with resource usage
2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering
3. **Resource Management**: Monitor the monitoring system itself
4. **Scalable Architecture**: Design for growth and additional services
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.

View File

@ -0,0 +1,326 @@
# Cron Job Management Patterns
This document outlines the cron job patterns and management strategies used in the home lab environment.
## Current Cron Schedule
### Overview
```bash
# Monthly maintenance
0 2 1 * * /home/cal/bin/ssh_key_maintenance.sh
# Tdarr monitoring and management
*/10 * * * * python3 /mnt/NV2/Development/claude-home/scripts/tdarr_monitor.py --server http://10.10.0.43:8265 --check nodes --detect-stuck --discord-alerts >/dev/null 2>&1
0 */6 * * * find "/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/" -name "tdarr-workDir2-*" -type d -mmin +360 -exec rm -rf {} \; 2>/dev/null || true
0 3 * * * find "/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media" -name "*.temp" -o -name "*.tdarr" -mtime +1 -delete 2>/dev/null || true
# Disabled/legacy jobs
#*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
```
## Job Categories
### 1. System Maintenance
**SSH Key Maintenance**
- **Schedule**: `0 2 1 * *` (Monthly, 1st at 2 AM)
- **Purpose**: Maintain SSH key security and rotation
- **Location**: `/home/cal/bin/ssh_key_maintenance.sh`
- **Priority**: High (security-critical)
### 2. Monitoring & Alerting
**Tdarr System Monitoring**
- **Schedule**: `*/10 * * * *` (Every 10 minutes)
- **Purpose**: Monitor Tdarr nodes, detect stuck jobs, send Discord alerts
- **Features**:
- Stuck job detection (30-minute threshold)
- Discord notifications with rich embeds
- Persistent memory state tracking
- **Script**: `/mnt/NV2/Development/claude-home/scripts/tdarr_monitor.py`
- **Output**: Silent (`>/dev/null 2>&1`)
### 3. Cleanup & Housekeeping
**Tdarr Work Directory Cleanup**
- **Schedule**: `0 */6 * * *` (Every 6 hours)
- **Purpose**: Remove stale Tdarr work directories
- **Target**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/`
- **Pattern**: `tdarr-workDir2-*` directories
- **Age threshold**: 6 hours (`-mmin +360`)
**Failed Tdarr Job Cleanup**
- **Schedule**: `0 3 * * *` (Daily at 3 AM)
- **Purpose**: Remove failed transcode artifacts
- **Target**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media/`
- **Patterns**: `*.temp` and `*.tdarr` files
- **Age threshold**: 24 hours (`-mtime +1`)
## Design Patterns
### 1. Absolute Paths
**Always use absolute paths in cron jobs**
```bash
# Good
*/10 * * * * python3 /full/path/to/script.py
# Bad - relative paths don't work in cron
*/10 * * * * python3 scripts/script.py
```
### 2. Error Handling
**Standard error suppression pattern**
```bash
command 2>/dev/null || true
```
- Suppresses stderr to prevent cron emails
- `|| true` ensures job always exits successfully
### 3. Time-based Cleanup
**Safe age thresholds for different content types**
- **Work directories**: 6 hours (short-lived, safe for active jobs)
- **Temp files**: 24 hours (allows for long transcodes)
- **Log files**: 7-30 days (depending on importance)
### 4. Resource-aware Scheduling
**Avoid resource conflicts**
```bash
# System maintenance at low-usage times
0 2 1 * * maintenance_script.sh
# Cleanup during off-peak hours
0 3 * * * cleanup_script.sh
# Monitoring with high frequency during active hours
*/10 * * * * monitor_script.py
```
## Management Workflow
### Adding New Cron Jobs
1. **Backup current crontab**
```bash
crontab -l > /tmp/crontab_backup_$(date +%Y%m%d)
```
2. **Edit safely**
```bash
crontab -l > /tmp/new_crontab
echo "# New job description" >> /tmp/new_crontab
echo "schedule command" >> /tmp/new_crontab
crontab /tmp/new_crontab
```
3. **Verify installation**
```bash
crontab -l
```
### Proper HERE Document (EOF) Usage
**When building cron files with HERE documents, use proper EOF formatting:**
#### ✅ **Correct Format**
```bash
cat > /tmp/new_crontab << 'EOF'
0 2 1 * * /home/cal/bin/ssh_key_maintenance.sh
# Tdarr monitoring every 10 minutes
*/10 * * * * python3 /path/to/script.py --args
EOF
```
#### ❌ **Common Mistakes**
```bash
# BAD - Causes "EOF not found" errors
cat >> /tmp/crontab << 'EOF'
new_cron_job
EOF
# Results in malformed file with literal "EOF < /dev/null" lines
```
#### **Key Rules for EOF in Cron Files**
1. **Use `cat >` not `cat >>`** for building complete files
```bash
# Good - overwrites file cleanly
cat > /tmp/crontab << 'EOF'
# Bad - appends and can create malformed files
cat >> /tmp/crontab << 'EOF'
```
2. **Quote the EOF delimiter** to prevent variable expansion
```bash
# Good - literal content
cat > file << 'EOF'
# Can cause issues with special characters
cat > file << EOF
```
3. **Clean up malformed files** before installing
```bash
# Remove EOF artifacts and empty lines
head -n -1 /tmp/crontab > /tmp/clean_crontab
# Or use grep to remove EOF lines
grep -v "^EOF" /tmp/crontab > /tmp/clean_crontab
```
4. **Alternative approach - direct echo method**
```bash
crontab -l > /tmp/current_crontab
echo "# New job comment" >> /tmp/current_crontab
echo "*/10 * * * * /path/to/command" >> /tmp/current_crontab
crontab /tmp/current_crontab
```
#### **Debugging EOF Issues**
```bash
# Check for EOF artifacts in crontab file
cat -n /tmp/crontab | grep EOF
# Validate crontab syntax before installing
crontab -T /tmp/crontab # Some systems support this
# Manual cleanup if needed
sed '/^EOF/d' /tmp/crontab > /tmp/clean_crontab
```
### Testing Cron Jobs
**Test command syntax first**
```bash
# Test the actual command before scheduling
python3 /full/path/to/script.py --test
# Check file permissions
ls -la /path/to/script
# Verify paths exist
ls -la /target/directory/
```
**Test with minimal frequency**
```bash
# Start with 5-minute intervals for testing
*/5 * * * * /path/to/new/script.sh
# Monitor logs
tail -f /var/log/syslog | grep CRON
```
### Monitoring Cron Jobs
**Check cron logs**
```bash
# System cron logs
sudo journalctl -u cron -f
# User cron logs
grep CRON /var/log/syslog | grep $(whoami)
```
**Verify job execution**
```bash
# Check if cleanup actually ran
ls -la /target/cleanup/directory/
# Monitor script logs
tail -f /path/to/script/logs/
```
## Security Considerations
### 1. Path Security
- Use absolute paths to prevent PATH manipulation
- Ensure scripts are owned by correct user
- Set appropriate permissions (750 for scripts)
### 2. Command Injection Prevention
```bash
# Good - quoted paths
find "/path/with spaces/" -name "pattern"
# Bad - unquoted paths vulnerable to injection
find /path/with spaces/ -name pattern
```
### 3. Resource Limits
- Prevent runaway processes with `timeout`
- Use `ionice` for I/O intensive cleanup jobs
- Consider `nice` for CPU-intensive tasks
## Troubleshooting
### Common Issues
**Job not running**
1. Check cron service: `sudo systemctl status cron`
2. Verify crontab syntax: `crontab -l`
3. Check file permissions and paths
4. Review cron logs for errors
**Environment differences**
- Cron runs with minimal environment
- Set PATH explicitly if needed
- Use absolute paths for all commands
**Silent failures**
- Remove `2>/dev/null` temporarily for debugging
- Add logging to scripts
- Check script exit codes
### Debugging Commands
```bash
# Test cron environment
* * * * * env > /tmp/cron_env.txt
# Test script in cron-like environment
env -i /bin/bash -c 'your_command_here'
# Monitor real-time execution
sudo tail -f /var/log/syslog | grep CRON
```
## Best Practices
### 1. Documentation
- Comment all cron jobs with purpose and schedule
- Document in this patterns file
- Include contact info for complex jobs
### 2. Maintenance
- Regular review of active jobs (quarterly)
- Remove obsolete jobs promptly
- Update absolute paths when moving scripts
### 3. Monitoring
- Implement health checks for critical jobs
- Use Discord/email notifications for failures
- Monitor disk space usage from cleanup jobs
### 4. Backup Strategy
- Backup crontab before changes
- Version control cron configurations
- Document restoration procedures
## Future Enhancements
### Planned Additions
- **Log rotation**: Automated cleanup of application logs
- **Health checks**: System resource monitoring
- **Backup verification**: Automated backup integrity checks
- **Certificate renewal**: SSL/TLS certificate automation
### Migration Considerations
- **Systemd timers**: Consider migration for complex scheduling
- **Configuration management**: Ansible or similar for multi-host
- **Centralized logging**: Aggregated cron job monitoring
---
## Related Documentation
- [Tdarr Monitoring Script](../scripts/README.md#tdarr_monitorpy---enhanced-tdarr-monitoring)
- [System Maintenance](../reference/system-maintenance.md)
- [Discord Integration](../examples/discord-notifications.md)

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,414 @@
# Monitoring System Troubleshooting Guide
## Discord Notification Issues
### Webhook Not Working
**Symptoms**: No Discord messages received, connection errors
**Diagnosis**:
```bash
# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "Test message"}'
# Check webhook URL format
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"
```
**Solutions**:
```bash
# Verify webhook URL is correct
# Format: https://discord.com/api/webhooks/ID/TOKEN
# Test with minimal payload
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "✅ Webhook working"}'
# Check for JSON formatting issues
echo '{"content": "test"}' | jq . # Validate JSON
```
### Message Formatting Problems
**Symptoms**: Malformed messages, broken markdown, missing user pings
**Common Issues**:
```bash
# ❌ Broken JSON escaping
{"content": "Error: "quotes" break JSON"}
# ✅ Proper JSON escaping
{"content": "Error: \"quotes\" properly escaped"}
# ❌ User ping inside code block (doesn't work)
{"content": "```\nIssue occurred <@user_id>\n```"}
# ✅ User ping outside code block
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}
```
## Tdarr Monitoring Issues
### Script Not Running
**Symptoms**: No monitoring alerts, script execution failures
**Diagnosis**:
```bash
# Check cron job status
crontab -l | grep tdarr-timeout-monitor
systemctl status cron
# Run script manually for debugging
bash -x /path/to/tdarr-timeout-monitor.sh
# Check script permissions
ls -la /path/to/tdarr-timeout-monitor.sh
```
**Solutions**:
```bash
# Fix script permissions
chmod +x /path/to/tdarr-timeout-monitor.sh
# Reinstall cron job
crontab -e
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh
# Check script environment
# Ensure PATH and variables are set correctly in script
```
### API Connection Failures
**Symptoms**: Cannot connect to Tdarr server, timeout errors
**Diagnosis**:
```bash
# Test Tdarr API manually
curl -f "http://tdarr-server:8266/api/v2/status"
# Check network connectivity
ping tdarr-server
nc -zv tdarr-server 8266
# Verify SSH access to server
ssh tdarr "docker ps | grep tdarr"
```
**Solutions**:
```bash
# Update server connection in script
# Verify server IP and port are correct
# Test API endpoints
curl "http://10.10.0.43:8265/api/v2/status" # Web port
curl "http://10.10.0.43:8266/api/v2/status" # Server port
# Check Tdarr server logs
ssh tdarr "docker logs tdarr | tail -20"
```
## Windows Desktop Monitoring Issues
### PowerShell Script Not Running
**Symptoms**: No reboot notifications from Windows systems
**Diagnosis**:
```powershell
# Check scheduled task status
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo
# Test script execution manually
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"
# Check PowerShell execution policy
Get-ExecutionPolicy
```
**Solutions**:
```powershell
# Set execution policy
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Recreate scheduled tasks
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"
# Check task trigger configuration
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger
```
### Network Access from Windows
**Symptoms**: PowerShell cannot reach Discord webhook
**Diagnosis**:
```powershell
# Test network connectivity
Test-NetConnection discord.com -Port 443
# Test webhook manually
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"
# Check Windows firewall
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}
```
**Solutions**:
```powershell
# Allow PowerShell through firewall
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow
# Test with simplified request
$body = @{content="Test from Windows"} | ConvertTo-Json
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"
```
## Log Management Issues
### Log Files Growing Too Large
**Symptoms**: Disk space filling up, slow log access
**Diagnosis**:
```bash
# Check log file sizes
du -sh /var/log/homelab-*
du -sh /tmp/*monitor*.log
# Check available disk space
df -h /var/log
df -h /tmp
```
**Solutions**:
```bash
# Implement log rotation
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
/var/log/homelab-*.log {
daily
missingok
rotate 7
compress
notifempty
create 644 root root
}
EOF
# Manual log cleanup
find /tmp -name "*monitor*.log" -size +10M -delete
truncate -s 0 /tmp/large-log-file.log
```
### Log Rotation Not Working
**Symptoms**: Old logs not being cleaned up
**Diagnosis**:
```bash
# Check logrotate status
systemctl status logrotate
cat /var/lib/logrotate/status
# Test logrotate configuration
logrotate -d /etc/logrotate.d/homelab-monitoring
```
**Solutions**:
```bash
# Force log rotation
logrotate -f /etc/logrotate.d/homelab-monitoring
# Fix logrotate configuration
sudo nano /etc/logrotate.d/homelab-monitoring
# Verify syntax and permissions
```
## Cron Job Issues
### Scheduled Tasks Not Running
**Symptoms**: Scripts not executing at scheduled times
**Diagnosis**:
```bash
# Check cron service
systemctl status cron
systemctl status crond # RHEL/CentOS
# View cron logs
grep CRON /var/log/syslog
journalctl -u cron
# List all cron jobs
crontab -l
sudo crontab -l # System crontab
```
**Solutions**:
```bash
# Restart cron service
sudo systemctl restart cron
# Fix cron job syntax
# Ensure absolute paths are used
# Example: */20 * * * * /full/path/to/script.sh
# Check script permissions and execution
ls -la /path/to/script.sh
/path/to/script.sh # Test manual execution
```
### Environment Variables in Cron
**Symptoms**: Scripts work manually but fail in cron
**Diagnosis**:
```bash
# Create test cron job to check environment
* * * * * env > /tmp/cron-env.txt
# Compare with shell environment
env > /tmp/shell-env.txt
diff /tmp/shell-env.txt /tmp/cron-env.txt
```
**Solutions**:
```bash
# Set PATH in crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# Or set PATH in script
#!/bin/bash
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
# Source environment if needed
source /etc/environment
```
## Network Monitoring Issues
### False Positives
**Symptoms**: Alerts for services that are actually working
**Diagnosis**:
```bash
# Test monitoring checks manually
curl -sSf --max-time 10 "https://service.homelab.local"
ping -c1 -W5 10.10.0.100
# Check for intermittent network issues
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done
```
**Solutions**:
```bash
# Adjust timeout values
curl --max-time 30 "$service" # Increase timeout
# Add retry logic
for retry in {1..3}; do
if curl -sSf "$service" >/dev/null 2>&1; then
break
elif [ $retry -eq 3 ]; then
send_alert "Service $service failed after 3 retries"
fi
sleep 5
done
```
### Missing Alerts
**Symptoms**: Real failures not triggering notifications
**Diagnosis**:
```bash
# Verify monitoring script logic
bash -x monitoring-script.sh
# Check if services are actually down
systemctl status service-name
curl -v service-url
```
**Solutions**:
```bash
# Lower detection thresholds
# Increase monitoring frequency
# Add redundant monitoring methods
# Test alert mechanism
echo "Test alert" | send_alert_function
```
## System Resource Issues
### Monitoring Overhead
**Symptoms**: High CPU/memory usage from monitoring scripts
**Diagnosis**:
```bash
# Monitor the monitoring scripts
top -p $(pgrep -f monitor)
ps aux | grep monitor
# Check monitoring frequency
crontab -l | grep monitor
```
**Solutions**:
```bash
# Reduce monitoring frequency
# Change from */1 to */5 minutes
# Optimize scripts
# Remove unnecessary commands
# Use efficient tools (prefer curl over wget, etc.)
# Add resource limits
timeout 30 monitoring-script.sh
```
## Emergency Recovery
### Complete Monitoring Failure
**Recovery Steps**:
```bash
# Restart all monitoring services
sudo systemctl restart cron
sudo systemctl restart rsyslog
# Reinstall monitoring scripts
cd /path/to/scripts
./install-monitoring.sh
# Test all components
./test-monitoring.sh
```
### Discord Integration Lost
**Quick Recovery**:
```bash
# Test webhook
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'
# Switch to backup webhook if needed
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"
```
## Prevention and Best Practices
### Monitoring Health Checks
```bash
#!/bin/bash
# monitor-the-monitors.sh
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"
for script in $MONITORING_SCRIPTS; do
if [ ! -x "$script" ]; then
echo "ALERT: $script not executable" | send_alert
fi
# Check if script has run recently
if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
echo "ALERT: $script hasn't run in over an hour" | send_alert
fi
done
```
### Backup Alerting Channels
```bash
# Multiple notification methods
send_alert() {
local message="$1"
# Primary: Discord
curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
# Backup: Email
echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
# Last resort: Local log
echo "$(date): $message" >> /var/log/critical-alerts.log
}
```
This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.

309
networking/CONTEXT.md Normal file
View File

@ -0,0 +1,309 @@
# Networking Infrastructure - Technology Context
## Overview
Home lab networking infrastructure with focus on reverse proxy configuration, SSL/TLS management, SSH key management, and network security. This context covers service discovery, load balancing, and performance optimization patterns.
## Architecture Patterns
### Reverse Proxy and Load Balancing
**Pattern**: Centralized traffic management with SSL termination
```nginx
# Nginx reverse proxy pattern
upstream backend {
server 10.10.0.100:3000;
server 10.10.0.101:3000;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name myapp.homelab.local;
ssl_certificate /etc/ssl/certs/homelab.crt;
ssl_certificate_key /etc/ssl/private/homelab.key;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
```
### Network Segmentation Strategy
**Pattern**: VLAN-based isolation with controlled inter-VLAN routing
```
Management VLAN: 10.10.0.x/24 # VM management, SSH access
Services VLAN: 10.10.1.x/24 # Application services
Storage VLAN: 10.10.2.x/24 # NAS, backup traffic
DMZ VLAN: 10.10.10.x/24 # External-facing services
```
## SSH Key Management
### Centralized Key Distribution
**Pattern**: Automated SSH key deployment with emergency backup
```bash
# Primary access key
~/.ssh/homelab_rsa # Daily operations key
# Emergency access key
~/.ssh/emergency_homelab_rsa # Backup recovery key
# Automated deployment
for host in $(cat hosts.txt); do
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@$host
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@$host
done
```
### Key Lifecycle Management
**Pattern**: Regular rotation with zero-downtime deployment
1. **Generation**: Create new key pairs annually
2. **Distribution**: Deploy to all managed systems
3. **Verification**: Test connectivity with new keys
4. **Rotation**: Remove old keys after verification
5. **Backup**: Store keys in secure, recoverable location
## Service Discovery and DNS
### Local DNS Resolution
**Pattern**: Internal DNS for service discovery
```bind
# Home lab DNS zones
homelab.local. IN A 10.10.0.16 # DNS server
proxmox.homelab.local. IN A 10.10.0.10 # Hypervisor
nas.homelab.local. IN A 10.10.0.20 # Storage
tdarr.homelab.local. IN A 10.10.0.43 # Media server
```
### Container Service Discovery
**Pattern**: Docker network-based service resolution
```yaml
# Docker Compose service discovery
version: "3.8"
services:
web:
networks:
- frontend
- backend
api:
networks:
- backend
- database
db:
networks:
- database
networks:
frontend:
driver: bridge
backend:
driver: bridge
database:
driver: bridge
internal: true # No external access
```
## Security Patterns
### SSH Security Hardening
**Configuration**: Secure SSH server setup
```sshd_config
# /etc/ssh/sshd_config.d/99-homelab-security.conf
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
```
### Network Access Control
**Pattern**: Firewall-based service protection
```bash
# ufw firewall rules
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw allow from 10.10.0.0/24 to any port 22
ufw allow from 10.10.0.0/24 to any port 80
ufw allow from 10.10.0.0/24 to any port 443
```
### SSL/TLS Certificate Management
**Pattern**: Automated certificate lifecycle
```bash
# Let's Encrypt automation
certbot certonly --nginx \
--email admin@homelab.local \
--agree-tos \
--domains homelab.local,*.homelab.local
# Certificate renewal automation
0 2 * * * certbot renew --quiet && systemctl reload nginx
```
## Performance Optimization
### Connection Management
**Pattern**: Optimized connection handling
```nginx
# Nginx performance tuning
worker_processes auto;
worker_connections 1024;
keepalive_timeout 65;
keepalive_requests 1000;
gzip on;
gzip_vary on;
gzip_types text/plain text/css application/json application/javascript;
# Connection pooling
upstream backend {
server 10.10.0.100:3000 max_fails=3 fail_timeout=30s;
keepalive 32;
}
```
### Caching Strategies
**Pattern**: Multi-level caching architecture
```nginx
# Static content caching
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}
# Proxy caching
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=app_cache:10m;
proxy_cache app_cache;
proxy_cache_valid 200 302 10m;
```
## Network Storage Integration
### CIFS/SMB Mount Resilience
**Pattern**: Robust network filesystem mounting
```fstab
//nas.homelab.local/media /mnt/media cifs \
credentials=/etc/cifs/credentials,\
uid=1000,gid=1000,\
file_mode=0644,dir_mode=0755,\
iocharset=utf8,\
cache=strict,\
actimeo=30,\
_netdev,\
reconnect,\
soft,\
rsize=1048576,\
wsize=1048576 0 0
```
## Monitoring and Observability
### Network Health Monitoring
**Pattern**: Automated connectivity verification
```bash
#!/bin/bash
# network-health-check.sh
HOSTS="10.10.0.10 10.10.0.20 10.10.0.43"
DNS_SERVERS="10.10.0.16 8.8.8.8"
for host in $HOSTS; do
if ping -c1 -W5 $host >/dev/null 2>&1; then
echo "✅ $host: Reachable"
else
echo "❌ $host: Unreachable"
fi
done
for dns in $DNS_SERVERS; do
if nslookup google.com $dns >/dev/null 2>&1; then
echo "✅ DNS $dns: Working"
else
echo "❌ DNS $dns: Failed"
fi
done
```
### Service Availability Monitoring
**Pattern**: HTTP/HTTPS endpoint monitoring
```bash
# Service health check
SERVICES="https://homelab.local http://proxmox.homelab.local:8006"
for service in $SERVICES; do
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "✅ $service: Available"
else
echo "❌ $service: Unavailable"
fi
done
```
## Common Integration Patterns
### Reverse Proxy with Docker
**Pattern**: Container service exposure
```nginx
# Dynamic service discovery with Docker
location /api/ {
proxy_pass http://api-container:3000/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /web/ {
proxy_pass http://web-container:8080/;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade"; # WebSocket support
}
```
### VPN Integration
**Pattern**: Secure remote access
```openvpn
# OpenVPN server configuration
port 1194
proto udp
dev tun
ca ca.crt
cert server.crt
key server.key
dh dh.pem
server 10.8.0.0 255.255.255.0
push "route 10.10.0.0 255.255.0.0" # Home lab networks
keepalive 10 120
```
## Best Practices
### Security Implementation
1. **SSH Keys Only**: Disable password authentication everywhere
2. **Network Segmentation**: Use VLANs for isolation
3. **Certificate Management**: Automate SSL/TLS certificate lifecycle
4. **Access Control**: Implement least-privilege networking
5. **Monitoring**: Continuous network and service monitoring
### Performance Optimization
1. **Connection Pooling**: Reuse connections for efficiency
2. **Caching**: Implement multi-level caching strategies
3. **Compression**: Enable gzip for reduced bandwidth
4. **Keep-Alives**: Optimize connection persistence
5. **CDN Strategy**: Cache static content effectively
### Operational Excellence
1. **Documentation**: Maintain network topology documentation
2. **Automation**: Script routine network operations
3. **Backup**: Regular configuration backups
4. **Testing**: Regular connectivity and performance testing
5. **Change Management**: Controlled network configuration changes
This technology context provides comprehensive guidance for implementing robust networking infrastructure in home lab environments.

View File

@ -0,0 +1,99 @@
# Home Lab Security Improvements
## Current Security Issues
### Critical Issues Found:
- **Password Authentication**: All servers using password-based SSH authentication
- **Credential Reuse**: Same password used across 7 home network servers
- **Insecure Storage**: Passwords stored in FileZilla (base64 encoded, not encrypted)
- **Root Access**: Cloud servers using root user accounts
### Risk Assessment:
- **High**: Password-based authentication vulnerable to brute force attacks
- **High**: Shared passwords create single point of failure
- **Medium**: FileZilla credentials accessible to anyone with file system access
- **Medium**: Root access increases attack surface
## Implemented Solutions
### 1. SSH Key-Based Authentication
- **Generated separate key pairs** for home lab vs cloud servers
- **4096-bit RSA keys** for strong encryption
- **Descriptive key comments** for identification
### 2. SSH Configuration Management
- **Centralized config** in `~/.ssh/config`
- **Host aliases** for easy server access
- **Port forwarding** pre-configured for common services
- **Security defaults** (ServerAliveInterval, StrictHostKeyChecking)
### 3. Network Segmentation
- **Home network** (10.10.0.0/24) uses dedicated key
- **Cloud servers** use separate key pair
- **Service-specific aliases** for different server roles
## Additional Security Recommendations
### Immediate Actions:
1. **Deploy SSH keys** using the provided script
2. **Test key-based authentication** on all servers
3. **Disable password authentication** once keys work
4. **Remove FileZilla passwords** after migration
### Server Hardening:
```bash
# On each server, edit /etc/ssh/sshd_config:
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no # (create non-root user on cloud servers first)
Port 2222 # Change default SSH port
AllowUsers cal # Restrict SSH access
```
### Monitoring:
- **SSH login monitoring** with fail2ban
- **Key rotation schedule** (annually)
- **Access logging** review
### Future Enhancements:
- **Certificate-based authentication** (SSH CA)
- **Multi-factor authentication** (TOTP)
- **VPN access** for home network
- **Bastion host** for cloud servers
## Migration Plan
### Phase 1: Key Deployment ✅
- [x] Generate SSH key pairs
- [x] Create SSH configuration
- [x] Document server inventory
### Phase 2: Authentication Migration
- [ ] Deploy public keys to all servers
- [ ] Test SSH connections with keys
- [ ] Verify all services accessible
### Phase 3: Security Lockdown
- [ ] Disable password authentication
- [ ] Change default SSH ports
- [ ] Configure fail2ban
- [ ] Remove FileZilla credentials
### Phase 4: Monitoring & Maintenance
- [ ] Set up access logging
- [ ] Schedule key rotation
- [ ] Document incident response
## Connection Examples
After setup, you'll connect using simple aliases:
```bash
# Instead of: ssh cal@10.10.0.42
ssh database-apis
# Instead of: ssh root@172.237.147.99
ssh akamai
# With automatic port forwarding:
ssh pihole # Forwards port 8080 → localhost:80
```

View File

@ -0,0 +1,70 @@
---
# Home Lab Server Inventory
# Generated from FileZilla configuration
home_network:
subnet: "10.10.0.0/24"
servers:
database_apis:
hostname: "10.10.0.42"
port: 22
user: "cal"
services: ["database", "api"]
description: "Database and API services"
discord_bots:
hostname: "10.10.0.33"
port: 22
user: "cal"
services: ["discord", "bots"]
description: "Discord bot hosting"
home_docker:
hostname: "10.10.0.124"
port: 22
user: "cal"
services: ["docker", "containers"]
description: "Main Docker container host"
pihole:
hostname: "10.10.0.16"
port: 22
user: "cal"
services: ["dns", "adblock"]
description: "Pi-hole DNS and ad blocking"
sba_pd_bots:
hostname: "10.10.0.88"
port: 22
user: "cal"
services: ["bots", "automation"]
description: "SBa and PD bot services"
tdarr:
hostname: "10.10.0.43"
port: 22
user: "cal"
services: ["media", "transcoding"]
description: "Tdarr media transcoding"
vpn_docker:
hostname: "10.10.0.121"
port: 22
user: "cal"
services: ["vpn", "docker"]
description: "VPN and Docker services"
remote_servers:
akamai_nano:
hostname: "172.237.147.99"
port: 22
user: "root"
provider: "akamai"
description: "Akamai cloud nano instance"
vultr_host:
hostname: "45.76.25.231"
port: 22
user: "root"
provider: "vultr"
description: "Vultr cloud host"

View File

@ -0,0 +1,114 @@
#!/bin/bash
# SSH Key Maintenance and Backup Script
# Run this periodically to maintain key security
echo "🔧 SSH Key Maintenance and Backup"
# Check if NAS is mounted
if [ ! -d "/mnt/NV2" ]; then
echo "❌ ERROR: NAS not mounted at /mnt/NV2"
exit 1
fi
# Create timestamp
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_ROOT="/mnt/NV2/ssh-keys"
BACKUP_DIR="$BACKUP_ROOT/maintenance-$TIMESTAMP"
# Ensure backup directory structure
mkdir -p "$BACKUP_DIR"
chmod 700 "$BACKUP_DIR"
echo "📁 Creating maintenance backup in: $BACKUP_DIR"
# Backup current keys and config
cp ~/.ssh/*_rsa* "$BACKUP_DIR/" 2>/dev/null || true
cp ~/.ssh/config "$BACKUP_DIR/" 2>/dev/null || true
cp ~/.ssh/known_hosts "$BACKUP_DIR/" 2>/dev/null || true
# Check key ages and recommend rotation
echo ""
echo "🔍 Key Age Analysis:"
for key in ~/.ssh/*_rsa; do
if [ -f "$key" ]; then
age_days=$(( ($(date +%s) - $(stat -c %Y "$key")) / 86400 ))
basename_key=$(basename "$key")
if [ $age_days -gt 365 ]; then
echo "⚠️ $basename_key: $age_days days old - ROTATION RECOMMENDED"
elif [ $age_days -gt 180 ]; then
echo "$basename_key: $age_days days old - consider rotation"
else
echo "$basename_key: $age_days days old - OK"
fi
fi
done
# Test key accessibility
echo ""
echo "🔐 Testing Key Access:"
for key in ~/.ssh/*_rsa; do
if [ -f "$key" ]; then
basename_key=$(basename "$key")
if ssh-keygen -l -f "$key" >/dev/null 2>&1; then
echo "$basename_key: Valid and readable"
else
echo "$basename_key: CORRUPTED or unreadable"
fi
fi
done
# Clean up old backups (keep last 10)
echo ""
echo "🧹 Cleaning old backups (keeping last 10):"
cd "$BACKUP_ROOT"
ls -dt backup-* maintenance-* 2>/dev/null | tail -n +11 | while read old_backup; do
if [ -d "$old_backup" ]; then
echo "🗑️ Removing old backup: $old_backup"
rm -rf "$old_backup"
fi
done
# Generate maintenance report
cat > "$BACKUP_DIR/MAINTENANCE_REPORT.md" << EOF
# SSH Key Maintenance Report
Generated: $(date)
Host: $(hostname)
User: $(whoami)
## Backup Location
$BACKUP_DIR
## Key Inventory
$(ls -la ~/.ssh/*_rsa* 2>/dev/null || echo "No SSH keys found")
## SSH Config Status
$(if [ -f ~/.ssh/config ]; then echo "SSH config exists: ~/.ssh/config"; else echo "No SSH config found"; fi)
## Server Connection Tests
Run these commands to verify connectivity:
### Primary Keys:
ssh -o ConnectTimeout=5 database-apis 'echo "DB APIs: OK"'
ssh -o ConnectTimeout=5 pihole 'echo "PiHole: OK"'
ssh -o ConnectTimeout=5 akamai 'echo "Akamai: OK"'
### Emergency Keys (if deployed):
ssh -i ~/.ssh/emergency_homelab_rsa -o ConnectTimeout=5 cal@10.10.0.16 'echo "Emergency Home: OK"'
ssh -i ~/.ssh/emergency_cloud_rsa -o ConnectTimeout=5 root@172.237.147.99 'echo "Emergency Cloud: OK"'
## Next Maintenance Due
$(date -d '+3 months')
## Key Rotation Schedule
- Home lab keys: Annual (generated $(date -r ~/.ssh/homelab_rsa 2>/dev/null || echo "Not found"))
- Cloud keys: Annual (generated $(date -r ~/.ssh/cloud_servers_rsa 2>/dev/null || echo "Not found"))
- Emergency keys: Bi-annual
EOF
echo "✅ Maintenance backup completed"
echo "📄 Report saved: $BACKUP_DIR/MAINTENANCE_REPORT.md"
echo ""
echo "💡 Schedule this script to run monthly via cron:"
echo " 0 2 1 * * /path/to/ssh_key_maintenance.sh"

View File

@ -0,0 +1,496 @@
# Networking Infrastructure Troubleshooting Guide
## SSH Connection Issues
### SSH Authentication Failures
**Symptoms**: Permission denied, connection refused, timeout
**Diagnosis**:
```bash
# Verbose SSH debugging
ssh -vvv user@host
# Test different authentication methods
ssh -o PasswordAuthentication=no user@host
ssh -o PubkeyAuthentication=yes user@host
# Check local key files
ls -la ~/.ssh/
ssh-keygen -lf ~/.ssh/homelab_rsa.pub
```
**Solutions**:
```bash
# Re-deploy SSH keys
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@host
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@host
# Fix key permissions
chmod 600 ~/.ssh/homelab_rsa
chmod 644 ~/.ssh/homelab_rsa.pub
chmod 700 ~/.ssh
# Verify remote authorized_keys
ssh user@host 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
```
### SSH Service Issues
**Symptoms**: Connection refused, service not running
**Diagnosis**:
```bash
# Check SSH service status
systemctl status sshd
ss -tlnp | grep :22
# Test port connectivity
nc -zv host 22
nmap -p 22 host
```
**Solutions**:
```bash
# Restart SSH service
sudo systemctl restart sshd
sudo systemctl enable sshd
# Check firewall
sudo ufw status
sudo ufw allow ssh
# Verify SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
```
## Network Connectivity Problems
### Basic Network Troubleshooting
**Symptoms**: Cannot reach hosts, timeouts, routing issues
**Diagnosis**:
```bash
# Basic connectivity tests
ping host
traceroute host
mtr host
# Check local network configuration
ip addr show
ip route show
cat /etc/resolv.conf
```
**Solutions**:
```bash
# Restart networking
sudo systemctl restart networking
sudo netplan apply # Ubuntu
# Reset network interface
sudo ip link set eth0 down
sudo ip link set eth0 up
# Check default gateway
sudo ip route add default via 10.10.0.1
```
### DNS Resolution Issues
**Symptoms**: Cannot resolve hostnames, slow resolution
**Diagnosis**:
```bash
# Test DNS resolution
nslookup google.com
dig google.com
host google.com
# Check DNS servers
systemd-resolve --status
cat /etc/resolv.conf
```
**Solutions**:
```bash
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart DNS services
sudo systemctl restart systemd-resolved
# Flush DNS cache
sudo systemd-resolve --flush-caches
```
## Reverse Proxy and Load Balancer Issues
### Nginx Configuration Problems
**Symptoms**: 502 Bad Gateway, 503 Service Unavailable, SSL errors
**Diagnosis**:
```bash
# Check Nginx status and logs
systemctl status nginx
sudo tail -f /var/log/nginx/error.log
sudo tail -f /var/log/nginx/access.log
# Test Nginx configuration
sudo nginx -t
sudo nginx -T # Show full configuration
```
**Solutions**:
```bash
# Reload Nginx configuration
sudo nginx -s reload
# Check upstream servers
curl -I http://backend-server:port
telnet backend-server port
# Fix common configuration issues
sudo nano /etc/nginx/sites-available/default
# Check proxy_pass URLs, upstream definitions
```
### SSL/TLS Certificate Issues
**Symptoms**: Certificate warnings, expired certificates, connection errors
**Diagnosis**:
```bash
# Check certificate validity
openssl s_client -connect host:443 -servername host
openssl x509 -in /etc/ssl/certs/cert.pem -text -noout
# Check certificate expiry
openssl x509 -in /etc/ssl/certs/cert.pem -noout -dates
```
**Solutions**:
```bash
# Renew Let's Encrypt certificates
sudo certbot renew --dry-run
sudo certbot renew --force-renewal
# Generate self-signed certificate
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/ssl/private/selfsigned.key \
-out /etc/ssl/certs/selfsigned.crt
```
## Network Storage Issues
### CIFS/SMB Mount Problems
**Symptoms**: Mount failures, connection timeouts, permission errors
**Diagnosis**:
```bash
# Test SMB connectivity
smbclient -L //nas-server -U username
testparm # Test Samba configuration
# Check mount status
mount | grep cifs
df -h | grep cifs
```
**Solutions**:
```bash
# Remount with verbose logging
sudo mount -t cifs //server/share /mnt/point -o username=user,password=pass,vers=3.0
# Fix mount options in /etc/fstab
//server/share /mnt/point cifs credentials=/etc/cifs/credentials,uid=1000,gid=1000,iocharset=utf8,file_mode=0644,dir_mode=0755,cache=strict,_netdev 0 0
# Test credentials
sudo cat /etc/cifs/credentials
# Should contain: username=, password=, domain=
```
### NFS Mount Issues
**Symptoms**: Stale file handles, mount hangs, permission denied
**Diagnosis**:
```bash
# Check NFS services
systemctl status nfs-client.target
showmount -e nfs-server
# Test NFS connectivity
rpcinfo -p nfs-server
```
**Solutions**:
```bash
# Restart NFS services
sudo systemctl restart nfs-client.target
# Remount NFS shares
sudo umount /mnt/nfs-share
sudo mount -t nfs server:/path /mnt/nfs-share
# Fix stale file handles
sudo umount -f /mnt/nfs-share
sudo mount /mnt/nfs-share
```
## Firewall and Security Issues
### Port Access Problems
**Symptoms**: Connection refused, filtered ports, blocked services
**Diagnosis**:
```bash
# Check firewall status
sudo ufw status verbose
sudo iptables -L -n -v
# Test port accessibility
nc -zv host port
nmap -p port host
```
**Solutions**:
```bash
# Open required ports
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow from 10.10.0.0/24
# Reset firewall if needed
sudo ufw --force reset
sudo ufw enable
```
### Network Security Issues
**Symptoms**: Unauthorized access, suspicious traffic, security alerts
**Diagnosis**:
```bash
# Check active connections
ss -tuln
netstat -tuln
# Review logs for security events
sudo tail -f /var/log/auth.log
sudo tail -f /var/log/syslog | grep -i security
```
**Solutions**:
```bash
# Block suspicious IPs
sudo ufw deny from suspicious-ip
# Update SSH security
sudo nano /etc/ssh/sshd_config
# Set: PasswordAuthentication no, PermitRootLogin no
sudo systemctl restart sshd
```
## Service Discovery and DNS Issues
### Local DNS Problems
**Symptoms**: Services unreachable by hostname, DNS timeouts
**Diagnosis**:
```bash
# Test local DNS resolution
nslookup service.homelab.local
dig @10.10.0.16 service.homelab.local
# Check DNS server status
systemctl status bind9 # or named
```
**Solutions**:
```bash
# Add to /etc/hosts as temporary fix
echo "10.10.0.100 service.homelab.local" | sudo tee -a /etc/hosts
# Restart DNS services
sudo systemctl restart bind9
sudo systemctl restart systemd-resolved
```
### Container Networking Issues
**Symptoms**: Containers cannot communicate, service discovery fails
**Diagnosis**:
```bash
# Check Docker networks
docker network ls
docker network inspect bridge
# Test container connectivity
docker exec container1 ping container2
docker exec container1 nslookup container2
```
**Solutions**:
```bash
# Create custom network
docker network create --driver bridge app-network
docker run --network app-network container
# Fix DNS in containers
docker run --dns 8.8.8.8 container
```
## Performance Issues
### Network Latency Problems
**Symptoms**: Slow response times, timeouts, poor performance
**Diagnosis**:
```bash
# Measure network latency
ping -c 100 host
mtr --report host
# Check network interface stats
ip -s link show
cat /proc/net/dev
```
**Solutions**:
```bash
# Optimize network settings
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Check for network congestion
iftop
nethogs
```
### Bandwidth Issues
**Symptoms**: Slow transfers, network congestion, dropped packets
**Diagnosis**:
```bash
# Test bandwidth
iperf3 -s # Server
iperf3 -c server-ip # Client
# Check interface utilization
vnstat -i eth0
```
**Solutions**:
```bash
# Implement QoS if needed
sudo tc qdisc add dev eth0 root fq_codel
# Optimize buffer sizes
sudo ethtool -G eth0 rx 4096 tx 4096
```
## Emergency Recovery Procedures
### Network Emergency Recovery
**Complete network failure recovery**:
```bash
# Reset all network configuration
sudo systemctl stop networking
sudo ip addr flush eth0
sudo ip route flush table main
sudo systemctl start networking
# Manual network configuration
sudo ip addr add 10.10.0.100/24 dev eth0
sudo ip route add default via 10.10.0.1
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
```
### SSH Emergency Access
**When locked out of systems**:
```bash
# Use emergency SSH key
ssh -i ~/.ssh/emergency_homelab_rsa user@host
# Via console access (if available)
# Use hypervisor console or physical access
# Reset SSH to allow password auth temporarily
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd
```
### Service Recovery
**Critical service restoration**:
```bash
# Restart all network services
sudo systemctl restart networking
sudo systemctl restart nginx
sudo systemctl restart sshd
# Emergency firewall disable
sudo ufw disable # CAUTION: Only for troubleshooting
# Service-specific recovery
sudo systemctl restart docker
sudo systemctl restart systemd-resolved
```
## Monitoring and Prevention
### Network Health Monitoring
```bash
#!/bin/bash
# network-monitor.sh
CRITICAL_HOSTS="10.10.0.1 10.10.0.16 nas.homelab.local"
CRITICAL_SERVICES="https://homelab.local http://proxmox.homelab.local:8006"
for host in $CRITICAL_HOSTS; do
if ! ping -c1 -W5 $host >/dev/null 2>&1; then
echo "ALERT: $host unreachable" | logger -t network-monitor
fi
done
for service in $CRITICAL_SERVICES; do
if ! curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "ALERT: $service unavailable" | logger -t network-monitor
fi
done
```
### Automated Recovery Scripts
```bash
#!/bin/bash
# network-recovery.sh
if ! ping -c1 8.8.8.8 >/dev/null 2>&1; then
echo "Network down, attempting recovery..."
sudo systemctl restart networking
sleep 10
if ping -c1 8.8.8.8 >/dev/null 2>&1; then
echo "Network recovered"
else
echo "Manual intervention required"
fi
fi
```
## Quick Reference Commands
### Network Diagnostics
```bash
# Connectivity tests
ping host
traceroute host
mtr host
nc -zv host port
# Service checks
systemctl status networking
systemctl status nginx
systemctl status sshd
# Network configuration
ip addr show
ip route show
ss -tuln
```
### Emergency Commands
```bash
# Network restart
sudo systemctl restart networking
# SSH emergency access
ssh -i ~/.ssh/emergency_homelab_rsa user@host
# Firewall quick disable (emergency only)
sudo ufw disable
# DNS quick fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
```
This troubleshooting guide provides comprehensive solutions for common networking issues in home lab environments.

View File

@ -1,26 +0,0 @@
# Docker Patterns
## Container Best Practices
- Use multi-stage builds for production images
- Minimize layer count and image size
- Run containers as non-root users
- Use specific version tags, avoid `latest`
- Implement health checks
## Common Patterns
- **Multi-service applications**: Use docker-compose for local development
- **Production deployments**: Single-container per service with orchestration
- **Development environments**: Volume mounts for code changes
- **CI/CD integration**: Build, test, and push in pipeline stages
## Security Considerations
- Scan images for vulnerabilities
- Use distroless or minimal base images
- Implement resource limits
- Network isolation between services
## Related Documentation
- Examples: `/examples/docker/multi-stage-builds.md`
- Examples: `/examples/docker/compose-patterns.md`
- Reference: `/reference/docker/troubleshooting.md`
- Reference: `/reference/docker/security-checklist.md`

View File

@ -1,32 +0,0 @@
# Networking Patterns
## Infrastructure Setup
- **Reverse proxy** configuration (Nginx/Traefik)
- **Load balancing** strategies and health checks
- **SSL/TLS termination** and certificate management
- **Network segmentation** and VLANs
## Service Discovery
- **DNS-based** service resolution
- **Container networking** with Docker networks
- **Service mesh** patterns for microservices
- **API gateway** implementation
## Security Patterns
- **Firewall rules** and port management
- **VPN setup** for remote access
- **Zero-trust networking** principles
- **Network monitoring** and intrusion detection
## Performance Optimization
- **CDN integration** for static assets
- **Connection pooling** and keep-alives
- **Bandwidth management** and QoS
- **Caching strategies** at network level
## Related Documentation
- Examples: `/examples/networking/nginx-config.md`
- Examples: `/examples/networking/vpn-setup.md`
- Examples: `/examples/networking/load-balancing.md`
- Reference: `/reference/networking/troubleshooting.md`
- Reference: `/reference/networking/security.md`

View File

@ -1,66 +0,0 @@
# Virtual Machine Management Patterns
## Automated Provisioning
- **Cloud-init deployment** - Fully automated VM provisioning from first boot
- **Post-install scripts** - Standardized configuration for existing VMs
- **SSH key management** - Automated key deployment with emergency backup
- **Security hardening** - Password auth disabled, firewall configured
## VM Provisioning Strategies
### Template-Based Deployment
- **Ubuntu Server templates** optimized for home lab environments
- **Resource allocation** sizing and planning
- **Network configuration** and VLAN assignment (10.10.0.x networks)
- **Storage provisioning** and disk management
### Infrastructure as Code
- **Cloud-init templates** for repeatable VM creation
- **Bash provisioning scripts** for existing infrastructure
- **SSH key integration** with existing homelab key management
- **Docker environment** setup with user permissions
## Lifecycle Management
- **Automated provisioning** with infrastructure as code
- **Configuration management** with standardized scripts
- **Snapshot management** and rollback strategies
- **Scaling policies** for resource optimization
## Monitoring & Maintenance
- **Resource monitoring** (CPU, memory, disk, network)
- **Health checks** and alerting systems
- **Patch management** and update strategies
- **Performance tuning** and optimization
## Backup & Recovery
- **VM-level backups** vs **application-level backups**
- **Disaster recovery** planning and testing
- **High availability** configurations
- **Migration strategies** between hosts
## Implementation Workflows
### New VM Creation (Recommended)
1. **Create VM in Proxmox** with cloud-init support
2. **Apply cloud-init template** (`scripts/vm-management/cloud-init-user-data.yaml`)
3. **Start VM** - fully automated provisioning
4. **Verify setup** via SSH key authentication
### Existing VM Configuration
1. **Run post-install script** (`scripts/vm-management/vm-post-install.sh <ip> <user>`)
2. **Automated provisioning** handles updates, SSH keys, Docker
3. **Security hardening** applied automatically
4. **Test connectivity** and verify Docker installation
## Security Architecture
- **SSH key-based authentication** only (passwords disabled)
- **Emergency key backup** for failover access
- **User privilege separation** (sudo required, docker group)
- **Automatic security updates** configured
- **Network isolation** ready (10.10.0.x internal network)
## Related Documentation
- **Implementation**: `scripts/vm-management/README.md` - Complete setup guides
- **SSH Keys**: `patterns/networking/ssh-key-management.md` - Key lifecycle management
- **Examples**: `examples/networking/ssh-homelab-setup.md` - SSH integration patterns
- **Reference**: `reference/vm-management/troubleshooting.md` - Common issues and solutions

View File

@ -1,498 +0,0 @@
#!/usr/bin/env python3
"""
Tdarr API Monitoring Script
Monitors Tdarr server via its web API endpoints:
- Server status and health
- Queue status and statistics
- Node status and performance
- Library scan progress
- Worker activity
Usage:
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check all
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check queue
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check nodes
"""
import argparse
import json
import logging
import sys
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Dict, List, Optional, Any
import requests
from urllib.parse import urljoin
@dataclass
class ServerStatus:
timestamp: str
server_url: str
status: str
error: Optional[str] = None
version: Optional[str] = None
server_id: Optional[str] = None
uptime: Optional[str] = None
system_info: Optional[Dict[str, Any]] = None
@dataclass
class QueueStats:
total_files: int
queued: int
processing: int
completed: int
queue_items: List[Dict[str, Any]]
@dataclass
class QueueStatus:
timestamp: str
queue_stats: Optional[QueueStats] = None
error: Optional[str] = None
@dataclass
class NodeInfo:
id: Optional[str]
nodeName: Optional[str]
status: str
lastSeen: Optional[int]
version: Optional[str]
platform: Optional[str]
workers: Dict[str, int]
processing: List[Dict[str, Any]]
@dataclass
class NodeSummary:
total_nodes: int
online_nodes: int
offline_nodes: int
online_details: List[NodeInfo]
offline_details: List[NodeInfo]
@dataclass
class NodeStatus:
timestamp: str
nodes: List[Dict[str, Any]]
node_summary: Optional[NodeSummary] = None
error: Optional[str] = None
@dataclass
class LibraryInfo:
name: Optional[str]
path: Optional[str]
file_count: int
scan_progress: int
last_scan: Optional[str]
is_scanning: bool
@dataclass
class ScanStatus:
total_libraries: int
total_files: int
scanning_libraries: int
@dataclass
class LibraryStatus:
timestamp: str
libraries: List[LibraryInfo]
scan_status: Optional[ScanStatus] = None
error: Optional[str] = None
@dataclass
class Statistics:
total_transcodes: int
space_saved: int
total_files_processed: int
failed_transcodes: int
processing_speed: int
eta: Optional[str]
@dataclass
class StatisticsStatus:
timestamp: str
statistics: Optional[Statistics] = None
error: Optional[str] = None
@dataclass
class HealthCheck:
status: str
healthy: bool
online_count: Optional[int] = None
total_count: Optional[int] = None
accessible: Optional[bool] = None
total_items: Optional[int] = None
@dataclass
class HealthStatus:
timestamp: str
overall_status: str
checks: Dict[str, HealthCheck]
class TdarrMonitor:
def __init__(self, server_url: str, timeout: int = 30):
"""Initialize Tdarr monitor with server URL."""
self.server_url = server_url.rstrip('/')
self.timeout = timeout
self.session = requests.Session()
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
def _make_request(self, endpoint: str) -> Optional[Dict[str, Any]]:
"""Make HTTP request to Tdarr API endpoint."""
url = urljoin(self.server_url, endpoint)
try:
response = self.session.get(url, timeout=self.timeout)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
self.logger.error(f"Request failed for {url}: {e}")
return None
except json.JSONDecodeError as e:
self.logger.error(f"JSON decode failed for {url}: {e}")
return None
def get_server_status(self) -> ServerStatus:
"""Get overall server status and configuration."""
timestamp = datetime.now().isoformat()
# Try to get server info from API
data = self._make_request('/api/v2/get-server-info')
if data:
return ServerStatus(
timestamp=timestamp,
server_url=self.server_url,
status='online',
version=data.get('version'),
server_id=data.get('serverId'),
uptime=data.get('uptime'),
system_info=data.get('systemInfo', {})
)
else:
return ServerStatus(
timestamp=timestamp,
server_url=self.server_url,
status='offline',
error='Unable to connect to Tdarr server'
)
def get_queue_status(self) -> QueueStatus:
"""Get transcoding queue status and statistics."""
timestamp = datetime.now().isoformat()
# Get queue information
data = self._make_request('/api/v2/get-queue')
if data:
queue_data = data.get('queue', [])
# Calculate queue statistics
total_files = len(queue_data)
queued_files = len([f for f in queue_data if f.get('status') == 'Queued'])
processing_files = len([f for f in queue_data if f.get('status') == 'Processing'])
completed_files = len([f for f in queue_data if f.get('status') == 'Completed'])
queue_stats = QueueStats(
total_files=total_files,
queued=queued_files,
processing=processing_files,
completed=completed_files,
queue_items=queue_data[:10] # First 10 items for details
)
return QueueStatus(
timestamp=timestamp,
queue_stats=queue_stats
)
else:
return QueueStatus(
timestamp=timestamp,
error='Unable to fetch queue data'
)
def get_node_status(self) -> NodeStatus:
"""Get status of all connected nodes."""
timestamp = datetime.now().isoformat()
# Get nodes information
data = self._make_request('/api/v2/get-nodes')
if data:
nodes = data.get('nodes', [])
# Process node information
online_nodes = []
offline_nodes = []
for node in nodes:
node_info = NodeInfo(
id=node.get('_id'),
nodeName=node.get('nodeName'),
status='online' if node.get('lastSeen', 0) > 0 else 'offline',
lastSeen=node.get('lastSeen'),
version=node.get('version'),
platform=node.get('platform'),
workers={
'cpu': node.get('workers', {}).get('CPU', 0),
'gpu': node.get('workers', {}).get('GPU', 0)
},
processing=node.get('currentJobs', [])
)
if node_info.status == 'online':
online_nodes.append(node_info)
else:
offline_nodes.append(node_info)
node_summary = NodeSummary(
total_nodes=len(nodes),
online_nodes=len(online_nodes),
offline_nodes=len(offline_nodes),
online_details=online_nodes,
offline_details=offline_nodes
)
return NodeStatus(
timestamp=timestamp,
nodes=nodes,
node_summary=node_summary
)
else:
return NodeStatus(
timestamp=timestamp,
nodes=[],
error='Unable to fetch node data'
)
def get_library_status(self) -> LibraryStatus:
"""Get library scan status and file statistics."""
timestamp = datetime.now().isoformat()
# Get library information
data = self._make_request('/api/v2/get-libraries')
if data:
libraries = data.get('libraries', [])
library_stats = []
total_files = 0
for lib in libraries:
lib_info = LibraryInfo(
name=lib.get('name'),
path=lib.get('path'),
file_count=lib.get('totalFiles', 0),
scan_progress=lib.get('scanProgress', 0),
last_scan=lib.get('lastScan'),
is_scanning=lib.get('isScanning', False)
)
library_stats.append(lib_info)
total_files += lib_info.file_count
scan_status = ScanStatus(
total_libraries=len(libraries),
total_files=total_files,
scanning_libraries=len([l for l in library_stats if l.is_scanning])
)
return LibraryStatus(
timestamp=timestamp,
libraries=library_stats,
scan_status=scan_status
)
else:
return LibraryStatus(
timestamp=timestamp,
libraries=[],
error='Unable to fetch library data'
)
def get_statistics(self) -> StatisticsStatus:
"""Get overall Tdarr statistics and health metrics."""
timestamp = datetime.now().isoformat()
# Get statistics
data = self._make_request('/api/v2/get-stats')
if data:
stats = data.get('stats', {})
statistics = Statistics(
total_transcodes=stats.get('totalTranscodes', 0),
space_saved=stats.get('spaceSaved', 0),
total_files_processed=stats.get('totalFilesProcessed', 0),
failed_transcodes=stats.get('failedTranscodes', 0),
processing_speed=stats.get('processingSpeed', 0),
eta=stats.get('eta')
)
return StatisticsStatus(
timestamp=timestamp,
statistics=statistics
)
else:
return StatisticsStatus(
timestamp=timestamp,
error='Unable to fetch statistics'
)
def health_check(self) -> HealthStatus:
"""Perform comprehensive health check."""
timestamp = datetime.now().isoformat()
# Server connectivity
server_status = self.get_server_status()
server_check = HealthCheck(
status=server_status.status,
healthy=server_status.status == 'online'
)
# Node connectivity
node_status = self.get_node_status()
nodes_healthy = (
node_status.node_summary.online_nodes > 0 if node_status.node_summary else False
) and not node_status.error
nodes_check = HealthCheck(
status='online' if nodes_healthy else 'offline',
healthy=nodes_healthy,
online_count=node_status.node_summary.online_nodes if node_status.node_summary else 0,
total_count=node_status.node_summary.total_nodes if node_status.node_summary else 0
)
# Queue status
queue_status = self.get_queue_status()
queue_healthy = not queue_status.error
queue_check = HealthCheck(
status='accessible' if queue_healthy else 'error',
healthy=queue_healthy,
accessible=queue_healthy,
total_items=queue_status.queue_stats.total_files if queue_status.queue_stats else 0
)
checks = {
'server': server_check,
'nodes': nodes_check,
'queue': queue_check
}
# Determine overall health
all_checks_healthy = all(check.healthy for check in checks.values())
overall_status = 'healthy' if all_checks_healthy else 'unhealthy'
return HealthStatus(
timestamp=timestamp,
overall_status=overall_status,
checks=checks
)
def main():
parser = argparse.ArgumentParser(description='Monitor Tdarr server via API')
parser.add_argument('--server', required=True, help='Tdarr server URL (e.g., http://10.10.0.43:8265)')
parser.add_argument('--check', choices=['all', 'status', 'queue', 'nodes', 'libraries', 'stats', 'health'],
default='health', help='Type of check to perform')
parser.add_argument('--timeout', type=int, default=30, help='Request timeout in seconds')
parser.add_argument('--output', choices=['json', 'pretty'], default='pretty', help='Output format')
parser.add_argument('--verbose', action='store_true', help='Enable verbose logging')
args = parser.parse_args()
if args.verbose:
logging.getLogger().setLevel(logging.DEBUG)
# Initialize monitor
monitor = TdarrMonitor(args.server, args.timeout)
# Perform requested check
result = None
if args.check == 'all':
result = {
'server_status': monitor.get_server_status(),
'queue_status': monitor.get_queue_status(),
'node_status': monitor.get_node_status(),
'library_status': monitor.get_library_status(),
'statistics': monitor.get_statistics()
}
elif args.check == 'status':
result = monitor.get_server_status()
elif args.check == 'queue':
result = monitor.get_queue_status()
elif args.check == 'nodes':
result = monitor.get_node_status()
elif args.check == 'libraries':
result = monitor.get_library_status()
elif args.check == 'stats':
result = monitor.get_statistics()
elif args.check == 'health':
result = monitor.health_check()
# Output results
if args.output == 'json':
# Convert dataclasses to dictionaries for JSON serialization
if args.check == 'all':
json_result = {}
for key, value in result.items():
json_result[key] = asdict(value)
print(json.dumps(json_result, indent=2))
else:
print(json.dumps(asdict(result), indent=2))
else:
# Pretty print format
print(f"=== Tdarr Monitor Results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} ===")
if args.check == 'health' or (hasattr(result, 'overall_status') and result.overall_status):
health = result if hasattr(result, 'overall_status') else None
if health:
status = health.overall_status
print(f"Overall Status: {status.upper()}")
if health.checks:
print("\nHealth Checks:")
for check_name, check_data in health.checks.items():
status_icon = "" if check_data.healthy else ""
print(f" {status_icon} {check_name.title()}: {asdict(check_data)}")
if args.check == 'all':
for section, data in result.items():
print(f"\n=== {section.replace('_', ' ').title()} ===")
print(json.dumps(asdict(data), indent=2))
elif args.check != 'health':
print(json.dumps(asdict(result), indent=2))
# Exit with appropriate code
if result:
# Check for unhealthy status in health check
if isinstance(result, HealthStatus) and result.overall_status == 'unhealthy':
sys.exit(1)
# Check for errors in individual status objects (all status classes except HealthStatus have error attribute)
elif (isinstance(result, (ServerStatus, QueueStatus, NodeStatus, LibraryStatus, StatisticsStatus))
and result.error):
sys.exit(1)
# Check for errors in 'all' results
elif isinstance(result, dict):
for status_obj in result.values():
if (isinstance(status_obj, (ServerStatus, QueueStatus, NodeStatus, LibraryStatus, StatisticsStatus))
and status_obj.error):
sys.exit(1)
sys.exit(0)
if __name__ == '__main__':
main()

View File

@ -1,6 +0,0 @@
#!/bin/bash
# Tdarr Manager - Quick access to Tdarr scheduler controls
# This is a convenience script that forwards to the main manager
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
exec "${SCRIPT_DIR}/tdarr/tdarr-schedule-manager.sh" "$@"

152
tdarr/CONTEXT.md Normal file
View File

@ -0,0 +1,152 @@
# Tdarr Transcoding System - Technology Context
## Overview
Tdarr is a distributed transcoding system that converts media files to optimized formats. This implementation uses an intelligent gaming-aware scheduler with unmapped node architecture for optimal performance and system stability.
## Architecture Patterns
### Distributed Unmapped Node Architecture (Recommended)
**Pattern**: Server-Node separation with local high-speed cache
- **Server**: Tdarr Server manages queue, web interface, and coordination
- **Node**: Unmapped nodes with local NVMe cache for processing
- **Benefits**: 3-5x performance improvement, network I/O reduction, linear scaling
**When to Use**:
- Multiple transcoding nodes across network
- High-performance requirements (10GB+ files)
- Network bandwidth limitations
- Gaming systems requiring GPU priority management
### Configuration Principles
1. **Cache Optimization**: Use local NVMe storage for work directories
2. **Gaming Detection**: Automatic pause during GPU-intensive activities
3. **Resource Isolation**: Container limits prevent kernel-level crashes
4. **Monitoring Integration**: Automated cleanup and Discord notifications
## Core Components
### Gaming-Aware Scheduler
**Purpose**: Automatically manages Tdarr node to avoid conflicts with gaming
**Location**: `scripts/tdarr-schedule-manager.sh`
**Key Features**:
- Detects gaming processes (Steam, Lutris, Wine, etc.)
- GPU usage monitoring (>15% threshold)
- Configurable time windows
- Automated temporary directory cleanup
**Schedule Format**: `"HOUR_START-HOUR_END:DAYS"`
- `"22-07:daily"` - Overnight transcoding
- `"09-17:1-5"` - Business hours weekdays only
- `"14-16:6,7"` - Weekend afternoon window
### Monitoring System
**Purpose**: Prevents staging section timeouts and system instability
**Location**: `scripts/monitoring/tdarr-timeout-monitor.sh`
**Capabilities**:
- Staging timeout detection (300-second hardcoded limit)
- Automatic work directory cleanup
- Discord notifications with user pings
- Log rotation and retention management
### Container Architecture
**Server Configuration**:
```yaml
# Hybrid storage with resource limits
services:
tdarr:
image: ghcr.io/haveagitgat/tdarr:latest
ports: ["8265:8266"]
volumes:
- "./tdarr-data:/app/configs"
- "/mnt/media:/media"
```
**Node Configuration**:
```bash
# Unmapped node with local cache
podman run -d \
--name tdarr-node-gpu \
-e nodeType=unmapped \
-v "/mnt/NV2/tdarr-cache:/cache" \
--device nvidia.com/gpu=all \
ghcr.io/haveagitgat/tdarr_node:latest
```
## Implementation Patterns
### Performance Optimization
1. **Local Cache Strategy**: Download → Process → Upload (vs. streaming)
2. **Resource Limits**: Prevent memory exhaustion and kernel crashes
3. **Network Resilience**: CIFS mount options for stability
4. **Automated Cleanup**: Prevent accumulation of stuck directories
### Error Prevention
1. **Plugin Safety**: Null-safe forEach operations `(streams || []).forEach()`
2. **Clean Installation**: Avoid custom plugin mounts causing version conflicts
3. **Container Isolation**: Resource limits prevent system-level crashes
4. **Network Stability**: Unmapped architecture reduces CIFS dependency
### Gaming Integration
1. **Process Detection**: Monitor for gaming applications and utilities
2. **GPU Threshold**: Stop transcoding when GPU usage >15%
3. **Time Windows**: Respect user-defined allowed transcoding hours
4. **Manual Override**: Direct start/stop commands bypass scheduler
## Common Workflows
### Initial Setup
1. Start server with "Allow unmapped Nodes" enabled
2. Configure node as unmapped with local cache
3. Install gaming-aware scheduler via cron
4. Set up monitoring system for automated cleanup
### Troubleshooting Patterns
1. **forEach Errors**: Clean plugin installation, avoid custom mounts
2. **Staging Timeouts**: Monitor system handles automatic cleanup
3. **System Crashes**: Convert to unmapped node architecture
4. **Network Issues**: Implement CIFS resilience options
### Performance Tuning
1. **Cache Size**: 100-500GB NVMe for concurrent jobs
2. **Bandwidth**: Unmapped nodes reduce streaming requirements
3. **Scaling**: Linear scaling with additional unmapped nodes
4. **GPU Priority**: Gaming detection ensures responsive system
## Best Practices
### Production Deployment
- Use unmapped node architecture for stability
- Implement comprehensive monitoring
- Configure gaming-aware scheduling for desktop systems
- Set appropriate container resource limits
### Development Guidelines
- Test with internal Tdarr test files first
- Implement null-safety checks in custom plugins
- Use structured logging for troubleshooting
- Separate concerns: scheduling, monitoring, processing
### Security Considerations
- Container isolation prevents system-level failures
- Resource limits protect against memory exhaustion
- Network mount resilience prevents kernel crashes
- Automated cleanup prevents disk space issues
## Migration Patterns
### From Mapped to Unmapped Nodes
1. Enable "Allow unmapped Nodes" in server options
2. Update node configuration (add nodeType=unmapped)
3. Change cache volume to local storage
4. Remove media volume mapping
5. Test workflow and monitor performance
### Plugin System Cleanup
1. Remove all custom plugin mounts
2. Force server restart to regenerate plugin ZIP
3. Restart nodes to download fresh plugins
4. Verify forEach fixes in downloaded plugins
This technology context provides the foundation for implementing, troubleshooting, and optimizing Tdarr transcoding systems in home lab environments.

View File

@ -0,0 +1,143 @@
# Tdarr CIFS Troubleshooting Session - 2025-08-11
## Problem Statement
Tdarr unmapped node experiencing persistent download timeouts at 9:08 PM with large files (31GB+ remux), causing "Cancelling" messages and stuck downloads. Downloads would hang for 33+ minutes before timing out, despite container remaining running.
## Initial Hypothesis: Mapped vs Unmapped Node Issue
**Status**: ❌ **DISPROVEN**
- Suspected unmapped node timeout configuration differences
- Windows PC running mapped Tdarr node works fine (slow but stable)
- Both mapped and unmapped Linux nodes exhibited identical timeout issues
- **Conclusion**: Architecture type was not the root cause
## Key Insight: Windows vs Linux Performance Difference
**Observation**: Windows Tdarr node (mapped mode) works without timeouts, Linux nodes (both mapped/unmapped) fail
**Implication**: Platform-specific issue, likely network stack or CIFS implementation
## Root Cause Discovery Process
### Phase 1: Linux Client CIFS Analysis
**Method**: Direct CIFS mount testing on Tdarr node machine (nobara-pc)
**Initial CIFS Mount Configuration** (problematic):
```bash
//10.10.0.35/media on /mnt/media type cifs (rw,relatime,vers=3.1.1,cache=strict,upcall_target=app,username=root,uid=1000,forceuid,gid=1000,forcegid,addr=10.10.0.35,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,noperm,reparse=nfs,nativesocket,symlink=native,rsize=4194304,wsize=4194304,bsize=1048576,retrans=1,echo_interval=60,actimeo=30,closetimeo=1,_netdev,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30)
```
**Critical Issues Identified**:
- `soft` - Mount fails on timeout instead of retrying indefinitely
- `retrans=1` - Only 1 retry attempt (NFS option, invalid for CIFS)
- `closetimeo=1` - Very short close timeout (1 second)
- `cache=strict` - No local caching, poor performance for large files
- `x-systemd.mount-timeout=30` - 30-second mount timeout
**Optimization Applied**:
```bash
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,hard,rsize=16777216,wsize=16777216,cache=loose,actimeo=60,echo_interval=30,_netdev,x-systemd.automount,x-systemd.device-timeout=60,x-systemd.mount-timeout=120,noperm 0 0
```
**Performance Testing Results**:
- **Local SSD**: `dd` 800MB in 0.217s (4.0 GB/s) - baseline
- **CIFS 1MB blocks**: 42.7 MB/s - fast, no issues
- **CIFS 4MB blocks**: 205 MB/s - fast, no issues
- **CIFS 8MB blocks**: 83.1 MB/s - **3-minute terminal freeze**
**Critical Discovery**: Block size dependency causing I/O blocking with large transfers
### Phase 2: Tdarr Server-Side Analysis
**Method**: Test Tdarr API download path directly
**API Test Command**:
```bash
curl -X POST "http://10.10.0.43:8265/api/v2/file/download" \
-H "Content-Type: application/json" \
-d '{"filePath": "/media/Movies/Jumanji (1995)/Jumanji (1995) Remux-1080p Proper.mkv"}' \
-o /tmp/tdarr-api-test.mkv
```
**Results**:
- **Performance**: 55.7-58.6 MB/s sustained
- **Progress**: Downloaded 15.3GB of 23GB (66%)
- **Failure**: **Download hung at 66% completion**
- **Timing**: Hung after ~5 minutes (consistent with previous timeout patterns)
### Phase 3: Tdarr Server CIFS Configuration Analysis
**Method**: Examine server-side storage mount
**Server CIFS Mount** (problematic):
```bash
//10.10.0.35/media /mnt/truenas-share cifs credentials=/root/.truenascreds,vers=3.1.1,rsize=4194304,wsize=4194304,cache=strict,actimeo=30,echo_interval=60,noperm 0 0
```
**Server Issues Identified**:
- **Missing `hard`** - Defaults to `soft` mount behavior
- `cache=strict` - No local caching (same issue as client)
- **No retry/timeout extensions** - Uses unreliable kernel defaults
- **No systemd timeout protection**
## Root Cause Confirmed
**Primary Issue**: Tdarr server's CIFS mount to TrueNAS using suboptimal configuration
**Impact**: Large file streaming via Tdarr API hangs when server's CIFS mount hits I/O blocking
**Evidence**: API download hung at exact same pattern as node timeouts (66% through large file)
## Solution Strategy
**Fix Tdarr Server CIFS Mount Configuration**:
```bash
//10.10.0.35/media /mnt/truenas-share cifs credentials=/root/.truenascreds,vers=3.1.1,hard,rsize=4194304,wsize=4194304,cache=loose,actimeo=60,echo_interval=30,_netdev,x-systemd.device-timeout=60,x-systemd.mount-timeout=120,noperm 0 0
```
**Key Optimizations**:
- `hard` - Retry indefinitely instead of timing out
- `cache=loose` - Enable local caching for large file performance
- `actimeo=60` - Longer attribute caching
- `echo_interval=30` - More frequent keep-alives
- Extended systemd timeouts for reliability
## Implementation Steps
1. **Update server `/etc/fstab`** with optimized CIFS configuration
2. **Remount server storage**:
```bash
ssh tdarr "sudo umount /mnt/truenas-share"
ssh tdarr "sudo systemctl daemon-reload"
ssh tdarr "sudo mount /mnt/truenas-share"
```
3. **Test large file API download** to verify fix
4. **Resume Tdarr transcoding** with confidence in large file handling
## Technical Insights
### CIFS vs SMB Protocol Differences
- **Windows nodes**: Use native SMB implementation (stable)
- **Linux nodes**: Use kernel CIFS module (prone to I/O blocking with poor configuration)
- **Block size sensitivity**: Large block transfers require careful timeout/retry configuration
### Tdarr Architecture Impact
- **Unmapped nodes**: Download entire files via API before processing (high bandwidth, vulnerable to server CIFS issues)
- **Mapped nodes**: Stream files during processing (lower bandwidth, still vulnerable to server CIFS issues)
- **Root cause affects both architectures** since server-side storage access is the bottleneck
### Performance Expectations Post-Fix
- **Consistent 50-100 MB/s** for large file downloads
- **No timeout failures** with properly configured hard mounts
- **Reliable processing** of 31GB+ remux files
## Files Modified
- **Client**: `/etc/fstab` on nobara-pc (CIFS optimization applied)
- **Server**: `/etc/fstab` on tdarr server (pending optimization)
## Monitoring and Validation
- **Success criteria**: Tdarr API download of 23GB+ file completes without hanging
- **Performance target**: Sustained 50+ MB/s throughout entire transfer
- **Reliability target**: No timeouts during large file processing
## Session Outcome
**Status**: ✅ **ROOT CAUSE IDENTIFIED AND SOLUTION READY**
- Eliminated client-side variables through systematic testing
- Confirmed server-side CIFS configuration as bottleneck
- Validated fix strategy through client-side optimization success
- Ready to implement server-side solution
---
*Session Date: 2025-08-11*
*Duration: ~3 hours*
*Methods: Direct testing, API analysis, mount configuration review*

View File

@ -0,0 +1,183 @@
# Tdarr Node Container Configurations
## Overview
Complete examples for running Tdarr transcoding nodes in containers, covering both CPU-only and GPU-accelerated setups.
## CPU-Only Configuration (Docker Compose)
For systems without GPU or when GPU isn't needed:
```yaml
version: "3.4"
services:
tdarr-node:
container_name: tdarr-node-cpu
image: ghcr.io/haveagitgat/tdarr_node:latest
restart: unless-stopped
environment:
- TZ=America/Chicago
- UMASK_SET=002
- nodeName=local-workstation-cpu
- serverIP=YOUR_TDARR_SERVER_IP # Replace with your tdarr server IP
- serverPort=8266
- inContainer=true
- ffmpegVersion=6
volumes:
# Mount your media from the same NAS share as the server
- /path/to/your/media:/media # Replace with your local media mount
# Temp directory for transcoding cache
- ./temp:/temp
```
**Use case**:
- CPU-only transcoding
- Testing Tdarr functionality
- Systems without dedicated GPU
- When GPU drivers aren't available
## GPU-Accelerated Configuration (Podman)
**Recommended for Fedora/RHEL/CentOS/Nobara systems:**
### Mapped Node (Direct Media Access)
```bash
podman run -d --name tdarr-node-gpu-mapped \
--gpus all \
--restart unless-stopped \
-e TZ=America/Chicago \
-e UMASK_SET=002 \
-e nodeName=local-workstation-gpu-mapped \
-e serverIP=10.10.0.43 \
-e serverPort=8266 \
-e inContainer=true \
-e ffmpegVersion=6 \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
-v /mnt/NV2/tdarr-cache:/cache \
-v /mnt/media/TV:/media/TV \
-v /mnt/media/Movies:/media/Movies \
-v /mnt/media/tdarr/tdarr-cache-clean:/temp \
ghcr.io/haveagitgat/tdarr_node:latest
```
### Unmapped Node (Downloads Files)
```bash
podman run -d --name tdarr-node-gpu-unmapped \
--gpus all \
--restart unless-stopped \
-e TZ=America/Chicago \
-e UMASK_SET=002 \
-e nodeName=local-workstation-gpu-unmapped \
-e serverIP=10.10.0.43 \
-e serverPort=8266 \
-e inContainer=true \
-e ffmpegVersion=6 \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
-v /mnt/NV2/tdarr-cache:/cache \
-v /mnt/media:/media \
-v /mnt/media/tdarr/tdarr-cache-clean:/temp \
ghcr.io/haveagitgat/tdarr_node:latest
```
**Use cases**:
- **Mapped**: Direct media access, faster processing, no file downloads
- **Unmapped**: Works when network shares aren't available locally
- Hardware video encoding/decoding (NVENC/NVDEC)
- High-performance transcoding with NVMe cache
- Multiple concurrent streams
- Fedora-based systems where Podman works better than Docker
## GPU-Accelerated Configuration (Docker)
**For Ubuntu/Debian systems where Docker GPU support works:**
```yaml
version: "3.4"
services:
tdarr-node:
container_name: tdarr-node-gpu
image: ghcr.io/haveagitgat/tdarr_node:latest
restart: unless-stopped
environment:
- TZ=America/Chicago
- UMASK_SET=002
- nodeName=local-workstation-gpu
- serverIP=YOUR_TDARR_SERVER_IP
- serverPort=8266
- inContainer=true
- ffmpegVersion=6
- NVIDIA_DRIVER_CAPABILITIES=all
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- /path/to/your/media:/media
- ./temp:/temp
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
```
## Configuration Parameters
### Required Environment Variables
- `TZ`: Timezone (e.g., `America/Chicago`)
- `nodeName`: Unique identifier for this node
- `serverIP`: IP address of Tdarr server
- `serverPort`: Tdarr server port (typically 8266)
- `inContainer`: Set to `true` for containerized deployments
- `ffmpegVersion`: FFmpeg version to use (6 recommended)
### GPU-Specific Variables
- `NVIDIA_DRIVER_CAPABILITIES`: Set to `all` for full GPU access
- `NVIDIA_VISIBLE_DEVICES`: `all` for all GPUs, or specific GPU IDs
### Volume Mounts
- `/media`: Mount point for media files (must match server configuration)
- `/temp`: Temporary directory for transcoding cache
## Platform-Specific Recommendations
### Fedora/RHEL/CentOS/Nobara
- **GPU**: Use Podman (Docker Desktop has GPU issues)
- **CPU**: Docker or Podman both work fine
### Ubuntu/Debian
- **GPU**: Use Docker with nvidia-container-toolkit
- **CPU**: Docker recommended
### Testing GPU Functionality
Verify GPU access inside container:
```bash
# For Podman
podman exec tdarr-node-gpu nvidia-smi
# For Docker
docker exec tdarr-node-gpu nvidia-smi
```
Test NVENC encoding:
```bash
# For Podman
podman exec tdarr-node-gpu /usr/local/bin/tdarr-ffmpeg -f lavfi -i testsrc2=duration=5:size=1920x1080:rate=30 -c:v h264_nvenc -t 5 /tmp/test.mp4
# For Docker
docker exec tdarr-node-gpu /usr/local/bin/tdarr-ffmpeg -f lavfi -i testsrc2=duration=5:size=1920x1080:rate=30 -c:v h264_nvenc -t 5 /tmp/test.mp4
```
## Troubleshooting
- **GPU not detected**: See `reference/docker/nvidia-gpu-troubleshooting.md`
- **Permission issues**: Ensure proper UMASK_SET and volume permissions
- **Connection issues**: Verify serverIP and firewall settings
- **Performance issues**: Monitor CPU/GPU utilization during transcoding
## Related Documentation
- `patterns/docker/gpu-acceleration.md` - GPU acceleration patterns
- `reference/docker/nvidia-gpu-troubleshooting.md` - Detailed GPU troubleshooting
- `start-tdarr-gpu-podman.sh` - Ready-to-use Podman startup script

View File

@ -0,0 +1,28 @@
version: "3.4"
services:
tdarr-node:
container_name: tdarr-node-local-cpu
image: ghcr.io/haveagitgat/tdarr_node:latest
restart: unless-stopped
environment:
- TZ=America/Chicago
- UMASK_SET=002
- nodeName=local-workstation-cpu
- serverIP=192.168.1.100 # Replace with your Tdarr server IP
- serverPort=8266
- inContainer=true
- ffmpegVersion=6
volumes:
# Media access (same as server)
- /mnt/media:/media # Replace with your media path
# Local transcoding cache
- ./temp:/temp
# Resource limits for CPU transcoding
deploy:
resources:
limits:
cpus: '14' # Leave some cores for system (16-core = use 14)
memory: 32G # Generous for 4K transcoding
reservations:
cpus: '8' # Minimum guaranteed cores
memory: 16G

View File

@ -0,0 +1,45 @@
version: "3.4"
services:
tdarr-node:
container_name: tdarr-node-local-gpu
image: ghcr.io/haveagitgat/tdarr_node:latest
restart: unless-stopped
environment:
- TZ=America/Chicago
- UMASK_SET=002
- nodeName=local-workstation-gpu
- serverIP=192.168.1.100 # Replace with your Tdarr server IP
- serverPort=8266
- inContainer=true
- ffmpegVersion=6
# NVIDIA environment variables
- NVIDIA_DRIVER_CAPABILITIES=all
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Media access (same as server)
- /mnt/media:/media # Replace with your media path
# Local transcoding cache
- ./temp:/temp
devices:
- /dev/dri:/dev/dri # Intel/AMD GPU fallback
# GPU configuration - choose ONE method:
# Method 1: Deploy syntax (recommended)
deploy:
resources:
limits:
memory: 16G # GPU transcoding uses less RAM
reservations:
memory: 8G
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Method 2: Runtime (alternative)
# runtime: nvidia
# Method 3: CDI (future)
# devices:
# - nvidia.com/gpu=all

View File

@ -0,0 +1,83 @@
#!/bin/bash
# Tdarr Mapped Node with GPU Support - Example Script
# This script starts a MAPPED Tdarr node container with NVIDIA GPU acceleration using Podman
#
# MAPPED NODES: Direct access to media files via volume mounts
# Use this approach when you want the node to directly access your media library
# for local processing without server coordination for file transfers
#
# Configure these variables for your setup:
set -e
CONTAINER_NAME="tdarr-node-gpu-mapped"
SERVER_IP="YOUR_SERVER_IP" # e.g., "10.10.0.43" or "192.168.1.100"
SERVER_PORT="8266" # Default Tdarr server port
NODE_NAME="YOUR_NODE_NAME" # e.g., "workstation-gpu" or "local-gpu-node"
MEDIA_PATH="/path/to/your/media" # e.g., "/mnt/media" or "/home/user/Videos"
CACHE_PATH="/path/to/cache" # e.g., "/mnt/ssd/tdarr-cache"
echo "🚀 Starting MAPPED Tdarr Node with GPU support using Podman..."
echo " Media Path: ${MEDIA_PATH}"
echo " Cache Path: ${CACHE_PATH}"
# Stop and remove existing container if it exists
if podman ps -a --format "{{.Names}}" | grep -q "^${CONTAINER_NAME}$"; then
echo "🛑 Stopping existing container: ${CONTAINER_NAME}"
podman stop "${CONTAINER_NAME}" 2>/dev/null || true
podman rm "${CONTAINER_NAME}" 2>/dev/null || true
fi
# Start Tdarr node with GPU support
echo "🎬 Starting Tdarr Node container..."
podman run -d --name "${CONTAINER_NAME}" \
--gpus all \
--restart unless-stopped \
-e TZ=America/Chicago \
-e UMASK_SET=002 \
-e nodeName="${NODE_NAME}" \
-e serverIP="${SERVER_IP}" \
-e serverPort="${SERVER_PORT}" \
-e inContainer=true \
-e ffmpegVersion=6 \
-e logLevel=DEBUG \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
-v "${MEDIA_PATH}:/media" \
-v "${CACHE_PATH}:/temp" \
ghcr.io/haveagitgat/tdarr_node:latest
echo "⏳ Waiting for container to initialize..."
sleep 5
# Check container status
if podman ps --format "{{.Names}}" | grep -q "^${CONTAINER_NAME}$"; then
echo "✅ Mapped Tdarr Node is running successfully!"
echo ""
echo "📊 Container Status:"
podman ps --filter "name=${CONTAINER_NAME}" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
echo ""
echo "🔍 Testing GPU Access:"
if podman exec "${CONTAINER_NAME}" nvidia-smi --query-gpu=name --format=csv,noheader,nounits 2>/dev/null; then
echo "🎉 GPU is accessible in container!"
else
echo "⚠️ GPU test failed, but container is running"
fi
echo ""
echo "🌐 Connection Details:"
echo " Server: ${SERVER_IP}:${SERVER_PORT}"
echo " Node Name: ${NODE_NAME}"
echo ""
echo "🧪 Test NVENC encoding:"
echo " podman exec ${CONTAINER_NAME} /usr/local/bin/tdarr-ffmpeg -f lavfi -i testsrc2=duration=5:size=1920x1080:rate=30 -c:v h264_nvenc -preset fast -t 5 /tmp/test.mp4"
echo ""
echo "📋 Container Management:"
echo " View logs: podman logs ${CONTAINER_NAME}"
echo " Stop: podman stop ${CONTAINER_NAME}"
echo " Remove: podman rm ${CONTAINER_NAME}"
else
echo "❌ Failed to start container"
echo "📋 Checking logs..."
podman logs "${CONTAINER_NAME}" --tail 10
exit 1
fi

View File

@ -0,0 +1,69 @@
# Tdarr Server Setup Example
## Directory Structure
```
~/container-data/tdarr/
├── docker-compose.yml
├── stonefish-tdarr-plugins/ # Custom plugins
├── tdarr/
│ ├── server/ # Local storage
│ ├── configs/
│ └── logs/
└── temp/ # Local temp if needed
```
## Storage Strategy
### Local Storage (Fast Access)
- **Database**: SQLite requires local filesystem for WAL mode
- **Configs**: Frequently accessed during startup
- **Logs**: Regular writes during operation
### Network Storage (Capacity)
- **Backups**: Infrequent access, large files
- **Media**: Read-only during transcoding
- **Cache**: Temporary transcoding files
## Upgrade Process
### Major Version Upgrades
1. **Backup current state**
```bash
docker-compose down
cp docker-compose.yml docker-compose.yml.backup
```
2. **For clean start** (recommended for major versions):
```bash
# Remove old database
sudo rm -rf ./tdarr/server
mkdir -p ./tdarr/server
# Pull latest image
docker-compose pull
# Start fresh
docker-compose up -d
```
3. **Monitor initialization**
```bash
docker-compose logs -f
```
## Common Issues
### Disk Space
- Monitor local database growth
- Regular cleanup of old backups
- Use network storage for large static data
### Permissions
- Container runs as PUID/PGID (usually 0/0)
- Ensure proper ownership of mounted directories
- Use `sudo rm -rf` for root-owned container files
### Network Filesystem Issues
- SQLite incompatible with NFS/SMB for database
- Keep database local, only backups on network
- Monitor transcoding cache disk usage

View File

@ -0,0 +1,37 @@
version: "3.4"
services:
tdarr:
container_name: tdarr
image: ghcr.io/haveagitgat/tdarr:latest
restart: unless-stopped
network_mode: bridge
ports:
- 8265:8265 # webUI port
- 8266:8266 # server port
environment:
- TZ=America/Chicago
- PUID=0
- PGID=0
- UMASK_SET=002
- serverIP=0.0.0.0
- serverPort=8266
- webUIPort=8265
- internalNode=false # Disable for distributed setup
- inContainer=true
- ffmpegVersion=6
- nodeName=docker-server
volumes:
# Plugin mounts (stonefish example)
- ./stonefish-tdarr-plugins/FlowPlugins/:/app/server/Tdarr/Plugins/FlowPlugins/
- ./stonefish-tdarr-plugins/FlowPluginsTs/:/app/server/Tdarr/Plugins/FlowPluginsTs/
- ./stonefish-tdarr-plugins/Community/:/app/server/Tdarr/Plugins/Community/
# Hybrid storage strategy
- ./tdarr/server:/app/server # Local: Database, configs, logs
- ./tdarr/configs:/app/configs
- ./tdarr/logs:/app/logs
- /mnt/truenas-share/tdarr/tdarr-server/Backups:/app/server/Tdarr/Backups # Network: Backups
# Media and cache
- /mnt/truenas-share:/media
- /mnt/truenas-share/tdarr/tdarr-cache:/temp

212
tdarr/scripts/CONTEXT.md Normal file
View File

@ -0,0 +1,212 @@
# Tdarr Scripts - Operational Context
## Script Overview
This directory contains active operational scripts for Tdarr transcoding automation, gaming-aware scheduling, and system management.
## Core Scripts
### Gaming-Aware Scheduler
**Primary Script**: `tdarr-schedule-manager.sh`
**Purpose**: Comprehensive management interface for gaming-aware Tdarr scheduling
**Key Functions**:
- **Preset Management**: Quick schedule templates (night-only, work-safe, weekend-heavy, gaming-only)
- **Installation**: Automated cron job setup and configuration
- **Status Monitoring**: Real-time status and logging
- **Configuration**: Interactive schedule editing and validation
**Usage Patterns**:
```bash
# Quick setup
./tdarr-schedule-manager.sh preset work-safe
./tdarr-schedule-manager.sh install
# Monitoring
./tdarr-schedule-manager.sh status
./tdarr-schedule-manager.sh logs
# Testing
./tdarr-schedule-manager.sh test
```
### Container Management
**Start Script**: `start-tdarr-gpu-podman-clean.sh`
**Purpose**: Launch unmapped Tdarr node with optimized configuration
**Key Features**:
- **Unmapped Node Configuration**: Local cache for optimal performance
- **GPU Support**: Full NVIDIA device passthrough
- **Resource Optimization**: Direct NVMe cache mapping
- **Clean Architecture**: No media volume dependencies
**Stop Script**: `stop-tdarr-gpu-podman.sh`
**Purpose**: Graceful container shutdown with cleanup
### Scheduling Engine
**Core Engine**: `tdarr-cron-check-configurable.sh`
**Purpose**: Minute-by-minute decision engine for Tdarr state management
**Decision Logic**:
1. **Gaming Detection**: Check for active gaming processes
2. **GPU Monitoring**: Verify GPU usage below threshold (15%)
3. **Time Window Validation**: Ensure current time within allowed schedule
4. **State Management**: Start/stop Tdarr based on conditions
**Gaming Process Detection**:
- Steam, Lutris, Heroic Games Launcher
- Wine, Bottles (Windows compatibility layers)
- GameMode, MangoHUD (gaming utilities)
- GPU usage monitoring via nvidia-smi
### Configuration Management
**Config File**: `tdarr-schedule.conf`
**Purpose**: Centralized configuration for scheduler behavior
**Configuration Structure**:
```bash
# Time blocks: "HOUR_START-HOUR_END:DAYS"
SCHEDULE_BLOCKS="22-07:daily 09-17:1-5"
# Gaming detection settings
GPU_THRESHOLD=15
GAMING_PROCESSES="steam lutris heroic wine bottles gamemode mangohud"
# Operational settings
LOG_FILE="/tmp/tdarr-scheduler.log"
CONTAINER_NAME="tdarr-node-gpu"
```
## Operational Patterns
### Automated Maintenance
**Cron Integration**: Two automated systems running simultaneously
1. **Scheduler** (every minute): `tdarr-cron-check-configurable.sh`
2. **Cleanup** (every 6 hours): Temporary directory maintenance
**Cleanup Automation**:
```bash
# Removes abandoned transcoding directories
0 */6 * * * find /tmp -name "tdarr-workDir2-*" -type d -mmin +360 -exec rm -rf {} \; 2>/dev/null || true
```
### Logging Strategy
**Log Location**: `/tmp/tdarr-scheduler.log`
**Log Format**: Timestamped entries with decision reasoning
**Log Rotation**: Manual cleanup, focused on recent activity
**Log Examples**:
```
[2025-08-13 14:30:01] Gaming detected (steam), stopping Tdarr
[2025-08-13 14:35:01] Gaming ended, but outside allowed hours (14:35 not in 22-07:daily)
[2025-08-13 22:00:01] Starting Tdarr (no gaming, within schedule)
```
### System Integration
**Gaming Detection**: Real-time process monitoring
**GPU Monitoring**: nvidia-smi integration for usage thresholds
**Container Management**: Podman-based lifecycle management
**Cron Integration**: Standard system scheduler for automation
## Configuration Presets
### Preset Profiles
**night-only**: `"22-07:daily"` - Overnight transcoding only
**work-safe**: `"22-07:daily 09-17:1-5"` - Nights + work hours
**weekend-heavy**: `"22-07:daily 09-17:1-5 08-20:6-7"` - Maximum time
**gaming-only**: No time limits, gaming detection only
### Schedule Format Specification
**Format**: `"HOUR_START-HOUR_END:DAYS"`
**Examples**:
- `"22-07:daily"` - 10PM to 7AM every day (overnight)
- `"09-17:1-5"` - 9AM to 5PM Monday-Friday
- `"14-16:6,7"` - 2PM to 4PM Saturday and Sunday
- `"08-20:6-7"` - 8AM to 8PM weekends only
## Container Architecture
### Unmapped Node Configuration
**Architecture Choice**: Local cache with API-based transfers
**Benefits**: 3-5x performance improvement, reduced network dependency
**Container Environment**:
```bash
-e nodeType=unmapped
-e unmappedNodeCache=/cache
-e enableGpu=true
-e TZ=America/New_York
```
**Volume Configuration**:
```bash
# Local high-speed cache (NVMe)
-v "/mnt/NV2/tdarr-cache:/cache"
# Configuration persistence
-v "/mnt/NV2/tdarr-cache-clean:/temp"
# No media volumes (unmapped mode uses API)
```
### Resource Management
**GPU Access**: Full NVIDIA device passthrough
**Memory**: Controlled by container limits
**CPU**: Shared with host system
**Storage**: Local NVMe for optimal I/O performance
## Troubleshooting Context
### Common Issues
1. **Gaming Not Detected**: Check process names in configuration
2. **Time Window Issues**: Verify schedule block format
3. **Container Start Failures**: Check GPU device access
4. **Log File Growth**: Manual cleanup of scheduler logs
### Diagnostic Commands
```bash
# Test current conditions
./tdarr-schedule-manager.sh test
# View real-time logs
./tdarr-schedule-manager.sh logs
# Check container status
podman ps | grep tdarr
# Verify GPU access
podman exec tdarr-node-gpu nvidia-smi
```
### Recovery Procedures
```bash
# Reset to defaults
./tdarr-schedule-manager.sh preset work-safe
# Reinstall scheduler
./tdarr-schedule-manager.sh install
# Manual container restart
./stop-tdarr-gpu-podman.sh
./start-tdarr-gpu-podman-clean.sh
```
## Integration Points
### External Dependencies
- **Podman**: Container runtime for node management
- **nvidia-smi**: GPU monitoring and device access
- **cron**: System scheduler for automation
- **SSH**: Remote server access (monitoring scripts)
### File System Dependencies
- **Cache Directory**: `/mnt/NV2/tdarr-cache` (local NVMe)
- **Temp Directory**: `/mnt/NV2/tdarr-cache-clean` (processing space)
- **Log Files**: `/tmp/tdarr-scheduler.log` (operational logs)
- **Configuration**: Local `tdarr-schedule.conf` file
### Network Dependencies
- **Tdarr Server**: API communication for unmapped node operation
- **Discord Webhooks**: Optional notification integration (via monitoring)
- **NAS Access**: For final file storage (post-processing only)
This operational context provides comprehensive guidance for managing active Tdarr automation scripts in production environments.

272
tdarr/troubleshooting.md Normal file
View File

@ -0,0 +1,272 @@
# Tdarr Troubleshooting Guide
## forEach Error Resolution
### Problem: TypeError: Cannot read properties of undefined (reading 'forEach')
**Symptoms**: Scanning phase fails at "Tagging video res" step, preventing all transcodes
**Root Cause**: Custom plugin mounts override community plugins with incompatible versions
### Solution: Clean Plugin Installation
1. **Remove custom plugin mounts** from docker-compose.yml
2. **Force plugin regeneration**:
```bash
ssh tdarr "docker restart tdarr"
podman restart tdarr-node-gpu
```
3. **Verify clean plugins**: Check for null-safety fixes `(streams || []).forEach()`
### Plugin Safety Patterns
```javascript
// ❌ Unsafe - causes forEach errors
args.variables.ffmpegCommand.streams.forEach()
// ✅ Safe - null-safe forEach
(args.variables.ffmpegCommand.streams || []).forEach()
```
## Staging Section Timeout Issues
### Problem: Files removed from staging after 300 seconds
**Symptoms**:
- `.tmp` files stuck in work directories
- ENOTEMPTY errors during cleanup
- Subsequent jobs blocked
### Solution: Automated Monitoring System
**Monitor Script**: `/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh`
**Automatic Actions**:
- Detects staging timeouts every 20 minutes
- Removes stuck work directories
- Sends Discord notifications
- Logs all cleanup activities
### Manual Cleanup Commands
```bash
# Check staging section
ssh tdarr "docker logs tdarr | tail -50"
# Find stuck work directories
find /mnt/NV2/tdarr-cache -name "tdarr-workDir*" -type d
# Force cleanup stuck directory
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir-[ID]
```
## System Stability Issues
### Problem: Kernel crashes during intensive transcoding
**Root Cause**: CIFS network issues during large file streaming (mapped nodes)
### Solution: Convert to Unmapped Node Architecture
1. **Enable unmapped nodes** in server Options
2. **Update node configuration**:
```bash
# Add to container environment
-e nodeType=unmapped
-e unmappedNodeCache=/cache
# Use local cache volume
-v "/mnt/NV2/tdarr-cache:/cache"
# Remove media volume (no longer needed)
```
3. **Benefits**: Eliminates CIFS streaming, prevents kernel crashes
### Container Resource Limits
```yaml
# Prevent memory exhaustion
deploy:
resources:
limits:
memory: 8G
cpus: '6'
```
## Gaming Detection Issues
### Problem: Tdarr doesn't stop during gaming
**Check gaming detection**:
```bash
# Test current gaming detection
./tdarr-schedule-manager.sh test
# View scheduler logs
tail -f /tmp/tdarr-scheduler.log
# Verify GPU usage detection
nvidia-smi
```
### Gaming Process Detection
**Monitored Processes**:
- Steam, Lutris, Heroic Games Launcher
- Wine, Bottles (Windows compatibility)
- GameMode, MangoHUD (utilities)
- **GPU usage >15%** (configurable threshold)
### Configuration Adjustments
```bash
# Edit gaming detection threshold
./tdarr-schedule-manager.sh edit
# Apply preset configurations
./tdarr-schedule-manager.sh preset gaming-only # No time limits
./tdarr-schedule-manager.sh preset night-only # 10PM-7AM only
```
## Network and Access Issues
### Server Connection Problems
**Server Access Commands**:
```bash
# SSH to Tdarr server
ssh tdarr
# Check server status
ssh tdarr "docker ps | grep tdarr"
# View server logs
ssh tdarr "docker logs tdarr"
# Access server container
ssh tdarr "docker exec -it tdarr /bin/bash"
```
### Node Registration Issues
```bash
# Check node logs
podman logs tdarr-node-gpu
# Verify node registration
# Look for "Node registered" in server logs
ssh tdarr "docker logs tdarr | grep -i node"
# Test node connectivity
curl http://10.10.0.43:8265/api/v2/status
```
## Performance Issues
### Slow Transcoding Performance
**Diagnosis**:
1. **Check cache location**: Should be local NVMe, not network
2. **Verify unmapped mode**: `nodeType=unmapped` in container
3. **Monitor I/O**: `iotop` during transcoding
**Expected Performance**:
- **Mapped nodes**: Constant SMB streaming (~100MB/s)
- **Unmapped nodes**: Download once → Process locally → Upload once
### GPU Utilization Problems
```bash
# Monitor GPU usage during transcoding
watch nvidia-smi
# Check GPU device access in container
podman exec tdarr-node-gpu nvidia-smi
# Verify NVENC encoder availability
podman exec tdarr-node-gpu ffmpeg -encoders | grep nvenc
```
## Plugin System Issues
### Plugin Loading Failures
**Troubleshooting Steps**:
1. **Check plugin directory**: Ensure no custom mounts override community plugins
2. **Verify dependencies**: FlowHelper files (`metadataUtils.js`, `letterboxUtils.js`)
3. **Test plugin syntax**:
```bash
# Test plugin in Node.js
node -e "require('./path/to/plugin.js')"
```
### Custom Plugin Integration
**Safe Integration Pattern**:
1. **Selective mounting**: Mount only specific required plugins
2. **Dependency verification**: Include all FlowHelper dependencies
3. **Version compatibility**: Ensure plugins match Tdarr version
4. **Null-safety checks**: Add `|| []` to forEach operations
## Monitoring and Logging
### Log Locations
```bash
# Scheduler logs
tail -f /tmp/tdarr-scheduler.log
# Monitor logs
tail -f /tmp/tdarr-monitor/monitor.log
# Server logs
ssh tdarr "docker logs tdarr"
# Node logs
podman logs tdarr-node-gpu
```
### Discord Notification Issues
**Check webhook configuration**:
```bash
# Test Discord webhook
curl -X POST [WEBHOOK_URL] \
-H "Content-Type: application/json" \
-d '{"content": "Test message"}'
```
**Common Issues**:
- JSON escaping in message content
- Markdown formatting in Discord
- User ping placement (outside code blocks)
## Emergency Recovery
### Complete System Reset
```bash
# Stop all containers
podman stop tdarr-node-gpu
ssh tdarr "docker stop tdarr"
# Clean cache directories
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir*
# Remove scheduler
crontab -e # Delete tdarr lines
# Restart with clean configuration
./start-tdarr-gpu-podman-clean.sh
./tdarr-schedule-manager.sh preset work-safe
./tdarr-schedule-manager.sh install
```
### Data Recovery
**Important**: Tdarr processes files in-place, original files remain untouched
- **Queue data**: Stored in server configuration (`/app/configs`)
- **Progress data**: Lost on container restart (unmapped nodes)
- **Cache files**: Safe to delete, will re-download
## Common Error Patterns
### "Copy failed" in Staging Section
**Cause**: Network timeout during file transfer to unmapped node
**Solution**: Monitoring system automatically retries
### "ENOTEMPTY" Directory Cleanup Errors
**Cause**: Partial downloads leave files in work directories
**Solution**: Force remove directories, monitoring handles automatically
### Node Disconnection During Processing
**Cause**: Gaming detection or manual stop during active job
**Result**: File returns to queue automatically, safe to restart
## Prevention Best Practices
1. **Use unmapped node architecture** for stability
2. **Implement monitoring system** for automatic cleanup
3. **Configure gaming-aware scheduling** for desktop systems
4. **Set container resource limits** to prevent crashes
5. **Use clean plugin installation** to avoid forEach errors
6. **Monitor system resources** during intensive operations
This troubleshooting guide covers the most common issues and their resolutions for production Tdarr deployments.

296
vm-management/CONTEXT.md Normal file
View File

@ -0,0 +1,296 @@
# Virtual Machine Management - Technology Context
## Overview
Virtual machine management for home lab environments with focus on automated provisioning, infrastructure as code, and security-first configuration. This context covers VM lifecycle management, Proxmox integration, and standardized deployment patterns.
## Architecture Patterns
### Infrastructure as Code (IaC) Approach
**Pattern**: Declarative VM configuration with repeatable deployments
```yaml
# Cloud-init template pattern
#cloud-config
users:
- name: cal
groups: [sudo, docker]
ssh_authorized_keys:
- ssh-rsa AAAAB3... primary-key
- ssh-rsa AAAAB3... emergency-key
packages:
- docker.io
- docker-compose
runcmd:
- systemctl enable docker
- usermod -aG docker cal
```
### Template-Based Deployment Strategy
**Pattern**: Standardized VM templates with cloud-init automation
- **Base Templates**: Ubuntu Server with cloud-init support
- **Resource Allocation**: Standardized sizing (2CPU/4GB/20GB baseline)
- **Network Configuration**: Predefined VLAN assignments (10.10.0.x internal)
- **Security Hardening**: SSH keys only, password auth disabled
## Provisioning Strategies
### Cloud-Init Deployment (Recommended for New VMs)
**Purpose**: Fully automated VM provisioning from first boot
**Implementation**:
1. Create VM in Proxmox with cloud-init support
2. Apply standardized cloud-init template
3. VM configures itself automatically on first boot
4. No manual intervention required
**Benefits**:
- Zero-touch deployment
- Consistent configuration
- Security hardening from first boot
- Immediate productivity
### Post-Install Scripting (Existing VMs)
**Purpose**: Standardize existing VM configurations
**Implementation**:
```bash
./vm-post-install.sh <vm-ip> [username]
# Automated: updates, SSH keys, Docker, hardening
```
**Use Cases**:
- Legacy VM standardization
- Imported VM configuration
- Recovery and remediation
- Incremental improvements
## Security Architecture
### SSH Key-Based Authentication
**Pattern**: Dual key deployment for security and redundancy
```bash
# Primary access key
~/.ssh/homelab_rsa # Daily operations
# Emergency access key
~/.ssh/emergency_homelab_rsa # Backup/recovery access
```
**Security Controls**:
- Password authentication completely disabled
- Root login prohibited
- SSH keys managed centrally
- Automatic key deployment
### User Privilege Management
**Pattern**: Least privilege with sudo elevation
```bash
# User configuration
username: cal
groups: [sudo, docker] # Minimal required groups
shell: /bin/bash
sudo: ALL=(ALL) NOPASSWD:ALL # Operational convenience
```
**Access Controls**:
- Non-root user accounts only
- Sudo required for administrative tasks
- Docker group for container management
- SSH key authentication mandatory
### Network Security
**Pattern**: Network segmentation and access control
- **Internal Network**: 10.10.0.x/24 for VM communication
- **Management Access**: SSH (port 22) only
- **Service Isolation**: Application-specific port exposure
- **Firewall Ready**: iptables/ufw configuration prepared
## Lifecycle Management Patterns
### VM Creation Workflow
1. **Template Selection**: Choose appropriate base image
2. **Resource Allocation**: Size based on workload requirements
3. **Network Assignment**: VLAN and IP address planning
4. **Cloud-Init Configuration**: Apply standardized template
5. **Automated Provisioning**: Zero-touch deployment
6. **Verification**: Automated connectivity and configuration tests
### Configuration Management
**Pattern**: Standardized system configuration
```bash
# Essential packages
packages: [
"curl", "wget", "git", "vim", "htop", "unzip",
"docker.io", "docker-compose-plugin"
]
# System services
runcmd:
- systemctl enable docker
- systemctl enable ssh
- systemctl enable unattended-upgrades
```
### Maintenance Automation
**Pattern**: Automated updates and maintenance
- **Security Updates**: Automatic installation enabled
- **Package Management**: Standardized package selection
- **Service Management**: Consistent service configuration
- **Log Management**: Centralized logging ready
## Resource Management
### Sizing Standards
**Pattern**: Standardized VM resource allocation
```yaml
# Basic workload (web services, small databases)
vcpus: 2
memory: 4096 # 4GB
disk: 20 # 20GB
# Medium workload (application servers, medium databases)
vcpus: 4
memory: 8192 # 8GB
disk: 40 # 40GB
# Heavy workload (transcoding, large databases)
vcpus: 6
memory: 16384 # 16GB
disk: 100 # 100GB
```
### Storage Strategy
**Pattern**: Application-appropriate storage allocation
- **System Disk**: OS and applications (20-40GB)
- **Data Volumes**: Application data (variable)
- **Backup Storage**: Network-attached for persistence
- **Cache Storage**: Local fast storage for performance
### Network Planning
**Pattern**: Structured network addressing
```yaml
# Network segments
management: 10.10.0.x/24 # VM management and SSH access
services: 10.10.1.x/24 # Application services
storage: 10.10.2.x/24 # Storage and backup traffic
dmz: 10.10.10.x/24 # External-facing services
```
## Monitoring and Operations
### Health Monitoring
**Pattern**: Automated system health checks
```bash
# Resource monitoring
cpu_usage: <80%
memory_usage: <90%
disk_usage: <85%
network_connectivity: verified
# Service monitoring
ssh_service: active
docker_service: active
unattended_upgrades: active
```
### Backup Strategies
**Pattern**: Multi-tier backup approach
- **VM Snapshots**: Point-in-time recovery (Proxmox)
- **Application Data**: Specific application backup procedures
- **Configuration Backup**: Cloud-init templates and scripts
- **SSH Keys**: Centralized key management backup
### Performance Tuning
**Pattern**: Workload-optimized configuration
```yaml
# CPU optimization
cpu_type: host # Performance over compatibility
numa: enabled # NUMA awareness for multi-socket
# Memory optimization
ballooning: enabled # Dynamic memory allocation
hugepages: disabled # Unless specifically needed
# Storage optimization
cache: writethrough # Balance performance and safety
io_thread: enabled # Improve I/O performance
```
## Integration Patterns
### Container Platform Integration
**Pattern**: Docker-ready VM deployment
```bash
# Automated Docker setup
- docker.io installation
- docker-compose plugin
- User added to docker group
- Service auto-start enabled
- Container runtime verified
```
### SSH Infrastructure Integration
**Pattern**: Centralized SSH key management
```bash
# Key deployment automation
primary_key: ~/.ssh/homelab_rsa.pub
emergency_key: ~/.ssh/emergency_homelab_rsa.pub
backup_system: automated
rotation_policy: annual
```
### Network Services Integration
**Pattern**: Ready for service deployment
- **Reverse Proxy**: Nginx/Traefik ready configuration
- **DNS**: Local DNS registration prepared
- **Certificates**: Let's Encrypt integration ready
- **Monitoring**: Prometheus/Grafana agent ready
## Common Implementation Workflows
### New VM Deployment
1. **Create VM** in Proxmox with cloud-init support
2. **Configure resources** based on workload requirements
3. **Apply cloud-init template** with standardized configuration
4. **Start VM** and wait for automated provisioning
5. **Verify deployment** via SSH key authentication
6. **Deploy applications** using container or package management
### Existing VM Standardization
1. **Assess current configuration** and identify gaps
2. **Run post-install script** for automated updates
3. **Verify SSH key deployment** and password auth disable
4. **Test Docker installation** and user permissions
5. **Update documentation** with new configuration
6. **Schedule regular maintenance** and monitoring
### VM Migration and Recovery
1. **Create VM snapshot** before changes
2. **Export VM configuration** and cloud-init template
3. **Test recovery procedure** in staging environment
4. **Document recovery steps** and verification procedures
5. **Implement backup automation** for critical VMs
## Best Practices
### Security Hardening
1. **SSH Keys Only**: Disable password authentication completely
2. **Emergency Access**: Deploy backup SSH keys for recovery
3. **User Separation**: Non-root users with sudo privileges
4. **Automatic Updates**: Enable security update automation
5. **Network Isolation**: Use VLANs and firewall rules
### Operational Excellence
1. **Infrastructure as Code**: Use cloud-init for reproducible deployments
2. **Standardization**: Consistent VM sizing and configuration
3. **Automation**: Minimize manual configuration steps
4. **Documentation**: Maintain deployment templates and procedures
5. **Testing**: Verify deployments before production use
### Performance Optimization
1. **Resource Right-Sizing**: Match resources to workload requirements
2. **Storage Strategy**: Use appropriate storage tiers
3. **Network Optimization**: Plan network topology for performance
4. **Monitoring**: Implement resource usage monitoring
5. **Capacity Planning**: Plan for growth and scaling
This technology context provides comprehensive guidance for implementing virtual machine management in home lab and production environments using modern IaC principles and security best practices.

View File

@ -0,0 +1,652 @@
# Virtual Machine Management Troubleshooting Guide
## VM Provisioning Issues
### Cloud-Init Configuration Problems
#### Cloud-Init Not Executing
**Symptoms**:
- VM starts but user accounts not created
- SSH keys not deployed
- Packages not installed
- Configuration not applied
**Diagnosis**:
```bash
# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
```
**Solutions**:
```bash
# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
```
#### Invalid Cloud-Init YAML
**Symptoms**:
- Cloud-init fails with syntax errors
- Parser errors in cloud-init logs
- Partial configuration application
**Common YAML Issues**:
```yaml
# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker] # Wrong indentation
# ✅ Correct indentation
users:
- name: cal
groups: [sudo, docker] # Proper indentation
# ❌ Missing quotes for special characters
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
# ✅ Quoted strings
ssh_authorized_keys:
- "ssh-rsa AAAAB3NzaC1... user@host"
```
### VM Boot and Startup Issues
#### VM Won't Start
**Symptoms**:
- VM fails to boot from Proxmox
- Kernel panic messages
- Boot loop or hanging
**Diagnosis**:
```bash
# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config
# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console
# Check Proxmox host resources
pvesh get /nodes/pve/status
```
**Solutions**:
```bash
# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed
```
#### Resource Constraints
**Symptoms**:
- VM extremely slow performance
- Out-of-memory kills
- Disk I/O bottlenecks
**Diagnosis**:
```bash
# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5
# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz
```
**Solutions**:
```bash
# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## SSH Access Issues
### SSH Connection Failures
#### Cannot Connect to VM
**Symptoms**:
- Connection timeout
- Connection refused
- Host unreachable
**Diagnosis**:
```bash
# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>
# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22
```
**Solutions**:
```bash
# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd
# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp
# Network configuration reset
ip addr show
dhclient # For DHCP
systemctl restart networking
```
#### SSH Key Authentication Failures
**Symptoms**:
- Password prompts despite key installation
- "Permission denied (publickey)"
- "No more authentication methods"
**Diagnosis**:
```bash
# Verbose SSH debugging
ssh -vvv cal@<vm-ip>
# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
```
**Solutions**:
```bash
# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh
# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
```
#### SSH Security Configuration Issues
**Symptoms**:
- Password authentication still enabled
- Root login allowed
- Insecure SSH settings
**Diagnosis**:
```bash
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
```
**Solutions**:
```bash
# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF
sudo systemctl restart sshd
```
## Docker Installation and Configuration Issues
### Docker Installation Failures
#### Package Installation Fails
**Symptoms**:
- Docker packages not found
- GPG key verification errors
- Repository access failures
**Diagnosis**:
```bash
# Test internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for package conflicts
dpkg -l | grep docker
```
**Solutions**:
```bash
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
#### Docker Service Issues
**Symptoms**:
- Docker daemon won't start
- Socket connection errors
- Service failure on boot
**Diagnosis**:
```bash
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check system resources
df -h
free -h
# Test daemon manually
sudo dockerd --debug
```
**Solutions**:
```bash
# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker
# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker
```
### Docker Permission and Access Issues
#### Permission Denied Errors
**Symptoms**:
- Must use sudo for Docker commands
- "Permission denied" when accessing Docker socket
- User not in docker group
**Diagnosis**:
```bash
# Check user groups
groups
groups cal
getent group docker
# Check Docker socket permissions
ls -la /var/run/docker.sock
# Verify Docker service is running
systemctl status docker
```
**Solutions**:
```bash
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal
# Apply group membership (requires logout/login or):
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
```
## Network Configuration Problems
### IP Address and Connectivity Issues
#### Incorrect IP Configuration
**Symptoms**:
- VM has wrong IP address
- No network connectivity
- Cannot reach default gateway
**Diagnosis**:
```bash
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping $(ip route | grep default | awk '{print $3}') # Gateway
ping 8.8.8.8 # External connectivity
```
**Solutions**:
```bash
# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
EOF
# Apply network configuration
sudo netplan apply
```
#### DNS Resolution Problems
**Symptoms**:
- Cannot resolve domain names
- Package downloads fail
- Host lookup failures
**Diagnosis**:
```bash
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8
```
**Solutions**:
```bash
# Fix DNS in netplan (see above example)
sudo netplan apply
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking
```
## System Maintenance Issues
### Package Management Problems
#### Update Failures
**Symptoms**:
- apt update fails
- Repository signature errors
- Dependency conflicts
**Diagnosis**:
```bash
# Check repository status
sudo apt update
apt-cache policy
# Check disk space
df -h /
df -h /var
# Check for held packages
apt-mark showhold
```
**Solutions**:
```bash
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update
```
### Storage and Disk Space Issues
#### Disk Space Exhaustion
**Symptoms**:
- Cannot install packages
- Docker operations fail
- System becomes unresponsive
**Diagnosis**:
```bash
# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null
# Find large files
find / -size +100M 2>/dev/null | head -20
```
**Solutions**:
```bash
# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d
# Clean Docker data
docker system prune -a -f
docker volume prune -f
# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## Emergency Recovery Procedures
### SSH Access Recovery
#### Complete SSH Lockout
**Recovery Steps**:
1. **Use Proxmox console** for direct VM access
2. **Reset SSH configuration**:
```bash
# Via console
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
sudo systemctl restart sshd
```
3. **Re-enable emergency access**:
```bash
# Temporary password access for recovery
sudo passwd cal
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd
```
#### Emergency SSH Key Deployment
**If primary keys fail**:
```bash
# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys
```
### VM Recovery and Rebuild
#### Corrupt VM Recovery
**Steps**:
1. **Create snapshot** before attempting recovery
2. **Export VM data**:
```bash
# Backup important data
rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
```
3. **Restore from template**:
```bash
# Delete corrupt VM
pvesh delete /nodes/pve/qemu/<vmid>
# Clone from template
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
```
#### Post-Install Script Recovery
**If automation fails**:
```bash
# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
```
## Prevention and Monitoring
### Pre-Deployment Validation
```bash
# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1
# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
```
### Health Monitoring Script
```bash
#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
for ip in $VM_IPS; do
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: SSH OK"
# Check Docker
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
echo "✅ $ip: Docker OK"
else
echo "❌ $ip: Docker FAILED"
fi
else
echo "❌ $ip: SSH FAILED"
fi
done
```
### Automated Backup
```bash
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
if ping -c1 $vm_ip >/dev/null 2>&1; then
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
fi
done
```
## Quick Reference Commands
### Essential VM Management
```bash
# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df
```
### Recovery Resources
- **SSH Keys Backup**: `/mnt/NV2/ssh-keys/backup-*/`
- **Proxmox Console**: Direct VM access when SSH fails
- **Emergency Contact**: Use Discord notifications for critical issues
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.