claude-configs/skills/proxmox/docs/migration_checklist.md
Cal Corum 8a1d15911f Initial commit: Claude Code configuration backup
Version control Claude Code configuration including:
- Global instructions (CLAUDE.md)
- User settings (settings.json)
- Custom agents (architect, designer, engineer, etc.)
- Custom skills (create-skill templates and workflows)

Excludes session data, secrets, cache, and temporary files per .gitignore.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-03 16:34:21 -06:00

384 lines
12 KiB
Markdown

# VM to LXC Migration Testing Checklist
Comprehensive validation checklist for VM to LXC container migrations.
## Pre-Migration Checklist
### Planning Phase
- [ ] VM analyzed with migration tool: `python3 migrate_vm_to_lxc.py analyze --vmid <id>`
- [ ] Migration suitability confirmed (excellent or good)
- [ ] Migration plan generated and reviewed
- [ ] Target LXC container ID selected (not in use)
- [ ] Static IP address planned (if needed)
- [ ] Maintenance window scheduled (low-traffic period)
- [ ] Stakeholders notified (if production service)
- [ ] Rollback plan documented and understood
### Backup Phase
- [ ] VM snapshot created: `snapshot-name: pre-migration-YYYY-MM-DD`
- [ ] VM snapshot verified in Proxmox UI
- [ ] Docker Compose files backed up from VM
- [ ] Docker volumes/data backed up (if applicable)
- [ ] List of running containers documented
- [ ] Environment variables documented
- [ ] Network configuration documented (IP, ports, DNS)
- [ ] External dependencies documented (databases, APIs, etc.)
### Infrastructure Validation
- [ ] Docker LXC template exists (ID 9001 or custom)
- [ ] Target container ID available
- [ ] Sufficient storage space on target storage pool
- [ ] Network configuration confirmed (VLAN, bridge, gateway)
- [ ] DNS entries documented (update after migration if needed)
- [ ] Firewall rules documented
- [ ] Reverse proxy configuration backed up (if using NPM/Traefik)
---
## Migration Execution Checklist
### Phase 1: Pre-Migration Testing
- [ ] VM is running and healthy
- [ ] All services responding normally
- [ ] No error logs in VM
- [ ] Docker containers all running: `docker ps -a`
- [ ] Resource usage documented (CPU, RAM, disk)
- [ ] Performance baseline captured (response times, etc.)
### Phase 2: VM Shutdown
- [ ] Services gracefully stopped (if order matters)
- [ ] Docker containers stopped: `docker compose down` (optional)
- [ ] VM shut down gracefully: `shutdown -h now` or Proxmox
- [ ] VM status confirmed: `stopped`
- [ ] Snapshot remains intact
### Phase 3: LXC Creation
- [ ] LXC created from template
- [ ] Container ID matches plan
- [ ] Hostname configured correctly
- [ ] Memory allocation set (estimated from analysis)
- [ ] CPU cores allocated (match or reduce from VM)
- [ ] Storage configured correctly
- [ ] Network configured (static IP or DHCP)
- [ ] Docker features enabled: `nesting=1,keyctl=1`
- [ ] Container set to privileged mode (unprivileged=0)
- [ ] Container configuration reviewed in Proxmox UI
### Phase 4: LXC Initial Start
- [ ] Container started successfully
- [ ] Container status: `running`
- [ ] Container accessible via SSH
- [ ] Network connectivity confirmed: `ping 8.8.8.8`
- [ ] DNS resolution working: `nslookup google.com`
- [ ] Docker service running: `systemctl status docker`
- [ ] Docker working: `docker ps` (should be empty initially)
---
## Service Migration Checklist
### Phase 5: Docker Configuration Transfer
- [ ] Docker Compose files copied to LXC
- [ ] Directory structure matches VM layout
- [ ] File permissions verified
- [ ] Environment files copied (.env files)
- [ ] Docker volumes path confirmed
- [ ] Data directories created (if needed)
- [ ] Configuration files reviewed for absolute paths
### Phase 6: Docker Containers Deployment
- [ ] Docker Compose files validated: `docker compose config`
- [ ] Images pulled successfully: `docker compose pull`
- [ ] Containers created: `docker compose up -d`
- [ ] All containers started: `docker compose ps`
- [ ] No container restart loops: `docker ps` (check STATUS)
- [ ] Container logs checked: `docker compose logs`
- [ ] No error messages in logs
### Phase 7: Service Validation
- [ ] All expected containers running
- [ ] Services responding on correct ports
- [ ] Web interfaces accessible (if applicable)
- [ ] APIs responding correctly (if applicable)
- [ ] Health check endpoints passing (if configured)
- [ ] Data persistence verified (check databases, files)
- [ ] Inter-container communication working
- [ ] External service connections working (databases, APIs)
---
## Network & Connectivity Checklist
### Phase 8: Network Validation
- [ ] LXC has correct IP address: `ip addr show`
- [ ] Gateway reachable: `ping <gateway-ip>`
- [ ] Internal network access verified
- [ ] Internet access confirmed
- [ ] DNS resolution working for all required domains
- [ ] Ports accessible from other hosts: `nc -zv <lxc-ip> <port>`
- [ ] Firewall rules applied (if needed)
### Phase 9: External Access
- [ ] Service accessible from local network
- [ ] Service accessible from internet (if required)
- [ ] Reverse proxy updated (if using NPM/Traefik)
- [ ] SSL certificates working (if HTTPS)
- [ ] Domain names resolving correctly
- [ ] Load balancer updated (if applicable)
---
## Performance & Stability Checklist
### Phase 10: Performance Validation
- [ ] CPU usage reasonable: `top` or `htop`
- [ ] Memory usage lower than VM: `free -h`
- [ ] Disk I/O acceptable: `iostat` or monitor in Proxmox
- [ ] Network throughput adequate: test with actual traffic
- [ ] Response times equal to or better than VM
- [ ] No performance degradation under load
### Phase 11: Resource Monitoring (First 24 Hours)
- [ ] Hour 1: Services stable, no crashes
- [ ] Hour 2: Resource usage normal
- [ ] Hour 4: No memory leaks detected
- [ ] Hour 8: Performance consistent
- [ ] Hour 24: All metrics stable
- [ ] Proxmox graphs show healthy trends
- [ ] No OOM (Out of Memory) kills: `dmesg | grep -i oom`
- [ ] No kernel errors: `dmesg | grep -i error`
### Phase 12: Functional Testing
- [ ] Primary functionality tested end-to-end
- [ ] User workflows validated
- [ ] Scheduled jobs running (cron, etc.)
- [ ] Backups configured and tested
- [ ] Monitoring alerts configured
- [ ] Logging working correctly
- [ ] Integrations with other services functioning
---
## Data Integrity Checklist
### Phase 13: Data Validation
- [ ] Database connections working
- [ ] Data readable and writable
- [ ] File uploads/downloads working
- [ ] Cache functioning correctly
- [ ] Sessions persisting correctly
- [ ] User data accessible
- [ ] No data corruption detected
- [ ] Database migrations applied (if needed)
### Phase 14: Backup Validation
- [ ] Backup jobs configured for LXC
- [ ] Test backup created successfully
- [ ] Test restore validated
- [ ] Backup storage sufficient
- [ ] Backup retention policy set
- [ ] Backup monitoring alerts configured
---
## Extended Monitoring Checklist
### Phase 15: Week 1 Monitoring
- [ ] Day 1: Initial 24 hours stable
- [ ] Day 2: Resource usage patterns established
- [ ] Day 3: Performance benchmarks met
- [ ] Day 4: No unexpected issues
- [ ] Day 5: Load testing passed (if applicable)
- [ ] Day 6: Weekend operations normal (if applicable)
- [ ] Day 7: Weekly summary reviewed, all green
### Phase 16: Week 2 Validation
- [ ] Week 2: Continued stability
- [ ] No memory leaks over extended period
- [ ] Disk usage growth as expected
- [ ] No unexpected restarts or crashes
- [ ] Resource utilization optimized
- [ ] Documentation updated with final configuration
---
## Rollback Checklist (If Needed)
### Emergency Rollback
- [ ] Stop LXC container: `pct stop <ctid>`
- [ ] Start original VM: `qm start <vmid>`
- [ ] Verify VM services starting
- [ ] Validate VM functionality
- [ ] Restore network access (update DNS/proxy if changed)
- [ ] Document rollback reason for analysis
- [ ] Plan remediation before retry
---
## Final Migration Completion Checklist
### Phase 17: Production Validation
- [ ] 1-2 weeks of stable operation confirmed
- [ ] All stakeholders confirm service quality
- [ ] Performance metrics meet or exceed VM baseline
- [ ] No outstanding issues or concerns
- [ ] Monitoring and alerting fully operational
- [ ] Documentation complete and accurate
### Phase 18: Cleanup
- [ ] VM no longer needed, safe to remove
- [ ] VM snapshot retained for safety (30 days recommended)
- [ ] Original VM stopped and archived
- [ ] Resources freed up (document savings)
- [ ] Migration marked complete in tracking system
- [ ] Lessons learned documented
### Phase 19: Documentation Updates
- [ ] Network diagram updated (if exists)
- [ ] IP address spreadsheet updated
- [ ] Service inventory updated
- [ ] Runbooks updated for new LXC location
- [ ] Backup documentation updated
- [ ] Disaster recovery plan updated
- [ ] Team knowledge base updated
---
## Quick Reference: Common Issues & Solutions
### Issue: Container won't start
**Check:**
- [ ] Storage space available: `pvesm status`
- [ ] Container configuration valid: `pct config <ctid>`
- [ ] No resource limits exceeded
- [ ] Logs: `journalctl -u pve-container@<ctid>`
### Issue: Docker won't start
**Check:**
- [ ] Nesting enabled: `pct config <ctid> | grep features`
- [ ] Container is privileged: `pct config <ctid> | grep unprivileged`
- [ ] Docker service: `systemctl status docker`
- [ ] Logs: `journalctl -u docker`
### Issue: Network not working
**Check:**
- [ ] Network interface configured: `ip addr show`
- [ ] Gateway configured: `ip route show`
- [ ] DNS configured: `cat /etc/resolv.conf`
- [ ] Firewall rules: `iptables -L`
### Issue: Poor performance
**Check:**
- [ ] Resource allocation sufficient: `pct config <ctid>`
- [ ] No CPU throttling: `cat /proc/loadavg`
- [ ] Memory not exhausted: `free -h`
- [ ] No I/O bottleneck: `iostat -x 1`
### Issue: Can't access services
**Check:**
- [ ] Containers running: `docker ps`
- [ ] Ports exposed: `docker ps` (PORTS column)
- [ ] Firewall rules: `iptables -L`
- [ ] Service binding: `netstat -tlnp | grep <port>`
- [ ] Reverse proxy config updated
---
## Service-Specific Checklists
### Discord Bots
- [ ] Bot token configured correctly
- [ ] Bot connected to Discord: check bot status
- [ ] Commands responding
- [ ] Database connections working (if applicable)
- [ ] Scheduled tasks running
- [ ] Logs showing normal operation
### Databases (PostgreSQL, MySQL, MongoDB)
- [ ] Database service running
- [ ] Data directory mounted correctly
- [ ] Connections from applications working
- [ ] Queries executing normally
- [ ] Backups configured
- [ ] Replication working (if applicable)
- [ ] Performance acceptable: run query benchmarks
### Plex Media Server
- [ ] Media libraries accessible
- [ ] Transcoding working (CPU or GPU)
- [ ] Streaming playback smooth
- [ ] Metadata refreshing
- [ ] Remote access configured (if needed)
- [ ] Hardware acceleration working (if configured)
### Docker-Based Web Apps
- [ ] Web interface accessible
- [ ] Login/authentication working
- [ ] Database connections functional
- [ ] File uploads working
- [ ] API endpoints responding
- [ ] SSL/TLS certificates valid
- [ ] Caching working correctly
---
## Migration Success Criteria
### Minimum Criteria (Must Have)
- ✅ All services running and accessible
- ✅ No data loss or corruption
- ✅ Performance equal to or better than VM
- ✅ 24 hours of stable operation
- ✅ No critical errors in logs
- ✅ Rollback plan tested and ready
### Optimal Criteria (Should Have)
- ✅ Resource usage reduced vs VM
- ✅ Faster startup times
- ✅ Improved I/O performance
- ✅ 1 week of stable operation
- ✅ Monitoring and alerts configured
- ✅ Documentation complete
### Excellence Criteria (Nice to Have)
- ✅ 2 weeks of flawless operation
- ✅ Measurable performance improvements
- ✅ Resource optimization completed
- ✅ Automated backups validated
- ✅ Team trained on new setup
- ✅ Migration lessons documented
---
## Notes & Best Practices
**Timing:**
- Migrate non-critical services first
- Schedule during low-traffic periods
- Allow extra time for first migration
- Plan for 2-4 hours per service initially
**Safety:**
- Always have VM snapshot before starting
- Keep VM stopped but available for 1-2 weeks
- Test rollback procedure before committing
- Document every step for repeatability
**Monitoring:**
- Watch resource usage closely first 48 hours
- Set up alerts for anomalies
- Compare to VM baseline metrics
- Keep detailed migration notes
**Optimization:**
- Start with conservative resource allocation
- Tune after monitoring actual usage
- Document optimal settings for future migrations
- Share learnings with team
---
**Checklist Version:** 1.0
**Last Updated:** 2025-01-11
**For:** Cal's Home Lab Proxmox Infrastructure