claude-home/.claude/plans/voice-automation-implementation-details.md

# Voice Automation Implementation Details & Insights

## Comprehensive Technical Analysis

This document captures detailed implementation insights, technical considerations, and lessons learned from analyzing voice-controlled home automation architecture for integration with Home Assistant and Claude Code.

## Core Architecture Deep Dive

### Speech-to-Text Engine Comparison

#### OpenAI Whisper (Primary Recommendation)
**Technical Specifications:**
- **Models Available:** tiny (39MB), base (74MB), small (244MB), medium (769MB), large (1550MB)
- **Languages:** 99+ languages with varying accuracy levels
- **Accuracy:** State-of-the-art, especially for English
- **Latency:**
  - Tiny: ~100ms on CPU, ~50ms on GPU
  - Base: ~200ms on CPU, ~100ms on GPU
  - Large: ~1s on CPU, ~300ms on GPU
- **Resource Usage:**
  - CPU: 1-4 cores depending on model size
  - RAM: 1-4GB depending on model size
  - GPU: Optional but significant speedup (2-10x faster)

**Container Options:**
```bash
# Official Whisper in container
docker run -p 9000:9000 onerahmet/openai-whisper-asr-webservice:latest

# Custom optimized version with GPU
docker run --gpus all -p 9000:9000 whisper-gpu:latest
```

**API Interface:**
```python
# RESTful API example
import requests
import json

response = requests.post('http://localhost:9000/asr',
    files={'audio_file': open('command.wav', 'rb')},
    data={'task': 'transcribe', 'language': 'english'}
)
text = response.json()['text']
```

#### Vosk Alternative Analysis
**Pros:**
- Smaller memory footprint (100-200MB models)
- Faster real-time processing
- Better for streaming audio
- Multiple model sizes per language

**Cons:**
- Lower accuracy than Whisper for natural speech
- Fewer supported languages
- Less robust with accents/noise

**Use Case:** Better for command-word recognition, worse for natural language

#### wav2vec2 Consideration
**Facebook's Model:**
- Excellent accuracy competitive with Whisper
- More complex setup and deployment
- Less containerized ecosystem
- **Recommendation:** Skip unless specific requirements

### Voice Activity Detection (VAD) Deep Dive

#### Wake Word Detection Systems

**Porcupine by Picovoice (Recommended)**
```python
# Porcupine integration example
import pvporcupine

porcupine = pvporcupine.create(
    access_key='your-access-key',
    keywords=['hey-claude', 'computer', 'assistant']
)

while True:
    pcm = get_next_audio_frame()  # 16kHz, 16-bit, mono
    keyword_index = porcupine.process(pcm)
    if keyword_index >= 0:
        print(f"Wake word detected: {keywords[keyword_index]}")
        # Trigger STT pipeline
```

**Technical Requirements:**
- Continuous audio monitoring
- Low CPU usage (< 1% on modern CPUs)
- Custom wake word training available
- Privacy: all processing local

**Alternative: Snowboy (Open Source)**
- No longer actively maintained
- Still functional for basic wake words
- Completely free and local
- Lower accuracy than Porcupine

#### Push-to-Talk Implementations

**Hardware Button Integration:**
```python
# GPIO button on Raspberry Pi
import RPi.GPIO as GPIO

BUTTON_PIN = 18
GPIO.setup(BUTTON_PIN, GPIO.IN, pull_up_down=GPIO.PUD_UP)

def button_callback(channel):
    if GPIO.input(channel) == GPIO.LOW:
        start_recording()
    else:
        stop_recording_and_process()

GPIO.add_event_detect(BUTTON_PIN, GPIO.BOTH, callback=button_callback)
```

**Mobile App Integration:**
- Home Assistant mobile app can trigger automations
- Custom webhook endpoints
- WebSocket connections for real-time triggers

### Claude Code Integration Architecture

#### API Bridge Service Implementation

**Service Architecture:**
```python
# voice-bridge service structure
from fastapi import FastAPI
from anthropic import Anthropic
import homeassistant_api
import whisper_client
import asyncio

app = FastAPI()

class VoiceBridge:
    def __init__(self):
        self.claude = Anthropic(api_key=os.getenv('CLAUDE_API_KEY'))
        self.ha = homeassistant_api.Client(
            url=os.getenv('HA_URL'),
            token=os.getenv('HA_TOKEN')
        )
        self.whisper = whisper_client.Client(os.getenv('WHISPER_URL'))

    async def process_audio(self, audio_data):
        # Step 1: Convert audio to text
        transcript = await self.whisper.transcribe(audio_data)

        # Step 2: Send to Claude for interpretation
        ha_context = await self.get_ha_context()
        claude_response = await self.claude.messages.create(
            model="claude-3-5-sonnet-20241022",
            messages=[{
                "role": "user",
                "content": f"""
                Convert this voice command to Home Assistant API calls:
                Command: "{transcript}"

                Available entities: {ha_context}

                Return JSON format for HA API calls.
                """
            }]
        )

        # Step 3: Execute HA commands
        commands = json.loads(claude_response.content)
        results = []
        for cmd in commands:
            result = await self.ha.call_service(**cmd)
            results.append(result)

        return {
            'transcript': transcript,
            'commands': commands,
            'results': results
        }

    async def get_ha_context(self):
        # Get current state of all entities
        states = await self.ha.get_states()
        return {
            'lights': [e for e in states if e['entity_id'].startswith('light.')],
            'sensors': [e for e in states if e['entity_id'].startswith('sensor.')],
            'switches': [e for e in states if e['entity_id'].startswith('switch.')],
            # ... other entity types
        }
```

#### Command Translation Patterns

**Direct Device Commands:**
```json
{
  "speech": "Turn on the living room lights",
  "claude_interpretation": {
    "intent": "light_control",
    "target": "light.living_room",
    "action": "turn_on"
  },
  "ha_api_call": {
    "service": "light.turn_on",
    "target": {"entity_id": "light.living_room"}
  }
}
```

**Scene Activation:**
```json
{
  "speech": "Set movie mode",
  "claude_interpretation": {
    "intent": "scene_activation",
    "scene": "movie_mode"
  },
  "ha_api_call": {
    "service": "scene.turn_on",
    "target": {"entity_id": "scene.movie_mode"}
  }
}
```

**Complex Logic:**
```json
{
  "speech": "Turn on lights in occupied rooms",
  "claude_interpretation": {
    "intent": "conditional_light_control",
    "condition": "occupancy_detected",
    "action": "turn_on_lights"
  },
  "ha_api_calls": [
    {
      "service": "light.turn_on",
      "target": {"entity_id": "light.bedroom"},
      "condition": "binary_sensor.bedroom_occupancy == 'on'"
    },
    {
      "service": "light.turn_on",
      "target": {"entity_id": "light.living_room"},
      "condition": "binary_sensor.living_room_occupancy == 'on'"
    }
  ]
}
```

#### Context Management Strategy

**Conversation Memory:**
```python
class ConversationContext:
    def __init__(self):
        self.history = []
        self.context_window = 10  # Keep last 10 interactions

    def add_interaction(self, speech, response, timestamp):
        self.history.append({
            'speech': speech,
            'response': response,
            'timestamp': timestamp,
            'ha_state_snapshot': self.capture_ha_state()
        })

        # Maintain sliding window
        if len(self.history) > self.context_window:
            self.history.pop(0)

    def get_context_for_claude(self):
        return {
            'recent_commands': self.history[-3:],
            'current_time': datetime.now(),
            'house_state': self.get_current_house_state()
        }
```

**Ambiguity Resolution:**
```python
# Handle ambiguous commands
def resolve_ambiguity(transcript, available_entities):
    if "lights" in transcript.lower() and not specific_room_mentioned:
        return {
            'type': 'clarification_needed',
            'message': 'Which lights? I can control: living room, bedroom, kitchen',
            'options': ['light.living_room', 'light.bedroom', 'light.kitchen']
        }
```

### Home Assistant Integration Patterns

#### API Authentication & Security
```python
# Secure API setup
HA_CONFIG = {
    'url': 'http://homeassistant:8123',
    'token': os.getenv('HA_LONG_LIVED_TOKEN'),  # Never hardcode
    'ssl_verify': True,  # In production
    'timeout': 10
}

# Create long-lived access token in HA:
# Settings -> People -> [Your User] -> Long-lived access tokens
```

#### WebSocket Integration for Real-time Updates
```python
import websockets
import json

async def ha_websocket_listener():
    uri = "ws://homeassistant:8123/api/websocket"

    async with websockets.connect(uri) as websocket:
        # Authenticate
        await websocket.send(json.dumps({
            'type': 'auth',
            'access_token': HA_TOKEN
        }))

        # Subscribe to state changes
        await websocket.send(json.dumps({
            'id': 1,
            'type': 'subscribe_events',
            'event_type': 'state_changed'
        }))

        async for message in websocket:
            data = json.loads(message)
            if data.get('type') == 'event':
                # Process state changes for voice responses
                await process_state_change(data['event'])
```

#### Voice Response Integration
```python
# HA TTS integration for voice feedback
async def speak_response(message, entity_id='media_player.living_room'):
    await ha_client.call_service(
        'tts', 'speak',
        target={'entity_id': entity_id},
        service_data={
            'message': message,
            'language': 'en',
            'options': {'voice': 'neural'}
        }
    )

# Usage examples:
await speak_response("Living room lights turned on")
await speak_response("I couldn't find that device. Please be more specific.")
await speak_response("Movie mode activated. Enjoy your film!")
```

### Hardware & Deployment Considerations

#### Microphone Hardware Analysis

**USB Microphones (Recommended for testing):**
- Blue Yeti: Excellent quality, multiple pickup patterns
- Audio-Technica ATR2100x-USB: Professional quality
- Samson Go Mic: Compact, budget-friendly

**Professional Audio Interfaces:**
- Focusrite Scarlett Solo: Single input, professional quality
- Behringer U-Phoria UM2: Budget 2-input interface
- PreSonus AudioBox USB 96: Mid-range option

**Raspberry Pi Integration:**
```bash
# ReSpeaker HAT for Raspberry Pi
# Provides 2-4 microphone array with hardware VAD
# I2S connection, low latency
# Built-in LED ring for visual feedback

# GPIO microphone setup
sudo apt install python3-pyaudio
# Configure ALSA for USB microphones
```

**Network Microphone Distribution:**
```python
# Distributed microphone system
MICROPHONE_NODES = {
    'living_room': 'http://pi-living:8080',
    'bedroom': 'http://pi-bedroom:8080',
    'kitchen': 'http://pi-kitchen:8080'
}

# Each Pi runs lightweight audio capture service
# Sends audio to central Whisper processing
```

#### GPU Acceleration Setup

**NVIDIA GPU Configuration:**
```yaml
# Docker Compose GPU configuration
whisper-gpu:
  image: whisper-gpu:latest
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
  environment:
    - NVIDIA_VISIBLE_DEVICES=all
```

**Performance Benchmarks (estimated):**
- **CPU Only (8-core):**
  - Whisper base: ~500ms latency
  - Whisper large: ~2000ms latency
- **With GPU (GTX 1660+):**
  - Whisper base: ~150ms latency
  - Whisper large: ~400ms latency

#### Container Orchestration Strategy

**Complete Docker Compose Stack:**
```yaml
version: '3.8'

services:
  # Core Home Assistant
  homeassistant:
    container_name: homeassistant
    image: ghcr.io/home-assistant/home-assistant:stable
    volumes:
      - ./ha-config:/config
      - /etc/localtime:/etc/localtime:ro
    restart: unless-stopped
    network_mode: host

  # Speech-to-Text Engine
  whisper:
    container_name: whisper-stt
    image: onerahmet/openai-whisper-asr-webservice:latest
    ports:
      - "9000:9000"
    environment:
      - ASR_MODEL=base
      - ASR_ENGINE=openai_whisper
    volumes:
      - whisper-models:/root/.cache/whisper
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Voice Processing Bridge
  voice-bridge:
    container_name: voice-bridge
    build:
      context: ./voice-bridge
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - CLAUDE_API_KEY=${CLAUDE_API_KEY}
      - HA_URL=http://homeassistant:8123
      - HA_TOKEN=${HA_LONG_LIVED_TOKEN}
      - WHISPER_URL=http://whisper:9000
      - PORCUPINE_ACCESS_KEY=${PORCUPINE_ACCESS_KEY}
    volumes:
      - ./voice-bridge-config:/app/config
      - /dev/snd:/dev/snd  # Audio device access
    depends_on:
      - homeassistant
      - whisper
    restart: unless-stopped
    privileged: true  # For audio device access

  # Optional: MQTT for device communication
  mosquitto:
    container_name: mqtt-broker
    image: eclipse-mosquitto:latest
    ports:
      - "1883:1883"
      - "9001:9001"
    volumes:
      - ./mosquitto:/mosquitto
    restart: unless-stopped

  # Optional: Node-RED for visual automation
  node-red:
    container_name: node-red
    image: nodered/node-red:latest
    ports:
      - "1880:1880"
    volumes:
      - node-red-data:/data
    restart: unless-stopped

volumes:
  whisper-models:
  node-red-data:
```

### Privacy & Security Deep Analysis

#### Data Flow Security Model

**Audio Data Privacy:**
```
[Microphone] → [Local VAD] → [Local STT] → [Text Only] → [Claude API]
     ↓              ↓             ↓            ↓
 Never leaves   Never leaves  Never leaves  Encrypted
  local net     local net     local net     HTTPS only
```

**Security Boundaries:**
1. **Audio Capture Layer:** Hardware → Local processing only
2. **Speech Recognition:** Local Whisper → No cloud STT
3. **Command Interpretation:** Text-only to Claude Code API
4. **Automation Execution:** Local Home Assistant only

#### Network Security Configuration

**Firewall Rules:**
```bash
# Only allow outbound HTTPS for Claude API
iptables -A OUTPUT -p tcp --dport 443 -d anthropic.com -j ACCEPT
iptables -A OUTPUT -p tcp --dport 443 -j DROP  # Block other HTTPS

# Block all other outbound traffic from voice containers
iptables -A OUTPUT -s voice-bridge-ip -j DROP
```

**API Key Security:**
```bash
# Environment variable best practices
echo "CLAUDE_API_KEY=your-key-here" >> .env
echo "HA_LONG_LIVED_TOKEN=your-token-here" >> .env
chmod 600 .env

# Container secrets mounting
docker run --env-file .env voice-bridge:latest
```

#### Privacy Controls Implementation

**Audio Retention Policy:**
```python
class AudioPrivacyManager:
    def __init__(self):
        self.max_retention_seconds = 5  # Keep audio only during processing
        self.transcript_retention_days = 7  # Keep transcripts short-term

    async def process_audio(self, audio_data):
        try:
            transcript = await self.stt_engine.transcribe(audio_data)
            # Process immediately
            result = await self.process_command(transcript)

            # Store transcript with expiration
            await self.store_transcript(transcript, expires_in=7*24*3600)

            return result
        finally:
            # Always delete audio data immediately
            del audio_data
            gc.collect()
```

**User Consent & Controls:**
```python
# Voice system controls in Home Assistant
VOICE_CONTROLS = {
    'input_boolean.voice_system_enabled': 'Global voice control toggle',
    'input_boolean.voice_learning_mode': 'Allow transcript storage for improvement',
    'input_select.voice_privacy_level': ['minimal', 'standard', 'enhanced'],
    'button.clear_voice_history': 'Clear all stored transcripts'
}
```

### Advanced Features & Future Expansion

#### Multi-Room Microphone Network

**Distributed Audio Architecture:**
```python
# Central coordinator service
class MultiRoomVoiceCoordinator:
    def __init__(self):
        self.microphone_nodes = {
            'living_room': MicrophoneNode('192.168.1.101'),
            'bedroom': MicrophoneNode('192.168.1.102'),
            'kitchen': MicrophoneNode('192.168.1.103')
        }

    async def listen_all_rooms(self):
        # Simultaneous listening across all nodes
        tasks = [node.listen() for node in self.microphone_nodes.values()]
        winner = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)

        # Process audio from first responding room
        audio_data, source_room = winner.result()
        return await self.process_with_context(audio_data, source_room)

    async def process_with_context(self, audio_data, room):
        # Add room context to Claude processing
        transcript = await self.stt.transcribe(audio_data)

        claude_prompt = f"""
        Voice command from {room}: "{transcript}"

        Room-specific devices available:
        {self.get_room_devices(room)}

        Convert to Home Assistant API calls.
        """
```

**Room-Aware Processing:**
```python
def get_room_devices(self, room):
    """Return devices specific to the source room"""
    room_entities = {
        'living_room': [
            'light.living_room_ceiling',
            'media_player.living_room_tv',
            'climate.living_room_thermostat'
        ],
        'bedroom': [
            'light.bedroom_bedside',
            'switch.bedroom_fan',
            'binary_sensor.bedroom_window'
        ]
    }
    return room_entities.get(room, [])
```

#### Context-Aware Conversations

**Advanced Context Management:**
```python
class AdvancedContextManager:
    def __init__(self):
        self.conversation_sessions = {}
        self.house_state_history = []

    def create_claude_context(self, user_id, transcript):
        session = self.get_or_create_session(user_id)

        context = {
            'transcript': transcript,
            'conversation_history': session.history[-5:],
            'current_time': datetime.now().isoformat(),
            'house_state': {
                'lights_on': self.get_lights_status(),
                'occupancy': self.get_occupancy_status(),
                'weather': self.get_weather(),
                'recent_events': self.get_recent_ha_events(minutes=15)
            },
            'user_preferences': self.get_user_preferences(user_id),
            'location_context': self.get_location_context()
        }

        return self.format_claude_prompt(context)

    def format_claude_prompt(self, context):
        return f"""
        You are controlling a Home Assistant smart home system via voice commands.

        Current situation:
        - Time: {context['current_time']}
        - House state: {context['house_state']}
        - Recent conversation: {context['conversation_history']}

        User said: "{context['transcript']}"

        Convert this to Home Assistant API calls. Consider:
        1. Current device states (don't turn on lights that are already on)
        2. Time of day (different responses for morning vs night)
        3. Recent conversation context
        4. User's typical preferences

        Respond with JSON array of Home Assistant service calls.
        """
```

#### Voice Response & Feedback Systems

**Advanced TTS Integration:**
```python
class VoiceResponseManager:
    def __init__(self):
        self.tts_engines = {
            'neural': 'tts.cloud_say',  # High quality
            'local': 'tts.piper_say',   # Local processing
            'espeak': 'tts.espeak_say'  # Fallback
        }

    async def respond_with_voice(self, message, room=None, urgency='normal'):
        # Select appropriate TTS based on context
        tts_engine = self.select_tts_engine(urgency)

        # Select speakers based on room or system state
        speakers = self.select_speakers(room)

        # Format message for natural speech
        speech_message = self.format_for_speech(message)

        # Send to appropriate speakers
        for speaker in speakers:
            await self.ha_client.call_service(
                'tts', tts_engine,
                target={'entity_id': speaker},
                service_data={
                    'message': speech_message,
                    'options': {
                        'voice': 'neural2-en-us-standard-a',
                        'speed': 1.0,
                        'pitch': 0.0
                    }
                }
            )

    def format_for_speech(self, message):
        """Convert technical responses to natural speech"""
        replacements = {
            'light.living_room': 'living room lights',
            'switch.bedroom_fan': 'bedroom fan',
            'climate.main_thermostat': 'thermostat',
            'scene.movie_mode': 'movie mode'
        }

        for tech_term, natural_term in replacements.items():
            message = message.replace(tech_term, natural_term)

        return message
```

**Visual Feedback Integration:**
```python
# LED ring feedback on microphone nodes
class MicrophoneVisualFeedback:
    def __init__(self, led_pin_count=12):
        self.leds = neopixel.NeoPixel(board.D18, led_pin_count)

    def show_listening(self):
        # Blue pulsing pattern
        self.animate_pulse(color=(0, 0, 255))

    def show_processing(self):
        # Spinning orange pattern
        self.animate_spin(color=(255, 165, 0))

    def show_success(self):
        # Green flash
        self.flash(color=(0, 255, 0), duration=1.0)

    def show_error(self):
        # Red flash
        self.flash(color=(255, 0, 0), duration=2.0)
```

### Performance Optimization Strategies

#### Caching & Response Time Optimization

**STT Model Caching:**
```python
class WhisperModelCache:
    def __init__(self):
        self.models = {}
        self.model_locks = {}

    async def get_model(self, model_size='base'):
        if model_size not in self.models:
            if model_size not in self.model_locks:
                self.model_locks[model_size] = asyncio.Lock()

            async with self.model_locks[model_size]:
                if model_size not in self.models:
                    self.models[model_size] = whisper.load_model(model_size)

        return self.models[model_size]
```

**Command Pattern Caching:**
```python
class CommandCache:
    def __init__(self, ttl_seconds=300):  # 5 minute TTL
        self.cache = {}
        self.ttl = ttl_seconds

    def get_cached_response(self, transcript_hash):
        if transcript_hash in self.cache:
            entry = self.cache[transcript_hash]
            if time.time() - entry['timestamp'] < self.ttl:
                return entry['response']
        return None

    def cache_response(self, transcript_hash, response):
        self.cache[transcript_hash] = {
            'response': response,
            'timestamp': time.time()
        }
```

#### Resource Management

**Memory Management:**
```python
class ResourceManager:
    def __init__(self):
        self.max_memory_usage = 4 * 1024**3  # 4GB limit

    async def process_with_memory_management(self, audio_data):
        initial_memory = self.get_memory_usage()

        try:
            if initial_memory > self.max_memory_usage * 0.8:
                await self.cleanup_memory()

            result = await self.process_audio(audio_data)
            return result

        finally:
            # Force garbage collection after each request
            gc.collect()

    async def cleanup_memory(self):
        # Clear caches, unload unused models
        self.command_cache.clear()
        self.whisper_cache.clear_unused()
        gc.collect()
```

### Error Handling & Reliability

#### Comprehensive Error Recovery

**STT Failure Handling:**
```python
class RobustSTTProcessor:
    def __init__(self):
        self.stt_engines = [
            WhisperSTT(model='base'),
            VoskSTT(),
            DeepSpeechSTT()  # Fallback options
        ]

    async def transcribe_with_fallbacks(self, audio_data):
        for i, engine in enumerate(self.stt_engines):
            try:
                transcript = await engine.transcribe(audio_data)
                if self.validate_transcript(transcript):
                    return transcript
            except Exception as e:
                logger.warning(f"STT engine {i} failed: {e}")
                if i == len(self.stt_engines) - 1:
                    raise
                continue

    def validate_transcript(self, transcript):
        # Basic validation rules
        if len(transcript.strip()) < 3:
            return False
        if transcript.count('?') > len(transcript) / 4:  # Too much uncertainty
            return False
        return True
```

**Claude API Failure Handling:**
```python
class RobustClaudeClient:
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.fallback_patterns = self.load_fallback_patterns()

    async def process_command_with_fallback(self, transcript):
        try:
            # Attempt Claude processing
            response = await self.client.messages.create(
                model="claude-3-5-sonnet-20241022",
                messages=[{"role": "user", "content": self.create_prompt(transcript)}],
                timeout=10.0
            )
            return json.loads(response.content)

        except (anthropic.APIError, asyncio.TimeoutError) as e:
            logger.warning(f"Claude API failed: {e}")

            # Attempt local pattern matching as fallback
            return self.fallback_command_processing(transcript)

    def fallback_command_processing(self, transcript):
        """Simple pattern matching for basic commands when Claude is unavailable"""
        transcript = transcript.lower()

        # Basic light controls
        if 'turn on' in transcript and 'light' in transcript:
            room = self.extract_room(transcript)
            return [{
                'service': 'light.turn_on',
                'target': {'entity_id': f'light.{room or "all"}'}
            }]

        # Basic switch controls
        if 'turn off' in transcript and ('light' in transcript or 'switch' in transcript):
            room = self.extract_room(transcript)
            return [{
                'service': 'light.turn_off',
                'target': {'entity_id': f'light.{room or "all"}'}
            }]

        # Scene activation
        scenes = ['movie', 'bedtime', 'morning', 'evening']
        for scene in scenes:
            if scene in transcript:
                return [{
                    'service': 'scene.turn_on',
                    'target': {'entity_id': f'scene.{scene}'}
                }]

        # If no patterns match, return error
        return [{'error': 'Command not recognized in offline mode'}]
```

#### Health Monitoring & Diagnostics

**System Health Monitoring:**
```python
class VoiceSystemHealthMonitor:
    def __init__(self):
        self.health_checks = {
            'whisper_api': self.check_whisper_health,
            'claude_api': self.check_claude_health,
            'homeassistant_api': self.check_ha_health,
            'microphone_nodes': self.check_microphone_health
        }

    async def run_health_checks(self):
        results = {}

        for service, check_func in self.health_checks.items():
            try:
                results[service] = await check_func()
            except Exception as e:
                results[service] = {
                    'status': 'unhealthy',
                    'error': str(e),
                    'timestamp': datetime.now().isoformat()
                }

        return results

    async def check_whisper_health(self):
        start_time = time.time()
        response = await aiohttp.get('http://whisper:9000/health')
        latency = time.time() - start_time

        return {
            'status': 'healthy' if response.status == 200 else 'unhealthy',
            'latency_ms': int(latency * 1000),
            'timestamp': datetime.now().isoformat()
        }

    # Similar checks for other services...
```

**Automated Recovery Actions:**
```python
class AutoRecoveryManager:
    def __init__(self):
        self.recovery_actions = {
            'whisper_unhealthy': self.restart_whisper_service,
            'high_memory_usage': self.cleanup_resources,
            'claude_rate_limited': self.enable_fallback_mode,
            'microphone_disconnected': self.reinitialize_audio
        }

    async def handle_health_issue(self, issue_type, details):
        if issue_type in self.recovery_actions:
            logger.info(f"Attempting recovery for {issue_type}")
            await self.recovery_actions[issue_type](details)
        else:
            logger.error(f"No recovery action for {issue_type}")
            await self.alert_administrators(issue_type, details)
```

### Testing & Validation Strategies

#### Audio Processing Testing

**STT Accuracy Testing:**
```python
class STTAccuracyTester:
    def __init__(self):
        self.test_phrases = [
            "Turn on the living room lights",
            "Set the temperature to 72 degrees",
            "Activate movie mode",
            "What's the weather like outside",
            "Turn off all lights",
            "Lock all doors"
        ]

    async def run_accuracy_tests(self, stt_engine):
        results = []

        for phrase in self.test_phrases:
            # Generate synthetic audio from phrase
            audio_data = await self.text_to_speech(phrase)

            # Test STT accuracy
            transcript = await stt_engine.transcribe(audio_data)

            accuracy = self.calculate_word_accuracy(phrase, transcript)
            results.append({
                'original': phrase,
                'transcript': transcript,
                'accuracy': accuracy
            })

        return results

    def calculate_word_accuracy(self, reference, hypothesis):
        ref_words = reference.lower().split()
        hyp_words = hypothesis.lower().split()

        # Simple word error rate calculation
        correct = sum(1 for r, h in zip(ref_words, hyp_words) if r == h)
        return correct / len(ref_words) if ref_words else 0
```

#### End-to-End Integration Testing

**Complete Pipeline Testing:**
```python
class E2ETestSuite:
    def __init__(self):
        self.test_scenarios = [
            {
                'name': 'basic_light_control',
                'audio_file': 'tests/audio/turn_on_lights.wav',
                'expected_ha_calls': [
                    {'service': 'light.turn_on', 'target': {'entity_id': 'light.living_room'}}
                ]
            },
            {
                'name': 'complex_scene_activation',
                'audio_file': 'tests/audio/movie_mode.wav',
                'expected_ha_calls': [
                    {'service': 'scene.turn_on', 'target': {'entity_id': 'scene.movie'}}
                ]
            }
        ]

    async def run_full_pipeline_tests(self):
        results = []

        for scenario in self.test_scenarios:
            result = await self.test_scenario(scenario)
            results.append(result)

        return results

    async def test_scenario(self, scenario):
        # Load test audio
        with open(scenario['audio_file'], 'rb') as f:
            audio_data = f.read()

        # Run through complete pipeline
        try:
            actual_calls = await self.voice_bridge.process_audio(audio_data)

            # Compare with expected results
            match = self.compare_ha_calls(
                scenario['expected_ha_calls'],
                actual_calls['commands']
            )

            return {
                'scenario': scenario['name'],
                'success': match,
                'expected': scenario['expected_ha_calls'],
                'actual': actual_calls['commands']
            }

        except Exception as e:
            return {
                'scenario': scenario['name'],
                'success': False,
                'error': str(e)
            }
```

## Implementation Timeline & Milestones

### Phase 1: Foundation (Weeks 1-2)
**Goals:**
- Home Assistant stable deployment
- Basic container infrastructure
- Initial device integration

**Success Criteria:**
- HA accessible and controlling existing devices
- Container stack running reliably
- Basic automations working

### Phase 2: Core Voice System (Weeks 3-4)
**Goals:**
- Whisper STT deployment
- Basic voice-bridge service
- Simple command processing

**Success Criteria:**
- Speech-to-text working with test audio files
- Claude Code API integration functional
- Basic "turn on lights" commands working

### Phase 3: Production Features (Weeks 5-6)
**Goals:**
- Wake word detection
- Multi-room microphone support
- Advanced error handling

**Success Criteria:**
- Hands-free operation with wake words
- Reliable operation across multiple rooms
- Graceful failure modes working

### Phase 4: Optimization & Polish (Weeks 7-8)
**Goals:**
- Performance optimization
- Advanced context awareness
- Visual/audio feedback systems

**Success Criteria:**
- Sub-500ms response times
- Context-aware conversations
- Family-friendly operation

## Cost Analysis

### Hardware Costs
- **Microphones:** $50-200 per room
- **Processing Hardware:** Covered by existing Proxmox setup
- **Additional Storage:** ~50GB for models and logs

### Service Costs
- **Claude Code API:** ~$0.01-0.10 per command (depending on context size)
- **Porcupine Wake Words:** $0.50-2.00 per month per wake word
- **No cloud STT costs** (fully local)

### Estimated Monthly Operating Costs
- **Light Usage (10 commands/day):** ~$3-10/month
- **Heavy Usage (50 commands/day):** ~$15-50/month
- **Wake word licensing:** ~$2-5/month

## Conclusion & Next Steps

This voice automation system represents a cutting-edge approach to local smart home control, combining the latest in speech recognition with advanced AI interpretation. The architecture prioritizes privacy, reliability, and extensibility while maintaining the local-only operation you desire.

**Key Success Factors:**
1. **Proven Technology Stack:** Whisper + Claude Code + Home Assistant
2. **Privacy-First Design:** Audio never leaves local network
3. **Flexible Architecture:** Easy to extend and customize
4. **Reliable Fallbacks:** Multiple failure recovery mechanisms

**Recommended Implementation Approach:**
1. Start with Home Assistant foundation
2. Add voice components incrementally
3. Test thoroughly at each phase
4. Optimize for your specific use patterns

The combination of your technical expertise, existing infrastructure, and this comprehensive architecture plan sets you up for success in creating a truly advanced, private, and powerful voice-controlled smart home system.

This system will provide the advanced automation capabilities that Apple Home lacks while maintaining the local control and privacy that drove your original Home Assistant interest. The addition of Claude Code as the natural language processing layer bridges the gap between human speech and technical automation in a way that would be extremely difficult to achieve with traditional rule-based systems.