claude-home/.claude/plans/voice-automation-implementation-details.md
Cal Corum bd49e9d61d CLAUDE: Add comprehensive home automation planning documents
- Add Home Assistant deployment guide with container architecture
- Document platform analysis comparing Home Assistant, OpenHAB, and Node-RED
- Add voice automation architecture with local/cloud hybrid approach
- Include implementation details for Rhasspy + Home Assistant integration
- Provide step-by-step deployment guides and configuration templates
- Document privacy-focused voice processing with local wake word detection

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 16:21:28 -05:00

35 KiB

Voice Automation Implementation Details & Insights

Comprehensive Technical Analysis

This document captures detailed implementation insights, technical considerations, and lessons learned from analyzing voice-controlled home automation architecture for integration with Home Assistant and Claude Code.

Core Architecture Deep Dive

Speech-to-Text Engine Comparison

OpenAI Whisper (Primary Recommendation)

Technical Specifications:

  • Models Available: tiny (39MB), base (74MB), small (244MB), medium (769MB), large (1550MB)
  • Languages: 99+ languages with varying accuracy levels
  • Accuracy: State-of-the-art, especially for English
  • Latency:
    • Tiny: ~100ms on CPU, ~50ms on GPU
    • Base: ~200ms on CPU, ~100ms on GPU
    • Large: ~1s on CPU, ~300ms on GPU
  • Resource Usage:
    • CPU: 1-4 cores depending on model size
    • RAM: 1-4GB depending on model size
    • GPU: Optional but significant speedup (2-10x faster)

Container Options:

# Official Whisper in container
docker run -p 9000:9000 onerahmet/openai-whisper-asr-webservice:latest

# Custom optimized version with GPU
docker run --gpus all -p 9000:9000 whisper-gpu:latest

API Interface:

# RESTful API example
import requests
import json

response = requests.post('http://localhost:9000/asr',
    files={'audio_file': open('command.wav', 'rb')},
    data={'task': 'transcribe', 'language': 'english'}
)
text = response.json()['text']

Vosk Alternative Analysis

Pros:

  • Smaller memory footprint (100-200MB models)
  • Faster real-time processing
  • Better for streaming audio
  • Multiple model sizes per language

Cons:

  • Lower accuracy than Whisper for natural speech
  • Fewer supported languages
  • Less robust with accents/noise

Use Case: Better for command-word recognition, worse for natural language

wav2vec2 Consideration

Facebook's Model:

  • Excellent accuracy competitive with Whisper
  • More complex setup and deployment
  • Less containerized ecosystem
  • Recommendation: Skip unless specific requirements

Voice Activity Detection (VAD) Deep Dive

Wake Word Detection Systems

Porcupine by Picovoice (Recommended)

# Porcupine integration example
import pvporcupine

porcupine = pvporcupine.create(
    access_key='your-access-key',
    keywords=['hey-claude', 'computer', 'assistant']
)

while True:
    pcm = get_next_audio_frame()  # 16kHz, 16-bit, mono
    keyword_index = porcupine.process(pcm)
    if keyword_index >= 0:
        print(f"Wake word detected: {keywords[keyword_index]}")
        # Trigger STT pipeline

Technical Requirements:

  • Continuous audio monitoring
  • Low CPU usage (< 1% on modern CPUs)
  • Custom wake word training available
  • Privacy: all processing local

Alternative: Snowboy (Open Source)

  • No longer actively maintained
  • Still functional for basic wake words
  • Completely free and local
  • Lower accuracy than Porcupine

Push-to-Talk Implementations

Hardware Button Integration:

# GPIO button on Raspberry Pi
import RPi.GPIO as GPIO

BUTTON_PIN = 18
GPIO.setup(BUTTON_PIN, GPIO.IN, pull_up_down=GPIO.PUD_UP)

def button_callback(channel):
    if GPIO.input(channel) == GPIO.LOW:
        start_recording()
    else:
        stop_recording_and_process()

GPIO.add_event_detect(BUTTON_PIN, GPIO.BOTH, callback=button_callback)

Mobile App Integration:

  • Home Assistant mobile app can trigger automations
  • Custom webhook endpoints
  • WebSocket connections for real-time triggers

Claude Code Integration Architecture

API Bridge Service Implementation

Service Architecture:

# voice-bridge service structure
from fastapi import FastAPI
from anthropic import Anthropic
import homeassistant_api
import whisper_client
import asyncio

app = FastAPI()

class VoiceBridge:
    def __init__(self):
        self.claude = Anthropic(api_key=os.getenv('CLAUDE_API_KEY'))
        self.ha = homeassistant_api.Client(
            url=os.getenv('HA_URL'),
            token=os.getenv('HA_TOKEN')
        )
        self.whisper = whisper_client.Client(os.getenv('WHISPER_URL'))
        
    async def process_audio(self, audio_data):
        # Step 1: Convert audio to text
        transcript = await self.whisper.transcribe(audio_data)
        
        # Step 2: Send to Claude for interpretation
        ha_context = await self.get_ha_context()
        claude_response = await self.claude.messages.create(
            model="claude-3-5-sonnet-20241022",
            messages=[{
                "role": "user", 
                "content": f"""
                Convert this voice command to Home Assistant API calls:
                Command: "{transcript}"
                
                Available entities: {ha_context}
                
                Return JSON format for HA API calls.
                """
            }]
        )
        
        # Step 3: Execute HA commands
        commands = json.loads(claude_response.content)
        results = []
        for cmd in commands:
            result = await self.ha.call_service(**cmd)
            results.append(result)
            
        return {
            'transcript': transcript,
            'commands': commands,
            'results': results
        }
        
    async def get_ha_context(self):
        # Get current state of all entities
        states = await self.ha.get_states()
        return {
            'lights': [e for e in states if e['entity_id'].startswith('light.')],
            'sensors': [e for e in states if e['entity_id'].startswith('sensor.')],
            'switches': [e for e in states if e['entity_id'].startswith('switch.')],
            # ... other entity types
        }

Command Translation Patterns

Direct Device Commands:

{
  "speech": "Turn on the living room lights",
  "claude_interpretation": {
    "intent": "light_control",
    "target": "light.living_room",
    "action": "turn_on"
  },
  "ha_api_call": {
    "service": "light.turn_on",
    "target": {"entity_id": "light.living_room"}
  }
}

Scene Activation:

{
  "speech": "Set movie mode",
  "claude_interpretation": {
    "intent": "scene_activation",
    "scene": "movie_mode"
  },
  "ha_api_call": {
    "service": "scene.turn_on",
    "target": {"entity_id": "scene.movie_mode"}
  }
}

Complex Logic:

{
  "speech": "Turn on lights in occupied rooms",
  "claude_interpretation": {
    "intent": "conditional_light_control",
    "condition": "occupancy_detected",
    "action": "turn_on_lights"
  },
  "ha_api_calls": [
    {
      "service": "light.turn_on",
      "target": {"entity_id": "light.bedroom"},
      "condition": "binary_sensor.bedroom_occupancy == 'on'"
    },
    {
      "service": "light.turn_on", 
      "target": {"entity_id": "light.living_room"},
      "condition": "binary_sensor.living_room_occupancy == 'on'"
    }
  ]
}

Context Management Strategy

Conversation Memory:

class ConversationContext:
    def __init__(self):
        self.history = []
        self.context_window = 10  # Keep last 10 interactions
        
    def add_interaction(self, speech, response, timestamp):
        self.history.append({
            'speech': speech,
            'response': response, 
            'timestamp': timestamp,
            'ha_state_snapshot': self.capture_ha_state()
        })
        
        # Maintain sliding window
        if len(self.history) > self.context_window:
            self.history.pop(0)
            
    def get_context_for_claude(self):
        return {
            'recent_commands': self.history[-3:],
            'current_time': datetime.now(),
            'house_state': self.get_current_house_state()
        }

Ambiguity Resolution:

# Handle ambiguous commands
def resolve_ambiguity(transcript, available_entities):
    if "lights" in transcript.lower() and not specific_room_mentioned:
        return {
            'type': 'clarification_needed',
            'message': 'Which lights? I can control: living room, bedroom, kitchen',
            'options': ['light.living_room', 'light.bedroom', 'light.kitchen']
        }

Home Assistant Integration Patterns

API Authentication & Security

# Secure API setup
HA_CONFIG = {
    'url': 'http://homeassistant:8123',
    'token': os.getenv('HA_LONG_LIVED_TOKEN'),  # Never hardcode
    'ssl_verify': True,  # In production
    'timeout': 10
}

# Create long-lived access token in HA:
# Settings -> People -> [Your User] -> Long-lived access tokens

WebSocket Integration for Real-time Updates

import websockets
import json

async def ha_websocket_listener():
    uri = "ws://homeassistant:8123/api/websocket"
    
    async with websockets.connect(uri) as websocket:
        # Authenticate
        await websocket.send(json.dumps({
            'type': 'auth',
            'access_token': HA_TOKEN
        }))
        
        # Subscribe to state changes
        await websocket.send(json.dumps({
            'id': 1,
            'type': 'subscribe_events',
            'event_type': 'state_changed'
        }))
        
        async for message in websocket:
            data = json.loads(message)
            if data.get('type') == 'event':
                # Process state changes for voice responses
                await process_state_change(data['event'])

Voice Response Integration

# HA TTS integration for voice feedback
async def speak_response(message, entity_id='media_player.living_room'):
    await ha_client.call_service(
        'tts', 'speak',
        target={'entity_id': entity_id},
        service_data={
            'message': message,
            'language': 'en',
            'options': {'voice': 'neural'}
        }
    )

# Usage examples:
await speak_response("Living room lights turned on")
await speak_response("I couldn't find that device. Please be more specific.")
await speak_response("Movie mode activated. Enjoy your film!")

Hardware & Deployment Considerations

Microphone Hardware Analysis

USB Microphones (Recommended for testing):

  • Blue Yeti: Excellent quality, multiple pickup patterns
  • Audio-Technica ATR2100x-USB: Professional quality
  • Samson Go Mic: Compact, budget-friendly

Professional Audio Interfaces:

  • Focusrite Scarlett Solo: Single input, professional quality
  • Behringer U-Phoria UM2: Budget 2-input interface
  • PreSonus AudioBox USB 96: Mid-range option

Raspberry Pi Integration:

# ReSpeaker HAT for Raspberry Pi
# Provides 2-4 microphone array with hardware VAD
# I2S connection, low latency
# Built-in LED ring for visual feedback

# GPIO microphone setup
sudo apt install python3-pyaudio
# Configure ALSA for USB microphones

Network Microphone Distribution:

# Distributed microphone system
MICROPHONE_NODES = {
    'living_room': 'http://pi-living:8080',
    'bedroom': 'http://pi-bedroom:8080', 
    'kitchen': 'http://pi-kitchen:8080'
}

# Each Pi runs lightweight audio capture service
# Sends audio to central Whisper processing

GPU Acceleration Setup

NVIDIA GPU Configuration:

# Docker Compose GPU configuration
whisper-gpu:
  image: whisper-gpu:latest
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
  environment:
    - NVIDIA_VISIBLE_DEVICES=all

Performance Benchmarks (estimated):

  • CPU Only (8-core):
    • Whisper base: ~500ms latency
    • Whisper large: ~2000ms latency
  • With GPU (GTX 1660+):
    • Whisper base: ~150ms latency
    • Whisper large: ~400ms latency

Container Orchestration Strategy

Complete Docker Compose Stack:

version: '3.8'

services:
  # Core Home Assistant
  homeassistant:
    container_name: homeassistant
    image: ghcr.io/home-assistant/home-assistant:stable
    volumes:
      - ./ha-config:/config
      - /etc/localtime:/etc/localtime:ro
    restart: unless-stopped
    network_mode: host
    
  # Speech-to-Text Engine  
  whisper:
    container_name: whisper-stt
    image: onerahmet/openai-whisper-asr-webservice:latest
    ports:
      - "9000:9000"
    environment:
      - ASR_MODEL=base
      - ASR_ENGINE=openai_whisper
    volumes:
      - whisper-models:/root/.cache/whisper
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
              
  # Voice Processing Bridge
  voice-bridge:
    container_name: voice-bridge
    build: 
      context: ./voice-bridge
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - CLAUDE_API_KEY=${CLAUDE_API_KEY}
      - HA_URL=http://homeassistant:8123
      - HA_TOKEN=${HA_LONG_LIVED_TOKEN}
      - WHISPER_URL=http://whisper:9000
      - PORCUPINE_ACCESS_KEY=${PORCUPINE_ACCESS_KEY}
    volumes:
      - ./voice-bridge-config:/app/config
      - /dev/snd:/dev/snd  # Audio device access
    depends_on:
      - homeassistant
      - whisper
    restart: unless-stopped
    privileged: true  # For audio device access
    
  # Optional: MQTT for device communication
  mosquitto:
    container_name: mqtt-broker
    image: eclipse-mosquitto:latest
    ports:
      - "1883:1883"
      - "9001:9001"
    volumes:
      - ./mosquitto:/mosquitto
    restart: unless-stopped
    
  # Optional: Node-RED for visual automation
  node-red:
    container_name: node-red
    image: nodered/node-red:latest
    ports:
      - "1880:1880"
    volumes:
      - node-red-data:/data
    restart: unless-stopped
    
volumes:
  whisper-models:
  node-red-data:

Privacy & Security Deep Analysis

Data Flow Security Model

Audio Data Privacy:

[Microphone] → [Local VAD] → [Local STT] → [Text Only] → [Claude API]
     ↓              ↓             ↓            ↓
 Never leaves   Never leaves  Never leaves  Encrypted
  local net     local net     local net     HTTPS only

Security Boundaries:

  1. Audio Capture Layer: Hardware → Local processing only
  2. Speech Recognition: Local Whisper → No cloud STT
  3. Command Interpretation: Text-only to Claude Code API
  4. Automation Execution: Local Home Assistant only

Network Security Configuration

Firewall Rules:

# Only allow outbound HTTPS for Claude API
iptables -A OUTPUT -p tcp --dport 443 -d anthropic.com -j ACCEPT
iptables -A OUTPUT -p tcp --dport 443 -j DROP  # Block other HTTPS

# Block all other outbound traffic from voice containers
iptables -A OUTPUT -s voice-bridge-ip -j DROP

API Key Security:

# Environment variable best practices
echo "CLAUDE_API_KEY=your-key-here" >> .env
echo "HA_LONG_LIVED_TOKEN=your-token-here" >> .env
chmod 600 .env

# Container secrets mounting
docker run --env-file .env voice-bridge:latest

Privacy Controls Implementation

Audio Retention Policy:

class AudioPrivacyManager:
    def __init__(self):
        self.max_retention_seconds = 5  # Keep audio only during processing
        self.transcript_retention_days = 7  # Keep transcripts short-term
        
    async def process_audio(self, audio_data):
        try:
            transcript = await self.stt_engine.transcribe(audio_data)
            # Process immediately
            result = await self.process_command(transcript)
            
            # Store transcript with expiration
            await self.store_transcript(transcript, expires_in=7*24*3600)
            
            return result
        finally:
            # Always delete audio data immediately
            del audio_data
            gc.collect()

User Consent & Controls:

# Voice system controls in Home Assistant
VOICE_CONTROLS = {
    'input_boolean.voice_system_enabled': 'Global voice control toggle',
    'input_boolean.voice_learning_mode': 'Allow transcript storage for improvement',
    'input_select.voice_privacy_level': ['minimal', 'standard', 'enhanced'],
    'button.clear_voice_history': 'Clear all stored transcripts'
}

Advanced Features & Future Expansion

Multi-Room Microphone Network

Distributed Audio Architecture:

# Central coordinator service
class MultiRoomVoiceCoordinator:
    def __init__(self):
        self.microphone_nodes = {
            'living_room': MicrophoneNode('192.168.1.101'),
            'bedroom': MicrophoneNode('192.168.1.102'),
            'kitchen': MicrophoneNode('192.168.1.103')
        }
        
    async def listen_all_rooms(self):
        # Simultaneous listening across all nodes
        tasks = [node.listen() for node in self.microphone_nodes.values()]
        winner = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
        
        # Process audio from first responding room
        audio_data, source_room = winner.result()
        return await self.process_with_context(audio_data, source_room)
        
    async def process_with_context(self, audio_data, room):
        # Add room context to Claude processing
        transcript = await self.stt.transcribe(audio_data)
        
        claude_prompt = f"""
        Voice command from {room}: "{transcript}"
        
        Room-specific devices available:
        {self.get_room_devices(room)}
        
        Convert to Home Assistant API calls.
        """

Room-Aware Processing:

def get_room_devices(self, room):
    """Return devices specific to the source room"""
    room_entities = {
        'living_room': [
            'light.living_room_ceiling',
            'media_player.living_room_tv',
            'climate.living_room_thermostat'
        ],
        'bedroom': [
            'light.bedroom_bedside',
            'switch.bedroom_fan',
            'binary_sensor.bedroom_window'
        ]
    }
    return room_entities.get(room, [])

Context-Aware Conversations

Advanced Context Management:

class AdvancedContextManager:
    def __init__(self):
        self.conversation_sessions = {}
        self.house_state_history = []
        
    def create_claude_context(self, user_id, transcript):
        session = self.get_or_create_session(user_id)
        
        context = {
            'transcript': transcript,
            'conversation_history': session.history[-5:],
            'current_time': datetime.now().isoformat(),
            'house_state': {
                'lights_on': self.get_lights_status(),
                'occupancy': self.get_occupancy_status(),
                'weather': self.get_weather(),
                'recent_events': self.get_recent_ha_events(minutes=15)
            },
            'user_preferences': self.get_user_preferences(user_id),
            'location_context': self.get_location_context()
        }
        
        return self.format_claude_prompt(context)
        
    def format_claude_prompt(self, context):
        return f"""
        You are controlling a Home Assistant smart home system via voice commands.
        
        Current situation:
        - Time: {context['current_time']}
        - House state: {context['house_state']}
        - Recent conversation: {context['conversation_history']}
        
        User said: "{context['transcript']}"
        
        Convert this to Home Assistant API calls. Consider:
        1. Current device states (don't turn on lights that are already on)
        2. Time of day (different responses for morning vs night)
        3. Recent conversation context
        4. User's typical preferences
        
        Respond with JSON array of Home Assistant service calls.
        """

Voice Response & Feedback Systems

Advanced TTS Integration:

class VoiceResponseManager:
    def __init__(self):
        self.tts_engines = {
            'neural': 'tts.cloud_say',  # High quality
            'local': 'tts.piper_say',   # Local processing
            'espeak': 'tts.espeak_say'  # Fallback
        }
        
    async def respond_with_voice(self, message, room=None, urgency='normal'):
        # Select appropriate TTS based on context
        tts_engine = self.select_tts_engine(urgency)
        
        # Select speakers based on room or system state
        speakers = self.select_speakers(room)
        
        # Format message for natural speech
        speech_message = self.format_for_speech(message)
        
        # Send to appropriate speakers
        for speaker in speakers:
            await self.ha_client.call_service(
                'tts', tts_engine,
                target={'entity_id': speaker},
                service_data={
                    'message': speech_message,
                    'options': {
                        'voice': 'neural2-en-us-standard-a',
                        'speed': 1.0,
                        'pitch': 0.0
                    }
                }
            )
            
    def format_for_speech(self, message):
        """Convert technical responses to natural speech"""
        replacements = {
            'light.living_room': 'living room lights',
            'switch.bedroom_fan': 'bedroom fan',
            'climate.main_thermostat': 'thermostat',
            'scene.movie_mode': 'movie mode'
        }
        
        for tech_term, natural_term in replacements.items():
            message = message.replace(tech_term, natural_term)
            
        return message

Visual Feedback Integration:

# LED ring feedback on microphone nodes
class MicrophoneVisualFeedback:
    def __init__(self, led_pin_count=12):
        self.leds = neopixel.NeoPixel(board.D18, led_pin_count)
        
    def show_listening(self):
        # Blue pulsing pattern
        self.animate_pulse(color=(0, 0, 255))
        
    def show_processing(self):
        # Spinning orange pattern  
        self.animate_spin(color=(255, 165, 0))
        
    def show_success(self):
        # Green flash
        self.flash(color=(0, 255, 0), duration=1.0)
        
    def show_error(self):
        # Red flash
        self.flash(color=(255, 0, 0), duration=2.0)

Performance Optimization Strategies

Caching & Response Time Optimization

STT Model Caching:

class WhisperModelCache:
    def __init__(self):
        self.models = {}
        self.model_locks = {}
        
    async def get_model(self, model_size='base'):
        if model_size not in self.models:
            if model_size not in self.model_locks:
                self.model_locks[model_size] = asyncio.Lock()
                
            async with self.model_locks[model_size]:
                if model_size not in self.models:
                    self.models[model_size] = whisper.load_model(model_size)
                    
        return self.models[model_size]

Command Pattern Caching:

class CommandCache:
    def __init__(self, ttl_seconds=300):  # 5 minute TTL
        self.cache = {}
        self.ttl = ttl_seconds
        
    def get_cached_response(self, transcript_hash):
        if transcript_hash in self.cache:
            entry = self.cache[transcript_hash]
            if time.time() - entry['timestamp'] < self.ttl:
                return entry['response']
        return None
        
    def cache_response(self, transcript_hash, response):
        self.cache[transcript_hash] = {
            'response': response,
            'timestamp': time.time()
        }

Resource Management

Memory Management:

class ResourceManager:
    def __init__(self):
        self.max_memory_usage = 4 * 1024**3  # 4GB limit
        
    async def process_with_memory_management(self, audio_data):
        initial_memory = self.get_memory_usage()
        
        try:
            if initial_memory > self.max_memory_usage * 0.8:
                await self.cleanup_memory()
                
            result = await self.process_audio(audio_data)
            return result
            
        finally:
            # Force garbage collection after each request
            gc.collect()
            
    async def cleanup_memory(self):
        # Clear caches, unload unused models
        self.command_cache.clear()
        self.whisper_cache.clear_unused()
        gc.collect()

Error Handling & Reliability

Comprehensive Error Recovery

STT Failure Handling:

class RobustSTTProcessor:
    def __init__(self):
        self.stt_engines = [
            WhisperSTT(model='base'),
            VoskSTT(),
            DeepSpeechSTT()  # Fallback options
        ]
        
    async def transcribe_with_fallbacks(self, audio_data):
        for i, engine in enumerate(self.stt_engines):
            try:
                transcript = await engine.transcribe(audio_data)
                if self.validate_transcript(transcript):
                    return transcript
            except Exception as e:
                logger.warning(f"STT engine {i} failed: {e}")
                if i == len(self.stt_engines) - 1:
                    raise
                continue
                
    def validate_transcript(self, transcript):
        # Basic validation rules
        if len(transcript.strip()) < 3:
            return False
        if transcript.count('?') > len(transcript) / 4:  # Too much uncertainty
            return False
        return True

Claude API Failure Handling:

class RobustClaudeClient:
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.fallback_patterns = self.load_fallback_patterns()
        
    async def process_command_with_fallback(self, transcript):
        try:
            # Attempt Claude processing
            response = await self.client.messages.create(
                model="claude-3-5-sonnet-20241022",
                messages=[{"role": "user", "content": self.create_prompt(transcript)}],
                timeout=10.0
            )
            return json.loads(response.content)
            
        except (anthropic.APIError, asyncio.TimeoutError) as e:
            logger.warning(f"Claude API failed: {e}")
            
            # Attempt local pattern matching as fallback
            return self.fallback_command_processing(transcript)
            
    def fallback_command_processing(self, transcript):
        """Simple pattern matching for basic commands when Claude is unavailable"""
        transcript = transcript.lower()
        
        # Basic light controls
        if 'turn on' in transcript and 'light' in transcript:
            room = self.extract_room(transcript)
            return [{
                'service': 'light.turn_on',
                'target': {'entity_id': f'light.{room or "all"}'}
            }]
            
        # Basic switch controls
        if 'turn off' in transcript and ('light' in transcript or 'switch' in transcript):
            room = self.extract_room(transcript) 
            return [{
                'service': 'light.turn_off',
                'target': {'entity_id': f'light.{room or "all"}'}
            }]
            
        # Scene activation
        scenes = ['movie', 'bedtime', 'morning', 'evening']
        for scene in scenes:
            if scene in transcript:
                return [{
                    'service': 'scene.turn_on',
                    'target': {'entity_id': f'scene.{scene}'}
                }]
                
        # If no patterns match, return error
        return [{'error': 'Command not recognized in offline mode'}]

Health Monitoring & Diagnostics

System Health Monitoring:

class VoiceSystemHealthMonitor:
    def __init__(self):
        self.health_checks = {
            'whisper_api': self.check_whisper_health,
            'claude_api': self.check_claude_health,
            'homeassistant_api': self.check_ha_health,
            'microphone_nodes': self.check_microphone_health
        }
        
    async def run_health_checks(self):
        results = {}
        
        for service, check_func in self.health_checks.items():
            try:
                results[service] = await check_func()
            except Exception as e:
                results[service] = {
                    'status': 'unhealthy',
                    'error': str(e),
                    'timestamp': datetime.now().isoformat()
                }
                
        return results
        
    async def check_whisper_health(self):
        start_time = time.time()
        response = await aiohttp.get('http://whisper:9000/health')
        latency = time.time() - start_time
        
        return {
            'status': 'healthy' if response.status == 200 else 'unhealthy',
            'latency_ms': int(latency * 1000),
            'timestamp': datetime.now().isoformat()
        }
        
    # Similar checks for other services...

Automated Recovery Actions:

class AutoRecoveryManager:
    def __init__(self):
        self.recovery_actions = {
            'whisper_unhealthy': self.restart_whisper_service,
            'high_memory_usage': self.cleanup_resources,
            'claude_rate_limited': self.enable_fallback_mode,
            'microphone_disconnected': self.reinitialize_audio
        }
        
    async def handle_health_issue(self, issue_type, details):
        if issue_type in self.recovery_actions:
            logger.info(f"Attempting recovery for {issue_type}")
            await self.recovery_actions[issue_type](details)
        else:
            logger.error(f"No recovery action for {issue_type}")
            await self.alert_administrators(issue_type, details)

Testing & Validation Strategies

Audio Processing Testing

STT Accuracy Testing:

class STTAccuracyTester:
    def __init__(self):
        self.test_phrases = [
            "Turn on the living room lights",
            "Set the temperature to 72 degrees",
            "Activate movie mode",
            "What's the weather like outside",
            "Turn off all lights",
            "Lock all doors"
        ]
        
    async def run_accuracy_tests(self, stt_engine):
        results = []
        
        for phrase in self.test_phrases:
            # Generate synthetic audio from phrase
            audio_data = await self.text_to_speech(phrase)
            
            # Test STT accuracy
            transcript = await stt_engine.transcribe(audio_data)
            
            accuracy = self.calculate_word_accuracy(phrase, transcript)
            results.append({
                'original': phrase,
                'transcript': transcript,
                'accuracy': accuracy
            })
            
        return results
        
    def calculate_word_accuracy(self, reference, hypothesis):
        ref_words = reference.lower().split()
        hyp_words = hypothesis.lower().split()
        
        # Simple word error rate calculation
        correct = sum(1 for r, h in zip(ref_words, hyp_words) if r == h)
        return correct / len(ref_words) if ref_words else 0

End-to-End Integration Testing

Complete Pipeline Testing:

class E2ETestSuite:
    def __init__(self):
        self.test_scenarios = [
            {
                'name': 'basic_light_control',
                'audio_file': 'tests/audio/turn_on_lights.wav',
                'expected_ha_calls': [
                    {'service': 'light.turn_on', 'target': {'entity_id': 'light.living_room'}}
                ]
            },
            {
                'name': 'complex_scene_activation', 
                'audio_file': 'tests/audio/movie_mode.wav',
                'expected_ha_calls': [
                    {'service': 'scene.turn_on', 'target': {'entity_id': 'scene.movie'}}
                ]
            }
        ]
        
    async def run_full_pipeline_tests(self):
        results = []
        
        for scenario in self.test_scenarios:
            result = await self.test_scenario(scenario)
            results.append(result)
            
        return results
        
    async def test_scenario(self, scenario):
        # Load test audio
        with open(scenario['audio_file'], 'rb') as f:
            audio_data = f.read()
            
        # Run through complete pipeline
        try:
            actual_calls = await self.voice_bridge.process_audio(audio_data)
            
            # Compare with expected results
            match = self.compare_ha_calls(
                scenario['expected_ha_calls'], 
                actual_calls['commands']
            )
            
            return {
                'scenario': scenario['name'],
                'success': match,
                'expected': scenario['expected_ha_calls'],
                'actual': actual_calls['commands']
            }
            
        except Exception as e:
            return {
                'scenario': scenario['name'],
                'success': False,
                'error': str(e)
            }

Implementation Timeline & Milestones

Phase 1: Foundation (Weeks 1-2)

Goals:

  • Home Assistant stable deployment
  • Basic container infrastructure
  • Initial device integration

Success Criteria:

  • HA accessible and controlling existing devices
  • Container stack running reliably
  • Basic automations working

Phase 2: Core Voice System (Weeks 3-4)

Goals:

  • Whisper STT deployment
  • Basic voice-bridge service
  • Simple command processing

Success Criteria:

  • Speech-to-text working with test audio files
  • Claude Code API integration functional
  • Basic "turn on lights" commands working

Phase 3: Production Features (Weeks 5-6)

Goals:

  • Wake word detection
  • Multi-room microphone support
  • Advanced error handling

Success Criteria:

  • Hands-free operation with wake words
  • Reliable operation across multiple rooms
  • Graceful failure modes working

Phase 4: Optimization & Polish (Weeks 7-8)

Goals:

  • Performance optimization
  • Advanced context awareness
  • Visual/audio feedback systems

Success Criteria:

  • Sub-500ms response times
  • Context-aware conversations
  • Family-friendly operation

Cost Analysis

Hardware Costs

  • Microphones: $50-200 per room
  • Processing Hardware: Covered by existing Proxmox setup
  • Additional Storage: ~50GB for models and logs

Service Costs

  • Claude Code API: ~$0.01-0.10 per command (depending on context size)
  • Porcupine Wake Words: $0.50-2.00 per month per wake word
  • No cloud STT costs (fully local)

Estimated Monthly Operating Costs

  • Light Usage (10 commands/day): ~$3-10/month
  • Heavy Usage (50 commands/day): ~$15-50/month
  • Wake word licensing: ~$2-5/month

Conclusion & Next Steps

This voice automation system represents a cutting-edge approach to local smart home control, combining the latest in speech recognition with advanced AI interpretation. The architecture prioritizes privacy, reliability, and extensibility while maintaining the local-only operation you desire.

Key Success Factors:

  1. Proven Technology Stack: Whisper + Claude Code + Home Assistant
  2. Privacy-First Design: Audio never leaves local network
  3. Flexible Architecture: Easy to extend and customize
  4. Reliable Fallbacks: Multiple failure recovery mechanisms

Recommended Implementation Approach:

  1. Start with Home Assistant foundation
  2. Add voice components incrementally
  3. Test thoroughly at each phase
  4. Optimize for your specific use patterns

The combination of your technical expertise, existing infrastructure, and this comprehensive architecture plan sets you up for success in creating a truly advanced, private, and powerful voice-controlled smart home system.

This system will provide the advanced automation capabilities that Apple Home lacks while maintaining the local control and privacy that drove your original Home Assistant interest. The addition of Claude Code as the natural language processing layer bridges the gap between human speech and technical automation in a way that would be extremely difficult to achieve with traditional rule-based systems.