claude-home/.claude/plans/voice-automation-implementation-details.md
Cal Corum bd49e9d61d CLAUDE: Add comprehensive home automation planning documents
- Add Home Assistant deployment guide with container architecture
- Document platform analysis comparing Home Assistant, OpenHAB, and Node-RED
- Add voice automation architecture with local/cloud hybrid approach
- Include implementation details for Rhasspy + Home Assistant integration
- Provide step-by-step deployment guides and configuration templates
- Document privacy-focused voice processing with local wake word detection

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 16:21:28 -05:00

1182 lines
35 KiB
Markdown

# Voice Automation Implementation Details & Insights
## Comprehensive Technical Analysis
This document captures detailed implementation insights, technical considerations, and lessons learned from analyzing voice-controlled home automation architecture for integration with Home Assistant and Claude Code.
## Core Architecture Deep Dive
### Speech-to-Text Engine Comparison
#### OpenAI Whisper (Primary Recommendation)
**Technical Specifications:**
- **Models Available:** tiny (39MB), base (74MB), small (244MB), medium (769MB), large (1550MB)
- **Languages:** 99+ languages with varying accuracy levels
- **Accuracy:** State-of-the-art, especially for English
- **Latency:**
- Tiny: ~100ms on CPU, ~50ms on GPU
- Base: ~200ms on CPU, ~100ms on GPU
- Large: ~1s on CPU, ~300ms on GPU
- **Resource Usage:**
- CPU: 1-4 cores depending on model size
- RAM: 1-4GB depending on model size
- GPU: Optional but significant speedup (2-10x faster)
**Container Options:**
```bash
# Official Whisper in container
docker run -p 9000:9000 onerahmet/openai-whisper-asr-webservice:latest
# Custom optimized version with GPU
docker run --gpus all -p 9000:9000 whisper-gpu:latest
```
**API Interface:**
```python
# RESTful API example
import requests
import json
response = requests.post('http://localhost:9000/asr',
files={'audio_file': open('command.wav', 'rb')},
data={'task': 'transcribe', 'language': 'english'}
)
text = response.json()['text']
```
#### Vosk Alternative Analysis
**Pros:**
- Smaller memory footprint (100-200MB models)
- Faster real-time processing
- Better for streaming audio
- Multiple model sizes per language
**Cons:**
- Lower accuracy than Whisper for natural speech
- Fewer supported languages
- Less robust with accents/noise
**Use Case:** Better for command-word recognition, worse for natural language
#### wav2vec2 Consideration
**Facebook's Model:**
- Excellent accuracy competitive with Whisper
- More complex setup and deployment
- Less containerized ecosystem
- **Recommendation:** Skip unless specific requirements
### Voice Activity Detection (VAD) Deep Dive
#### Wake Word Detection Systems
**Porcupine by Picovoice (Recommended)**
```python
# Porcupine integration example
import pvporcupine
porcupine = pvporcupine.create(
access_key='your-access-key',
keywords=['hey-claude', 'computer', 'assistant']
)
while True:
pcm = get_next_audio_frame() # 16kHz, 16-bit, mono
keyword_index = porcupine.process(pcm)
if keyword_index >= 0:
print(f"Wake word detected: {keywords[keyword_index]}")
# Trigger STT pipeline
```
**Technical Requirements:**
- Continuous audio monitoring
- Low CPU usage (< 1% on modern CPUs)
- Custom wake word training available
- Privacy: all processing local
**Alternative: Snowboy (Open Source)**
- No longer actively maintained
- Still functional for basic wake words
- Completely free and local
- Lower accuracy than Porcupine
#### Push-to-Talk Implementations
**Hardware Button Integration:**
```python
# GPIO button on Raspberry Pi
import RPi.GPIO as GPIO
BUTTON_PIN = 18
GPIO.setup(BUTTON_PIN, GPIO.IN, pull_up_down=GPIO.PUD_UP)
def button_callback(channel):
if GPIO.input(channel) == GPIO.LOW:
start_recording()
else:
stop_recording_and_process()
GPIO.add_event_detect(BUTTON_PIN, GPIO.BOTH, callback=button_callback)
```
**Mobile App Integration:**
- Home Assistant mobile app can trigger automations
- Custom webhook endpoints
- WebSocket connections for real-time triggers
### Claude Code Integration Architecture
#### API Bridge Service Implementation
**Service Architecture:**
```python
# voice-bridge service structure
from fastapi import FastAPI
from anthropic import Anthropic
import homeassistant_api
import whisper_client
import asyncio
app = FastAPI()
class VoiceBridge:
def __init__(self):
self.claude = Anthropic(api_key=os.getenv('CLAUDE_API_KEY'))
self.ha = homeassistant_api.Client(
url=os.getenv('HA_URL'),
token=os.getenv('HA_TOKEN')
)
self.whisper = whisper_client.Client(os.getenv('WHISPER_URL'))
async def process_audio(self, audio_data):
# Step 1: Convert audio to text
transcript = await self.whisper.transcribe(audio_data)
# Step 2: Send to Claude for interpretation
ha_context = await self.get_ha_context()
claude_response = await self.claude.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{
"role": "user",
"content": f"""
Convert this voice command to Home Assistant API calls:
Command: "{transcript}"
Available entities: {ha_context}
Return JSON format for HA API calls.
"""
}]
)
# Step 3: Execute HA commands
commands = json.loads(claude_response.content)
results = []
for cmd in commands:
result = await self.ha.call_service(**cmd)
results.append(result)
return {
'transcript': transcript,
'commands': commands,
'results': results
}
async def get_ha_context(self):
# Get current state of all entities
states = await self.ha.get_states()
return {
'lights': [e for e in states if e['entity_id'].startswith('light.')],
'sensors': [e for e in states if e['entity_id'].startswith('sensor.')],
'switches': [e for e in states if e['entity_id'].startswith('switch.')],
# ... other entity types
}
```
#### Command Translation Patterns
**Direct Device Commands:**
```json
{
"speech": "Turn on the living room lights",
"claude_interpretation": {
"intent": "light_control",
"target": "light.living_room",
"action": "turn_on"
},
"ha_api_call": {
"service": "light.turn_on",
"target": {"entity_id": "light.living_room"}
}
}
```
**Scene Activation:**
```json
{
"speech": "Set movie mode",
"claude_interpretation": {
"intent": "scene_activation",
"scene": "movie_mode"
},
"ha_api_call": {
"service": "scene.turn_on",
"target": {"entity_id": "scene.movie_mode"}
}
}
```
**Complex Logic:**
```json
{
"speech": "Turn on lights in occupied rooms",
"claude_interpretation": {
"intent": "conditional_light_control",
"condition": "occupancy_detected",
"action": "turn_on_lights"
},
"ha_api_calls": [
{
"service": "light.turn_on",
"target": {"entity_id": "light.bedroom"},
"condition": "binary_sensor.bedroom_occupancy == 'on'"
},
{
"service": "light.turn_on",
"target": {"entity_id": "light.living_room"},
"condition": "binary_sensor.living_room_occupancy == 'on'"
}
]
}
```
#### Context Management Strategy
**Conversation Memory:**
```python
class ConversationContext:
def __init__(self):
self.history = []
self.context_window = 10 # Keep last 10 interactions
def add_interaction(self, speech, response, timestamp):
self.history.append({
'speech': speech,
'response': response,
'timestamp': timestamp,
'ha_state_snapshot': self.capture_ha_state()
})
# Maintain sliding window
if len(self.history) > self.context_window:
self.history.pop(0)
def get_context_for_claude(self):
return {
'recent_commands': self.history[-3:],
'current_time': datetime.now(),
'house_state': self.get_current_house_state()
}
```
**Ambiguity Resolution:**
```python
# Handle ambiguous commands
def resolve_ambiguity(transcript, available_entities):
if "lights" in transcript.lower() and not specific_room_mentioned:
return {
'type': 'clarification_needed',
'message': 'Which lights? I can control: living room, bedroom, kitchen',
'options': ['light.living_room', 'light.bedroom', 'light.kitchen']
}
```
### Home Assistant Integration Patterns
#### API Authentication & Security
```python
# Secure API setup
HA_CONFIG = {
'url': 'http://homeassistant:8123',
'token': os.getenv('HA_LONG_LIVED_TOKEN'), # Never hardcode
'ssl_verify': True, # In production
'timeout': 10
}
# Create long-lived access token in HA:
# Settings -> People -> [Your User] -> Long-lived access tokens
```
#### WebSocket Integration for Real-time Updates
```python
import websockets
import json
async def ha_websocket_listener():
uri = "ws://homeassistant:8123/api/websocket"
async with websockets.connect(uri) as websocket:
# Authenticate
await websocket.send(json.dumps({
'type': 'auth',
'access_token': HA_TOKEN
}))
# Subscribe to state changes
await websocket.send(json.dumps({
'id': 1,
'type': 'subscribe_events',
'event_type': 'state_changed'
}))
async for message in websocket:
data = json.loads(message)
if data.get('type') == 'event':
# Process state changes for voice responses
await process_state_change(data['event'])
```
#### Voice Response Integration
```python
# HA TTS integration for voice feedback
async def speak_response(message, entity_id='media_player.living_room'):
await ha_client.call_service(
'tts', 'speak',
target={'entity_id': entity_id},
service_data={
'message': message,
'language': 'en',
'options': {'voice': 'neural'}
}
)
# Usage examples:
await speak_response("Living room lights turned on")
await speak_response("I couldn't find that device. Please be more specific.")
await speak_response("Movie mode activated. Enjoy your film!")
```
### Hardware & Deployment Considerations
#### Microphone Hardware Analysis
**USB Microphones (Recommended for testing):**
- Blue Yeti: Excellent quality, multiple pickup patterns
- Audio-Technica ATR2100x-USB: Professional quality
- Samson Go Mic: Compact, budget-friendly
**Professional Audio Interfaces:**
- Focusrite Scarlett Solo: Single input, professional quality
- Behringer U-Phoria UM2: Budget 2-input interface
- PreSonus AudioBox USB 96: Mid-range option
**Raspberry Pi Integration:**
```bash
# ReSpeaker HAT for Raspberry Pi
# Provides 2-4 microphone array with hardware VAD
# I2S connection, low latency
# Built-in LED ring for visual feedback
# GPIO microphone setup
sudo apt install python3-pyaudio
# Configure ALSA for USB microphones
```
**Network Microphone Distribution:**
```python
# Distributed microphone system
MICROPHONE_NODES = {
'living_room': 'http://pi-living:8080',
'bedroom': 'http://pi-bedroom:8080',
'kitchen': 'http://pi-kitchen:8080'
}
# Each Pi runs lightweight audio capture service
# Sends audio to central Whisper processing
```
#### GPU Acceleration Setup
**NVIDIA GPU Configuration:**
```yaml
# Docker Compose GPU configuration
whisper-gpu:
image: whisper-gpu:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
```
**Performance Benchmarks (estimated):**
- **CPU Only (8-core):**
- Whisper base: ~500ms latency
- Whisper large: ~2000ms latency
- **With GPU (GTX 1660+):**
- Whisper base: ~150ms latency
- Whisper large: ~400ms latency
#### Container Orchestration Strategy
**Complete Docker Compose Stack:**
```yaml
version: '3.8'
services:
# Core Home Assistant
homeassistant:
container_name: homeassistant
image: ghcr.io/home-assistant/home-assistant:stable
volumes:
- ./ha-config:/config
- /etc/localtime:/etc/localtime:ro
restart: unless-stopped
network_mode: host
# Speech-to-Text Engine
whisper:
container_name: whisper-stt
image: onerahmet/openai-whisper-asr-webservice:latest
ports:
- "9000:9000"
environment:
- ASR_MODEL=base
- ASR_ENGINE=openai_whisper
volumes:
- whisper-models:/root/.cache/whisper
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Voice Processing Bridge
voice-bridge:
container_name: voice-bridge
build:
context: ./voice-bridge
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
- CLAUDE_API_KEY=${CLAUDE_API_KEY}
- HA_URL=http://homeassistant:8123
- HA_TOKEN=${HA_LONG_LIVED_TOKEN}
- WHISPER_URL=http://whisper:9000
- PORCUPINE_ACCESS_KEY=${PORCUPINE_ACCESS_KEY}
volumes:
- ./voice-bridge-config:/app/config
- /dev/snd:/dev/snd # Audio device access
depends_on:
- homeassistant
- whisper
restart: unless-stopped
privileged: true # For audio device access
# Optional: MQTT for device communication
mosquitto:
container_name: mqtt-broker
image: eclipse-mosquitto:latest
ports:
- "1883:1883"
- "9001:9001"
volumes:
- ./mosquitto:/mosquitto
restart: unless-stopped
# Optional: Node-RED for visual automation
node-red:
container_name: node-red
image: nodered/node-red:latest
ports:
- "1880:1880"
volumes:
- node-red-data:/data
restart: unless-stopped
volumes:
whisper-models:
node-red-data:
```
### Privacy & Security Deep Analysis
#### Data Flow Security Model
**Audio Data Privacy:**
```
[Microphone] → [Local VAD] → [Local STT] → [Text Only] → [Claude API]
↓ ↓ ↓ ↓
Never leaves Never leaves Never leaves Encrypted
local net local net local net HTTPS only
```
**Security Boundaries:**
1. **Audio Capture Layer:** Hardware Local processing only
2. **Speech Recognition:** Local Whisper No cloud STT
3. **Command Interpretation:** Text-only to Claude Code API
4. **Automation Execution:** Local Home Assistant only
#### Network Security Configuration
**Firewall Rules:**
```bash
# Only allow outbound HTTPS for Claude API
iptables -A OUTPUT -p tcp --dport 443 -d anthropic.com -j ACCEPT
iptables -A OUTPUT -p tcp --dport 443 -j DROP # Block other HTTPS
# Block all other outbound traffic from voice containers
iptables -A OUTPUT -s voice-bridge-ip -j DROP
```
**API Key Security:**
```bash
# Environment variable best practices
echo "CLAUDE_API_KEY=your-key-here" >> .env
echo "HA_LONG_LIVED_TOKEN=your-token-here" >> .env
chmod 600 .env
# Container secrets mounting
docker run --env-file .env voice-bridge:latest
```
#### Privacy Controls Implementation
**Audio Retention Policy:**
```python
class AudioPrivacyManager:
def __init__(self):
self.max_retention_seconds = 5 # Keep audio only during processing
self.transcript_retention_days = 7 # Keep transcripts short-term
async def process_audio(self, audio_data):
try:
transcript = await self.stt_engine.transcribe(audio_data)
# Process immediately
result = await self.process_command(transcript)
# Store transcript with expiration
await self.store_transcript(transcript, expires_in=7*24*3600)
return result
finally:
# Always delete audio data immediately
del audio_data
gc.collect()
```
**User Consent & Controls:**
```python
# Voice system controls in Home Assistant
VOICE_CONTROLS = {
'input_boolean.voice_system_enabled': 'Global voice control toggle',
'input_boolean.voice_learning_mode': 'Allow transcript storage for improvement',
'input_select.voice_privacy_level': ['minimal', 'standard', 'enhanced'],
'button.clear_voice_history': 'Clear all stored transcripts'
}
```
### Advanced Features & Future Expansion
#### Multi-Room Microphone Network
**Distributed Audio Architecture:**
```python
# Central coordinator service
class MultiRoomVoiceCoordinator:
def __init__(self):
self.microphone_nodes = {
'living_room': MicrophoneNode('192.168.1.101'),
'bedroom': MicrophoneNode('192.168.1.102'),
'kitchen': MicrophoneNode('192.168.1.103')
}
async def listen_all_rooms(self):
# Simultaneous listening across all nodes
tasks = [node.listen() for node in self.microphone_nodes.values()]
winner = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
# Process audio from first responding room
audio_data, source_room = winner.result()
return await self.process_with_context(audio_data, source_room)
async def process_with_context(self, audio_data, room):
# Add room context to Claude processing
transcript = await self.stt.transcribe(audio_data)
claude_prompt = f"""
Voice command from {room}: "{transcript}"
Room-specific devices available:
{self.get_room_devices(room)}
Convert to Home Assistant API calls.
"""
```
**Room-Aware Processing:**
```python
def get_room_devices(self, room):
"""Return devices specific to the source room"""
room_entities = {
'living_room': [
'light.living_room_ceiling',
'media_player.living_room_tv',
'climate.living_room_thermostat'
],
'bedroom': [
'light.bedroom_bedside',
'switch.bedroom_fan',
'binary_sensor.bedroom_window'
]
}
return room_entities.get(room, [])
```
#### Context-Aware Conversations
**Advanced Context Management:**
```python
class AdvancedContextManager:
def __init__(self):
self.conversation_sessions = {}
self.house_state_history = []
def create_claude_context(self, user_id, transcript):
session = self.get_or_create_session(user_id)
context = {
'transcript': transcript,
'conversation_history': session.history[-5:],
'current_time': datetime.now().isoformat(),
'house_state': {
'lights_on': self.get_lights_status(),
'occupancy': self.get_occupancy_status(),
'weather': self.get_weather(),
'recent_events': self.get_recent_ha_events(minutes=15)
},
'user_preferences': self.get_user_preferences(user_id),
'location_context': self.get_location_context()
}
return self.format_claude_prompt(context)
def format_claude_prompt(self, context):
return f"""
You are controlling a Home Assistant smart home system via voice commands.
Current situation:
- Time: {context['current_time']}
- House state: {context['house_state']}
- Recent conversation: {context['conversation_history']}
User said: "{context['transcript']}"
Convert this to Home Assistant API calls. Consider:
1. Current device states (don't turn on lights that are already on)
2. Time of day (different responses for morning vs night)
3. Recent conversation context
4. User's typical preferences
Respond with JSON array of Home Assistant service calls.
"""
```
#### Voice Response & Feedback Systems
**Advanced TTS Integration:**
```python
class VoiceResponseManager:
def __init__(self):
self.tts_engines = {
'neural': 'tts.cloud_say', # High quality
'local': 'tts.piper_say', # Local processing
'espeak': 'tts.espeak_say' # Fallback
}
async def respond_with_voice(self, message, room=None, urgency='normal'):
# Select appropriate TTS based on context
tts_engine = self.select_tts_engine(urgency)
# Select speakers based on room or system state
speakers = self.select_speakers(room)
# Format message for natural speech
speech_message = self.format_for_speech(message)
# Send to appropriate speakers
for speaker in speakers:
await self.ha_client.call_service(
'tts', tts_engine,
target={'entity_id': speaker},
service_data={
'message': speech_message,
'options': {
'voice': 'neural2-en-us-standard-a',
'speed': 1.0,
'pitch': 0.0
}
}
)
def format_for_speech(self, message):
"""Convert technical responses to natural speech"""
replacements = {
'light.living_room': 'living room lights',
'switch.bedroom_fan': 'bedroom fan',
'climate.main_thermostat': 'thermostat',
'scene.movie_mode': 'movie mode'
}
for tech_term, natural_term in replacements.items():
message = message.replace(tech_term, natural_term)
return message
```
**Visual Feedback Integration:**
```python
# LED ring feedback on microphone nodes
class MicrophoneVisualFeedback:
def __init__(self, led_pin_count=12):
self.leds = neopixel.NeoPixel(board.D18, led_pin_count)
def show_listening(self):
# Blue pulsing pattern
self.animate_pulse(color=(0, 0, 255))
def show_processing(self):
# Spinning orange pattern
self.animate_spin(color=(255, 165, 0))
def show_success(self):
# Green flash
self.flash(color=(0, 255, 0), duration=1.0)
def show_error(self):
# Red flash
self.flash(color=(255, 0, 0), duration=2.0)
```
### Performance Optimization Strategies
#### Caching & Response Time Optimization
**STT Model Caching:**
```python
class WhisperModelCache:
def __init__(self):
self.models = {}
self.model_locks = {}
async def get_model(self, model_size='base'):
if model_size not in self.models:
if model_size not in self.model_locks:
self.model_locks[model_size] = asyncio.Lock()
async with self.model_locks[model_size]:
if model_size not in self.models:
self.models[model_size] = whisper.load_model(model_size)
return self.models[model_size]
```
**Command Pattern Caching:**
```python
class CommandCache:
def __init__(self, ttl_seconds=300): # 5 minute TTL
self.cache = {}
self.ttl = ttl_seconds
def get_cached_response(self, transcript_hash):
if transcript_hash in self.cache:
entry = self.cache[transcript_hash]
if time.time() - entry['timestamp'] < self.ttl:
return entry['response']
return None
def cache_response(self, transcript_hash, response):
self.cache[transcript_hash] = {
'response': response,
'timestamp': time.time()
}
```
#### Resource Management
**Memory Management:**
```python
class ResourceManager:
def __init__(self):
self.max_memory_usage = 4 * 1024**3 # 4GB limit
async def process_with_memory_management(self, audio_data):
initial_memory = self.get_memory_usage()
try:
if initial_memory > self.max_memory_usage * 0.8:
await self.cleanup_memory()
result = await self.process_audio(audio_data)
return result
finally:
# Force garbage collection after each request
gc.collect()
async def cleanup_memory(self):
# Clear caches, unload unused models
self.command_cache.clear()
self.whisper_cache.clear_unused()
gc.collect()
```
### Error Handling & Reliability
#### Comprehensive Error Recovery
**STT Failure Handling:**
```python
class RobustSTTProcessor:
def __init__(self):
self.stt_engines = [
WhisperSTT(model='base'),
VoskSTT(),
DeepSpeechSTT() # Fallback options
]
async def transcribe_with_fallbacks(self, audio_data):
for i, engine in enumerate(self.stt_engines):
try:
transcript = await engine.transcribe(audio_data)
if self.validate_transcript(transcript):
return transcript
except Exception as e:
logger.warning(f"STT engine {i} failed: {e}")
if i == len(self.stt_engines) - 1:
raise
continue
def validate_transcript(self, transcript):
# Basic validation rules
if len(transcript.strip()) < 3:
return False
if transcript.count('?') > len(transcript) / 4: # Too much uncertainty
return False
return True
```
**Claude API Failure Handling:**
```python
class RobustClaudeClient:
def __init__(self):
self.client = anthropic.Anthropic()
self.fallback_patterns = self.load_fallback_patterns()
async def process_command_with_fallback(self, transcript):
try:
# Attempt Claude processing
response = await self.client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": self.create_prompt(transcript)}],
timeout=10.0
)
return json.loads(response.content)
except (anthropic.APIError, asyncio.TimeoutError) as e:
logger.warning(f"Claude API failed: {e}")
# Attempt local pattern matching as fallback
return self.fallback_command_processing(transcript)
def fallback_command_processing(self, transcript):
"""Simple pattern matching for basic commands when Claude is unavailable"""
transcript = transcript.lower()
# Basic light controls
if 'turn on' in transcript and 'light' in transcript:
room = self.extract_room(transcript)
return [{
'service': 'light.turn_on',
'target': {'entity_id': f'light.{room or "all"}'}
}]
# Basic switch controls
if 'turn off' in transcript and ('light' in transcript or 'switch' in transcript):
room = self.extract_room(transcript)
return [{
'service': 'light.turn_off',
'target': {'entity_id': f'light.{room or "all"}'}
}]
# Scene activation
scenes = ['movie', 'bedtime', 'morning', 'evening']
for scene in scenes:
if scene in transcript:
return [{
'service': 'scene.turn_on',
'target': {'entity_id': f'scene.{scene}'}
}]
# If no patterns match, return error
return [{'error': 'Command not recognized in offline mode'}]
```
#### Health Monitoring & Diagnostics
**System Health Monitoring:**
```python
class VoiceSystemHealthMonitor:
def __init__(self):
self.health_checks = {
'whisper_api': self.check_whisper_health,
'claude_api': self.check_claude_health,
'homeassistant_api': self.check_ha_health,
'microphone_nodes': self.check_microphone_health
}
async def run_health_checks(self):
results = {}
for service, check_func in self.health_checks.items():
try:
results[service] = await check_func()
except Exception as e:
results[service] = {
'status': 'unhealthy',
'error': str(e),
'timestamp': datetime.now().isoformat()
}
return results
async def check_whisper_health(self):
start_time = time.time()
response = await aiohttp.get('http://whisper:9000/health')
latency = time.time() - start_time
return {
'status': 'healthy' if response.status == 200 else 'unhealthy',
'latency_ms': int(latency * 1000),
'timestamp': datetime.now().isoformat()
}
# Similar checks for other services...
```
**Automated Recovery Actions:**
```python
class AutoRecoveryManager:
def __init__(self):
self.recovery_actions = {
'whisper_unhealthy': self.restart_whisper_service,
'high_memory_usage': self.cleanup_resources,
'claude_rate_limited': self.enable_fallback_mode,
'microphone_disconnected': self.reinitialize_audio
}
async def handle_health_issue(self, issue_type, details):
if issue_type in self.recovery_actions:
logger.info(f"Attempting recovery for {issue_type}")
await self.recovery_actions[issue_type](details)
else:
logger.error(f"No recovery action for {issue_type}")
await self.alert_administrators(issue_type, details)
```
### Testing & Validation Strategies
#### Audio Processing Testing
**STT Accuracy Testing:**
```python
class STTAccuracyTester:
def __init__(self):
self.test_phrases = [
"Turn on the living room lights",
"Set the temperature to 72 degrees",
"Activate movie mode",
"What's the weather like outside",
"Turn off all lights",
"Lock all doors"
]
async def run_accuracy_tests(self, stt_engine):
results = []
for phrase in self.test_phrases:
# Generate synthetic audio from phrase
audio_data = await self.text_to_speech(phrase)
# Test STT accuracy
transcript = await stt_engine.transcribe(audio_data)
accuracy = self.calculate_word_accuracy(phrase, transcript)
results.append({
'original': phrase,
'transcript': transcript,
'accuracy': accuracy
})
return results
def calculate_word_accuracy(self, reference, hypothesis):
ref_words = reference.lower().split()
hyp_words = hypothesis.lower().split()
# Simple word error rate calculation
correct = sum(1 for r, h in zip(ref_words, hyp_words) if r == h)
return correct / len(ref_words) if ref_words else 0
```
#### End-to-End Integration Testing
**Complete Pipeline Testing:**
```python
class E2ETestSuite:
def __init__(self):
self.test_scenarios = [
{
'name': 'basic_light_control',
'audio_file': 'tests/audio/turn_on_lights.wav',
'expected_ha_calls': [
{'service': 'light.turn_on', 'target': {'entity_id': 'light.living_room'}}
]
},
{
'name': 'complex_scene_activation',
'audio_file': 'tests/audio/movie_mode.wav',
'expected_ha_calls': [
{'service': 'scene.turn_on', 'target': {'entity_id': 'scene.movie'}}
]
}
]
async def run_full_pipeline_tests(self):
results = []
for scenario in self.test_scenarios:
result = await self.test_scenario(scenario)
results.append(result)
return results
async def test_scenario(self, scenario):
# Load test audio
with open(scenario['audio_file'], 'rb') as f:
audio_data = f.read()
# Run through complete pipeline
try:
actual_calls = await self.voice_bridge.process_audio(audio_data)
# Compare with expected results
match = self.compare_ha_calls(
scenario['expected_ha_calls'],
actual_calls['commands']
)
return {
'scenario': scenario['name'],
'success': match,
'expected': scenario['expected_ha_calls'],
'actual': actual_calls['commands']
}
except Exception as e:
return {
'scenario': scenario['name'],
'success': False,
'error': str(e)
}
```
## Implementation Timeline & Milestones
### Phase 1: Foundation (Weeks 1-2)
**Goals:**
- Home Assistant stable deployment
- Basic container infrastructure
- Initial device integration
**Success Criteria:**
- HA accessible and controlling existing devices
- Container stack running reliably
- Basic automations working
### Phase 2: Core Voice System (Weeks 3-4)
**Goals:**
- Whisper STT deployment
- Basic voice-bridge service
- Simple command processing
**Success Criteria:**
- Speech-to-text working with test audio files
- Claude Code API integration functional
- Basic "turn on lights" commands working
### Phase 3: Production Features (Weeks 5-6)
**Goals:**
- Wake word detection
- Multi-room microphone support
- Advanced error handling
**Success Criteria:**
- Hands-free operation with wake words
- Reliable operation across multiple rooms
- Graceful failure modes working
### Phase 4: Optimization & Polish (Weeks 7-8)
**Goals:**
- Performance optimization
- Advanced context awareness
- Visual/audio feedback systems
**Success Criteria:**
- Sub-500ms response times
- Context-aware conversations
- Family-friendly operation
## Cost Analysis
### Hardware Costs
- **Microphones:** $50-200 per room
- **Processing Hardware:** Covered by existing Proxmox setup
- **Additional Storage:** ~50GB for models and logs
### Service Costs
- **Claude Code API:** ~$0.01-0.10 per command (depending on context size)
- **Porcupine Wake Words:** $0.50-2.00 per month per wake word
- **No cloud STT costs** (fully local)
### Estimated Monthly Operating Costs
- **Light Usage (10 commands/day):** ~$3-10/month
- **Heavy Usage (50 commands/day):** ~$15-50/month
- **Wake word licensing:** ~$2-5/month
## Conclusion & Next Steps
This voice automation system represents a cutting-edge approach to local smart home control, combining the latest in speech recognition with advanced AI interpretation. The architecture prioritizes privacy, reliability, and extensibility while maintaining the local-only operation you desire.
**Key Success Factors:**
1. **Proven Technology Stack:** Whisper + Claude Code + Home Assistant
2. **Privacy-First Design:** Audio never leaves local network
3. **Flexible Architecture:** Easy to extend and customize
4. **Reliable Fallbacks:** Multiple failure recovery mechanisms
**Recommended Implementation Approach:**
1. Start with Home Assistant foundation
2. Add voice components incrementally
3. Test thoroughly at each phase
4. Optimize for your specific use patterns
The combination of your technical expertise, existing infrastructure, and this comprehensive architecture plan sets you up for success in creating a truly advanced, private, and powerful voice-controlled smart home system.
This system will provide the advanced automation capabilities that Apple Home lacks while maintaining the local control and privacy that drove your original Home Assistant interest. The addition of Claude Code as the natural language processing layer bridges the gap between human speech and technical automation in a way that would be extremely difficult to achieve with traditional rule-based systems.