- Add Home Assistant deployment guide with container architecture - Document platform analysis comparing Home Assistant, OpenHAB, and Node-RED - Add voice automation architecture with local/cloud hybrid approach - Include implementation details for Rhasspy + Home Assistant integration - Provide step-by-step deployment guides and configuration templates - Document privacy-focused voice processing with local wake word detection 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1182 lines
35 KiB
Markdown
1182 lines
35 KiB
Markdown
# Voice Automation Implementation Details & Insights
|
|
|
|
## Comprehensive Technical Analysis
|
|
|
|
This document captures detailed implementation insights, technical considerations, and lessons learned from analyzing voice-controlled home automation architecture for integration with Home Assistant and Claude Code.
|
|
|
|
## Core Architecture Deep Dive
|
|
|
|
### Speech-to-Text Engine Comparison
|
|
|
|
#### OpenAI Whisper (Primary Recommendation)
|
|
**Technical Specifications:**
|
|
- **Models Available:** tiny (39MB), base (74MB), small (244MB), medium (769MB), large (1550MB)
|
|
- **Languages:** 99+ languages with varying accuracy levels
|
|
- **Accuracy:** State-of-the-art, especially for English
|
|
- **Latency:**
|
|
- Tiny: ~100ms on CPU, ~50ms on GPU
|
|
- Base: ~200ms on CPU, ~100ms on GPU
|
|
- Large: ~1s on CPU, ~300ms on GPU
|
|
- **Resource Usage:**
|
|
- CPU: 1-4 cores depending on model size
|
|
- RAM: 1-4GB depending on model size
|
|
- GPU: Optional but significant speedup (2-10x faster)
|
|
|
|
**Container Options:**
|
|
```bash
|
|
# Official Whisper in container
|
|
docker run -p 9000:9000 onerahmet/openai-whisper-asr-webservice:latest
|
|
|
|
# Custom optimized version with GPU
|
|
docker run --gpus all -p 9000:9000 whisper-gpu:latest
|
|
```
|
|
|
|
**API Interface:**
|
|
```python
|
|
# RESTful API example
|
|
import requests
|
|
import json
|
|
|
|
response = requests.post('http://localhost:9000/asr',
|
|
files={'audio_file': open('command.wav', 'rb')},
|
|
data={'task': 'transcribe', 'language': 'english'}
|
|
)
|
|
text = response.json()['text']
|
|
```
|
|
|
|
#### Vosk Alternative Analysis
|
|
**Pros:**
|
|
- Smaller memory footprint (100-200MB models)
|
|
- Faster real-time processing
|
|
- Better for streaming audio
|
|
- Multiple model sizes per language
|
|
|
|
**Cons:**
|
|
- Lower accuracy than Whisper for natural speech
|
|
- Fewer supported languages
|
|
- Less robust with accents/noise
|
|
|
|
**Use Case:** Better for command-word recognition, worse for natural language
|
|
|
|
#### wav2vec2 Consideration
|
|
**Facebook's Model:**
|
|
- Excellent accuracy competitive with Whisper
|
|
- More complex setup and deployment
|
|
- Less containerized ecosystem
|
|
- **Recommendation:** Skip unless specific requirements
|
|
|
|
### Voice Activity Detection (VAD) Deep Dive
|
|
|
|
#### Wake Word Detection Systems
|
|
|
|
**Porcupine by Picovoice (Recommended)**
|
|
```python
|
|
# Porcupine integration example
|
|
import pvporcupine
|
|
|
|
porcupine = pvporcupine.create(
|
|
access_key='your-access-key',
|
|
keywords=['hey-claude', 'computer', 'assistant']
|
|
)
|
|
|
|
while True:
|
|
pcm = get_next_audio_frame() # 16kHz, 16-bit, mono
|
|
keyword_index = porcupine.process(pcm)
|
|
if keyword_index >= 0:
|
|
print(f"Wake word detected: {keywords[keyword_index]}")
|
|
# Trigger STT pipeline
|
|
```
|
|
|
|
**Technical Requirements:**
|
|
- Continuous audio monitoring
|
|
- Low CPU usage (< 1% on modern CPUs)
|
|
- Custom wake word training available
|
|
- Privacy: all processing local
|
|
|
|
**Alternative: Snowboy (Open Source)**
|
|
- No longer actively maintained
|
|
- Still functional for basic wake words
|
|
- Completely free and local
|
|
- Lower accuracy than Porcupine
|
|
|
|
#### Push-to-Talk Implementations
|
|
|
|
**Hardware Button Integration:**
|
|
```python
|
|
# GPIO button on Raspberry Pi
|
|
import RPi.GPIO as GPIO
|
|
|
|
BUTTON_PIN = 18
|
|
GPIO.setup(BUTTON_PIN, GPIO.IN, pull_up_down=GPIO.PUD_UP)
|
|
|
|
def button_callback(channel):
|
|
if GPIO.input(channel) == GPIO.LOW:
|
|
start_recording()
|
|
else:
|
|
stop_recording_and_process()
|
|
|
|
GPIO.add_event_detect(BUTTON_PIN, GPIO.BOTH, callback=button_callback)
|
|
```
|
|
|
|
**Mobile App Integration:**
|
|
- Home Assistant mobile app can trigger automations
|
|
- Custom webhook endpoints
|
|
- WebSocket connections for real-time triggers
|
|
|
|
### Claude Code Integration Architecture
|
|
|
|
#### API Bridge Service Implementation
|
|
|
|
**Service Architecture:**
|
|
```python
|
|
# voice-bridge service structure
|
|
from fastapi import FastAPI
|
|
from anthropic import Anthropic
|
|
import homeassistant_api
|
|
import whisper_client
|
|
import asyncio
|
|
|
|
app = FastAPI()
|
|
|
|
class VoiceBridge:
|
|
def __init__(self):
|
|
self.claude = Anthropic(api_key=os.getenv('CLAUDE_API_KEY'))
|
|
self.ha = homeassistant_api.Client(
|
|
url=os.getenv('HA_URL'),
|
|
token=os.getenv('HA_TOKEN')
|
|
)
|
|
self.whisper = whisper_client.Client(os.getenv('WHISPER_URL'))
|
|
|
|
async def process_audio(self, audio_data):
|
|
# Step 1: Convert audio to text
|
|
transcript = await self.whisper.transcribe(audio_data)
|
|
|
|
# Step 2: Send to Claude for interpretation
|
|
ha_context = await self.get_ha_context()
|
|
claude_response = await self.claude.messages.create(
|
|
model="claude-3-5-sonnet-20241022",
|
|
messages=[{
|
|
"role": "user",
|
|
"content": f"""
|
|
Convert this voice command to Home Assistant API calls:
|
|
Command: "{transcript}"
|
|
|
|
Available entities: {ha_context}
|
|
|
|
Return JSON format for HA API calls.
|
|
"""
|
|
}]
|
|
)
|
|
|
|
# Step 3: Execute HA commands
|
|
commands = json.loads(claude_response.content)
|
|
results = []
|
|
for cmd in commands:
|
|
result = await self.ha.call_service(**cmd)
|
|
results.append(result)
|
|
|
|
return {
|
|
'transcript': transcript,
|
|
'commands': commands,
|
|
'results': results
|
|
}
|
|
|
|
async def get_ha_context(self):
|
|
# Get current state of all entities
|
|
states = await self.ha.get_states()
|
|
return {
|
|
'lights': [e for e in states if e['entity_id'].startswith('light.')],
|
|
'sensors': [e for e in states if e['entity_id'].startswith('sensor.')],
|
|
'switches': [e for e in states if e['entity_id'].startswith('switch.')],
|
|
# ... other entity types
|
|
}
|
|
```
|
|
|
|
#### Command Translation Patterns
|
|
|
|
**Direct Device Commands:**
|
|
```json
|
|
{
|
|
"speech": "Turn on the living room lights",
|
|
"claude_interpretation": {
|
|
"intent": "light_control",
|
|
"target": "light.living_room",
|
|
"action": "turn_on"
|
|
},
|
|
"ha_api_call": {
|
|
"service": "light.turn_on",
|
|
"target": {"entity_id": "light.living_room"}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Scene Activation:**
|
|
```json
|
|
{
|
|
"speech": "Set movie mode",
|
|
"claude_interpretation": {
|
|
"intent": "scene_activation",
|
|
"scene": "movie_mode"
|
|
},
|
|
"ha_api_call": {
|
|
"service": "scene.turn_on",
|
|
"target": {"entity_id": "scene.movie_mode"}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Complex Logic:**
|
|
```json
|
|
{
|
|
"speech": "Turn on lights in occupied rooms",
|
|
"claude_interpretation": {
|
|
"intent": "conditional_light_control",
|
|
"condition": "occupancy_detected",
|
|
"action": "turn_on_lights"
|
|
},
|
|
"ha_api_calls": [
|
|
{
|
|
"service": "light.turn_on",
|
|
"target": {"entity_id": "light.bedroom"},
|
|
"condition": "binary_sensor.bedroom_occupancy == 'on'"
|
|
},
|
|
{
|
|
"service": "light.turn_on",
|
|
"target": {"entity_id": "light.living_room"},
|
|
"condition": "binary_sensor.living_room_occupancy == 'on'"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
#### Context Management Strategy
|
|
|
|
**Conversation Memory:**
|
|
```python
|
|
class ConversationContext:
|
|
def __init__(self):
|
|
self.history = []
|
|
self.context_window = 10 # Keep last 10 interactions
|
|
|
|
def add_interaction(self, speech, response, timestamp):
|
|
self.history.append({
|
|
'speech': speech,
|
|
'response': response,
|
|
'timestamp': timestamp,
|
|
'ha_state_snapshot': self.capture_ha_state()
|
|
})
|
|
|
|
# Maintain sliding window
|
|
if len(self.history) > self.context_window:
|
|
self.history.pop(0)
|
|
|
|
def get_context_for_claude(self):
|
|
return {
|
|
'recent_commands': self.history[-3:],
|
|
'current_time': datetime.now(),
|
|
'house_state': self.get_current_house_state()
|
|
}
|
|
```
|
|
|
|
**Ambiguity Resolution:**
|
|
```python
|
|
# Handle ambiguous commands
|
|
def resolve_ambiguity(transcript, available_entities):
|
|
if "lights" in transcript.lower() and not specific_room_mentioned:
|
|
return {
|
|
'type': 'clarification_needed',
|
|
'message': 'Which lights? I can control: living room, bedroom, kitchen',
|
|
'options': ['light.living_room', 'light.bedroom', 'light.kitchen']
|
|
}
|
|
```
|
|
|
|
### Home Assistant Integration Patterns
|
|
|
|
#### API Authentication & Security
|
|
```python
|
|
# Secure API setup
|
|
HA_CONFIG = {
|
|
'url': 'http://homeassistant:8123',
|
|
'token': os.getenv('HA_LONG_LIVED_TOKEN'), # Never hardcode
|
|
'ssl_verify': True, # In production
|
|
'timeout': 10
|
|
}
|
|
|
|
# Create long-lived access token in HA:
|
|
# Settings -> People -> [Your User] -> Long-lived access tokens
|
|
```
|
|
|
|
#### WebSocket Integration for Real-time Updates
|
|
```python
|
|
import websockets
|
|
import json
|
|
|
|
async def ha_websocket_listener():
|
|
uri = "ws://homeassistant:8123/api/websocket"
|
|
|
|
async with websockets.connect(uri) as websocket:
|
|
# Authenticate
|
|
await websocket.send(json.dumps({
|
|
'type': 'auth',
|
|
'access_token': HA_TOKEN
|
|
}))
|
|
|
|
# Subscribe to state changes
|
|
await websocket.send(json.dumps({
|
|
'id': 1,
|
|
'type': 'subscribe_events',
|
|
'event_type': 'state_changed'
|
|
}))
|
|
|
|
async for message in websocket:
|
|
data = json.loads(message)
|
|
if data.get('type') == 'event':
|
|
# Process state changes for voice responses
|
|
await process_state_change(data['event'])
|
|
```
|
|
|
|
#### Voice Response Integration
|
|
```python
|
|
# HA TTS integration for voice feedback
|
|
async def speak_response(message, entity_id='media_player.living_room'):
|
|
await ha_client.call_service(
|
|
'tts', 'speak',
|
|
target={'entity_id': entity_id},
|
|
service_data={
|
|
'message': message,
|
|
'language': 'en',
|
|
'options': {'voice': 'neural'}
|
|
}
|
|
)
|
|
|
|
# Usage examples:
|
|
await speak_response("Living room lights turned on")
|
|
await speak_response("I couldn't find that device. Please be more specific.")
|
|
await speak_response("Movie mode activated. Enjoy your film!")
|
|
```
|
|
|
|
### Hardware & Deployment Considerations
|
|
|
|
#### Microphone Hardware Analysis
|
|
|
|
**USB Microphones (Recommended for testing):**
|
|
- Blue Yeti: Excellent quality, multiple pickup patterns
|
|
- Audio-Technica ATR2100x-USB: Professional quality
|
|
- Samson Go Mic: Compact, budget-friendly
|
|
|
|
**Professional Audio Interfaces:**
|
|
- Focusrite Scarlett Solo: Single input, professional quality
|
|
- Behringer U-Phoria UM2: Budget 2-input interface
|
|
- PreSonus AudioBox USB 96: Mid-range option
|
|
|
|
**Raspberry Pi Integration:**
|
|
```bash
|
|
# ReSpeaker HAT for Raspberry Pi
|
|
# Provides 2-4 microphone array with hardware VAD
|
|
# I2S connection, low latency
|
|
# Built-in LED ring for visual feedback
|
|
|
|
# GPIO microphone setup
|
|
sudo apt install python3-pyaudio
|
|
# Configure ALSA for USB microphones
|
|
```
|
|
|
|
**Network Microphone Distribution:**
|
|
```python
|
|
# Distributed microphone system
|
|
MICROPHONE_NODES = {
|
|
'living_room': 'http://pi-living:8080',
|
|
'bedroom': 'http://pi-bedroom:8080',
|
|
'kitchen': 'http://pi-kitchen:8080'
|
|
}
|
|
|
|
# Each Pi runs lightweight audio capture service
|
|
# Sends audio to central Whisper processing
|
|
```
|
|
|
|
#### GPU Acceleration Setup
|
|
|
|
**NVIDIA GPU Configuration:**
|
|
```yaml
|
|
# Docker Compose GPU configuration
|
|
whisper-gpu:
|
|
image: whisper-gpu:latest
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: 1
|
|
capabilities: [gpu]
|
|
environment:
|
|
- NVIDIA_VISIBLE_DEVICES=all
|
|
```
|
|
|
|
**Performance Benchmarks (estimated):**
|
|
- **CPU Only (8-core):**
|
|
- Whisper base: ~500ms latency
|
|
- Whisper large: ~2000ms latency
|
|
- **With GPU (GTX 1660+):**
|
|
- Whisper base: ~150ms latency
|
|
- Whisper large: ~400ms latency
|
|
|
|
#### Container Orchestration Strategy
|
|
|
|
**Complete Docker Compose Stack:**
|
|
```yaml
|
|
version: '3.8'
|
|
|
|
services:
|
|
# Core Home Assistant
|
|
homeassistant:
|
|
container_name: homeassistant
|
|
image: ghcr.io/home-assistant/home-assistant:stable
|
|
volumes:
|
|
- ./ha-config:/config
|
|
- /etc/localtime:/etc/localtime:ro
|
|
restart: unless-stopped
|
|
network_mode: host
|
|
|
|
# Speech-to-Text Engine
|
|
whisper:
|
|
container_name: whisper-stt
|
|
image: onerahmet/openai-whisper-asr-webservice:latest
|
|
ports:
|
|
- "9000:9000"
|
|
environment:
|
|
- ASR_MODEL=base
|
|
- ASR_ENGINE=openai_whisper
|
|
volumes:
|
|
- whisper-models:/root/.cache/whisper
|
|
restart: unless-stopped
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: 1
|
|
capabilities: [gpu]
|
|
|
|
# Voice Processing Bridge
|
|
voice-bridge:
|
|
container_name: voice-bridge
|
|
build:
|
|
context: ./voice-bridge
|
|
dockerfile: Dockerfile
|
|
ports:
|
|
- "8080:8080"
|
|
environment:
|
|
- CLAUDE_API_KEY=${CLAUDE_API_KEY}
|
|
- HA_URL=http://homeassistant:8123
|
|
- HA_TOKEN=${HA_LONG_LIVED_TOKEN}
|
|
- WHISPER_URL=http://whisper:9000
|
|
- PORCUPINE_ACCESS_KEY=${PORCUPINE_ACCESS_KEY}
|
|
volumes:
|
|
- ./voice-bridge-config:/app/config
|
|
- /dev/snd:/dev/snd # Audio device access
|
|
depends_on:
|
|
- homeassistant
|
|
- whisper
|
|
restart: unless-stopped
|
|
privileged: true # For audio device access
|
|
|
|
# Optional: MQTT for device communication
|
|
mosquitto:
|
|
container_name: mqtt-broker
|
|
image: eclipse-mosquitto:latest
|
|
ports:
|
|
- "1883:1883"
|
|
- "9001:9001"
|
|
volumes:
|
|
- ./mosquitto:/mosquitto
|
|
restart: unless-stopped
|
|
|
|
# Optional: Node-RED for visual automation
|
|
node-red:
|
|
container_name: node-red
|
|
image: nodered/node-red:latest
|
|
ports:
|
|
- "1880:1880"
|
|
volumes:
|
|
- node-red-data:/data
|
|
restart: unless-stopped
|
|
|
|
volumes:
|
|
whisper-models:
|
|
node-red-data:
|
|
```
|
|
|
|
### Privacy & Security Deep Analysis
|
|
|
|
#### Data Flow Security Model
|
|
|
|
**Audio Data Privacy:**
|
|
```
|
|
[Microphone] → [Local VAD] → [Local STT] → [Text Only] → [Claude API]
|
|
↓ ↓ ↓ ↓
|
|
Never leaves Never leaves Never leaves Encrypted
|
|
local net local net local net HTTPS only
|
|
```
|
|
|
|
**Security Boundaries:**
|
|
1. **Audio Capture Layer:** Hardware → Local processing only
|
|
2. **Speech Recognition:** Local Whisper → No cloud STT
|
|
3. **Command Interpretation:** Text-only to Claude Code API
|
|
4. **Automation Execution:** Local Home Assistant only
|
|
|
|
#### Network Security Configuration
|
|
|
|
**Firewall Rules:**
|
|
```bash
|
|
# Only allow outbound HTTPS for Claude API
|
|
iptables -A OUTPUT -p tcp --dport 443 -d anthropic.com -j ACCEPT
|
|
iptables -A OUTPUT -p tcp --dport 443 -j DROP # Block other HTTPS
|
|
|
|
# Block all other outbound traffic from voice containers
|
|
iptables -A OUTPUT -s voice-bridge-ip -j DROP
|
|
```
|
|
|
|
**API Key Security:**
|
|
```bash
|
|
# Environment variable best practices
|
|
echo "CLAUDE_API_KEY=your-key-here" >> .env
|
|
echo "HA_LONG_LIVED_TOKEN=your-token-here" >> .env
|
|
chmod 600 .env
|
|
|
|
# Container secrets mounting
|
|
docker run --env-file .env voice-bridge:latest
|
|
```
|
|
|
|
#### Privacy Controls Implementation
|
|
|
|
**Audio Retention Policy:**
|
|
```python
|
|
class AudioPrivacyManager:
|
|
def __init__(self):
|
|
self.max_retention_seconds = 5 # Keep audio only during processing
|
|
self.transcript_retention_days = 7 # Keep transcripts short-term
|
|
|
|
async def process_audio(self, audio_data):
|
|
try:
|
|
transcript = await self.stt_engine.transcribe(audio_data)
|
|
# Process immediately
|
|
result = await self.process_command(transcript)
|
|
|
|
# Store transcript with expiration
|
|
await self.store_transcript(transcript, expires_in=7*24*3600)
|
|
|
|
return result
|
|
finally:
|
|
# Always delete audio data immediately
|
|
del audio_data
|
|
gc.collect()
|
|
```
|
|
|
|
**User Consent & Controls:**
|
|
```python
|
|
# Voice system controls in Home Assistant
|
|
VOICE_CONTROLS = {
|
|
'input_boolean.voice_system_enabled': 'Global voice control toggle',
|
|
'input_boolean.voice_learning_mode': 'Allow transcript storage for improvement',
|
|
'input_select.voice_privacy_level': ['minimal', 'standard', 'enhanced'],
|
|
'button.clear_voice_history': 'Clear all stored transcripts'
|
|
}
|
|
```
|
|
|
|
### Advanced Features & Future Expansion
|
|
|
|
#### Multi-Room Microphone Network
|
|
|
|
**Distributed Audio Architecture:**
|
|
```python
|
|
# Central coordinator service
|
|
class MultiRoomVoiceCoordinator:
|
|
def __init__(self):
|
|
self.microphone_nodes = {
|
|
'living_room': MicrophoneNode('192.168.1.101'),
|
|
'bedroom': MicrophoneNode('192.168.1.102'),
|
|
'kitchen': MicrophoneNode('192.168.1.103')
|
|
}
|
|
|
|
async def listen_all_rooms(self):
|
|
# Simultaneous listening across all nodes
|
|
tasks = [node.listen() for node in self.microphone_nodes.values()]
|
|
winner = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
|
|
|
|
# Process audio from first responding room
|
|
audio_data, source_room = winner.result()
|
|
return await self.process_with_context(audio_data, source_room)
|
|
|
|
async def process_with_context(self, audio_data, room):
|
|
# Add room context to Claude processing
|
|
transcript = await self.stt.transcribe(audio_data)
|
|
|
|
claude_prompt = f"""
|
|
Voice command from {room}: "{transcript}"
|
|
|
|
Room-specific devices available:
|
|
{self.get_room_devices(room)}
|
|
|
|
Convert to Home Assistant API calls.
|
|
"""
|
|
```
|
|
|
|
**Room-Aware Processing:**
|
|
```python
|
|
def get_room_devices(self, room):
|
|
"""Return devices specific to the source room"""
|
|
room_entities = {
|
|
'living_room': [
|
|
'light.living_room_ceiling',
|
|
'media_player.living_room_tv',
|
|
'climate.living_room_thermostat'
|
|
],
|
|
'bedroom': [
|
|
'light.bedroom_bedside',
|
|
'switch.bedroom_fan',
|
|
'binary_sensor.bedroom_window'
|
|
]
|
|
}
|
|
return room_entities.get(room, [])
|
|
```
|
|
|
|
#### Context-Aware Conversations
|
|
|
|
**Advanced Context Management:**
|
|
```python
|
|
class AdvancedContextManager:
|
|
def __init__(self):
|
|
self.conversation_sessions = {}
|
|
self.house_state_history = []
|
|
|
|
def create_claude_context(self, user_id, transcript):
|
|
session = self.get_or_create_session(user_id)
|
|
|
|
context = {
|
|
'transcript': transcript,
|
|
'conversation_history': session.history[-5:],
|
|
'current_time': datetime.now().isoformat(),
|
|
'house_state': {
|
|
'lights_on': self.get_lights_status(),
|
|
'occupancy': self.get_occupancy_status(),
|
|
'weather': self.get_weather(),
|
|
'recent_events': self.get_recent_ha_events(minutes=15)
|
|
},
|
|
'user_preferences': self.get_user_preferences(user_id),
|
|
'location_context': self.get_location_context()
|
|
}
|
|
|
|
return self.format_claude_prompt(context)
|
|
|
|
def format_claude_prompt(self, context):
|
|
return f"""
|
|
You are controlling a Home Assistant smart home system via voice commands.
|
|
|
|
Current situation:
|
|
- Time: {context['current_time']}
|
|
- House state: {context['house_state']}
|
|
- Recent conversation: {context['conversation_history']}
|
|
|
|
User said: "{context['transcript']}"
|
|
|
|
Convert this to Home Assistant API calls. Consider:
|
|
1. Current device states (don't turn on lights that are already on)
|
|
2. Time of day (different responses for morning vs night)
|
|
3. Recent conversation context
|
|
4. User's typical preferences
|
|
|
|
Respond with JSON array of Home Assistant service calls.
|
|
"""
|
|
```
|
|
|
|
#### Voice Response & Feedback Systems
|
|
|
|
**Advanced TTS Integration:**
|
|
```python
|
|
class VoiceResponseManager:
|
|
def __init__(self):
|
|
self.tts_engines = {
|
|
'neural': 'tts.cloud_say', # High quality
|
|
'local': 'tts.piper_say', # Local processing
|
|
'espeak': 'tts.espeak_say' # Fallback
|
|
}
|
|
|
|
async def respond_with_voice(self, message, room=None, urgency='normal'):
|
|
# Select appropriate TTS based on context
|
|
tts_engine = self.select_tts_engine(urgency)
|
|
|
|
# Select speakers based on room or system state
|
|
speakers = self.select_speakers(room)
|
|
|
|
# Format message for natural speech
|
|
speech_message = self.format_for_speech(message)
|
|
|
|
# Send to appropriate speakers
|
|
for speaker in speakers:
|
|
await self.ha_client.call_service(
|
|
'tts', tts_engine,
|
|
target={'entity_id': speaker},
|
|
service_data={
|
|
'message': speech_message,
|
|
'options': {
|
|
'voice': 'neural2-en-us-standard-a',
|
|
'speed': 1.0,
|
|
'pitch': 0.0
|
|
}
|
|
}
|
|
)
|
|
|
|
def format_for_speech(self, message):
|
|
"""Convert technical responses to natural speech"""
|
|
replacements = {
|
|
'light.living_room': 'living room lights',
|
|
'switch.bedroom_fan': 'bedroom fan',
|
|
'climate.main_thermostat': 'thermostat',
|
|
'scene.movie_mode': 'movie mode'
|
|
}
|
|
|
|
for tech_term, natural_term in replacements.items():
|
|
message = message.replace(tech_term, natural_term)
|
|
|
|
return message
|
|
```
|
|
|
|
**Visual Feedback Integration:**
|
|
```python
|
|
# LED ring feedback on microphone nodes
|
|
class MicrophoneVisualFeedback:
|
|
def __init__(self, led_pin_count=12):
|
|
self.leds = neopixel.NeoPixel(board.D18, led_pin_count)
|
|
|
|
def show_listening(self):
|
|
# Blue pulsing pattern
|
|
self.animate_pulse(color=(0, 0, 255))
|
|
|
|
def show_processing(self):
|
|
# Spinning orange pattern
|
|
self.animate_spin(color=(255, 165, 0))
|
|
|
|
def show_success(self):
|
|
# Green flash
|
|
self.flash(color=(0, 255, 0), duration=1.0)
|
|
|
|
def show_error(self):
|
|
# Red flash
|
|
self.flash(color=(255, 0, 0), duration=2.0)
|
|
```
|
|
|
|
### Performance Optimization Strategies
|
|
|
|
#### Caching & Response Time Optimization
|
|
|
|
**STT Model Caching:**
|
|
```python
|
|
class WhisperModelCache:
|
|
def __init__(self):
|
|
self.models = {}
|
|
self.model_locks = {}
|
|
|
|
async def get_model(self, model_size='base'):
|
|
if model_size not in self.models:
|
|
if model_size not in self.model_locks:
|
|
self.model_locks[model_size] = asyncio.Lock()
|
|
|
|
async with self.model_locks[model_size]:
|
|
if model_size not in self.models:
|
|
self.models[model_size] = whisper.load_model(model_size)
|
|
|
|
return self.models[model_size]
|
|
```
|
|
|
|
**Command Pattern Caching:**
|
|
```python
|
|
class CommandCache:
|
|
def __init__(self, ttl_seconds=300): # 5 minute TTL
|
|
self.cache = {}
|
|
self.ttl = ttl_seconds
|
|
|
|
def get_cached_response(self, transcript_hash):
|
|
if transcript_hash in self.cache:
|
|
entry = self.cache[transcript_hash]
|
|
if time.time() - entry['timestamp'] < self.ttl:
|
|
return entry['response']
|
|
return None
|
|
|
|
def cache_response(self, transcript_hash, response):
|
|
self.cache[transcript_hash] = {
|
|
'response': response,
|
|
'timestamp': time.time()
|
|
}
|
|
```
|
|
|
|
#### Resource Management
|
|
|
|
**Memory Management:**
|
|
```python
|
|
class ResourceManager:
|
|
def __init__(self):
|
|
self.max_memory_usage = 4 * 1024**3 # 4GB limit
|
|
|
|
async def process_with_memory_management(self, audio_data):
|
|
initial_memory = self.get_memory_usage()
|
|
|
|
try:
|
|
if initial_memory > self.max_memory_usage * 0.8:
|
|
await self.cleanup_memory()
|
|
|
|
result = await self.process_audio(audio_data)
|
|
return result
|
|
|
|
finally:
|
|
# Force garbage collection after each request
|
|
gc.collect()
|
|
|
|
async def cleanup_memory(self):
|
|
# Clear caches, unload unused models
|
|
self.command_cache.clear()
|
|
self.whisper_cache.clear_unused()
|
|
gc.collect()
|
|
```
|
|
|
|
### Error Handling & Reliability
|
|
|
|
#### Comprehensive Error Recovery
|
|
|
|
**STT Failure Handling:**
|
|
```python
|
|
class RobustSTTProcessor:
|
|
def __init__(self):
|
|
self.stt_engines = [
|
|
WhisperSTT(model='base'),
|
|
VoskSTT(),
|
|
DeepSpeechSTT() # Fallback options
|
|
]
|
|
|
|
async def transcribe_with_fallbacks(self, audio_data):
|
|
for i, engine in enumerate(self.stt_engines):
|
|
try:
|
|
transcript = await engine.transcribe(audio_data)
|
|
if self.validate_transcript(transcript):
|
|
return transcript
|
|
except Exception as e:
|
|
logger.warning(f"STT engine {i} failed: {e}")
|
|
if i == len(self.stt_engines) - 1:
|
|
raise
|
|
continue
|
|
|
|
def validate_transcript(self, transcript):
|
|
# Basic validation rules
|
|
if len(transcript.strip()) < 3:
|
|
return False
|
|
if transcript.count('?') > len(transcript) / 4: # Too much uncertainty
|
|
return False
|
|
return True
|
|
```
|
|
|
|
**Claude API Failure Handling:**
|
|
```python
|
|
class RobustClaudeClient:
|
|
def __init__(self):
|
|
self.client = anthropic.Anthropic()
|
|
self.fallback_patterns = self.load_fallback_patterns()
|
|
|
|
async def process_command_with_fallback(self, transcript):
|
|
try:
|
|
# Attempt Claude processing
|
|
response = await self.client.messages.create(
|
|
model="claude-3-5-sonnet-20241022",
|
|
messages=[{"role": "user", "content": self.create_prompt(transcript)}],
|
|
timeout=10.0
|
|
)
|
|
return json.loads(response.content)
|
|
|
|
except (anthropic.APIError, asyncio.TimeoutError) as e:
|
|
logger.warning(f"Claude API failed: {e}")
|
|
|
|
# Attempt local pattern matching as fallback
|
|
return self.fallback_command_processing(transcript)
|
|
|
|
def fallback_command_processing(self, transcript):
|
|
"""Simple pattern matching for basic commands when Claude is unavailable"""
|
|
transcript = transcript.lower()
|
|
|
|
# Basic light controls
|
|
if 'turn on' in transcript and 'light' in transcript:
|
|
room = self.extract_room(transcript)
|
|
return [{
|
|
'service': 'light.turn_on',
|
|
'target': {'entity_id': f'light.{room or "all"}'}
|
|
}]
|
|
|
|
# Basic switch controls
|
|
if 'turn off' in transcript and ('light' in transcript or 'switch' in transcript):
|
|
room = self.extract_room(transcript)
|
|
return [{
|
|
'service': 'light.turn_off',
|
|
'target': {'entity_id': f'light.{room or "all"}'}
|
|
}]
|
|
|
|
# Scene activation
|
|
scenes = ['movie', 'bedtime', 'morning', 'evening']
|
|
for scene in scenes:
|
|
if scene in transcript:
|
|
return [{
|
|
'service': 'scene.turn_on',
|
|
'target': {'entity_id': f'scene.{scene}'}
|
|
}]
|
|
|
|
# If no patterns match, return error
|
|
return [{'error': 'Command not recognized in offline mode'}]
|
|
```
|
|
|
|
#### Health Monitoring & Diagnostics
|
|
|
|
**System Health Monitoring:**
|
|
```python
|
|
class VoiceSystemHealthMonitor:
|
|
def __init__(self):
|
|
self.health_checks = {
|
|
'whisper_api': self.check_whisper_health,
|
|
'claude_api': self.check_claude_health,
|
|
'homeassistant_api': self.check_ha_health,
|
|
'microphone_nodes': self.check_microphone_health
|
|
}
|
|
|
|
async def run_health_checks(self):
|
|
results = {}
|
|
|
|
for service, check_func in self.health_checks.items():
|
|
try:
|
|
results[service] = await check_func()
|
|
except Exception as e:
|
|
results[service] = {
|
|
'status': 'unhealthy',
|
|
'error': str(e),
|
|
'timestamp': datetime.now().isoformat()
|
|
}
|
|
|
|
return results
|
|
|
|
async def check_whisper_health(self):
|
|
start_time = time.time()
|
|
response = await aiohttp.get('http://whisper:9000/health')
|
|
latency = time.time() - start_time
|
|
|
|
return {
|
|
'status': 'healthy' if response.status == 200 else 'unhealthy',
|
|
'latency_ms': int(latency * 1000),
|
|
'timestamp': datetime.now().isoformat()
|
|
}
|
|
|
|
# Similar checks for other services...
|
|
```
|
|
|
|
**Automated Recovery Actions:**
|
|
```python
|
|
class AutoRecoveryManager:
|
|
def __init__(self):
|
|
self.recovery_actions = {
|
|
'whisper_unhealthy': self.restart_whisper_service,
|
|
'high_memory_usage': self.cleanup_resources,
|
|
'claude_rate_limited': self.enable_fallback_mode,
|
|
'microphone_disconnected': self.reinitialize_audio
|
|
}
|
|
|
|
async def handle_health_issue(self, issue_type, details):
|
|
if issue_type in self.recovery_actions:
|
|
logger.info(f"Attempting recovery for {issue_type}")
|
|
await self.recovery_actions[issue_type](details)
|
|
else:
|
|
logger.error(f"No recovery action for {issue_type}")
|
|
await self.alert_administrators(issue_type, details)
|
|
```
|
|
|
|
### Testing & Validation Strategies
|
|
|
|
#### Audio Processing Testing
|
|
|
|
**STT Accuracy Testing:**
|
|
```python
|
|
class STTAccuracyTester:
|
|
def __init__(self):
|
|
self.test_phrases = [
|
|
"Turn on the living room lights",
|
|
"Set the temperature to 72 degrees",
|
|
"Activate movie mode",
|
|
"What's the weather like outside",
|
|
"Turn off all lights",
|
|
"Lock all doors"
|
|
]
|
|
|
|
async def run_accuracy_tests(self, stt_engine):
|
|
results = []
|
|
|
|
for phrase in self.test_phrases:
|
|
# Generate synthetic audio from phrase
|
|
audio_data = await self.text_to_speech(phrase)
|
|
|
|
# Test STT accuracy
|
|
transcript = await stt_engine.transcribe(audio_data)
|
|
|
|
accuracy = self.calculate_word_accuracy(phrase, transcript)
|
|
results.append({
|
|
'original': phrase,
|
|
'transcript': transcript,
|
|
'accuracy': accuracy
|
|
})
|
|
|
|
return results
|
|
|
|
def calculate_word_accuracy(self, reference, hypothesis):
|
|
ref_words = reference.lower().split()
|
|
hyp_words = hypothesis.lower().split()
|
|
|
|
# Simple word error rate calculation
|
|
correct = sum(1 for r, h in zip(ref_words, hyp_words) if r == h)
|
|
return correct / len(ref_words) if ref_words else 0
|
|
```
|
|
|
|
#### End-to-End Integration Testing
|
|
|
|
**Complete Pipeline Testing:**
|
|
```python
|
|
class E2ETestSuite:
|
|
def __init__(self):
|
|
self.test_scenarios = [
|
|
{
|
|
'name': 'basic_light_control',
|
|
'audio_file': 'tests/audio/turn_on_lights.wav',
|
|
'expected_ha_calls': [
|
|
{'service': 'light.turn_on', 'target': {'entity_id': 'light.living_room'}}
|
|
]
|
|
},
|
|
{
|
|
'name': 'complex_scene_activation',
|
|
'audio_file': 'tests/audio/movie_mode.wav',
|
|
'expected_ha_calls': [
|
|
{'service': 'scene.turn_on', 'target': {'entity_id': 'scene.movie'}}
|
|
]
|
|
}
|
|
]
|
|
|
|
async def run_full_pipeline_tests(self):
|
|
results = []
|
|
|
|
for scenario in self.test_scenarios:
|
|
result = await self.test_scenario(scenario)
|
|
results.append(result)
|
|
|
|
return results
|
|
|
|
async def test_scenario(self, scenario):
|
|
# Load test audio
|
|
with open(scenario['audio_file'], 'rb') as f:
|
|
audio_data = f.read()
|
|
|
|
# Run through complete pipeline
|
|
try:
|
|
actual_calls = await self.voice_bridge.process_audio(audio_data)
|
|
|
|
# Compare with expected results
|
|
match = self.compare_ha_calls(
|
|
scenario['expected_ha_calls'],
|
|
actual_calls['commands']
|
|
)
|
|
|
|
return {
|
|
'scenario': scenario['name'],
|
|
'success': match,
|
|
'expected': scenario['expected_ha_calls'],
|
|
'actual': actual_calls['commands']
|
|
}
|
|
|
|
except Exception as e:
|
|
return {
|
|
'scenario': scenario['name'],
|
|
'success': False,
|
|
'error': str(e)
|
|
}
|
|
```
|
|
|
|
## Implementation Timeline & Milestones
|
|
|
|
### Phase 1: Foundation (Weeks 1-2)
|
|
**Goals:**
|
|
- Home Assistant stable deployment
|
|
- Basic container infrastructure
|
|
- Initial device integration
|
|
|
|
**Success Criteria:**
|
|
- HA accessible and controlling existing devices
|
|
- Container stack running reliably
|
|
- Basic automations working
|
|
|
|
### Phase 2: Core Voice System (Weeks 3-4)
|
|
**Goals:**
|
|
- Whisper STT deployment
|
|
- Basic voice-bridge service
|
|
- Simple command processing
|
|
|
|
**Success Criteria:**
|
|
- Speech-to-text working with test audio files
|
|
- Claude Code API integration functional
|
|
- Basic "turn on lights" commands working
|
|
|
|
### Phase 3: Production Features (Weeks 5-6)
|
|
**Goals:**
|
|
- Wake word detection
|
|
- Multi-room microphone support
|
|
- Advanced error handling
|
|
|
|
**Success Criteria:**
|
|
- Hands-free operation with wake words
|
|
- Reliable operation across multiple rooms
|
|
- Graceful failure modes working
|
|
|
|
### Phase 4: Optimization & Polish (Weeks 7-8)
|
|
**Goals:**
|
|
- Performance optimization
|
|
- Advanced context awareness
|
|
- Visual/audio feedback systems
|
|
|
|
**Success Criteria:**
|
|
- Sub-500ms response times
|
|
- Context-aware conversations
|
|
- Family-friendly operation
|
|
|
|
## Cost Analysis
|
|
|
|
### Hardware Costs
|
|
- **Microphones:** $50-200 per room
|
|
- **Processing Hardware:** Covered by existing Proxmox setup
|
|
- **Additional Storage:** ~50GB for models and logs
|
|
|
|
### Service Costs
|
|
- **Claude Code API:** ~$0.01-0.10 per command (depending on context size)
|
|
- **Porcupine Wake Words:** $0.50-2.00 per month per wake word
|
|
- **No cloud STT costs** (fully local)
|
|
|
|
### Estimated Monthly Operating Costs
|
|
- **Light Usage (10 commands/day):** ~$3-10/month
|
|
- **Heavy Usage (50 commands/day):** ~$15-50/month
|
|
- **Wake word licensing:** ~$2-5/month
|
|
|
|
## Conclusion & Next Steps
|
|
|
|
This voice automation system represents a cutting-edge approach to local smart home control, combining the latest in speech recognition with advanced AI interpretation. The architecture prioritizes privacy, reliability, and extensibility while maintaining the local-only operation you desire.
|
|
|
|
**Key Success Factors:**
|
|
1. **Proven Technology Stack:** Whisper + Claude Code + Home Assistant
|
|
2. **Privacy-First Design:** Audio never leaves local network
|
|
3. **Flexible Architecture:** Easy to extend and customize
|
|
4. **Reliable Fallbacks:** Multiple failure recovery mechanisms
|
|
|
|
**Recommended Implementation Approach:**
|
|
1. Start with Home Assistant foundation
|
|
2. Add voice components incrementally
|
|
3. Test thoroughly at each phase
|
|
4. Optimize for your specific use patterns
|
|
|
|
The combination of your technical expertise, existing infrastructure, and this comprehensive architecture plan sets you up for success in creating a truly advanced, private, and powerful voice-controlled smart home system.
|
|
|
|
This system will provide the advanced automation capabilities that Apple Home lacks while maintaining the local control and privacy that drove your original Home Assistant interest. The addition of Claude Code as the natural language processing layer bridges the gap between human speech and technical automation in a way that would be extremely difficult to achieve with traditional rule-based systems. |