Cal Corum a34aec06f1 Initial commit: Voice server with Piper TTS

A local HTTP service that accepts text via POST and speaks it through
system speakers using Piper TTS neural voice synthesis.

Features:
- POST /notify - Queue text for TTS playback
- GET /health - Health check with TTS/audio/queue status
- GET /voices - List installed voice models
- Async queue processing (no overlapping audio)
- Non-blocking audio via sounddevice
- 73 tests covering API contract

Tech stack:
- FastAPI + Uvicorn
- Piper TTS (neural voices, offline)
- sounddevice (PortAudio)
- Pydantic for validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-19 00:18:12 -06:00

70 KiB

Raw Blame History

Product Requirements Document: Local Voice Server

Version: 1.0 Date: 2025-12-18 Author: Atlas (Principal Software Architect) Project: Local HTTP Voice Server for Text-to-Speech

Executive Summary
Goals and Non-Goals
Technical Requirements
System Architecture
API Specification
TTS Engine Analysis
Web Framework Selection
Audio Playback Strategy
Error Handling Strategy
Implementation Checklist
Testing Strategy
Future Considerations

Executive Summary

Project Overview

This project delivers a local HTTP service that accepts POST requests containing text strings and converts them to speech through the computer's speakers. The service will run locally on Linux (Nobara/Fedora 42), providing fast, offline text-to-speech capabilities without requiring external API calls or internet connectivity.

Success Metrics

Response Time: TTS conversion and playback initiation within 200ms for short texts (< 100 characters)
Reliability: 99.9% successful request handling under normal operating conditions
Concurrency: Support for at least 5 concurrent TTS requests with proper queuing
Audio Quality: Clear, intelligible speech output comparable to Google TTS quality
Startup Time: Server ready to accept requests within 2 seconds of launch

Technical Stack

Component	Technology	Justification
Web Framework	FastAPI	Async support, high performance (15k-20k req/s), automatic API documentation
TTS Engine	Piper TTS	Neural voice quality, offline, optimized for local inference, ONNX-based
Audio Playback	sounddevice	Cross-platform, Pythonic API, excellent NumPy integration, non-blocking playback
Package Manager	uv	Fast Python package management (user preference)
ASGI Server	Uvicorn	High-performance ASGI server, native FastAPI integration
Async Runtime	asyncio	Built-in Python async support for concurrent request handling

Timeline Estimate

Phase 1 - Core Implementation: 2-3 days (basic HTTP server + TTS integration)
Phase 2 - Error Handling & Testing: 1-2 days (comprehensive error handling, unit tests)
Phase 3 - Concurrency & Queue Management: 1-2 days (async queue, concurrent playback)
Total Estimated Time: 4-7 days for production-ready v1.0

Resource Requirements

Development: 1 full-stack Python developer with async programming experience
Testing: Access to Linux environment (Nobara/Fedora 42) with audio hardware
Infrastructure: Local development machine with 2+ CPU cores, 4GB+ RAM

Goals and Non-Goals

Goals

Primary Goals:

Create a local HTTP service that accepts text via POST requests
Convert text to speech using high-quality offline TTS
Play audio through system speakers with minimal latency
Support concurrent requests with proper queue management
Provide comprehensive error handling and logging
Maintain zero external dependencies (fully offline capable)

Secondary Goals:

Automatic API documentation via FastAPI's built-in OpenAPI support
Configurable TTS parameters (voice, speed, volume) via request parameters
Health check endpoint for service monitoring
Graceful handling of long-running text conversions
Support for multiple voice models

Non-Goals

Explicitly Out of Scope:

Cloud-based or external API integration
Speech-to-text (STT) capabilities
Audio file storage or retrieval
User authentication or authorization
Rate limiting or quota management
Multi-language UI or web interface
Real-time streaming audio synthesis
Mobile app integration
Persistent audio history or logging
Advanced audio effects (reverb, pitch shifting, etc.)

Technical Requirements

Functional Requirements

FR1: HTTP Server

FR1.1: Server SHALL listen on configurable host and port (default: 0.0.0.0:8888)
FR1.2: Server SHALL accept POST requests to /notify endpoint
FR1.3: Server SHALL accept JSON payload with message field containing text
FR1.4: Server SHALL return HTTP 200 with success confirmation
FR1.5: Server SHALL support CORS for local development

FR2: Text-to-Speech Conversion

FR2.1: System SHALL convert text strings to audio using Piper TTS
FR2.2: System SHALL support configurable voice models via request parameters
FR2.3: System SHALL support adjustable speech rate (50-400 words per minute)
FR2.4: System SHALL handle text inputs from 1 to 10,000 characters
FR2.5: System SHALL use default voice if not specified in request

FR3: Audio Playback

FR3.1: System SHALL play generated audio through default system audio output
FR3.2: System SHALL support non-blocking audio playback
FR3.3: System SHALL queue concurrent requests in FIFO order
FR3.4: System SHALL allow configurable maximum queue size (default: 50)
FR3.5: System SHALL provide feedback when queue is full

FR4: Configuration

FR4.1: System SHALL support configuration via environment variables
FR4.2: System SHALL support configuration via command-line arguments
FR4.3: System SHALL provide sensible defaults for all configuration values
FR4.4: System SHALL validate configuration at startup

FR5: Error Handling

FR5.1: System SHALL return appropriate HTTP error codes for failures
FR5.2: System SHALL log all errors with timestamps and context
FR5.3: System SHALL continue operating after non-fatal errors
FR5.4: System SHALL gracefully handle TTS engine failures
FR5.5: System SHALL provide detailed error messages in responses

Non-Functional Requirements

NFR1: Performance

NFR1.1: API response time SHALL be < 50ms (excluding TTS processing)
NFR1.2: TTS conversion SHALL complete in < 2 seconds for 500 character texts
NFR1.3: System SHALL handle 20+ requests per second without degradation
NFR1.4: Memory usage SHALL remain < 500MB under normal load
NFR1.5: CPU usage SHALL average < 30% during active TTS processing

NFR2: Reliability

NFR2.1: System SHALL maintain 99.9% uptime during operation
NFR2.2: System SHALL recover from audio device disconnections
NFR2.3: System SHALL handle Out-of-Memory conditions gracefully
NFR2.4: System SHALL log all critical errors for debugging

NFR3: Maintainability

NFR3.1: Code SHALL maintain > 80% test coverage
NFR3.2: All functions SHALL include docstrings with type hints
NFR3.3: Code SHALL follow PEP 8 style guidelines
NFR3.4: Dependencies SHALL be pinned to specific versions
NFR3.5: README SHALL provide clear setup and usage instructions

NFR4: Security

NFR4.1: System SHALL sanitize all text inputs to prevent injection attacks
NFR4.2: System SHALL limit request payload size to 1MB
NFR4.3: System SHALL not expose internal stack traces in API responses
NFR4.4: System SHALL log all incoming requests for audit purposes

NFR5: Compatibility

NFR5.1: System SHALL run on Linux (Nobara/Fedora 42)
NFR5.2: System SHALL support Python 3.9+
NFR5.3: System SHALL work with standard ALSA/PulseAudio setups
NFR5.4: System SHALL be deployable as a systemd service

System Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Client Applications                       │
│              (AI Agents, Scripts, Other Services)               │
└────────────────────────────┬────────────────────────────────────┘
                             │ HTTP POST /notify
                             │ JSON: {"message": "text"}
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                     FastAPI Web Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│  │   /notify    │  │   /health    │  │    /docs     │        │
│  │  endpoint    │  │  endpoint    │  │  (Swagger)   │        │
│  └──────┬───────┘  └──────────────┘  └──────────────┘        │
│         │                                                       │
│         │ Validates & Enqueues                                 │
│         ▼                                                       │
│  ┌──────────────────────────────────────────────────┐         │
│  │          Async Request Queue                     │         │
│  │  (asyncio.Queue with max size limit)            │         │
│  └──────────────────┬───────────────────────────────┘         │
└────────────────────┬┼───────────────────────────────────────────┘
                     ││
                     ││ Background Task Processing
                     ▼▼
┌─────────────────────────────────────────────────────────────────┐
│                    TTS Processing Layer                         │
│  ┌────────────────────────────────────────────────────┐        │
│  │              Piper TTS Engine                      │        │
│  │  ┌──────────────┐  ┌──────────────┐               │        │
│  │  │ Voice Models │  │ ONNX Runtime │               │        │
│  │  │  (.onnx +    │  │  Inference   │               │        │
│  │  │   .json)     │  │    Engine    │               │        │
│  │  └──────────────┘  └──────────────┘               │        │
│  └─────────────────────────┬──────────────────────────┘        │
│                            │ Generate WAV                       │
│                            ▼                                    │
│  ┌────────────────────────────────────────────────────┐        │
│  │          In-Memory Audio Buffer                    │        │
│  │        (NumPy array / bytes)                       │        │
│  └─────────────────────────┬──────────────────────────┘        │
└────────────────────────────┼───────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Audio Playback Layer                           │
│  ┌────────────────────────────────────────────────────┐        │
│  │              PyAudio Stream Manager                │        │
│  │  - Callback-based playback                         │        │
│  │  - Non-blocking operation                          │        │
│  │  - Stream lifecycle management                     │        │
│  └─────────────────────────┬──────────────────────────┘        │
│                            │                                    │
│                            ▼                                    │
│  ┌────────────────────────────────────────────────────┐        │
│  │         System Audio Output (ALSA/PulseAudio)     │        │
│  └────────────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
                    🔊 Computer Speakers

Component Descriptions

1. FastAPI Web Server

Responsibilities:
- Accept and validate HTTP POST requests
- Provide automatic OpenAPI documentation
- Handle CORS configuration
- Route requests to appropriate handlers
- Return HTTP responses with appropriate status codes
Dependencies:
- FastAPI framework
- Uvicorn ASGI server
- Pydantic for request/response validation

2. Async Request Queue

Responsibilities:
- Queue incoming TTS requests in FIFO order
- Prevent queue overflow with configurable max size
- Enable asynchronous processing without blocking HTTP responses
- Provide queue status information
Implementation:
- asyncio.Queue for async-safe queuing
- Background task workers to process queue
- Queue metrics (size, processed count, errors)

3. TTS Processing Layer

Responsibilities:
- Load and manage Piper TTS voice models
- Convert text to audio waveforms
- Handle voice model selection
- Configure TTS parameters (rate, pitch, volume)
- Generate in-memory audio buffers
Implementation:
- Piper TTS Python bindings
- ONNX Runtime for model inference
- Voice model caching for performance
- Error handling for model loading failures

4. Audio Playback Layer

Responsibilities:
- Initialize audio output streams
- Play audio buffers through system speakers
- Support non-blocking playback
- Handle audio device errors
- Manage stream lifecycle
Implementation:
- sounddevice for cross-platform audio I/O
- Non-blocking sd.play() with background playback
- Simple NumPy array integration
- Graceful handling of audio device disconnections

Data Flow

Request Processing Flow:

HTTP Request Reception:
- Client sends POST to /notify with JSON payload
- FastAPI validates request schema using Pydantic models
- Request is immediately acknowledged with HTTP 202 (Accepted)
Request Enqueueing:
- Validated request is added to async queue
- If queue is full, return HTTP 503 (Service Unavailable)
- Queue position is logged for monitoring
Background Processing:
- Background worker retrieves request from queue
- Text is passed to Piper TTS for conversion
- Piper generates WAV audio in memory
Audio Playback:
- Audio buffer is passed to PyAudio
- PyAudio streams audio to system output
- Playback occurs in callback thread (non-blocking)
- Completion is logged
Error Handling:
- Errors at any stage are caught and logged
- Failed requests are removed from queue
- Error metrics are updated

Technology Stack Justification

FastAPI vs Flask

Decision: FastAPI

Rationale:

Performance: FastAPI handles 15,000-20,000 req/s vs Flask's 2,000-3,000 req/s (Strapi Comparison)
Async Native: Built on ASGI with native async/await support, critical for non-blocking TTS processing
Type Safety: Pydantic integration provides automatic request validation and serialization
Documentation: Automatic OpenAPI (Swagger) documentation generation
Modern Architecture: Designed for microservices and high-concurrency applications
Growing Adoption: 78k GitHub stars, 38% developer adoption in 2025 (40% YoY increase)

Trade-offs:

Steeper learning curve compared to Flask
Smaller ecosystem of extensions (though growing rapidly)
Requires ASGI server (Uvicorn) vs Flask's built-in development server

Piper TTS Engine Selection

Decision: Piper TTS

Rationale:

Voice Quality: Neural TTS with "Google TTS level quality" (AntiX Forum)
Offline Operation: Fully local, no internet required
Performance: Optimized for local inference using ONNX Runtime
Resource Efficiency: Runs on Raspberry Pi 4, suitable for desktop Linux
Easy Installation: Available via pip (pip install piper-tts)
Active Development: Maintained project with 2025 updates
Multiple Voices: Extensive voice model library with quality/speed trade-offs

Comparison with Alternatives:

Engine	Voice Quality	Speed	Resource Usage	Offline	Ease of Use
Piper TTS	⭐⭐⭐⭐⭐ Neural	⭐⭐⭐⭐ Fast	⭐⭐⭐⭐ Medium	✅ Yes	⭐⭐⭐⭐ Easy
pyttsx3	⭐⭐ Robotic	⭐⭐⭐⭐⭐ Very Fast	⭐⭐⭐⭐⭐ Very Low	✅ Yes	⭐⭐⭐⭐⭐ Very Easy
eSpeak	⭐⭐ Robotic	⭐⭐⭐⭐⭐ Very Fast	⭐⭐⭐⭐⭐ Very Low	✅ Yes	⭐⭐⭐⭐ Easy
gTTS	⭐⭐⭐⭐⭐ Neural	⭐⭐ Slow	⭐⭐⭐⭐ Low	❌ No	⭐⭐⭐⭐⭐ Very Easy
Coqui TTS	⭐⭐⭐⭐⭐ Neural	⭐⭐⭐ Medium	⭐⭐ High	✅ Yes	⭐⭐ Complex

Trade-offs:

Larger model files (~20-100MB per voice) vs simple engines
Higher resource usage than pyttsx3/eSpeak
Requires ONNX Runtime dependency

sounddevice for Audio Playback

Decision: sounddevice

Rationale:

Pythonic API: Clean, intuitive interface that feels native to Python
NumPy Integration: Direct support for NumPy arrays (perfect for Piper TTS output)
Non-Blocking: Simple sd.play() returns immediately, audio plays in background
Cross-Platform: Works on Linux, Windows, macOS via PortAudio backend
Active Maintenance: Well-maintained with regular updates
Simple Async: Easy integration with asyncio via sd.wait() or callbacks

Comparison with Alternatives:

Library	Non-Blocking	Dependencies	Maintenance	Linux Support
sounddevice	✅ Native	PortAudio	⭐⭐⭐⭐ Active	✅ Excellent
PyAudio	✅ Callbacks	PortAudio	⭐⭐⭐ Active	✅ Excellent
simpleaudio	✅ Async	None	❌ Archived	⭐⭐⭐ Good
pygame	⭐ Limited	SDL	⭐⭐⭐⭐ Active	⭐⭐⭐⭐ Excellent

Why sounddevice over PyAudio:

Simpler API - sd.play(audio, samplerate) vs PyAudio's stream setup
Better NumPy support - no conversion needed from Piper's output
More Pythonic - feels like a modern Python library
Easier async integration - works naturally with asyncio

API Specification

Endpoint: POST /notify

Description: Accept text string and queue for TTS playback

Request Schema:

{
  "message": "string (required)",
  "voice": "string (optional)",
  "rate": "integer (optional, default: 170)",
  "voice_enabled": "boolean (optional, default: true)"
}

Request Parameters:

Parameter	Type	Required	Default	Constraints	Description
`message`	string	Yes	-	1-10000 chars	Text to convert to speech
`voice`	string	No	`en_US-lessac-medium`	Valid voice model name	Piper voice model to use
`rate`	integer	No	`170`	50-400	Speech rate in words per minute
`voice_enabled`	boolean	No	`true`	-	Enable/disable TTS (for debugging)

Example Request:

curl -X POST http://localhost:8888/notify \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Hello, this is a test of the voice server",
    "rate": 200,
    "voice_enabled": true
  }'

Response Schema (Success - 202 Accepted):

{
  "status": "queued",
  "message_length": 42,
  "queue_position": 3,
  "estimated_duration": 2.5,
  "voice_model": "en_US-lessac-medium"
}

Response Schema (Error - 400 Bad Request):

{
  "error": "validation_error",
  "detail": "message field is required",
  "timestamp": "2025-12-18T10:30:45.123Z"
}

Response Schema (Error - 503 Service Unavailable):

{
  "error": "queue_full",
  "detail": "TTS queue is full, please retry later",
  "queue_size": 50,
  "timestamp": "2025-12-18T10:30:45.123Z"
}

HTTP Status Codes:

Code	Meaning	Scenario
202	Accepted	Request successfully queued for processing
400	Bad Request	Invalid request parameters or malformed JSON
413	Payload Too Large	Message exceeds 10,000 characters
422	Unprocessable Entity	Valid JSON but invalid parameter values
500	Internal Server Error	TTS engine failure or unexpected error
503	Service Unavailable	Queue is full or service is shutting down

Endpoint: GET /health

Description: Health check endpoint for monitoring

Request: No parameters

Response Schema (Healthy - 200 OK):

{
  "status": "healthy",
  "uptime_seconds": 3600,
  "queue_size": 2,
  "queue_capacity": 50,
  "tts_engine": "piper",
  "audio_output": "available",
  "voice_models_loaded": ["en_US-lessac-medium"],
  "total_requests": 1523,
  "failed_requests": 12,
  "timestamp": "2025-12-18T10:30:45.123Z"
}

Response Schema (Unhealthy - 503 Service Unavailable):

{
  "status": "unhealthy",
  "errors": [
    "Audio output device unavailable",
    "TTS engine failed to initialize"
  ],
  "timestamp": "2025-12-18T10:30:45.123Z"
}

Endpoint: GET /docs

Description: Automatic Swagger UI documentation (provided by FastAPI)

Access: http://localhost:8888/docs

Features:

Interactive API testing
Schema visualization
Request/response examples
Authentication testing (if implemented)

Endpoint: GET /voices

Description: List available TTS voice models

Request: No parameters

Response Schema (200 OK):

{
  "voices": [
    {
      "name": "en_US-lessac-medium",
      "language": "en_US",
      "quality": "medium",
      "size_mb": 63.5,
      "installed": true
    },
    {
      "name": "en_US-libritts-high",
      "language": "en_US",
      "quality": "high",
      "size_mb": 108.2,
      "installed": false
    }
  ],
  "default_voice": "en_US-lessac-medium"
}

TTS Engine Analysis

Detailed Comparison Matrix

Engine	Voice Quality	Latency	CPU Usage	Memory	Offline	Linux Support	Python API	Maintenance
Piper TTS	⭐⭐⭐⭐⭐	~500ms	Medium	~200MB	✅	✅ Excellent	✅ Native	🟢 Active
pyttsx3	⭐⭐	~100ms	Low	~50MB	✅	✅ Good	✅ Native	🟢 Active
eSpeak-ng	⭐⭐	~50ms	Very Low	~20MB	✅	✅ Excellent	⚠️ Wrapper	🟢 Active
gTTS	⭐⭐⭐⭐⭐	~2000ms	Low	~30MB	❌	✅ Good	✅ Native	🟢 Active
Coqui TTS	⭐⭐⭐⭐⭐	~1500ms	High	~500MB	✅	✅ Good	✅ Native	🟡 Slow
Festival	⭐⭐⭐	~300ms	Low	~100MB	✅	✅ Excellent	⚠️ Wrapper	🟡 Slow
Mimic3	⭐⭐⭐⭐	~800ms	Medium	~300MB	✅	✅ Good	❌ HTTP only	🟢 Active

Detailed Engine Profiles

1. Piper TTS (RECOMMENDED)

Pros:

Neural TTS with natural-sounding voices
Optimized for local inference (ONNX Runtime)
Multiple quality levels (low/medium/high)
Extensive language and voice support
Active development and community
Easy pip installation
GPU acceleration support (CUDA)

Cons:

Larger model files (20-100MB per voice)
Higher resource usage than simple engines
Initial model download required
Slightly higher latency than robotic engines

Installation:

uv pip install piper-tts

Usage Example:

from piper import PiperVoice
import wave

voice = PiperVoice.load("en_US-lessac-medium.onnx")
with wave.open("output.wav", "wb") as wav_file:
    voice.synthesize("Hello world", wav_file)

Voice Quality Sample:

Low Quality: Faster, smaller models (~20MB), decent quality
Medium Quality: Balanced performance (~60MB), recommended default
High Quality: Best quality (~100MB), slower inference

References:

2. pyttsx3

Pros:

Extremely lightweight and fast
Cross-platform (Windows SAPI5, macOS NSSpeech, Linux eSpeak)
Zero external dependencies
Simple API
No model downloads required

Cons:

Robotic voice quality
Limited voice customization
Depends on system TTS engines

Installation:

uv pip install pyttsx3

Usage Example:

import pyttsx3

engine = pyttsx3.init()
engine.say("Hello world")
engine.runAndWait()

References:

3. eSpeak-ng

Pros:

Ultra-fast synthesis
100+ language support
Minimal resource usage
Highly customizable
System-level installation

Cons:

Robotic, mechanical voice quality
Python wrapper required (not native)
Less natural prosody

Installation:

# System package
sudo dnf install espeak-ng

# Python wrapper
uv pip install py3-tts  # Uses eSpeak backend

Usage Example:

echo "Hello world" | espeak-ng

References:

4. Coqui TTS

Pros:

State-of-the-art neural voices
Custom voice training support
Multiple model architectures
High-quality output

Cons:

Very high resource requirements
Slower inference
Complex setup
Larger memory footprint
Development has slowed

Installation:

uv pip install TTS

Usage Example:

from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world", file_path="output.wav")

References:

Coqui TTS GitHub

Recommendation: Piper TTS

Final Decision: Piper TTS is the optimal choice for this project.

Justification:

Quality: Neural voices with Google TTS-level quality
Offline: Fully local, no internet required (critical requirement)
Performance: Optimized for local inference, suitable for desktop Linux
Python Native: First-class Python API, easy integration
Maintenance: Actively maintained with 2025 updates
Flexibility: Multiple quality levels allow performance tuning
Ease of Use: Simple pip installation, straightforward API

Configuration Strategy:

Default Voice: en_US-lessac-medium (balanced quality/performance)
GPU Acceleration: Optional CUDA support for faster inference
Model Caching: Pre-load voice models at startup to reduce latency
Quality Toggle: Allow clients to request different quality levels

Web Framework Selection

FastAPI: Detailed Analysis

Why FastAPI is Ideal for This Project:

1. Async-First Architecture

FastAPI is built on Starlette (ASGI framework) with native async/await support. This is critical for our use case:

@app.post("/notify")
async def notify(request: NotifyRequest):
    # Non-blocking enqueueing
    await tts_queue.put(request)
    return {"status": "queued"}

# Background worker runs concurrently
async def process_queue():
    while True:
        request = await tts_queue.get()
        await generate_and_play_tts(request)

Benefit: HTTP responses return immediately while TTS processing happens in background.

2. Performance Benchmarks

According to TechEmpower benchmarks (Better Stack):

FastAPI: 15,000-20,000 requests/second
Flask: 2,000-3,000 requests/second

Benefit: 5-10x higher throughput for handling concurrent TTS requests.

3. Automatic API Documentation

FastAPI generates interactive OpenAPI (Swagger) documentation automatically:

@app.post("/notify", response_model=NotifyResponse)
async def notify(request: NotifyRequest):
    """
    Convert text to speech and play through speakers.

    - **message**: Text to convert (1-10000 characters)
    - **rate**: Speech rate in WPM (50-400)
    - **voice**: Voice model name (optional)
    """
    ...

Benefit: Instant API documentation at /docs without manual maintenance.

4. Type Safety with Pydantic

Automatic request validation and serialization:

from pydantic import BaseModel, Field, validator

class NotifyRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=10000)
    rate: int = Field(170, ge=50, le=400)
    voice_enabled: bool = True

    @validator('message')
    def sanitize_message(cls, v):
        # Automatic validation before handler runs
        return v.strip()

Benefit: Eliminates manual validation code, reduces bugs.

5. Dependency Injection

Clean separation of concerns:

async def get_tts_engine():
    return global_tts_engine

@app.post("/notify")
async def notify(
    request: NotifyRequest,
    tts_engine: PiperVoice = Depends(get_tts_engine)
):
    # tts_engine automatically injected
    ...

Benefit: Testable, maintainable code with clear dependencies.

6. Background Tasks

Built-in support for fire-and-forget tasks:

from fastapi import BackgroundTasks

@app.post("/notify")
async def notify(request: NotifyRequest, background_tasks: BackgroundTasks):
    background_tasks.add_task(generate_tts, request.message)
    return {"status": "queued"}

Benefit: Simplified async task management.

Flask Comparison (Why Not Flask)

Flask Limitations for This Project:

WSGI-Based: Synchronous by default, requires Gunicorn/gevent for async
Lower Performance: 2,000-3,000 req/s vs FastAPI's 15,000-20,000 req/s
Manual Documentation: Requires Flask-RESTPlus or manual OpenAPI setup
Manual Validation: No built-in request validation, requires Flask-Pydantic extension
Blocking I/O: Natural behavior blocks request threads during TTS processing

When Flask Would Be Better:

Simple synchronous applications
Heavy reliance on Flask extensions (Flask-Login, Flask-Admin)
Team already experienced with Flask
Need for Jinja2 templating (not needed here)

Verdict: FastAPI is the clear winner for this async-heavy, high-performance use case.

Audio Playback Strategy

sounddevice Implementation Details

Non-Blocking Playback

sounddevice provides simple, non-blocking audio playback out of the box:

import sounddevice as sd
import numpy as np

class AudioPlayer:
    """Simple audio player using sounddevice."""

    def __init__(self, sample_rate: int = 22050):
        self.sample_rate = sample_rate
        self._current_stream = None

    def play(self, audio_data: np.ndarray, sample_rate: int = None):
        """
        Non-blocking audio playback.

        Args:
            audio_data: NumPy array of audio samples (float32 or int16)
            sample_rate: Sample rate in Hz (defaults to instance default)
        """
        rate = sample_rate or self.sample_rate

        # Stop any currently playing audio
        self.stop()

        # Play audio - returns immediately, audio plays in background
        sd.play(audio_data, rate)

    def is_playing(self) -> bool:
        """Check if audio is currently playing."""
        return sd.get_stream() is not None and sd.get_stream().active

    def stop(self):
        """Stop current playback."""
        sd.stop()

    def wait(self):
        """Block until current playback completes."""
        sd.wait()

    async def wait_async(self):
        """Async wait for playback completion."""
        import asyncio
        while self.is_playing():
            await asyncio.sleep(0.05)

Benefits of sounddevice:

sd.play() returns immediately - audio plays in background thread
Direct NumPy array support - no conversion needed from Piper TTS
Simple API - one line to play audio
Built-in sd.wait() for synchronous waiting when needed

Handling Concurrent Requests

Strategy: Queue-based sequential playback with async queue management.

Rationale:

Playing multiple TTS outputs simultaneously would create audio chaos
Sequential playback ensures clarity
Queue allows buffering during high request volume

Implementation:

import asyncio
import sounddevice as sd
import numpy as np
from typing import Dict, Any

class TTSQueue:
    def __init__(self, max_size: int = 50):
        self.queue = asyncio.Queue(maxsize=max_size)
        self.player = AudioPlayer()
        self.stats = {"processed": 0, "errors": 0}

    async def enqueue(self, request: Dict[str, Any]):
        """Add TTS request to queue."""
        try:
            await asyncio.wait_for(
                self.queue.put(request),
                timeout=1.0
            )
            return self.queue.qsize()
        except asyncio.TimeoutError:
            raise QueueFullError("TTS queue is full")

    async def process_queue(self):
        """Background worker to process TTS queue."""
        while True:
            request = await self.queue.get()

            try:
                # Generate TTS audio
                audio_data = await self.generate_tts(request)

                # Play audio (non-blocking start)
                self.player.play(audio_data, sample_rate=22050)

                # Wait for playback to complete (async-friendly)
                await self.player.wait_async()

                self.stats["processed"] += 1

            except Exception as e:
                logger.error(f"TTS processing error: {e}")
                self.stats["errors"] += 1

            finally:
                self.queue.task_done()

    async def generate_tts(self, request: Dict[str, Any]) -> np.ndarray:
        """Generate TTS audio using Piper."""
        # Run CPU-intensive TTS in thread pool
        loop = asyncio.get_event_loop()
        audio_data = await loop.run_in_executor(
            None,
            self._sync_generate_tts,
            request["message"],
            request.get("voice", "en_US-lessac-medium")
        )
        return audio_data

    def _sync_generate_tts(self, text: str, voice: str) -> np.ndarray:
        """Synchronous TTS generation (runs in thread pool)."""
        # Piper TTS generation code
        ...
        return audio_array

Startup:

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: initialize queue and start processor
    global tts_queue
    tts_queue = TTSQueue(max_size=50)
    asyncio.create_task(tts_queue.process_queue())
    yield
    # Shutdown: stop audio playback
    sd.stop()

app = FastAPI(lifespan=lifespan)

Audio Device Error Handling

Common Issues:

Audio device disconnected (headphones unplugged)
PulseAudio/ALSA daemon crashed
No audio devices available
Device in use by another process

Handling Strategy:

import sounddevice as sd
import numpy as np
import time
import logging

logger = logging.getLogger(__name__)

class RobustAudioPlayer:
    """Audio player with automatic retry and device recovery."""

    def __init__(self, retry_attempts: int = 3, sample_rate: int = 22050):
        self.retry_attempts = retry_attempts
        self.sample_rate = sample_rate
        self.verify_audio_devices()

    def verify_audio_devices(self):
        """Verify audio devices are available."""
        try:
            devices = sd.query_devices()
            output_devices = [d for d in devices if d['max_output_channels'] > 0]
            if not output_devices:
                raise AudioDeviceError("No audio output devices found")
            logger.info(f"Audio initialized: {len(output_devices)} output devices found")
            logger.debug(f"Default output: {sd.query_devices(kind='output')['name']}")
        except Exception as e:
            logger.error(f"Audio initialization failed: {e}")
            raise

    def play(self, audio_data: np.ndarray, sample_rate: int = None):
        """Play audio with automatic retry on device errors."""
        rate = sample_rate or self.sample_rate

        for attempt in range(self.retry_attempts):
            try:
                sd.play(audio_data, rate)
                return
            except sd.PortAudioError as e:
                logger.warning(f"Audio playback failed (attempt {attempt+1}): {e}")

                if attempt < self.retry_attempts - 1:
                    # Wait and retry - device may become available
                    sd.stop()
                    time.sleep(0.5)
                    self.verify_audio_devices()
                else:
                    raise AudioPlaybackError(f"Failed after {self.retry_attempts} attempts: {e}")

    def is_playing(self) -> bool:
        """Check if audio is currently playing."""
        stream = sd.get_stream()
        return stream is not None and stream.active

    def stop(self):
        """Stop current playback."""
        sd.stop()

    async def wait_async(self):
        """Async wait for playback completion."""
        import asyncio
        while self.is_playing():
            await asyncio.sleep(0.05)

Device Query for Diagnostics:

def get_audio_diagnostics() -> dict:
    """Get audio system diagnostics for health check."""
    try:
        devices = sd.query_devices()
        default_output = sd.query_devices(kind='output')
        return {
            "status": "available",
            "device_count": len(devices),
            "default_output": default_output['name'],
            "sample_rate": default_output['default_samplerate']
        }
    except Exception as e:
        return {
            "status": "unavailable",
            "error": str(e)
        }

Error Handling Strategy

Error Categories and Handling

1. Request Validation Errors

Scenarios:

Missing required fields
Invalid parameter types
Out-of-range values
Malformed JSON

Handling:

from fastapi import HTTPException, status
from pydantic import BaseModel, Field, ValidationError

class NotifyRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=10000)
    rate: int = Field(170, ge=50, le=400)
    voice: str = Field("en_US-lessac-medium", regex=r"^[\w-]+$")

@app.exception_handler(ValidationError)
async def validation_exception_handler(request, exc):
    return JSONResponse(
        status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
        content={
            "error": "validation_error",
            "detail": str(exc),
            "timestamp": datetime.utcnow().isoformat()
        }
    )

HTTP Status: 422 Unprocessable Entity

2. Queue Full Errors

Scenario: Too many concurrent requests, queue is at capacity

Handling:

class QueueFullError(Exception):
    pass

@app.post("/notify")
async def notify(request: NotifyRequest):
    try:
        position = await tts_queue.enqueue(request)
        return {
            "status": "queued",
            "queue_position": position
        }
    except QueueFullError:
        raise HTTPException(
            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
            detail={
                "error": "queue_full",
                "message": "TTS queue is full, please retry later",
                "queue_size": tts_queue.max_size,
                "retry_after": 5  # seconds
            }
        )

HTTP Status: 503 Service Unavailable Client Action: Implement exponential backoff retry

3. TTS Engine Errors

Scenarios:

Voice model not found
ONNX Runtime errors
Memory allocation failures
Corrupted model files

Handling:

class TTSEngineError(Exception):
    pass

async def generate_tts(text: str, voice: str) -> np.ndarray:
    try:
        # Attempt TTS generation
        audio = piper_voice.synthesize(text)
        return audio
    except FileNotFoundError:
        raise TTSEngineError(f"Voice model '{voice}' not found")
    except MemoryError:
        raise TTSEngineError("Insufficient memory for TTS generation")
    except Exception as e:
        logger.error(f"TTS generation failed: {e}", exc_info=True)
        raise TTSEngineError(f"TTS generation failed: {str(e)}")

@app.exception_handler(TTSEngineError)
async def tts_engine_exception_handler(request, exc):
    return JSONResponse(
        status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
        content={
            "error": "tts_engine_error",
            "detail": str(exc),
            "timestamp": datetime.utcnow().isoformat()
        }
    )

HTTP Status: 500 Internal Server Error

4. Audio Playback Errors

Scenarios:

No audio devices available
Audio device disconnected
ALSA/PulseAudio errors
Permission denied

Handling:

class AudioPlaybackError(Exception):
    pass

async def play_audio(audio_data: np.ndarray):
    try:
        player.play_with_retry(audio_data, sample_rate=22050)
    except AudioDeviceError as e:
        logger.error(f"Audio device error: {e}")
        raise AudioPlaybackError("No audio output devices available")
    except OSError as e:
        logger.error(f"Audio system error: {e}")
        raise AudioPlaybackError(f"Audio playback failed: {str(e)}")

# In queue processor
try:
    await play_audio(audio_data)
except AudioPlaybackError as e:
    logger.error(f"Playback error: {e}")
    # Continue processing queue, don't crash server
    stats["errors"] += 1

Action: Log error, continue processing queue (don't crash server)

5. System Resource Errors

Scenarios:

Out of memory
CPU overload
Disk space exhausted

Handling:

import psutil

async def check_system_resources():
    """Monitor system resources."""
    memory = psutil.virtual_memory()
    if memory.percent > 90:
        logger.warning(f"High memory usage: {memory.percent}%")

    cpu = psutil.cpu_percent(interval=1)
    if cpu > 90:
        logger.warning(f"High CPU usage: {cpu}%")

@app.middleware("http")
async def resource_monitoring_middleware(request, call_next):
    """Monitor resources on each request."""
    await check_system_resources()
    response = await call_next(request)
    return response

Action: Log warnings, implement queue size limits to prevent resource exhaustion

Logging Strategy

Log Levels:

import logging
from logging.handlers import RotatingFileHandler

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        RotatingFileHandler(
            'voice-server.log',
            maxBytes=10*1024*1024,  # 10MB
            backupCount=5
        ),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

# Log levels usage:
logger.debug("TTS parameters: rate=%d, voice=%s", rate, voice)  # DEBUG
logger.info("Request queued: position=%d", queue_position)       # INFO
logger.warning("Queue nearly full: %d/%d", current, max_size)    # WARNING
logger.error("TTS generation failed: %s", error, exc_info=True)  # ERROR
logger.critical("Audio system unavailable, shutting down")       # CRITICAL

Structured Logging:

import json
from datetime import datetime

def log_request(request_id: str, message: str, status: str):
    """Structured JSON logging."""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "request_id": request_id,
        "message_length": len(message),
        "status": status,
        "event_type": "tts_request"
    }
    logger.info(json.dumps(log_entry))

Health Check Implementation

Comprehensive Health Checks:

@app.get("/health")
async def health_check():
    """Detailed health status."""
    health_status = {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "checks": {}
    }

    # Check TTS engine
    try:
        tts_engine.test_synthesis("test")
        health_status["checks"]["tts_engine"] = "healthy"
    except Exception as e:
        health_status["checks"]["tts_engine"] = f"unhealthy: {str(e)}"
        health_status["status"] = "unhealthy"

    # Check audio output
    try:
        audio_player.test_output()
        health_status["checks"]["audio_output"] = "healthy"
    except Exception as e:
        health_status["checks"]["audio_output"] = f"unhealthy: {str(e)}"
        health_status["status"] = "unhealthy"

    # Check queue status
    queue_size = tts_queue.qsize()
    health_status["checks"]["queue"] = {
        "size": queue_size,
        "capacity": tts_queue.max_size,
        "utilization": f"{(queue_size/tts_queue.max_size)*100:.1f}%"
    }

    # Check system resources
    health_status["checks"]["system"] = {
        "memory_percent": psutil.virtual_memory().percent,
        "cpu_percent": psutil.cpu_percent(interval=0.1)
    }

    status_code = 200 if health_status["status"] == "healthy" else 503
    return JSONResponse(status_code=status_code, content=health_status)

Implementation Checklist

Phase 1: Core Infrastructure (Days 1-2)

1.1 Project Setup

Initialize project directory /mnt/NV2/Development/voice-server
Create Python virtual environment using uv
Install core dependencies:
- uv pip install fastapi
- uv pip install uvicorn[standard]
- uv pip install piper-tts
- uv pip install sounddevice
- uv pip install numpy
- uv pip install pydantic
- uv pip install python-dotenv
Create requirements.txt with pinned versions
Create .env.example for configuration template
Initialize git repository
Create .gitignore (Python, IDEs, .env, voice models)

1.2 FastAPI Application Structure

Create app/main.py with FastAPI app initialization
Implement /notify endpoint skeleton
Implement /health endpoint skeleton
Implement /voices endpoint skeleton
Configure CORS middleware
Configure JSON logging middleware
Create Pydantic models for request/response schemas
Test basic server startup: uvicorn app.main:app --reload

1.3 Configuration Management

Create app/config.py for configuration loading
Implement environment variable loading
Define configuration schema (host, port, queue size, etc.)
Implement configuration validation at startup
Create CLI argument parsing for overrides
Document all configuration options in README

Phase 2: TTS Integration (Days 2-3)

2.1 Piper TTS Setup

Create app/tts_engine.py module
Implement PiperTTSEngine class
Download default voice model (en_US-lessac-medium)
Implement voice model loading with caching
Implement text-to-audio synthesis method
Add support for configurable speech rate
Test TTS generation with sample text
Measure TTS latency for various text lengths

2.2 Voice Model Management

Create models/ directory for voice model storage
Implement voice model discovery (scan models/ directory)
Implement lazy loading of voice models (load on first use)
Create model metadata cache (name, language, quality, size)
Implement /voices endpoint to list available models
Add error handling for missing/corrupted models
Document voice model installation process

2.3 TTS Parameter Support

Implement speech rate adjustment (50-400 WPM)
Test rate adjustment across range
Add voice selection via request parameter
Implement voice validation (reject unknown voices)
Add voice_enabled flag for debugging/testing
Create comprehensive TTS unit tests

Phase 3: Audio Playback (Day 3)

3.1 sounddevice Integration

Create app/audio_player.py module
Implement AudioPlayer class with non-blocking sd.play()
Verify sounddevice detects audio devices at startup
Implement non-blocking playback method
Implement async wait_async() method for queue processing
Test audio playback with sample NumPy array
Verify non-blocking behavior with concurrent requests

3.2 Audio Error Handling

Implement audio device detection
Add retry logic for device failures
Handle device disconnection gracefully
Test with headphones unplugged during playback
Implement fallback to different audio devices
Add detailed audio error logging
Create audio system health check

3.3 Playback Testing

Test simultaneous playback (should queue)
Test rapid successive requests
Measure audio latency (request → sound output)
Test with various audio formats
Verify memory cleanup after playback
Test long-running playback (10+ minutes)

Phase 4: Queue Management (Day 4)

4.1 Async Queue Implementation

Create app/queue_manager.py module
Implement TTSQueue class with asyncio.Queue
Set configurable max queue size (default: 50)
Implement queue full detection
Create background queue processor task
Implement graceful queue shutdown
Add queue metrics (size, processed, errors)

4.2 Request Processing Pipeline

Implement request enqueueing in /notify endpoint
Create background worker to process queue
Integrate TTS generation in worker
Integrate audio playback in worker
Implement sequential playback (one at a time)
Add request timeout handling (max 60s per request)
Test queue with 100+ concurrent requests

4.3 Queue Monitoring

Add queue size to /health endpoint
Implement queue utilization metrics
Add logging for queue events (enqueue, process, error)
Create queue performance benchmarks
Test queue overflow scenarios
Document queue behavior and limits

Phase 5: Error Handling (Day 5)

5.1 Exception Handlers

Implement custom exception classes
Create QueueFullError exception handler
Create TTSEngineError exception handler
Create AudioPlaybackError exception handler
Create ValidationError exception handler
Implement generic exception handler (catch-all)
Test all error scenarios

5.2 Logging Infrastructure

Configure structured JSON logging
Implement rotating file handler (10MB, 5 backups)
Add request ID tracking across logs
Implement log levels appropriately (DEBUG, INFO, WARNING, ERROR)
Create log aggregation for queue processor
Test log rotation
Document log file locations and format

5.3 Health Monitoring

Implement comprehensive /health endpoint
Add TTS engine health check
Add audio system health check
Add queue status to health check
Add system resource metrics (CPU, memory)
Test health endpoint under load
Create health check monitoring script

Phase 6: Testing (Days 5-6)

6.1 Unit Tests

Create tests/ directory structure
Install pytest: uv pip install pytest pytest-asyncio
Write tests for Pydantic models
Write tests for TTS engine
Write tests for audio player
Write tests for queue manager
Write tests for configuration loading
Achieve 80%+ code coverage

6.2 Integration Tests

Write tests for /notify endpoint
Write tests for /health endpoint
Write tests for /voices endpoint
Test end-to-end request flow
Test concurrent request handling
Test queue overflow scenarios
Test error scenarios (TTS failure, audio failure)

6.3 Performance Tests

Create load testing script with locust or wrk
Test 100 concurrent requests
Measure request latency (p50, p95, p99)
Measure TTS generation time
Measure audio playback latency
Measure memory usage under load
Document performance characteristics

6.4 System Tests

Test on target Linux environment (Nobara/Fedora 42)
Test with different audio devices
Test with PulseAudio and ALSA
Test headphone disconnect/reconnect
Test system resource exhaustion scenarios
Test server restart recovery
Test long-running stability (24+ hours)

Phase 7: Documentation & Deployment (Days 6-7)

7.1 Documentation

Create comprehensive README.md:
- Project overview
- Installation instructions
- Configuration options
- Usage examples
- API documentation
- Troubleshooting guide
Create CONTRIBUTING.md (if open source)
Create CHANGELOG.md
Document voice model installation
Create architecture diagrams
Add inline code documentation
Create example client scripts (curl, Python)

7.2 Deployment Preparation

Create systemd service file (voice-server.service)
Test systemd service installation
Test automatic restart on failure
Create deployment script (deploy.sh)
Document deployment process
Create backup/restore procedures
Test upgrade procedure

7.3 Production Hardening

Enable production logging (disable debug logs)
Configure log rotation
Set up monitoring (optional: Prometheus, Grafana)
Implement graceful shutdown (SIGTERM handling)
Test crash recovery
Implement rate limiting (optional)
Security audit (input sanitization, resource limits)
Performance tuning (queue size, worker count)

Testing Strategy

Unit Testing

Framework: pytest with pytest-asyncio

Test Coverage Requirements:

Minimum 80% code coverage
100% coverage for critical paths (TTS, audio playback)
All error handlers must have tests

Test Structure:

tests/
├── __init__.py
├── conftest.py              # Shared fixtures
├── unit/
│   ├── test_config.py       # Configuration loading tests
│   ├── test_models.py       # Pydantic model tests
│   ├── test_tts_engine.py   # TTS engine tests
│   ├── test_audio_player.py # Audio player tests
│   └── test_queue.py        # Queue manager tests
├── integration/
│   ├── test_api.py          # API endpoint tests
│   ├── test_end_to_end.py   # Full request flow tests
│   └── test_errors.py       # Error scenario tests
└── performance/
    └── test_load.py         # Load testing

Sample Unit Test:

# tests/unit/test_tts_engine.py
import pytest
from app.tts_engine import PiperTTSEngine

@pytest.fixture
def tts_engine():
    """Create TTS engine instance."""
    return PiperTTSEngine(model_dir="models/")

def test_tts_engine_initialization(tts_engine):
    """Test TTS engine initializes successfully."""
    assert tts_engine is not None
    assert tts_engine.default_voice == "en_US-lessac-medium"

def test_text_to_audio_conversion(tts_engine):
    """Test converting text to audio."""
    audio = tts_engine.synthesize("Hello world")
    assert audio is not None
    assert len(audio) > 0
    assert audio.dtype == np.float32

def test_invalid_voice_raises_error(tts_engine):
    """Test that invalid voice raises appropriate error."""
    with pytest.raises(ValueError, match="Voice model .* not found"):
        tts_engine.synthesize("Hello", voice="invalid_voice")

@pytest.mark.asyncio
async def test_async_synthesis(tts_engine):
    """Test async TTS synthesis."""
    audio = await tts_engine.synthesize_async("Hello world")
    assert audio is not None

Sample Integration Test:

# tests/integration/test_api.py
import pytest
from fastapi.testclient import TestClient
from app.main import app

@pytest.fixture
def client():
    """Create test client."""
    return TestClient(app)

def test_notify_endpoint_success(client):
    """Test successful /notify request."""
    response = client.post(
        "/notify",
        json={"message": "Test message", "rate": 180}
    )
    assert response.status_code == 202
    data = response.json()
    assert data["status"] == "queued"
    assert data["message_length"] == 12

def test_notify_endpoint_validation_error(client):
    """Test /notify with invalid parameters."""
    response = client.post(
        "/notify",
        json={"message": "", "rate": 1000}  # Empty message, invalid rate
    )
    assert response.status_code == 422

def test_health_endpoint(client):
    """Test /health endpoint."""
    response = client.get("/health")
    assert response.status_code == 200
    data = response.json()
    assert "status" in data
    assert "queue_size" in data

Load Testing

Tool: wrk or locust

Sample wrk Test:

# Install wrk
sudo dnf install wrk

# Run load test: 100 concurrent connections, 30 seconds
wrk -t4 -c100 -d30s -s post.lua http://localhost:8888/notify

# post.lua script:
# wrk.method = "POST"
# wrk.headers["Content-Type"] = "application/json"
# wrk.body = '{"message": "Load test message"}'

Sample locust Test:

# locustfile.py
from locust import HttpUser, task, between

class VoiceServerUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def notify(self):
        self.client.post("/notify", json={
            "message": "This is a load test message",
            "rate": 180
        })

    @task(5)
    def health_check(self):
        self.client.get("/health")

# Run: locust -f locustfile.py --host=http://localhost:8888

Performance Benchmarks:

Metric	Target	Acceptable	Unacceptable
API Response Time (p95)	< 50ms	< 100ms	> 200ms
TTS Generation (500 chars)	< 2s	< 5s	> 10s
Requests/Second	> 50	> 20	< 10
Memory Usage (idle)	< 200MB	< 500MB	> 1GB
Memory Usage (load)	< 500MB	< 1GB	> 2GB
Queue Processing Rate	> 10/s	> 5/s	< 2/s

Manual Testing Checklist

Functional Testing:

Send POST request with valid message → Hear audio playback
Send request with long text (5000 chars) → Successful playback
Send request with special characters → Successful sanitization
Send request with invalid voice → Receive 422 error
Send request with rate=50 → Slow speech playback
Send request with rate=400 → Fast speech playback
Send 10 concurrent requests → All play sequentially
Fill queue to capacity → Receive 503 error
Check /health endpoint → Receive status information
Check /voices endpoint → See available voice models
Check /docs endpoint → See Swagger documentation

Error Scenario Testing:

Unplug headphones during playback → Graceful error handling
Kill PulseAudio daemon → Audio error logged, server continues
Send malformed JSON → Receive 400 error
Send empty message → Receive 422 error
Send 11,000 character message → Receive 413 error
Restart server during playback → Queue cleared, server restarts

System Testing:

Run server for 24 hours → No memory leaks
Send 10,000 requests → All processed successfully
Monitor CPU usage during load → < 50% average
Monitor memory usage during load → < 1GB
Test on Fedora 42 → Successful operation
Test with ALSA (without PulseAudio) → Successful operation

Future Considerations

Optional Features (Post-v1.0)

1. Advanced Voice Control

Pitch adjustment: Allow clients to specify pitch modification
Volume control: Per-request volume settings
Emotion/tone control: Happy, sad, angry voice modulation (if TTS engine supports)
Voice cloning: Custom voice model training (Coqui TTS integration)

Implementation Complexity: Medium User Value: High for accessibility and personalization

2. Audio Format Options

Output format selection: Support WAV, MP3, OGG output
Sample rate options: Allow 16kHz, 22kHz, 44.1kHz selection
Compression levels: Configurable audio quality vs file size

Implementation Complexity: Low User Value: Medium (mostly for file storage use cases)

3. Streaming Audio

Real-time streaming: Stream audio as it's generated (WebSocket or SSE)
Chunked TTS: Generate and stream long texts in chunks
Lower latency: Start playback before full text is synthesized

Implementation Complexity: High User Value: High for very long texts

4. SSML Support

Prosody control: Fine-grained control over speech characteristics
Break insertion: Explicit pauses and timing control
Phoneme specification: Correct pronunciation for unusual words
Multi-voice support: Different voices within single text

Example:

<speak>
  Hello, <break time="500ms"/> this is <emphasis>important</emphasis>.
  <voice name="en_US-libritts">A different voice.</voice>
</speak>

Implementation Complexity: Medium User Value: High for advanced use cases

5. Caching Layer

TTS result caching: Cache frequently requested texts
Cache invalidation: LRU eviction policy
Cache persistence: Store cache across restarts
Cache statistics: Hit rate monitoring

Implementation Complexity: Low User Value: High for repeated texts (notifications, alerts)

Sample Implementation:

from functools import lru_cache
import hashlib

class TTSCache:
    def __init__(self, max_size: int = 1000):
        self.cache = {}
        self.max_size = max_size

    def get_cache_key(self, text: str, voice: str, rate: int) -> str:
        """Generate cache key from TTS parameters."""
        content = f"{text}|{voice}|{rate}"
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, text: str, voice: str, rate: int):
        """Retrieve cached audio."""
        key = self.get_cache_key(text, voice, rate)
        return self.cache.get(key)

    def put(self, text: str, voice: str, rate: int, audio_data):
        """Store audio in cache with LRU eviction."""
        if len(self.cache) >= self.max_size:
            # Evict oldest entry (simple FIFO, use OrderedDict for true LRU)
            self.cache.pop(next(iter(self.cache)))

        key = self.get_cache_key(text, voice, rate)
        self.cache[key] = audio_data

6. Multi-Language Support

Automatic language detection: Detect input language
Language-specific voice selection: Match voice to detected language
Mixed-language support: Handle multilingual texts

Implementation Complexity: Medium User Value: High for international users

7. Audio Effects

Reverb: Add spatial audio effects
Echo: Add echo effects
Speed adjustment: Time-stretch without pitch change
Normalization: Automatic volume leveling

Implementation Complexity: Medium (requires audio processing library like pydub or librosa) User Value: Medium (aesthetic enhancement)

8. Queue Priority System

Priority levels: High, normal, low priority requests
Priority queues: Separate queues for different priorities
Preemption: Allow high-priority requests to interrupt low-priority

Implementation Complexity: Medium User Value: Medium for multi-tenant scenarios

9. Webhook Notifications

Completion webhooks: Notify external service when TTS completes
Error webhooks: Notify on TTS failures
Webhook retry logic: Handle webhook delivery failures

Example Request:

{
  "message": "Hello world",
  "webhook_url": "https://example.com/tts-complete"
}

Implementation Complexity: Low User Value: High for integration scenarios

10. Authentication & Authorization

API key authentication: Secure endpoint access
Rate limiting: Per-user request limits
Usage quotas: Daily/monthly request quotas
Multi-tenant support: Isolated queues per user

Implementation Complexity: High User Value: High for shared/production deployments

11. Web Interface

Simple web UI: Browser-based TTS interface
Queue visualization: Real-time queue status display
Voice model management: Upload/download voice models via UI
Settings configuration: Web-based configuration editor

Implementation Complexity: Medium User Value: High for non-technical users

12. Docker Deployment

Dockerfile: Container image for easy deployment
Docker Compose: Multi-container setup with monitoring
Volume management: Persistent voice model storage
Health check integration: Container health monitoring

Sample Dockerfile:

FROM python:3.11-slim

# Install system dependencies (PortAudio for sounddevice)
RUN apt-get update && apt-get install -y \
    libportaudio2 \
    portaudio19-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app/ ./app/

# Download default voice model
RUN python -c "from piper import PiperVoice; PiperVoice.download('en_US-lessac-medium')"

EXPOSE 8888

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8888"]

Implementation Complexity: Low User Value: High for deployment consistency

13. Metrics & Monitoring

Prometheus metrics: Request count, latency, queue size
Grafana dashboards: Visual monitoring
Alerting: Notify on errors, high queue size, etc.
Performance profiling: Identify bottlenecks

Sample Metrics:

from prometheus_client import Counter, Histogram, Gauge

request_counter = Counter('tts_requests_total', 'Total TTS requests')
latency_histogram = Histogram('tts_latency_seconds', 'TTS latency')
queue_size_gauge = Gauge('tts_queue_size', 'Current queue size')

@app.post("/notify")
async def notify(request: NotifyRequest):
    request_counter.inc()
    with latency_histogram.time():
        # Process request
        ...
    queue_size_gauge.set(tts_queue.qsize())

Implementation Complexity: Medium User Value: High for production deployments

Scalability Considerations

Horizontal Scaling:

Use Redis for shared queue across multiple server instances
Implement distributed locking for audio device access
Load balance requests across multiple servers

Vertical Scaling:

Increase queue size for higher throughput
Use GPU acceleration for TTS (CUDA support in Piper)
Optimize voice model loading (keep models in memory)

Architecture Evolution:

Separate TTS generation and audio playback into microservices
Use message queue (RabbitMQ, Kafka) for request distribution
Implement worker pool for parallel TTS generation

Appendix: References

Technical Documentation

Research & Comparisons

Tools & Libraries

Document History

Version	Date	Author	Changes
1.0	2025-12-18	Atlas	Initial PRD creation

Document Status: ✅ Complete - Ready for Implementation

Next Steps:

Review PRD with stakeholders
Approve technical stack decisions
Begin Phase 1 implementation
Set up project tracking (GitHub Issues, Jira, etc.)
Assign development resources

Questions or Feedback: Contact Atlas at [atlas@manticorum.com]

70 KiB Raw Blame History

Product Requirements Document: Local Voice Server

Table of Contents

Executive Summary

Project Overview

Success Metrics

Technical Stack

Timeline Estimate

Resource Requirements

Goals and Non-Goals

Goals

Non-Goals

Technical Requirements

Functional Requirements

FR1: HTTP Server

FR2: Text-to-Speech Conversion

FR3: Audio Playback

FR4: Configuration

FR5: Error Handling

Non-Functional Requirements

NFR1: Performance

NFR2: Reliability

NFR3: Maintainability

NFR4: Security

NFR5: Compatibility

System Architecture

High-Level Architecture

Component Descriptions

1. FastAPI Web Server

2. Async Request Queue

3. TTS Processing Layer

4. Audio Playback Layer

Data Flow

Technology Stack Justification

FastAPI vs Flask

Piper TTS Engine Selection

sounddevice for Audio Playback

API Specification

Endpoint: POST /notify

Endpoint: GET /health

Endpoint: GET /docs

Endpoint: GET /voices

TTS Engine Analysis

Detailed Comparison Matrix

Detailed Engine Profiles

1. Piper TTS (RECOMMENDED)

2. pyttsx3

3. eSpeak-ng

4. Coqui TTS

Recommendation: Piper TTS

Web Framework Selection

FastAPI: Detailed Analysis

1. Async-First Architecture

2. Performance Benchmarks

3. Automatic API Documentation

4. Type Safety with Pydantic

5. Dependency Injection

6. Background Tasks

Flask Comparison (Why Not Flask)

Audio Playback Strategy

sounddevice Implementation Details

Non-Blocking Playback

Handling Concurrent Requests

Audio Device Error Handling

Error Handling Strategy

Error Categories and Handling

1. Request Validation Errors

2. Queue Full Errors

3. TTS Engine Errors

4. Audio Playback Errors

5. System Resource Errors

Logging Strategy

Health Check Implementation

Implementation Checklist

Phase 1: Core Infrastructure (Days 1-2)

1.1 Project Setup

1.2 FastAPI Application Structure

1.3 Configuration Management

Phase 2: TTS Integration (Days 2-3)

2.1 Piper TTS Setup

70 KiB

Raw Blame History