voice-server/PRD.md
Cal Corum a34aec06f1 Initial commit: Voice server with Piper TTS
A local HTTP service that accepts text via POST and speaks it through
system speakers using Piper TTS neural voice synthesis.

Features:
- POST /notify - Queue text for TTS playback
- GET /health - Health check with TTS/audio/queue status
- GET /voices - List installed voice models
- Async queue processing (no overlapping audio)
- Non-blocking audio via sounddevice
- 73 tests covering API contract

Tech stack:
- FastAPI + Uvicorn
- Piper TTS (neural voices, offline)
- sounddevice (PortAudio)
- Pydantic for validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 00:18:12 -06:00

70 KiB

Product Requirements Document: Local Voice Server

Version: 1.0 Date: 2025-12-18 Author: Atlas (Principal Software Architect) Project: Local HTTP Voice Server for Text-to-Speech


Table of Contents

  1. Executive Summary
  2. Goals and Non-Goals
  3. Technical Requirements
  4. System Architecture
  5. API Specification
  6. TTS Engine Analysis
  7. Web Framework Selection
  8. Audio Playback Strategy
  9. Error Handling Strategy
  10. Implementation Checklist
  11. Testing Strategy
  12. Future Considerations

Executive Summary

Project Overview

This project delivers a local HTTP service that accepts POST requests containing text strings and converts them to speech through the computer's speakers. The service will run locally on Linux (Nobara/Fedora 42), providing fast, offline text-to-speech capabilities without requiring external API calls or internet connectivity.

Success Metrics

  • Response Time: TTS conversion and playback initiation within 200ms for short texts (< 100 characters)
  • Reliability: 99.9% successful request handling under normal operating conditions
  • Concurrency: Support for at least 5 concurrent TTS requests with proper queuing
  • Audio Quality: Clear, intelligible speech output comparable to Google TTS quality
  • Startup Time: Server ready to accept requests within 2 seconds of launch

Technical Stack

Component Technology Justification
Web Framework FastAPI Async support, high performance (15k-20k req/s), automatic API documentation
TTS Engine Piper TTS Neural voice quality, offline, optimized for local inference, ONNX-based
Audio Playback sounddevice Cross-platform, Pythonic API, excellent NumPy integration, non-blocking playback
Package Manager uv Fast Python package management (user preference)
ASGI Server Uvicorn High-performance ASGI server, native FastAPI integration
Async Runtime asyncio Built-in Python async support for concurrent request handling

Timeline Estimate

  • Phase 1 - Core Implementation: 2-3 days (basic HTTP server + TTS integration)
  • Phase 2 - Error Handling & Testing: 1-2 days (comprehensive error handling, unit tests)
  • Phase 3 - Concurrency & Queue Management: 1-2 days (async queue, concurrent playback)
  • Total Estimated Time: 4-7 days for production-ready v1.0

Resource Requirements

  • Development: 1 full-stack Python developer with async programming experience
  • Testing: Access to Linux environment (Nobara/Fedora 42) with audio hardware
  • Infrastructure: Local development machine with 2+ CPU cores, 4GB+ RAM

Goals and Non-Goals

Goals

Primary Goals:

  1. Create a local HTTP service that accepts text via POST requests
  2. Convert text to speech using high-quality offline TTS
  3. Play audio through system speakers with minimal latency
  4. Support concurrent requests with proper queue management
  5. Provide comprehensive error handling and logging
  6. Maintain zero external dependencies (fully offline capable)

Secondary Goals:

  1. Automatic API documentation via FastAPI's built-in OpenAPI support
  2. Configurable TTS parameters (voice, speed, volume) via request parameters
  3. Health check endpoint for service monitoring
  4. Graceful handling of long-running text conversions
  5. Support for multiple voice models

Non-Goals

Explicitly Out of Scope:

  1. Cloud-based or external API integration
  2. Speech-to-text (STT) capabilities
  3. Audio file storage or retrieval
  4. User authentication or authorization
  5. Rate limiting or quota management
  6. Multi-language UI or web interface
  7. Real-time streaming audio synthesis
  8. Mobile app integration
  9. Persistent audio history or logging
  10. Advanced audio effects (reverb, pitch shifting, etc.)

Technical Requirements

Functional Requirements

FR1: HTTP Server

  • FR1.1: Server SHALL listen on configurable host and port (default: 0.0.0.0:8888)
  • FR1.2: Server SHALL accept POST requests to /notify endpoint
  • FR1.3: Server SHALL accept JSON payload with message field containing text
  • FR1.4: Server SHALL return HTTP 200 with success confirmation
  • FR1.5: Server SHALL support CORS for local development

FR2: Text-to-Speech Conversion

  • FR2.1: System SHALL convert text strings to audio using Piper TTS
  • FR2.2: System SHALL support configurable voice models via request parameters
  • FR2.3: System SHALL support adjustable speech rate (50-400 words per minute)
  • FR2.4: System SHALL handle text inputs from 1 to 10,000 characters
  • FR2.5: System SHALL use default voice if not specified in request

FR3: Audio Playback

  • FR3.1: System SHALL play generated audio through default system audio output
  • FR3.2: System SHALL support non-blocking audio playback
  • FR3.3: System SHALL queue concurrent requests in FIFO order
  • FR3.4: System SHALL allow configurable maximum queue size (default: 50)
  • FR3.5: System SHALL provide feedback when queue is full

FR4: Configuration

  • FR4.1: System SHALL support configuration via environment variables
  • FR4.2: System SHALL support configuration via command-line arguments
  • FR4.3: System SHALL provide sensible defaults for all configuration values
  • FR4.4: System SHALL validate configuration at startup

FR5: Error Handling

  • FR5.1: System SHALL return appropriate HTTP error codes for failures
  • FR5.2: System SHALL log all errors with timestamps and context
  • FR5.3: System SHALL continue operating after non-fatal errors
  • FR5.4: System SHALL gracefully handle TTS engine failures
  • FR5.5: System SHALL provide detailed error messages in responses

Non-Functional Requirements

NFR1: Performance

  • NFR1.1: API response time SHALL be < 50ms (excluding TTS processing)
  • NFR1.2: TTS conversion SHALL complete in < 2 seconds for 500 character texts
  • NFR1.3: System SHALL handle 20+ requests per second without degradation
  • NFR1.4: Memory usage SHALL remain < 500MB under normal load
  • NFR1.5: CPU usage SHALL average < 30% during active TTS processing

NFR2: Reliability

  • NFR2.1: System SHALL maintain 99.9% uptime during operation
  • NFR2.2: System SHALL recover from audio device disconnections
  • NFR2.3: System SHALL handle Out-of-Memory conditions gracefully
  • NFR2.4: System SHALL log all critical errors for debugging

NFR3: Maintainability

  • NFR3.1: Code SHALL maintain > 80% test coverage
  • NFR3.2: All functions SHALL include docstrings with type hints
  • NFR3.3: Code SHALL follow PEP 8 style guidelines
  • NFR3.4: Dependencies SHALL be pinned to specific versions
  • NFR3.5: README SHALL provide clear setup and usage instructions

NFR4: Security

  • NFR4.1: System SHALL sanitize all text inputs to prevent injection attacks
  • NFR4.2: System SHALL limit request payload size to 1MB
  • NFR4.3: System SHALL not expose internal stack traces in API responses
  • NFR4.4: System SHALL log all incoming requests for audit purposes

NFR5: Compatibility

  • NFR5.1: System SHALL run on Linux (Nobara/Fedora 42)
  • NFR5.2: System SHALL support Python 3.9+
  • NFR5.3: System SHALL work with standard ALSA/PulseAudio setups
  • NFR5.4: System SHALL be deployable as a systemd service

System Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Client Applications                       │
│              (AI Agents, Scripts, Other Services)               │
└────────────────────────────┬────────────────────────────────────┘
                             │ HTTP POST /notify
                             │ JSON: {"message": "text"}
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                     FastAPI Web Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│  │   /notify    │  │   /health    │  │    /docs     │        │
│  │  endpoint    │  │  endpoint    │  │  (Swagger)   │        │
│  └──────┬───────┘  └──────────────┘  └──────────────┘        │
│         │                                                       │
│         │ Validates & Enqueues                                 │
│         ▼                                                       │
│  ┌──────────────────────────────────────────────────┐         │
│  │          Async Request Queue                     │         │
│  │  (asyncio.Queue with max size limit)            │         │
│  └──────────────────┬───────────────────────────────┘         │
└────────────────────┬┼───────────────────────────────────────────┘
                     ││
                     ││ Background Task Processing
                     ▼▼
┌─────────────────────────────────────────────────────────────────┐
│                    TTS Processing Layer                         │
│  ┌────────────────────────────────────────────────────┐        │
│  │              Piper TTS Engine                      │        │
│  │  ┌──────────────┐  ┌──────────────┐               │        │
│  │  │ Voice Models │  │ ONNX Runtime │               │        │
│  │  │  (.onnx +    │  │  Inference   │               │        │
│  │  │   .json)     │  │    Engine    │               │        │
│  │  └──────────────┘  └──────────────┘               │        │
│  └─────────────────────────┬──────────────────────────┘        │
│                            │ Generate WAV                       │
│                            ▼                                    │
│  ┌────────────────────────────────────────────────────┐        │
│  │          In-Memory Audio Buffer                    │        │
│  │        (NumPy array / bytes)                       │        │
│  └─────────────────────────┬──────────────────────────┘        │
└────────────────────────────┼───────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Audio Playback Layer                           │
│  ┌────────────────────────────────────────────────────┐        │
│  │              PyAudio Stream Manager                │        │
│  │  - Callback-based playback                         │        │
│  │  - Non-blocking operation                          │        │
│  │  - Stream lifecycle management                     │        │
│  └─────────────────────────┬──────────────────────────┘        │
│                            │                                    │
│                            ▼                                    │
│  ┌────────────────────────────────────────────────────┐        │
│  │         System Audio Output (ALSA/PulseAudio)     │        │
│  └────────────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
                    🔊 Computer Speakers

Component Descriptions

1. FastAPI Web Server

  • Responsibilities:

    • Accept and validate HTTP POST requests
    • Provide automatic OpenAPI documentation
    • Handle CORS configuration
    • Route requests to appropriate handlers
    • Return HTTP responses with appropriate status codes
  • Dependencies:

    • FastAPI framework
    • Uvicorn ASGI server
    • Pydantic for request/response validation

2. Async Request Queue

  • Responsibilities:

    • Queue incoming TTS requests in FIFO order
    • Prevent queue overflow with configurable max size
    • Enable asynchronous processing without blocking HTTP responses
    • Provide queue status information
  • Implementation:

    • asyncio.Queue for async-safe queuing
    • Background task workers to process queue
    • Queue metrics (size, processed count, errors)

3. TTS Processing Layer

  • Responsibilities:

    • Load and manage Piper TTS voice models
    • Convert text to audio waveforms
    • Handle voice model selection
    • Configure TTS parameters (rate, pitch, volume)
    • Generate in-memory audio buffers
  • Implementation:

    • Piper TTS Python bindings
    • ONNX Runtime for model inference
    • Voice model caching for performance
    • Error handling for model loading failures

4. Audio Playback Layer

  • Responsibilities:

    • Initialize audio output streams
    • Play audio buffers through system speakers
    • Support non-blocking playback
    • Handle audio device errors
    • Manage stream lifecycle
  • Implementation:

    • sounddevice for cross-platform audio I/O
    • Non-blocking sd.play() with background playback
    • Simple NumPy array integration
    • Graceful handling of audio device disconnections

Data Flow

Request Processing Flow:

  1. HTTP Request Reception:

    • Client sends POST to /notify with JSON payload
    • FastAPI validates request schema using Pydantic models
    • Request is immediately acknowledged with HTTP 202 (Accepted)
  2. Request Enqueueing:

    • Validated request is added to async queue
    • If queue is full, return HTTP 503 (Service Unavailable)
    • Queue position is logged for monitoring
  3. Background Processing:

    • Background worker retrieves request from queue
    • Text is passed to Piper TTS for conversion
    • Piper generates WAV audio in memory
  4. Audio Playback:

    • Audio buffer is passed to PyAudio
    • PyAudio streams audio to system output
    • Playback occurs in callback thread (non-blocking)
    • Completion is logged
  5. Error Handling:

    • Errors at any stage are caught and logged
    • Failed requests are removed from queue
    • Error metrics are updated

Technology Stack Justification

FastAPI vs Flask

Decision: FastAPI

Rationale:

  • Performance: FastAPI handles 15,000-20,000 req/s vs Flask's 2,000-3,000 req/s (Strapi Comparison)
  • Async Native: Built on ASGI with native async/await support, critical for non-blocking TTS processing
  • Type Safety: Pydantic integration provides automatic request validation and serialization
  • Documentation: Automatic OpenAPI (Swagger) documentation generation
  • Modern Architecture: Designed for microservices and high-concurrency applications
  • Growing Adoption: 78k GitHub stars, 38% developer adoption in 2025 (40% YoY increase)

Trade-offs:

  • Steeper learning curve compared to Flask
  • Smaller ecosystem of extensions (though growing rapidly)
  • Requires ASGI server (Uvicorn) vs Flask's built-in development server

Piper TTS Engine Selection

Decision: Piper TTS

Rationale:

  • Voice Quality: Neural TTS with "Google TTS level quality" (AntiX Forum)
  • Offline Operation: Fully local, no internet required
  • Performance: Optimized for local inference using ONNX Runtime
  • Resource Efficiency: Runs on Raspberry Pi 4, suitable for desktop Linux
  • Easy Installation: Available via pip (pip install piper-tts)
  • Active Development: Maintained project with 2025 updates
  • Multiple Voices: Extensive voice model library with quality/speed trade-offs

Comparison with Alternatives:

Engine Voice Quality Speed Resource Usage Offline Ease of Use
Piper TTS Neural Fast Medium Yes Easy
pyttsx3 Robotic Very Fast Very Low Yes Very Easy
eSpeak Robotic Very Fast Very Low Yes Easy
gTTS Neural Slow Low No Very Easy
Coqui TTS Neural Medium High Yes Complex

Trade-offs:

  • Larger model files (~20-100MB per voice) vs simple engines
  • Higher resource usage than pyttsx3/eSpeak
  • Requires ONNX Runtime dependency

sounddevice for Audio Playback

Decision: sounddevice

Rationale:

  • Pythonic API: Clean, intuitive interface that feels native to Python
  • NumPy Integration: Direct support for NumPy arrays (perfect for Piper TTS output)
  • Non-Blocking: Simple sd.play() returns immediately, audio plays in background
  • Cross-Platform: Works on Linux, Windows, macOS via PortAudio backend
  • Active Maintenance: Well-maintained with regular updates
  • Simple Async: Easy integration with asyncio via sd.wait() or callbacks

Comparison with Alternatives:

Library Non-Blocking Dependencies Maintenance Linux Support
sounddevice Native PortAudio Active Excellent
PyAudio Callbacks PortAudio Active Excellent
simpleaudio Async None Archived Good
pygame Limited SDL Active Excellent

Why sounddevice over PyAudio:

  • Simpler API - sd.play(audio, samplerate) vs PyAudio's stream setup
  • Better NumPy support - no conversion needed from Piper's output
  • More Pythonic - feels like a modern Python library
  • Easier async integration - works naturally with asyncio

API Specification

Endpoint: POST /notify

Description: Accept text string and queue for TTS playback

Request Schema:

{
  "message": "string (required)",
  "voice": "string (optional)",
  "rate": "integer (optional, default: 170)",
  "voice_enabled": "boolean (optional, default: true)"
}

Request Parameters:

Parameter Type Required Default Constraints Description
message string Yes - 1-10000 chars Text to convert to speech
voice string No en_US-lessac-medium Valid voice model name Piper voice model to use
rate integer No 170 50-400 Speech rate in words per minute
voice_enabled boolean No true - Enable/disable TTS (for debugging)

Example Request:

curl -X POST http://localhost:8888/notify \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Hello, this is a test of the voice server",
    "rate": 200,
    "voice_enabled": true
  }'

Response Schema (Success - 202 Accepted):

{
  "status": "queued",
  "message_length": 42,
  "queue_position": 3,
  "estimated_duration": 2.5,
  "voice_model": "en_US-lessac-medium"
}

Response Schema (Error - 400 Bad Request):

{
  "error": "validation_error",
  "detail": "message field is required",
  "timestamp": "2025-12-18T10:30:45.123Z"
}

Response Schema (Error - 503 Service Unavailable):

{
  "error": "queue_full",
  "detail": "TTS queue is full, please retry later",
  "queue_size": 50,
  "timestamp": "2025-12-18T10:30:45.123Z"
}

HTTP Status Codes:

Code Meaning Scenario
202 Accepted Request successfully queued for processing
400 Bad Request Invalid request parameters or malformed JSON
413 Payload Too Large Message exceeds 10,000 characters
422 Unprocessable Entity Valid JSON but invalid parameter values
500 Internal Server Error TTS engine failure or unexpected error
503 Service Unavailable Queue is full or service is shutting down

Endpoint: GET /health

Description: Health check endpoint for monitoring

Request: No parameters

Response Schema (Healthy - 200 OK):

{
  "status": "healthy",
  "uptime_seconds": 3600,
  "queue_size": 2,
  "queue_capacity": 50,
  "tts_engine": "piper",
  "audio_output": "available",
  "voice_models_loaded": ["en_US-lessac-medium"],
  "total_requests": 1523,
  "failed_requests": 12,
  "timestamp": "2025-12-18T10:30:45.123Z"
}

Response Schema (Unhealthy - 503 Service Unavailable):

{
  "status": "unhealthy",
  "errors": [
    "Audio output device unavailable",
    "TTS engine failed to initialize"
  ],
  "timestamp": "2025-12-18T10:30:45.123Z"
}

Endpoint: GET /docs

Description: Automatic Swagger UI documentation (provided by FastAPI)

Access: http://localhost:8888/docs

Features:

  • Interactive API testing
  • Schema visualization
  • Request/response examples
  • Authentication testing (if implemented)

Endpoint: GET /voices

Description: List available TTS voice models

Request: No parameters

Response Schema (200 OK):

{
  "voices": [
    {
      "name": "en_US-lessac-medium",
      "language": "en_US",
      "quality": "medium",
      "size_mb": 63.5,
      "installed": true
    },
    {
      "name": "en_US-libritts-high",
      "language": "en_US",
      "quality": "high",
      "size_mb": 108.2,
      "installed": false
    }
  ],
  "default_voice": "en_US-lessac-medium"
}

TTS Engine Analysis

Detailed Comparison Matrix

Engine Voice Quality Latency CPU Usage Memory Offline Linux Support Python API Maintenance
Piper TTS ~500ms Medium ~200MB Excellent Native 🟢 Active
pyttsx3 ~100ms Low ~50MB Good Native 🟢 Active
eSpeak-ng ~50ms Very Low ~20MB Excellent ⚠️ Wrapper 🟢 Active
gTTS ~2000ms Low ~30MB Good Native 🟢 Active
Coqui TTS ~1500ms High ~500MB Good Native 🟡 Slow
Festival ~300ms Low ~100MB Excellent ⚠️ Wrapper 🟡 Slow
Mimic3 ~800ms Medium ~300MB Good HTTP only 🟢 Active

Detailed Engine Profiles

Pros:

  • Neural TTS with natural-sounding voices
  • Optimized for local inference (ONNX Runtime)
  • Multiple quality levels (low/medium/high)
  • Extensive language and voice support
  • Active development and community
  • Easy pip installation
  • GPU acceleration support (CUDA)

Cons:

  • Larger model files (20-100MB per voice)
  • Higher resource usage than simple engines
  • Initial model download required
  • Slightly higher latency than robotic engines

Installation:

uv pip install piper-tts

Usage Example:

from piper import PiperVoice
import wave

voice = PiperVoice.load("en_US-lessac-medium.onnx")
with wave.open("output.wav", "wb") as wav_file:
    voice.synthesize("Hello world", wav_file)

Voice Quality Sample:

  • Low Quality: Faster, smaller models (~20MB), decent quality
  • Medium Quality: Balanced performance (~60MB), recommended default
  • High Quality: Best quality (~100MB), slower inference

References:


2. pyttsx3

Pros:

  • Extremely lightweight and fast
  • Cross-platform (Windows SAPI5, macOS NSSpeech, Linux eSpeak)
  • Zero external dependencies
  • Simple API
  • No model downloads required

Cons:

  • Robotic voice quality
  • Limited voice customization
  • Depends on system TTS engines

Installation:

uv pip install pyttsx3

Usage Example:

import pyttsx3

engine = pyttsx3.init()
engine.say("Hello world")
engine.runAndWait()

References:


3. eSpeak-ng

Pros:

  • Ultra-fast synthesis
  • 100+ language support
  • Minimal resource usage
  • Highly customizable
  • System-level installation

Cons:

  • Robotic, mechanical voice quality
  • Python wrapper required (not native)
  • Less natural prosody

Installation:

# System package
sudo dnf install espeak-ng

# Python wrapper
uv pip install py3-tts  # Uses eSpeak backend

Usage Example:

echo "Hello world" | espeak-ng

References:


4. Coqui TTS

Pros:

  • State-of-the-art neural voices
  • Custom voice training support
  • Multiple model architectures
  • High-quality output

Cons:

  • Very high resource requirements
  • Slower inference
  • Complex setup
  • Larger memory footprint
  • Development has slowed

Installation:

uv pip install TTS

Usage Example:

from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world", file_path="output.wav")

References:


Recommendation: Piper TTS

Final Decision: Piper TTS is the optimal choice for this project.

Justification:

  1. Quality: Neural voices with Google TTS-level quality
  2. Offline: Fully local, no internet required (critical requirement)
  3. Performance: Optimized for local inference, suitable for desktop Linux
  4. Python Native: First-class Python API, easy integration
  5. Maintenance: Actively maintained with 2025 updates
  6. Flexibility: Multiple quality levels allow performance tuning
  7. Ease of Use: Simple pip installation, straightforward API

Configuration Strategy:

  • Default Voice: en_US-lessac-medium (balanced quality/performance)
  • GPU Acceleration: Optional CUDA support for faster inference
  • Model Caching: Pre-load voice models at startup to reduce latency
  • Quality Toggle: Allow clients to request different quality levels

Web Framework Selection

FastAPI: Detailed Analysis

Why FastAPI is Ideal for This Project:

1. Async-First Architecture

FastAPI is built on Starlette (ASGI framework) with native async/await support. This is critical for our use case:

@app.post("/notify")
async def notify(request: NotifyRequest):
    # Non-blocking enqueueing
    await tts_queue.put(request)
    return {"status": "queued"}

# Background worker runs concurrently
async def process_queue():
    while True:
        request = await tts_queue.get()
        await generate_and_play_tts(request)

Benefit: HTTP responses return immediately while TTS processing happens in background.

2. Performance Benchmarks

According to TechEmpower benchmarks (Better Stack):

  • FastAPI: 15,000-20,000 requests/second
  • Flask: 2,000-3,000 requests/second

Benefit: 5-10x higher throughput for handling concurrent TTS requests.

3. Automatic API Documentation

FastAPI generates interactive OpenAPI (Swagger) documentation automatically:

@app.post("/notify", response_model=NotifyResponse)
async def notify(request: NotifyRequest):
    """
    Convert text to speech and play through speakers.

    - **message**: Text to convert (1-10000 characters)
    - **rate**: Speech rate in WPM (50-400)
    - **voice**: Voice model name (optional)
    """
    ...

Benefit: Instant API documentation at /docs without manual maintenance.

4. Type Safety with Pydantic

Automatic request validation and serialization:

from pydantic import BaseModel, Field, validator

class NotifyRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=10000)
    rate: int = Field(170, ge=50, le=400)
    voice_enabled: bool = True

    @validator('message')
    def sanitize_message(cls, v):
        # Automatic validation before handler runs
        return v.strip()

Benefit: Eliminates manual validation code, reduces bugs.

5. Dependency Injection

Clean separation of concerns:

async def get_tts_engine():
    return global_tts_engine

@app.post("/notify")
async def notify(
    request: NotifyRequest,
    tts_engine: PiperVoice = Depends(get_tts_engine)
):
    # tts_engine automatically injected
    ...

Benefit: Testable, maintainable code with clear dependencies.

6. Background Tasks

Built-in support for fire-and-forget tasks:

from fastapi import BackgroundTasks

@app.post("/notify")
async def notify(request: NotifyRequest, background_tasks: BackgroundTasks):
    background_tasks.add_task(generate_tts, request.message)
    return {"status": "queued"}

Benefit: Simplified async task management.

Flask Comparison (Why Not Flask)

Flask Limitations for This Project:

  1. WSGI-Based: Synchronous by default, requires Gunicorn/gevent for async
  2. Lower Performance: 2,000-3,000 req/s vs FastAPI's 15,000-20,000 req/s
  3. Manual Documentation: Requires Flask-RESTPlus or manual OpenAPI setup
  4. Manual Validation: No built-in request validation, requires Flask-Pydantic extension
  5. Blocking I/O: Natural behavior blocks request threads during TTS processing

When Flask Would Be Better:

  • Simple synchronous applications
  • Heavy reliance on Flask extensions (Flask-Login, Flask-Admin)
  • Team already experienced with Flask
  • Need for Jinja2 templating (not needed here)

Verdict: FastAPI is the clear winner for this async-heavy, high-performance use case.


Audio Playback Strategy

sounddevice Implementation Details

Non-Blocking Playback

sounddevice provides simple, non-blocking audio playback out of the box:

import sounddevice as sd
import numpy as np

class AudioPlayer:
    """Simple audio player using sounddevice."""

    def __init__(self, sample_rate: int = 22050):
        self.sample_rate = sample_rate
        self._current_stream = None

    def play(self, audio_data: np.ndarray, sample_rate: int = None):
        """
        Non-blocking audio playback.

        Args:
            audio_data: NumPy array of audio samples (float32 or int16)
            sample_rate: Sample rate in Hz (defaults to instance default)
        """
        rate = sample_rate or self.sample_rate

        # Stop any currently playing audio
        self.stop()

        # Play audio - returns immediately, audio plays in background
        sd.play(audio_data, rate)

    def is_playing(self) -> bool:
        """Check if audio is currently playing."""
        return sd.get_stream() is not None and sd.get_stream().active

    def stop(self):
        """Stop current playback."""
        sd.stop()

    def wait(self):
        """Block until current playback completes."""
        sd.wait()

    async def wait_async(self):
        """Async wait for playback completion."""
        import asyncio
        while self.is_playing():
            await asyncio.sleep(0.05)

Benefits of sounddevice:

  • sd.play() returns immediately - audio plays in background thread
  • Direct NumPy array support - no conversion needed from Piper TTS
  • Simple API - one line to play audio
  • Built-in sd.wait() for synchronous waiting when needed

Handling Concurrent Requests

Strategy: Queue-based sequential playback with async queue management.

Rationale:

  • Playing multiple TTS outputs simultaneously would create audio chaos
  • Sequential playback ensures clarity
  • Queue allows buffering during high request volume

Implementation:

import asyncio
import sounddevice as sd
import numpy as np
from typing import Dict, Any

class TTSQueue:
    def __init__(self, max_size: int = 50):
        self.queue = asyncio.Queue(maxsize=max_size)
        self.player = AudioPlayer()
        self.stats = {"processed": 0, "errors": 0}

    async def enqueue(self, request: Dict[str, Any]):
        """Add TTS request to queue."""
        try:
            await asyncio.wait_for(
                self.queue.put(request),
                timeout=1.0
            )
            return self.queue.qsize()
        except asyncio.TimeoutError:
            raise QueueFullError("TTS queue is full")

    async def process_queue(self):
        """Background worker to process TTS queue."""
        while True:
            request = await self.queue.get()

            try:
                # Generate TTS audio
                audio_data = await self.generate_tts(request)

                # Play audio (non-blocking start)
                self.player.play(audio_data, sample_rate=22050)

                # Wait for playback to complete (async-friendly)
                await self.player.wait_async()

                self.stats["processed"] += 1

            except Exception as e:
                logger.error(f"TTS processing error: {e}")
                self.stats["errors"] += 1

            finally:
                self.queue.task_done()

    async def generate_tts(self, request: Dict[str, Any]) -> np.ndarray:
        """Generate TTS audio using Piper."""
        # Run CPU-intensive TTS in thread pool
        loop = asyncio.get_event_loop()
        audio_data = await loop.run_in_executor(
            None,
            self._sync_generate_tts,
            request["message"],
            request.get("voice", "en_US-lessac-medium")
        )
        return audio_data

    def _sync_generate_tts(self, text: str, voice: str) -> np.ndarray:
        """Synchronous TTS generation (runs in thread pool)."""
        # Piper TTS generation code
        ...
        return audio_array

Startup:

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: initialize queue and start processor
    global tts_queue
    tts_queue = TTSQueue(max_size=50)
    asyncio.create_task(tts_queue.process_queue())
    yield
    # Shutdown: stop audio playback
    sd.stop()

app = FastAPI(lifespan=lifespan)

Audio Device Error Handling

Common Issues:

  1. Audio device disconnected (headphones unplugged)
  2. PulseAudio/ALSA daemon crashed
  3. No audio devices available
  4. Device in use by another process

Handling Strategy:

import sounddevice as sd
import numpy as np
import time
import logging

logger = logging.getLogger(__name__)

class RobustAudioPlayer:
    """Audio player with automatic retry and device recovery."""

    def __init__(self, retry_attempts: int = 3, sample_rate: int = 22050):
        self.retry_attempts = retry_attempts
        self.sample_rate = sample_rate
        self.verify_audio_devices()

    def verify_audio_devices(self):
        """Verify audio devices are available."""
        try:
            devices = sd.query_devices()
            output_devices = [d for d in devices if d['max_output_channels'] > 0]
            if not output_devices:
                raise AudioDeviceError("No audio output devices found")
            logger.info(f"Audio initialized: {len(output_devices)} output devices found")
            logger.debug(f"Default output: {sd.query_devices(kind='output')['name']}")
        except Exception as e:
            logger.error(f"Audio initialization failed: {e}")
            raise

    def play(self, audio_data: np.ndarray, sample_rate: int = None):
        """Play audio with automatic retry on device errors."""
        rate = sample_rate or self.sample_rate

        for attempt in range(self.retry_attempts):
            try:
                sd.play(audio_data, rate)
                return
            except sd.PortAudioError as e:
                logger.warning(f"Audio playback failed (attempt {attempt+1}): {e}")

                if attempt < self.retry_attempts - 1:
                    # Wait and retry - device may become available
                    sd.stop()
                    time.sleep(0.5)
                    self.verify_audio_devices()
                else:
                    raise AudioPlaybackError(f"Failed after {self.retry_attempts} attempts: {e}")

    def is_playing(self) -> bool:
        """Check if audio is currently playing."""
        stream = sd.get_stream()
        return stream is not None and stream.active

    def stop(self):
        """Stop current playback."""
        sd.stop()

    async def wait_async(self):
        """Async wait for playback completion."""
        import asyncio
        while self.is_playing():
            await asyncio.sleep(0.05)

Device Query for Diagnostics:

def get_audio_diagnostics() -> dict:
    """Get audio system diagnostics for health check."""
    try:
        devices = sd.query_devices()
        default_output = sd.query_devices(kind='output')
        return {
            "status": "available",
            "device_count": len(devices),
            "default_output": default_output['name'],
            "sample_rate": default_output['default_samplerate']
        }
    except Exception as e:
        return {
            "status": "unavailable",
            "error": str(e)
        }

Error Handling Strategy

Error Categories and Handling

1. Request Validation Errors

Scenarios:

  • Missing required fields
  • Invalid parameter types
  • Out-of-range values
  • Malformed JSON

Handling:

from fastapi import HTTPException, status
from pydantic import BaseModel, Field, ValidationError

class NotifyRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=10000)
    rate: int = Field(170, ge=50, le=400)
    voice: str = Field("en_US-lessac-medium", regex=r"^[\w-]+$")

@app.exception_handler(ValidationError)
async def validation_exception_handler(request, exc):
    return JSONResponse(
        status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
        content={
            "error": "validation_error",
            "detail": str(exc),
            "timestamp": datetime.utcnow().isoformat()
        }
    )

HTTP Status: 422 Unprocessable Entity


2. Queue Full Errors

Scenario: Too many concurrent requests, queue is at capacity

Handling:

class QueueFullError(Exception):
    pass

@app.post("/notify")
async def notify(request: NotifyRequest):
    try:
        position = await tts_queue.enqueue(request)
        return {
            "status": "queued",
            "queue_position": position
        }
    except QueueFullError:
        raise HTTPException(
            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
            detail={
                "error": "queue_full",
                "message": "TTS queue is full, please retry later",
                "queue_size": tts_queue.max_size,
                "retry_after": 5  # seconds
            }
        )

HTTP Status: 503 Service Unavailable Client Action: Implement exponential backoff retry


3. TTS Engine Errors

Scenarios:

  • Voice model not found
  • ONNX Runtime errors
  • Memory allocation failures
  • Corrupted model files

Handling:

class TTSEngineError(Exception):
    pass

async def generate_tts(text: str, voice: str) -> np.ndarray:
    try:
        # Attempt TTS generation
        audio = piper_voice.synthesize(text)
        return audio
    except FileNotFoundError:
        raise TTSEngineError(f"Voice model '{voice}' not found")
    except MemoryError:
        raise TTSEngineError("Insufficient memory for TTS generation")
    except Exception as e:
        logger.error(f"TTS generation failed: {e}", exc_info=True)
        raise TTSEngineError(f"TTS generation failed: {str(e)}")

@app.exception_handler(TTSEngineError)
async def tts_engine_exception_handler(request, exc):
    return JSONResponse(
        status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
        content={
            "error": "tts_engine_error",
            "detail": str(exc),
            "timestamp": datetime.utcnow().isoformat()
        }
    )

HTTP Status: 500 Internal Server Error


4. Audio Playback Errors

Scenarios:

  • No audio devices available
  • Audio device disconnected
  • ALSA/PulseAudio errors
  • Permission denied

Handling:

class AudioPlaybackError(Exception):
    pass

async def play_audio(audio_data: np.ndarray):
    try:
        player.play_with_retry(audio_data, sample_rate=22050)
    except AudioDeviceError as e:
        logger.error(f"Audio device error: {e}")
        raise AudioPlaybackError("No audio output devices available")
    except OSError as e:
        logger.error(f"Audio system error: {e}")
        raise AudioPlaybackError(f"Audio playback failed: {str(e)}")

# In queue processor
try:
    await play_audio(audio_data)
except AudioPlaybackError as e:
    logger.error(f"Playback error: {e}")
    # Continue processing queue, don't crash server
    stats["errors"] += 1

Action: Log error, continue processing queue (don't crash server)


5. System Resource Errors

Scenarios:

  • Out of memory
  • CPU overload
  • Disk space exhausted

Handling:

import psutil

async def check_system_resources():
    """Monitor system resources."""
    memory = psutil.virtual_memory()
    if memory.percent > 90:
        logger.warning(f"High memory usage: {memory.percent}%")

    cpu = psutil.cpu_percent(interval=1)
    if cpu > 90:
        logger.warning(f"High CPU usage: {cpu}%")

@app.middleware("http")
async def resource_monitoring_middleware(request, call_next):
    """Monitor resources on each request."""
    await check_system_resources()
    response = await call_next(request)
    return response

Action: Log warnings, implement queue size limits to prevent resource exhaustion


Logging Strategy

Log Levels:

import logging
from logging.handlers import RotatingFileHandler

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        RotatingFileHandler(
            'voice-server.log',
            maxBytes=10*1024*1024,  # 10MB
            backupCount=5
        ),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

# Log levels usage:
logger.debug("TTS parameters: rate=%d, voice=%s", rate, voice)  # DEBUG
logger.info("Request queued: position=%d", queue_position)       # INFO
logger.warning("Queue nearly full: %d/%d", current, max_size)    # WARNING
logger.error("TTS generation failed: %s", error, exc_info=True)  # ERROR
logger.critical("Audio system unavailable, shutting down")       # CRITICAL

Structured Logging:

import json
from datetime import datetime

def log_request(request_id: str, message: str, status: str):
    """Structured JSON logging."""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "request_id": request_id,
        "message_length": len(message),
        "status": status,
        "event_type": "tts_request"
    }
    logger.info(json.dumps(log_entry))

Health Check Implementation

Comprehensive Health Checks:

@app.get("/health")
async def health_check():
    """Detailed health status."""
    health_status = {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "checks": {}
    }

    # Check TTS engine
    try:
        tts_engine.test_synthesis("test")
        health_status["checks"]["tts_engine"] = "healthy"
    except Exception as e:
        health_status["checks"]["tts_engine"] = f"unhealthy: {str(e)}"
        health_status["status"] = "unhealthy"

    # Check audio output
    try:
        audio_player.test_output()
        health_status["checks"]["audio_output"] = "healthy"
    except Exception as e:
        health_status["checks"]["audio_output"] = f"unhealthy: {str(e)}"
        health_status["status"] = "unhealthy"

    # Check queue status
    queue_size = tts_queue.qsize()
    health_status["checks"]["queue"] = {
        "size": queue_size,
        "capacity": tts_queue.max_size,
        "utilization": f"{(queue_size/tts_queue.max_size)*100:.1f}%"
    }

    # Check system resources
    health_status["checks"]["system"] = {
        "memory_percent": psutil.virtual_memory().percent,
        "cpu_percent": psutil.cpu_percent(interval=0.1)
    }

    status_code = 200 if health_status["status"] == "healthy" else 503
    return JSONResponse(status_code=status_code, content=health_status)

Implementation Checklist

Phase 1: Core Infrastructure (Days 1-2)

1.1 Project Setup

  • Initialize project directory /mnt/NV2/Development/voice-server
  • Create Python virtual environment using uv
  • Install core dependencies:
    • uv pip install fastapi
    • uv pip install uvicorn[standard]
    • uv pip install piper-tts
    • uv pip install sounddevice
    • uv pip install numpy
    • uv pip install pydantic
    • uv pip install python-dotenv
  • Create requirements.txt with pinned versions
  • Create .env.example for configuration template
  • Initialize git repository
  • Create .gitignore (Python, IDEs, .env, voice models)

1.2 FastAPI Application Structure

  • Create app/main.py with FastAPI app initialization
  • Implement /notify endpoint skeleton
  • Implement /health endpoint skeleton
  • Implement /voices endpoint skeleton
  • Configure CORS middleware
  • Configure JSON logging middleware
  • Create Pydantic models for request/response schemas
  • Test basic server startup: uvicorn app.main:app --reload

1.3 Configuration Management

  • Create app/config.py for configuration loading
  • Implement environment variable loading
  • Define configuration schema (host, port, queue size, etc.)
  • Implement configuration validation at startup
  • Create CLI argument parsing for overrides
  • Document all configuration options in README

Phase 2: TTS Integration (Days 2-3)

2.1 Piper TTS Setup

  • Create app/tts_engine.py module
  • Implement PiperTTSEngine class
  • Download default voice model (en_US-lessac-medium)
  • Implement voice model loading with caching
  • Implement text-to-audio synthesis method
  • Add support for configurable speech rate
  • Test TTS generation with sample text
  • Measure TTS latency for various text lengths

2.2 Voice Model Management

  • Create models/ directory for voice model storage
  • Implement voice model discovery (scan models/ directory)
  • Implement lazy loading of voice models (load on first use)
  • Create model metadata cache (name, language, quality, size)
  • Implement /voices endpoint to list available models
  • Add error handling for missing/corrupted models
  • Document voice model installation process

2.3 TTS Parameter Support

  • Implement speech rate adjustment (50-400 WPM)
  • Test rate adjustment across range
  • Add voice selection via request parameter
  • Implement voice validation (reject unknown voices)
  • Add voice_enabled flag for debugging/testing
  • Create comprehensive TTS unit tests

Phase 3: Audio Playback (Day 3)

3.1 sounddevice Integration

  • Create app/audio_player.py module
  • Implement AudioPlayer class with non-blocking sd.play()
  • Verify sounddevice detects audio devices at startup
  • Implement non-blocking playback method
  • Implement async wait_async() method for queue processing
  • Test audio playback with sample NumPy array
  • Verify non-blocking behavior with concurrent requests

3.2 Audio Error Handling

  • Implement audio device detection
  • Add retry logic for device failures
  • Handle device disconnection gracefully
  • Test with headphones unplugged during playback
  • Implement fallback to different audio devices
  • Add detailed audio error logging
  • Create audio system health check

3.3 Playback Testing

  • Test simultaneous playback (should queue)
  • Test rapid successive requests
  • Measure audio latency (request → sound output)
  • Test with various audio formats
  • Verify memory cleanup after playback
  • Test long-running playback (10+ minutes)

Phase 4: Queue Management (Day 4)

4.1 Async Queue Implementation

  • Create app/queue_manager.py module
  • Implement TTSQueue class with asyncio.Queue
  • Set configurable max queue size (default: 50)
  • Implement queue full detection
  • Create background queue processor task
  • Implement graceful queue shutdown
  • Add queue metrics (size, processed, errors)

4.2 Request Processing Pipeline

  • Implement request enqueueing in /notify endpoint
  • Create background worker to process queue
  • Integrate TTS generation in worker
  • Integrate audio playback in worker
  • Implement sequential playback (one at a time)
  • Add request timeout handling (max 60s per request)
  • Test queue with 100+ concurrent requests

4.3 Queue Monitoring

  • Add queue size to /health endpoint
  • Implement queue utilization metrics
  • Add logging for queue events (enqueue, process, error)
  • Create queue performance benchmarks
  • Test queue overflow scenarios
  • Document queue behavior and limits

Phase 5: Error Handling (Day 5)

5.1 Exception Handlers

  • Implement custom exception classes
  • Create QueueFullError exception handler
  • Create TTSEngineError exception handler
  • Create AudioPlaybackError exception handler
  • Create ValidationError exception handler
  • Implement generic exception handler (catch-all)
  • Test all error scenarios

5.2 Logging Infrastructure

  • Configure structured JSON logging
  • Implement rotating file handler (10MB, 5 backups)
  • Add request ID tracking across logs
  • Implement log levels appropriately (DEBUG, INFO, WARNING, ERROR)
  • Create log aggregation for queue processor
  • Test log rotation
  • Document log file locations and format

5.3 Health Monitoring

  • Implement comprehensive /health endpoint
  • Add TTS engine health check
  • Add audio system health check
  • Add queue status to health check
  • Add system resource metrics (CPU, memory)
  • Test health endpoint under load
  • Create health check monitoring script

Phase 6: Testing (Days 5-6)

6.1 Unit Tests

  • Create tests/ directory structure
  • Install pytest: uv pip install pytest pytest-asyncio
  • Write tests for Pydantic models
  • Write tests for TTS engine
  • Write tests for audio player
  • Write tests for queue manager
  • Write tests for configuration loading
  • Achieve 80%+ code coverage

6.2 Integration Tests

  • Write tests for /notify endpoint
  • Write tests for /health endpoint
  • Write tests for /voices endpoint
  • Test end-to-end request flow
  • Test concurrent request handling
  • Test queue overflow scenarios
  • Test error scenarios (TTS failure, audio failure)

6.3 Performance Tests

  • Create load testing script with locust or wrk
  • Test 100 concurrent requests
  • Measure request latency (p50, p95, p99)
  • Measure TTS generation time
  • Measure audio playback latency
  • Measure memory usage under load
  • Document performance characteristics

6.4 System Tests

  • Test on target Linux environment (Nobara/Fedora 42)
  • Test with different audio devices
  • Test with PulseAudio and ALSA
  • Test headphone disconnect/reconnect
  • Test system resource exhaustion scenarios
  • Test server restart recovery
  • Test long-running stability (24+ hours)

Phase 7: Documentation & Deployment (Days 6-7)

7.1 Documentation

  • Create comprehensive README.md:
    • Project overview
    • Installation instructions
    • Configuration options
    • Usage examples
    • API documentation
    • Troubleshooting guide
  • Create CONTRIBUTING.md (if open source)
  • Create CHANGELOG.md
  • Document voice model installation
  • Create architecture diagrams
  • Add inline code documentation
  • Create example client scripts (curl, Python)

7.2 Deployment Preparation

  • Create systemd service file (voice-server.service)
  • Test systemd service installation
  • Test automatic restart on failure
  • Create deployment script (deploy.sh)
  • Document deployment process
  • Create backup/restore procedures
  • Test upgrade procedure

7.3 Production Hardening

  • Enable production logging (disable debug logs)
  • Configure log rotation
  • Set up monitoring (optional: Prometheus, Grafana)
  • Implement graceful shutdown (SIGTERM handling)
  • Test crash recovery
  • Implement rate limiting (optional)
  • Security audit (input sanitization, resource limits)
  • Performance tuning (queue size, worker count)

Testing Strategy

Unit Testing

Framework: pytest with pytest-asyncio

Test Coverage Requirements:

  • Minimum 80% code coverage
  • 100% coverage for critical paths (TTS, audio playback)
  • All error handlers must have tests

Test Structure:

tests/
├── __init__.py
├── conftest.py              # Shared fixtures
├── unit/
│   ├── test_config.py       # Configuration loading tests
│   ├── test_models.py       # Pydantic model tests
│   ├── test_tts_engine.py   # TTS engine tests
│   ├── test_audio_player.py # Audio player tests
│   └── test_queue.py        # Queue manager tests
├── integration/
│   ├── test_api.py          # API endpoint tests
│   ├── test_end_to_end.py   # Full request flow tests
│   └── test_errors.py       # Error scenario tests
└── performance/
    └── test_load.py         # Load testing

Sample Unit Test:

# tests/unit/test_tts_engine.py
import pytest
from app.tts_engine import PiperTTSEngine

@pytest.fixture
def tts_engine():
    """Create TTS engine instance."""
    return PiperTTSEngine(model_dir="models/")

def test_tts_engine_initialization(tts_engine):
    """Test TTS engine initializes successfully."""
    assert tts_engine is not None
    assert tts_engine.default_voice == "en_US-lessac-medium"

def test_text_to_audio_conversion(tts_engine):
    """Test converting text to audio."""
    audio = tts_engine.synthesize("Hello world")
    assert audio is not None
    assert len(audio) > 0
    assert audio.dtype == np.float32

def test_invalid_voice_raises_error(tts_engine):
    """Test that invalid voice raises appropriate error."""
    with pytest.raises(ValueError, match="Voice model .* not found"):
        tts_engine.synthesize("Hello", voice="invalid_voice")

@pytest.mark.asyncio
async def test_async_synthesis(tts_engine):
    """Test async TTS synthesis."""
    audio = await tts_engine.synthesize_async("Hello world")
    assert audio is not None

Sample Integration Test:

# tests/integration/test_api.py
import pytest
from fastapi.testclient import TestClient
from app.main import app

@pytest.fixture
def client():
    """Create test client."""
    return TestClient(app)

def test_notify_endpoint_success(client):
    """Test successful /notify request."""
    response = client.post(
        "/notify",
        json={"message": "Test message", "rate": 180}
    )
    assert response.status_code == 202
    data = response.json()
    assert data["status"] == "queued"
    assert data["message_length"] == 12

def test_notify_endpoint_validation_error(client):
    """Test /notify with invalid parameters."""
    response = client.post(
        "/notify",
        json={"message": "", "rate": 1000}  # Empty message, invalid rate
    )
    assert response.status_code == 422

def test_health_endpoint(client):
    """Test /health endpoint."""
    response = client.get("/health")
    assert response.status_code == 200
    data = response.json()
    assert "status" in data
    assert "queue_size" in data

Load Testing

Tool: wrk or locust

Sample wrk Test:

# Install wrk
sudo dnf install wrk

# Run load test: 100 concurrent connections, 30 seconds
wrk -t4 -c100 -d30s -s post.lua http://localhost:8888/notify

# post.lua script:
# wrk.method = "POST"
# wrk.headers["Content-Type"] = "application/json"
# wrk.body = '{"message": "Load test message"}'

Sample locust Test:

# locustfile.py
from locust import HttpUser, task, between

class VoiceServerUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def notify(self):
        self.client.post("/notify", json={
            "message": "This is a load test message",
            "rate": 180
        })

    @task(5)
    def health_check(self):
        self.client.get("/health")

# Run: locust -f locustfile.py --host=http://localhost:8888

Performance Benchmarks:

Metric Target Acceptable Unacceptable
API Response Time (p95) < 50ms < 100ms > 200ms
TTS Generation (500 chars) < 2s < 5s > 10s
Requests/Second > 50 > 20 < 10
Memory Usage (idle) < 200MB < 500MB > 1GB
Memory Usage (load) < 500MB < 1GB > 2GB
Queue Processing Rate > 10/s > 5/s < 2/s

Manual Testing Checklist

Functional Testing:

  • Send POST request with valid message → Hear audio playback
  • Send request with long text (5000 chars) → Successful playback
  • Send request with special characters → Successful sanitization
  • Send request with invalid voice → Receive 422 error
  • Send request with rate=50 → Slow speech playback
  • Send request with rate=400 → Fast speech playback
  • Send 10 concurrent requests → All play sequentially
  • Fill queue to capacity → Receive 503 error
  • Check /health endpoint → Receive status information
  • Check /voices endpoint → See available voice models
  • Check /docs endpoint → See Swagger documentation

Error Scenario Testing:

  • Unplug headphones during playback → Graceful error handling
  • Kill PulseAudio daemon → Audio error logged, server continues
  • Send malformed JSON → Receive 400 error
  • Send empty message → Receive 422 error
  • Send 11,000 character message → Receive 413 error
  • Restart server during playback → Queue cleared, server restarts

System Testing:

  • Run server for 24 hours → No memory leaks
  • Send 10,000 requests → All processed successfully
  • Monitor CPU usage during load → < 50% average
  • Monitor memory usage during load → < 1GB
  • Test on Fedora 42 → Successful operation
  • Test with ALSA (without PulseAudio) → Successful operation

Future Considerations

Optional Features (Post-v1.0)

1. Advanced Voice Control

  • Pitch adjustment: Allow clients to specify pitch modification
  • Volume control: Per-request volume settings
  • Emotion/tone control: Happy, sad, angry voice modulation (if TTS engine supports)
  • Voice cloning: Custom voice model training (Coqui TTS integration)

Implementation Complexity: Medium User Value: High for accessibility and personalization


2. Audio Format Options

  • Output format selection: Support WAV, MP3, OGG output
  • Sample rate options: Allow 16kHz, 22kHz, 44.1kHz selection
  • Compression levels: Configurable audio quality vs file size

Implementation Complexity: Low User Value: Medium (mostly for file storage use cases)


3. Streaming Audio

  • Real-time streaming: Stream audio as it's generated (WebSocket or SSE)
  • Chunked TTS: Generate and stream long texts in chunks
  • Lower latency: Start playback before full text is synthesized

Implementation Complexity: High User Value: High for very long texts


4. SSML Support

  • Prosody control: Fine-grained control over speech characteristics
  • Break insertion: Explicit pauses and timing control
  • Phoneme specification: Correct pronunciation for unusual words
  • Multi-voice support: Different voices within single text

Example:

<speak>
  Hello, <break time="500ms"/> this is <emphasis>important</emphasis>.
  <voice name="en_US-libritts">A different voice.</voice>
</speak>

Implementation Complexity: Medium User Value: High for advanced use cases


5. Caching Layer

  • TTS result caching: Cache frequently requested texts
  • Cache invalidation: LRU eviction policy
  • Cache persistence: Store cache across restarts
  • Cache statistics: Hit rate monitoring

Implementation Complexity: Low User Value: High for repeated texts (notifications, alerts)

Sample Implementation:

from functools import lru_cache
import hashlib

class TTSCache:
    def __init__(self, max_size: int = 1000):
        self.cache = {}
        self.max_size = max_size

    def get_cache_key(self, text: str, voice: str, rate: int) -> str:
        """Generate cache key from TTS parameters."""
        content = f"{text}|{voice}|{rate}"
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, text: str, voice: str, rate: int):
        """Retrieve cached audio."""
        key = self.get_cache_key(text, voice, rate)
        return self.cache.get(key)

    def put(self, text: str, voice: str, rate: int, audio_data):
        """Store audio in cache with LRU eviction."""
        if len(self.cache) >= self.max_size:
            # Evict oldest entry (simple FIFO, use OrderedDict for true LRU)
            self.cache.pop(next(iter(self.cache)))

        key = self.get_cache_key(text, voice, rate)
        self.cache[key] = audio_data

6. Multi-Language Support

  • Automatic language detection: Detect input language
  • Language-specific voice selection: Match voice to detected language
  • Mixed-language support: Handle multilingual texts

Implementation Complexity: Medium User Value: High for international users


7. Audio Effects

  • Reverb: Add spatial audio effects
  • Echo: Add echo effects
  • Speed adjustment: Time-stretch without pitch change
  • Normalization: Automatic volume leveling

Implementation Complexity: Medium (requires audio processing library like pydub or librosa) User Value: Medium (aesthetic enhancement)


8. Queue Priority System

  • Priority levels: High, normal, low priority requests
  • Priority queues: Separate queues for different priorities
  • Preemption: Allow high-priority requests to interrupt low-priority

Implementation Complexity: Medium User Value: Medium for multi-tenant scenarios


9. Webhook Notifications

  • Completion webhooks: Notify external service when TTS completes
  • Error webhooks: Notify on TTS failures
  • Webhook retry logic: Handle webhook delivery failures

Example Request:

{
  "message": "Hello world",
  "webhook_url": "https://example.com/tts-complete"
}

Implementation Complexity: Low User Value: High for integration scenarios


10. Authentication & Authorization

  • API key authentication: Secure endpoint access
  • Rate limiting: Per-user request limits
  • Usage quotas: Daily/monthly request quotas
  • Multi-tenant support: Isolated queues per user

Implementation Complexity: High User Value: High for shared/production deployments


11. Web Interface

  • Simple web UI: Browser-based TTS interface
  • Queue visualization: Real-time queue status display
  • Voice model management: Upload/download voice models via UI
  • Settings configuration: Web-based configuration editor

Implementation Complexity: Medium User Value: High for non-technical users


12. Docker Deployment

  • Dockerfile: Container image for easy deployment
  • Docker Compose: Multi-container setup with monitoring
  • Volume management: Persistent voice model storage
  • Health check integration: Container health monitoring

Sample Dockerfile:

FROM python:3.11-slim

# Install system dependencies (PortAudio for sounddevice)
RUN apt-get update && apt-get install -y \
    libportaudio2 \
    portaudio19-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app/ ./app/

# Download default voice model
RUN python -c "from piper import PiperVoice; PiperVoice.download('en_US-lessac-medium')"

EXPOSE 8888

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8888"]

Implementation Complexity: Low User Value: High for deployment consistency


13. Metrics & Monitoring

  • Prometheus metrics: Request count, latency, queue size
  • Grafana dashboards: Visual monitoring
  • Alerting: Notify on errors, high queue size, etc.
  • Performance profiling: Identify bottlenecks

Sample Metrics:

from prometheus_client import Counter, Histogram, Gauge

request_counter = Counter('tts_requests_total', 'Total TTS requests')
latency_histogram = Histogram('tts_latency_seconds', 'TTS latency')
queue_size_gauge = Gauge('tts_queue_size', 'Current queue size')

@app.post("/notify")
async def notify(request: NotifyRequest):
    request_counter.inc()
    with latency_histogram.time():
        # Process request
        ...
    queue_size_gauge.set(tts_queue.qsize())

Implementation Complexity: Medium User Value: High for production deployments


Scalability Considerations

Horizontal Scaling:

  • Use Redis for shared queue across multiple server instances
  • Implement distributed locking for audio device access
  • Load balance requests across multiple servers

Vertical Scaling:

  • Increase queue size for higher throughput
  • Use GPU acceleration for TTS (CUDA support in Piper)
  • Optimize voice model loading (keep models in memory)

Architecture Evolution:

  • Separate TTS generation and audio playback into microservices
  • Use message queue (RabbitMQ, Kafka) for request distribution
  • Implement worker pool for parallel TTS generation

Appendix: References

Technical Documentation

Research & Comparisons

Tools & Libraries


Document History

Version Date Author Changes
1.0 2025-12-18 Atlas Initial PRD creation

Document Status: Complete - Ready for Implementation

Next Steps:

  1. Review PRD with stakeholders
  2. Approve technical stack decisions
  3. Begin Phase 1 implementation
  4. Set up project tracking (GitHub Issues, Jira, etc.)
  5. Assign development resources

Questions or Feedback: Contact Atlas at [atlas@manticorum.com]