A local HTTP service that accepts text via POST and speaks it through system speakers using Piper TTS neural voice synthesis. Features: - POST /notify - Queue text for TTS playback - GET /health - Health check with TTS/audio/queue status - GET /voices - List installed voice models - Async queue processing (no overlapping audio) - Non-blocking audio via sounddevice - 73 tests covering API contract Tech stack: - FastAPI + Uvicorn - Piper TTS (neural voices, offline) - sounddevice (PortAudio) - Pydantic for validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
70 KiB
Product Requirements Document: Local Voice Server
Version: 1.0 Date: 2025-12-18 Author: Atlas (Principal Software Architect) Project: Local HTTP Voice Server for Text-to-Speech
Table of Contents
- Executive Summary
- Goals and Non-Goals
- Technical Requirements
- System Architecture
- API Specification
- TTS Engine Analysis
- Web Framework Selection
- Audio Playback Strategy
- Error Handling Strategy
- Implementation Checklist
- Testing Strategy
- Future Considerations
Executive Summary
Project Overview
This project delivers a local HTTP service that accepts POST requests containing text strings and converts them to speech through the computer's speakers. The service will run locally on Linux (Nobara/Fedora 42), providing fast, offline text-to-speech capabilities without requiring external API calls or internet connectivity.
Success Metrics
- Response Time: TTS conversion and playback initiation within 200ms for short texts (< 100 characters)
- Reliability: 99.9% successful request handling under normal operating conditions
- Concurrency: Support for at least 5 concurrent TTS requests with proper queuing
- Audio Quality: Clear, intelligible speech output comparable to Google TTS quality
- Startup Time: Server ready to accept requests within 2 seconds of launch
Technical Stack
| Component | Technology | Justification |
|---|---|---|
| Web Framework | FastAPI | Async support, high performance (15k-20k req/s), automatic API documentation |
| TTS Engine | Piper TTS | Neural voice quality, offline, optimized for local inference, ONNX-based |
| Audio Playback | sounddevice | Cross-platform, Pythonic API, excellent NumPy integration, non-blocking playback |
| Package Manager | uv | Fast Python package management (user preference) |
| ASGI Server | Uvicorn | High-performance ASGI server, native FastAPI integration |
| Async Runtime | asyncio | Built-in Python async support for concurrent request handling |
Timeline Estimate
- Phase 1 - Core Implementation: 2-3 days (basic HTTP server + TTS integration)
- Phase 2 - Error Handling & Testing: 1-2 days (comprehensive error handling, unit tests)
- Phase 3 - Concurrency & Queue Management: 1-2 days (async queue, concurrent playback)
- Total Estimated Time: 4-7 days for production-ready v1.0
Resource Requirements
- Development: 1 full-stack Python developer with async programming experience
- Testing: Access to Linux environment (Nobara/Fedora 42) with audio hardware
- Infrastructure: Local development machine with 2+ CPU cores, 4GB+ RAM
Goals and Non-Goals
Goals
Primary Goals:
- Create a local HTTP service that accepts text via POST requests
- Convert text to speech using high-quality offline TTS
- Play audio through system speakers with minimal latency
- Support concurrent requests with proper queue management
- Provide comprehensive error handling and logging
- Maintain zero external dependencies (fully offline capable)
Secondary Goals:
- Automatic API documentation via FastAPI's built-in OpenAPI support
- Configurable TTS parameters (voice, speed, volume) via request parameters
- Health check endpoint for service monitoring
- Graceful handling of long-running text conversions
- Support for multiple voice models
Non-Goals
Explicitly Out of Scope:
- Cloud-based or external API integration
- Speech-to-text (STT) capabilities
- Audio file storage or retrieval
- User authentication or authorization
- Rate limiting or quota management
- Multi-language UI or web interface
- Real-time streaming audio synthesis
- Mobile app integration
- Persistent audio history or logging
- Advanced audio effects (reverb, pitch shifting, etc.)
Technical Requirements
Functional Requirements
FR1: HTTP Server
- FR1.1: Server SHALL listen on configurable host and port (default:
0.0.0.0:8888) - FR1.2: Server SHALL accept POST requests to
/notifyendpoint - FR1.3: Server SHALL accept JSON payload with
messagefield containing text - FR1.4: Server SHALL return HTTP 200 with success confirmation
- FR1.5: Server SHALL support CORS for local development
FR2: Text-to-Speech Conversion
- FR2.1: System SHALL convert text strings to audio using Piper TTS
- FR2.2: System SHALL support configurable voice models via request parameters
- FR2.3: System SHALL support adjustable speech rate (50-400 words per minute)
- FR2.4: System SHALL handle text inputs from 1 to 10,000 characters
- FR2.5: System SHALL use default voice if not specified in request
FR3: Audio Playback
- FR3.1: System SHALL play generated audio through default system audio output
- FR3.2: System SHALL support non-blocking audio playback
- FR3.3: System SHALL queue concurrent requests in FIFO order
- FR3.4: System SHALL allow configurable maximum queue size (default: 50)
- FR3.5: System SHALL provide feedback when queue is full
FR4: Configuration
- FR4.1: System SHALL support configuration via environment variables
- FR4.2: System SHALL support configuration via command-line arguments
- FR4.3: System SHALL provide sensible defaults for all configuration values
- FR4.4: System SHALL validate configuration at startup
FR5: Error Handling
- FR5.1: System SHALL return appropriate HTTP error codes for failures
- FR5.2: System SHALL log all errors with timestamps and context
- FR5.3: System SHALL continue operating after non-fatal errors
- FR5.4: System SHALL gracefully handle TTS engine failures
- FR5.5: System SHALL provide detailed error messages in responses
Non-Functional Requirements
NFR1: Performance
- NFR1.1: API response time SHALL be < 50ms (excluding TTS processing)
- NFR1.2: TTS conversion SHALL complete in < 2 seconds for 500 character texts
- NFR1.3: System SHALL handle 20+ requests per second without degradation
- NFR1.4: Memory usage SHALL remain < 500MB under normal load
- NFR1.5: CPU usage SHALL average < 30% during active TTS processing
NFR2: Reliability
- NFR2.1: System SHALL maintain 99.9% uptime during operation
- NFR2.2: System SHALL recover from audio device disconnections
- NFR2.3: System SHALL handle Out-of-Memory conditions gracefully
- NFR2.4: System SHALL log all critical errors for debugging
NFR3: Maintainability
- NFR3.1: Code SHALL maintain > 80% test coverage
- NFR3.2: All functions SHALL include docstrings with type hints
- NFR3.3: Code SHALL follow PEP 8 style guidelines
- NFR3.4: Dependencies SHALL be pinned to specific versions
- NFR3.5: README SHALL provide clear setup and usage instructions
NFR4: Security
- NFR4.1: System SHALL sanitize all text inputs to prevent injection attacks
- NFR4.2: System SHALL limit request payload size to 1MB
- NFR4.3: System SHALL not expose internal stack traces in API responses
- NFR4.4: System SHALL log all incoming requests for audit purposes
NFR5: Compatibility
- NFR5.1: System SHALL run on Linux (Nobara/Fedora 42)
- NFR5.2: System SHALL support Python 3.9+
- NFR5.3: System SHALL work with standard ALSA/PulseAudio setups
- NFR5.4: System SHALL be deployable as a systemd service
System Architecture
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Client Applications │
│ (AI Agents, Scripts, Other Services) │
└────────────────────────────┬────────────────────────────────────┘
│ HTTP POST /notify
│ JSON: {"message": "text"}
▼
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Web Server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ /notify │ │ /health │ │ /docs │ │
│ │ endpoint │ │ endpoint │ │ (Swagger) │ │
│ └──────┬───────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ │ Validates & Enqueues │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Async Request Queue │ │
│ │ (asyncio.Queue with max size limit) │ │
│ └──────────────────┬───────────────────────────────┘ │
└────────────────────┬┼───────────────────────────────────────────┘
││
││ Background Task Processing
▼▼
┌─────────────────────────────────────────────────────────────────┐
│ TTS Processing Layer │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Piper TTS Engine │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Voice Models │ │ ONNX Runtime │ │ │
│ │ │ (.onnx + │ │ Inference │ │ │
│ │ │ .json) │ │ Engine │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────┬──────────────────────────┘ │
│ │ Generate WAV │
│ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ In-Memory Audio Buffer │ │
│ │ (NumPy array / bytes) │ │
│ └─────────────────────────┬──────────────────────────┘ │
└────────────────────────────┼───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Audio Playback Layer │
│ ┌────────────────────────────────────────────────────┐ │
│ │ PyAudio Stream Manager │ │
│ │ - Callback-based playback │ │
│ │ - Non-blocking operation │ │
│ │ - Stream lifecycle management │ │
│ └─────────────────────────┬──────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ System Audio Output (ALSA/PulseAudio) │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
🔊 Computer Speakers
Component Descriptions
1. FastAPI Web Server
-
Responsibilities:
- Accept and validate HTTP POST requests
- Provide automatic OpenAPI documentation
- Handle CORS configuration
- Route requests to appropriate handlers
- Return HTTP responses with appropriate status codes
-
Dependencies:
- FastAPI framework
- Uvicorn ASGI server
- Pydantic for request/response validation
2. Async Request Queue
-
Responsibilities:
- Queue incoming TTS requests in FIFO order
- Prevent queue overflow with configurable max size
- Enable asynchronous processing without blocking HTTP responses
- Provide queue status information
-
Implementation:
asyncio.Queuefor async-safe queuing- Background task workers to process queue
- Queue metrics (size, processed count, errors)
3. TTS Processing Layer
-
Responsibilities:
- Load and manage Piper TTS voice models
- Convert text to audio waveforms
- Handle voice model selection
- Configure TTS parameters (rate, pitch, volume)
- Generate in-memory audio buffers
-
Implementation:
- Piper TTS Python bindings
- ONNX Runtime for model inference
- Voice model caching for performance
- Error handling for model loading failures
4. Audio Playback Layer
-
Responsibilities:
- Initialize audio output streams
- Play audio buffers through system speakers
- Support non-blocking playback
- Handle audio device errors
- Manage stream lifecycle
-
Implementation:
- sounddevice for cross-platform audio I/O
- Non-blocking
sd.play()with background playback - Simple NumPy array integration
- Graceful handling of audio device disconnections
Data Flow
Request Processing Flow:
-
HTTP Request Reception:
- Client sends POST to
/notifywith JSON payload - FastAPI validates request schema using Pydantic models
- Request is immediately acknowledged with HTTP 202 (Accepted)
- Client sends POST to
-
Request Enqueueing:
- Validated request is added to async queue
- If queue is full, return HTTP 503 (Service Unavailable)
- Queue position is logged for monitoring
-
Background Processing:
- Background worker retrieves request from queue
- Text is passed to Piper TTS for conversion
- Piper generates WAV audio in memory
-
Audio Playback:
- Audio buffer is passed to PyAudio
- PyAudio streams audio to system output
- Playback occurs in callback thread (non-blocking)
- Completion is logged
-
Error Handling:
- Errors at any stage are caught and logged
- Failed requests are removed from queue
- Error metrics are updated
Technology Stack Justification
FastAPI vs Flask
Decision: FastAPI
Rationale:
- Performance: FastAPI handles 15,000-20,000 req/s vs Flask's 2,000-3,000 req/s (Strapi Comparison)
- Async Native: Built on ASGI with native async/await support, critical for non-blocking TTS processing
- Type Safety: Pydantic integration provides automatic request validation and serialization
- Documentation: Automatic OpenAPI (Swagger) documentation generation
- Modern Architecture: Designed for microservices and high-concurrency applications
- Growing Adoption: 78k GitHub stars, 38% developer adoption in 2025 (40% YoY increase)
Trade-offs:
- Steeper learning curve compared to Flask
- Smaller ecosystem of extensions (though growing rapidly)
- Requires ASGI server (Uvicorn) vs Flask's built-in development server
Piper TTS Engine Selection
Decision: Piper TTS
Rationale:
- Voice Quality: Neural TTS with "Google TTS level quality" (AntiX Forum)
- Offline Operation: Fully local, no internet required
- Performance: Optimized for local inference using ONNX Runtime
- Resource Efficiency: Runs on Raspberry Pi 4, suitable for desktop Linux
- Easy Installation: Available via pip (
pip install piper-tts) - Active Development: Maintained project with 2025 updates
- Multiple Voices: Extensive voice model library with quality/speed trade-offs
Comparison with Alternatives:
| Engine | Voice Quality | Speed | Resource Usage | Offline | Ease of Use |
|---|---|---|---|---|---|
| Piper TTS | ⭐⭐⭐⭐⭐ Neural | ⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐ Medium | ✅ Yes | ⭐⭐⭐⭐ Easy |
| pyttsx3 | ⭐⭐ Robotic | ⭐⭐⭐⭐⭐ Very Fast | ⭐⭐⭐⭐⭐ Very Low | ✅ Yes | ⭐⭐⭐⭐⭐ Very Easy |
| eSpeak | ⭐⭐ Robotic | ⭐⭐⭐⭐⭐ Very Fast | ⭐⭐⭐⭐⭐ Very Low | ✅ Yes | ⭐⭐⭐⭐ Easy |
| gTTS | ⭐⭐⭐⭐⭐ Neural | ⭐⭐ Slow | ⭐⭐⭐⭐ Low | ❌ No | ⭐⭐⭐⭐⭐ Very Easy |
| Coqui TTS | ⭐⭐⭐⭐⭐ Neural | ⭐⭐⭐ Medium | ⭐⭐ High | ✅ Yes | ⭐⭐ Complex |
Trade-offs:
- Larger model files (~20-100MB per voice) vs simple engines
- Higher resource usage than pyttsx3/eSpeak
- Requires ONNX Runtime dependency
sounddevice for Audio Playback
Decision: sounddevice
Rationale:
- Pythonic API: Clean, intuitive interface that feels native to Python
- NumPy Integration: Direct support for NumPy arrays (perfect for Piper TTS output)
- Non-Blocking: Simple
sd.play()returns immediately, audio plays in background - Cross-Platform: Works on Linux, Windows, macOS via PortAudio backend
- Active Maintenance: Well-maintained with regular updates
- Simple Async: Easy integration with asyncio via
sd.wait()or callbacks
Comparison with Alternatives:
| Library | Non-Blocking | Dependencies | Maintenance | Linux Support |
|---|---|---|---|---|
| sounddevice | ✅ Native | PortAudio | ⭐⭐⭐⭐ Active | ✅ Excellent |
| PyAudio | ✅ Callbacks | PortAudio | ⭐⭐⭐ Active | ✅ Excellent |
| simpleaudio | ✅ Async | None | ❌ Archived | ⭐⭐⭐ Good |
| pygame | ⭐ Limited | SDL | ⭐⭐⭐⭐ Active | ⭐⭐⭐⭐ Excellent |
Why sounddevice over PyAudio:
- Simpler API -
sd.play(audio, samplerate)vs PyAudio's stream setup - Better NumPy support - no conversion needed from Piper's output
- More Pythonic - feels like a modern Python library
- Easier async integration - works naturally with asyncio
API Specification
Endpoint: POST /notify
Description: Accept text string and queue for TTS playback
Request Schema:
{
"message": "string (required)",
"voice": "string (optional)",
"rate": "integer (optional, default: 170)",
"voice_enabled": "boolean (optional, default: true)"
}
Request Parameters:
| Parameter | Type | Required | Default | Constraints | Description |
|---|---|---|---|---|---|
message |
string | Yes | - | 1-10000 chars | Text to convert to speech |
voice |
string | No | en_US-lessac-medium |
Valid voice model name | Piper voice model to use |
rate |
integer | No | 170 |
50-400 | Speech rate in words per minute |
voice_enabled |
boolean | No | true |
- | Enable/disable TTS (for debugging) |
Example Request:
curl -X POST http://localhost:8888/notify \
-H "Content-Type: application/json" \
-d '{
"message": "Hello, this is a test of the voice server",
"rate": 200,
"voice_enabled": true
}'
Response Schema (Success - 202 Accepted):
{
"status": "queued",
"message_length": 42,
"queue_position": 3,
"estimated_duration": 2.5,
"voice_model": "en_US-lessac-medium"
}
Response Schema (Error - 400 Bad Request):
{
"error": "validation_error",
"detail": "message field is required",
"timestamp": "2025-12-18T10:30:45.123Z"
}
Response Schema (Error - 503 Service Unavailable):
{
"error": "queue_full",
"detail": "TTS queue is full, please retry later",
"queue_size": 50,
"timestamp": "2025-12-18T10:30:45.123Z"
}
HTTP Status Codes:
| Code | Meaning | Scenario |
|---|---|---|
| 202 | Accepted | Request successfully queued for processing |
| 400 | Bad Request | Invalid request parameters or malformed JSON |
| 413 | Payload Too Large | Message exceeds 10,000 characters |
| 422 | Unprocessable Entity | Valid JSON but invalid parameter values |
| 500 | Internal Server Error | TTS engine failure or unexpected error |
| 503 | Service Unavailable | Queue is full or service is shutting down |
Endpoint: GET /health
Description: Health check endpoint for monitoring
Request: No parameters
Response Schema (Healthy - 200 OK):
{
"status": "healthy",
"uptime_seconds": 3600,
"queue_size": 2,
"queue_capacity": 50,
"tts_engine": "piper",
"audio_output": "available",
"voice_models_loaded": ["en_US-lessac-medium"],
"total_requests": 1523,
"failed_requests": 12,
"timestamp": "2025-12-18T10:30:45.123Z"
}
Response Schema (Unhealthy - 503 Service Unavailable):
{
"status": "unhealthy",
"errors": [
"Audio output device unavailable",
"TTS engine failed to initialize"
],
"timestamp": "2025-12-18T10:30:45.123Z"
}
Endpoint: GET /docs
Description: Automatic Swagger UI documentation (provided by FastAPI)
Access: http://localhost:8888/docs
Features:
- Interactive API testing
- Schema visualization
- Request/response examples
- Authentication testing (if implemented)
Endpoint: GET /voices
Description: List available TTS voice models
Request: No parameters
Response Schema (200 OK):
{
"voices": [
{
"name": "en_US-lessac-medium",
"language": "en_US",
"quality": "medium",
"size_mb": 63.5,
"installed": true
},
{
"name": "en_US-libritts-high",
"language": "en_US",
"quality": "high",
"size_mb": 108.2,
"installed": false
}
],
"default_voice": "en_US-lessac-medium"
}
TTS Engine Analysis
Detailed Comparison Matrix
| Engine | Voice Quality | Latency | CPU Usage | Memory | Offline | Linux Support | Python API | Maintenance |
|---|---|---|---|---|---|---|---|---|
| Piper TTS | ⭐⭐⭐⭐⭐ | ~500ms | Medium | ~200MB | ✅ | ✅ Excellent | ✅ Native | 🟢 Active |
| pyttsx3 | ⭐⭐ | ~100ms | Low | ~50MB | ✅ | ✅ Good | ✅ Native | 🟢 Active |
| eSpeak-ng | ⭐⭐ | ~50ms | Very Low | ~20MB | ✅ | ✅ Excellent | ⚠️ Wrapper | 🟢 Active |
| gTTS | ⭐⭐⭐⭐⭐ | ~2000ms | Low | ~30MB | ❌ | ✅ Good | ✅ Native | 🟢 Active |
| Coqui TTS | ⭐⭐⭐⭐⭐ | ~1500ms | High | ~500MB | ✅ | ✅ Good | ✅ Native | 🟡 Slow |
| Festival | ⭐⭐⭐ | ~300ms | Low | ~100MB | ✅ | ✅ Excellent | ⚠️ Wrapper | 🟡 Slow |
| Mimic3 | ⭐⭐⭐⭐ | ~800ms | Medium | ~300MB | ✅ | ✅ Good | ❌ HTTP only | 🟢 Active |
Detailed Engine Profiles
1. Piper TTS (RECOMMENDED)
Pros:
- Neural TTS with natural-sounding voices
- Optimized for local inference (ONNX Runtime)
- Multiple quality levels (low/medium/high)
- Extensive language and voice support
- Active development and community
- Easy pip installation
- GPU acceleration support (CUDA)
Cons:
- Larger model files (20-100MB per voice)
- Higher resource usage than simple engines
- Initial model download required
- Slightly higher latency than robotic engines
Installation:
uv pip install piper-tts
Usage Example:
from piper import PiperVoice
import wave
voice = PiperVoice.load("en_US-lessac-medium.onnx")
with wave.open("output.wav", "wb") as wav_file:
voice.synthesize("Hello world", wav_file)
Voice Quality Sample:
- Low Quality: Faster, smaller models (~20MB), decent quality
- Medium Quality: Balanced performance (~60MB), recommended default
- High Quality: Best quality (~100MB), slower inference
References:
2. pyttsx3
Pros:
- Extremely lightweight and fast
- Cross-platform (Windows SAPI5, macOS NSSpeech, Linux eSpeak)
- Zero external dependencies
- Simple API
- No model downloads required
Cons:
- Robotic voice quality
- Limited voice customization
- Depends on system TTS engines
Installation:
uv pip install pyttsx3
Usage Example:
import pyttsx3
engine = pyttsx3.init()
engine.say("Hello world")
engine.runAndWait()
References:
3. eSpeak-ng
Pros:
- Ultra-fast synthesis
- 100+ language support
- Minimal resource usage
- Highly customizable
- System-level installation
Cons:
- Robotic, mechanical voice quality
- Python wrapper required (not native)
- Less natural prosody
Installation:
# System package
sudo dnf install espeak-ng
# Python wrapper
uv pip install py3-tts # Uses eSpeak backend
Usage Example:
echo "Hello world" | espeak-ng
References:
4. Coqui TTS
Pros:
- State-of-the-art neural voices
- Custom voice training support
- Multiple model architectures
- High-quality output
Cons:
- Very high resource requirements
- Slower inference
- Complex setup
- Larger memory footprint
- Development has slowed
Installation:
uv pip install TTS
Usage Example:
from TTS.api import TTS
tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world", file_path="output.wav")
References:
Recommendation: Piper TTS
Final Decision: Piper TTS is the optimal choice for this project.
Justification:
- Quality: Neural voices with Google TTS-level quality
- Offline: Fully local, no internet required (critical requirement)
- Performance: Optimized for local inference, suitable for desktop Linux
- Python Native: First-class Python API, easy integration
- Maintenance: Actively maintained with 2025 updates
- Flexibility: Multiple quality levels allow performance tuning
- Ease of Use: Simple pip installation, straightforward API
Configuration Strategy:
- Default Voice:
en_US-lessac-medium(balanced quality/performance) - GPU Acceleration: Optional CUDA support for faster inference
- Model Caching: Pre-load voice models at startup to reduce latency
- Quality Toggle: Allow clients to request different quality levels
Web Framework Selection
FastAPI: Detailed Analysis
Why FastAPI is Ideal for This Project:
1. Async-First Architecture
FastAPI is built on Starlette (ASGI framework) with native async/await support. This is critical for our use case:
@app.post("/notify")
async def notify(request: NotifyRequest):
# Non-blocking enqueueing
await tts_queue.put(request)
return {"status": "queued"}
# Background worker runs concurrently
async def process_queue():
while True:
request = await tts_queue.get()
await generate_and_play_tts(request)
Benefit: HTTP responses return immediately while TTS processing happens in background.
2. Performance Benchmarks
According to TechEmpower benchmarks (Better Stack):
- FastAPI: 15,000-20,000 requests/second
- Flask: 2,000-3,000 requests/second
Benefit: 5-10x higher throughput for handling concurrent TTS requests.
3. Automatic API Documentation
FastAPI generates interactive OpenAPI (Swagger) documentation automatically:
@app.post("/notify", response_model=NotifyResponse)
async def notify(request: NotifyRequest):
"""
Convert text to speech and play through speakers.
- **message**: Text to convert (1-10000 characters)
- **rate**: Speech rate in WPM (50-400)
- **voice**: Voice model name (optional)
"""
...
Benefit: Instant API documentation at /docs without manual maintenance.
4. Type Safety with Pydantic
Automatic request validation and serialization:
from pydantic import BaseModel, Field, validator
class NotifyRequest(BaseModel):
message: str = Field(..., min_length=1, max_length=10000)
rate: int = Field(170, ge=50, le=400)
voice_enabled: bool = True
@validator('message')
def sanitize_message(cls, v):
# Automatic validation before handler runs
return v.strip()
Benefit: Eliminates manual validation code, reduces bugs.
5. Dependency Injection
Clean separation of concerns:
async def get_tts_engine():
return global_tts_engine
@app.post("/notify")
async def notify(
request: NotifyRequest,
tts_engine: PiperVoice = Depends(get_tts_engine)
):
# tts_engine automatically injected
...
Benefit: Testable, maintainable code with clear dependencies.
6. Background Tasks
Built-in support for fire-and-forget tasks:
from fastapi import BackgroundTasks
@app.post("/notify")
async def notify(request: NotifyRequest, background_tasks: BackgroundTasks):
background_tasks.add_task(generate_tts, request.message)
return {"status": "queued"}
Benefit: Simplified async task management.
Flask Comparison (Why Not Flask)
Flask Limitations for This Project:
- WSGI-Based: Synchronous by default, requires Gunicorn/gevent for async
- Lower Performance: 2,000-3,000 req/s vs FastAPI's 15,000-20,000 req/s
- Manual Documentation: Requires Flask-RESTPlus or manual OpenAPI setup
- Manual Validation: No built-in request validation, requires Flask-Pydantic extension
- Blocking I/O: Natural behavior blocks request threads during TTS processing
When Flask Would Be Better:
- Simple synchronous applications
- Heavy reliance on Flask extensions (Flask-Login, Flask-Admin)
- Team already experienced with Flask
- Need for Jinja2 templating (not needed here)
Verdict: FastAPI is the clear winner for this async-heavy, high-performance use case.
Audio Playback Strategy
sounddevice Implementation Details
Non-Blocking Playback
sounddevice provides simple, non-blocking audio playback out of the box:
import sounddevice as sd
import numpy as np
class AudioPlayer:
"""Simple audio player using sounddevice."""
def __init__(self, sample_rate: int = 22050):
self.sample_rate = sample_rate
self._current_stream = None
def play(self, audio_data: np.ndarray, sample_rate: int = None):
"""
Non-blocking audio playback.
Args:
audio_data: NumPy array of audio samples (float32 or int16)
sample_rate: Sample rate in Hz (defaults to instance default)
"""
rate = sample_rate or self.sample_rate
# Stop any currently playing audio
self.stop()
# Play audio - returns immediately, audio plays in background
sd.play(audio_data, rate)
def is_playing(self) -> bool:
"""Check if audio is currently playing."""
return sd.get_stream() is not None and sd.get_stream().active
def stop(self):
"""Stop current playback."""
sd.stop()
def wait(self):
"""Block until current playback completes."""
sd.wait()
async def wait_async(self):
"""Async wait for playback completion."""
import asyncio
while self.is_playing():
await asyncio.sleep(0.05)
Benefits of sounddevice:
sd.play()returns immediately - audio plays in background thread- Direct NumPy array support - no conversion needed from Piper TTS
- Simple API - one line to play audio
- Built-in
sd.wait()for synchronous waiting when needed
Handling Concurrent Requests
Strategy: Queue-based sequential playback with async queue management.
Rationale:
- Playing multiple TTS outputs simultaneously would create audio chaos
- Sequential playback ensures clarity
- Queue allows buffering during high request volume
Implementation:
import asyncio
import sounddevice as sd
import numpy as np
from typing import Dict, Any
class TTSQueue:
def __init__(self, max_size: int = 50):
self.queue = asyncio.Queue(maxsize=max_size)
self.player = AudioPlayer()
self.stats = {"processed": 0, "errors": 0}
async def enqueue(self, request: Dict[str, Any]):
"""Add TTS request to queue."""
try:
await asyncio.wait_for(
self.queue.put(request),
timeout=1.0
)
return self.queue.qsize()
except asyncio.TimeoutError:
raise QueueFullError("TTS queue is full")
async def process_queue(self):
"""Background worker to process TTS queue."""
while True:
request = await self.queue.get()
try:
# Generate TTS audio
audio_data = await self.generate_tts(request)
# Play audio (non-blocking start)
self.player.play(audio_data, sample_rate=22050)
# Wait for playback to complete (async-friendly)
await self.player.wait_async()
self.stats["processed"] += 1
except Exception as e:
logger.error(f"TTS processing error: {e}")
self.stats["errors"] += 1
finally:
self.queue.task_done()
async def generate_tts(self, request: Dict[str, Any]) -> np.ndarray:
"""Generate TTS audio using Piper."""
# Run CPU-intensive TTS in thread pool
loop = asyncio.get_event_loop()
audio_data = await loop.run_in_executor(
None,
self._sync_generate_tts,
request["message"],
request.get("voice", "en_US-lessac-medium")
)
return audio_data
def _sync_generate_tts(self, text: str, voice: str) -> np.ndarray:
"""Synchronous TTS generation (runs in thread pool)."""
# Piper TTS generation code
...
return audio_array
Startup:
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: initialize queue and start processor
global tts_queue
tts_queue = TTSQueue(max_size=50)
asyncio.create_task(tts_queue.process_queue())
yield
# Shutdown: stop audio playback
sd.stop()
app = FastAPI(lifespan=lifespan)
Audio Device Error Handling
Common Issues:
- Audio device disconnected (headphones unplugged)
- PulseAudio/ALSA daemon crashed
- No audio devices available
- Device in use by another process
Handling Strategy:
import sounddevice as sd
import numpy as np
import time
import logging
logger = logging.getLogger(__name__)
class RobustAudioPlayer:
"""Audio player with automatic retry and device recovery."""
def __init__(self, retry_attempts: int = 3, sample_rate: int = 22050):
self.retry_attempts = retry_attempts
self.sample_rate = sample_rate
self.verify_audio_devices()
def verify_audio_devices(self):
"""Verify audio devices are available."""
try:
devices = sd.query_devices()
output_devices = [d for d in devices if d['max_output_channels'] > 0]
if not output_devices:
raise AudioDeviceError("No audio output devices found")
logger.info(f"Audio initialized: {len(output_devices)} output devices found")
logger.debug(f"Default output: {sd.query_devices(kind='output')['name']}")
except Exception as e:
logger.error(f"Audio initialization failed: {e}")
raise
def play(self, audio_data: np.ndarray, sample_rate: int = None):
"""Play audio with automatic retry on device errors."""
rate = sample_rate or self.sample_rate
for attempt in range(self.retry_attempts):
try:
sd.play(audio_data, rate)
return
except sd.PortAudioError as e:
logger.warning(f"Audio playback failed (attempt {attempt+1}): {e}")
if attempt < self.retry_attempts - 1:
# Wait and retry - device may become available
sd.stop()
time.sleep(0.5)
self.verify_audio_devices()
else:
raise AudioPlaybackError(f"Failed after {self.retry_attempts} attempts: {e}")
def is_playing(self) -> bool:
"""Check if audio is currently playing."""
stream = sd.get_stream()
return stream is not None and stream.active
def stop(self):
"""Stop current playback."""
sd.stop()
async def wait_async(self):
"""Async wait for playback completion."""
import asyncio
while self.is_playing():
await asyncio.sleep(0.05)
Device Query for Diagnostics:
def get_audio_diagnostics() -> dict:
"""Get audio system diagnostics for health check."""
try:
devices = sd.query_devices()
default_output = sd.query_devices(kind='output')
return {
"status": "available",
"device_count": len(devices),
"default_output": default_output['name'],
"sample_rate": default_output['default_samplerate']
}
except Exception as e:
return {
"status": "unavailable",
"error": str(e)
}
Error Handling Strategy
Error Categories and Handling
1. Request Validation Errors
Scenarios:
- Missing required fields
- Invalid parameter types
- Out-of-range values
- Malformed JSON
Handling:
from fastapi import HTTPException, status
from pydantic import BaseModel, Field, ValidationError
class NotifyRequest(BaseModel):
message: str = Field(..., min_length=1, max_length=10000)
rate: int = Field(170, ge=50, le=400)
voice: str = Field("en_US-lessac-medium", regex=r"^[\w-]+$")
@app.exception_handler(ValidationError)
async def validation_exception_handler(request, exc):
return JSONResponse(
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
content={
"error": "validation_error",
"detail": str(exc),
"timestamp": datetime.utcnow().isoformat()
}
)
HTTP Status: 422 Unprocessable Entity
2. Queue Full Errors
Scenario: Too many concurrent requests, queue is at capacity
Handling:
class QueueFullError(Exception):
pass
@app.post("/notify")
async def notify(request: NotifyRequest):
try:
position = await tts_queue.enqueue(request)
return {
"status": "queued",
"queue_position": position
}
except QueueFullError:
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail={
"error": "queue_full",
"message": "TTS queue is full, please retry later",
"queue_size": tts_queue.max_size,
"retry_after": 5 # seconds
}
)
HTTP Status: 503 Service Unavailable Client Action: Implement exponential backoff retry
3. TTS Engine Errors
Scenarios:
- Voice model not found
- ONNX Runtime errors
- Memory allocation failures
- Corrupted model files
Handling:
class TTSEngineError(Exception):
pass
async def generate_tts(text: str, voice: str) -> np.ndarray:
try:
# Attempt TTS generation
audio = piper_voice.synthesize(text)
return audio
except FileNotFoundError:
raise TTSEngineError(f"Voice model '{voice}' not found")
except MemoryError:
raise TTSEngineError("Insufficient memory for TTS generation")
except Exception as e:
logger.error(f"TTS generation failed: {e}", exc_info=True)
raise TTSEngineError(f"TTS generation failed: {str(e)}")
@app.exception_handler(TTSEngineError)
async def tts_engine_exception_handler(request, exc):
return JSONResponse(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
content={
"error": "tts_engine_error",
"detail": str(exc),
"timestamp": datetime.utcnow().isoformat()
}
)
HTTP Status: 500 Internal Server Error
4. Audio Playback Errors
Scenarios:
- No audio devices available
- Audio device disconnected
- ALSA/PulseAudio errors
- Permission denied
Handling:
class AudioPlaybackError(Exception):
pass
async def play_audio(audio_data: np.ndarray):
try:
player.play_with_retry(audio_data, sample_rate=22050)
except AudioDeviceError as e:
logger.error(f"Audio device error: {e}")
raise AudioPlaybackError("No audio output devices available")
except OSError as e:
logger.error(f"Audio system error: {e}")
raise AudioPlaybackError(f"Audio playback failed: {str(e)}")
# In queue processor
try:
await play_audio(audio_data)
except AudioPlaybackError as e:
logger.error(f"Playback error: {e}")
# Continue processing queue, don't crash server
stats["errors"] += 1
Action: Log error, continue processing queue (don't crash server)
5. System Resource Errors
Scenarios:
- Out of memory
- CPU overload
- Disk space exhausted
Handling:
import psutil
async def check_system_resources():
"""Monitor system resources."""
memory = psutil.virtual_memory()
if memory.percent > 90:
logger.warning(f"High memory usage: {memory.percent}%")
cpu = psutil.cpu_percent(interval=1)
if cpu > 90:
logger.warning(f"High CPU usage: {cpu}%")
@app.middleware("http")
async def resource_monitoring_middleware(request, call_next):
"""Monitor resources on each request."""
await check_system_resources()
response = await call_next(request)
return response
Action: Log warnings, implement queue size limits to prevent resource exhaustion
Logging Strategy
Log Levels:
import logging
from logging.handlers import RotatingFileHandler
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
RotatingFileHandler(
'voice-server.log',
maxBytes=10*1024*1024, # 10MB
backupCount=5
),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
# Log levels usage:
logger.debug("TTS parameters: rate=%d, voice=%s", rate, voice) # DEBUG
logger.info("Request queued: position=%d", queue_position) # INFO
logger.warning("Queue nearly full: %d/%d", current, max_size) # WARNING
logger.error("TTS generation failed: %s", error, exc_info=True) # ERROR
logger.critical("Audio system unavailable, shutting down") # CRITICAL
Structured Logging:
import json
from datetime import datetime
def log_request(request_id: str, message: str, status: str):
"""Structured JSON logging."""
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"message_length": len(message),
"status": status,
"event_type": "tts_request"
}
logger.info(json.dumps(log_entry))
Health Check Implementation
Comprehensive Health Checks:
@app.get("/health")
async def health_check():
"""Detailed health status."""
health_status = {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"checks": {}
}
# Check TTS engine
try:
tts_engine.test_synthesis("test")
health_status["checks"]["tts_engine"] = "healthy"
except Exception as e:
health_status["checks"]["tts_engine"] = f"unhealthy: {str(e)}"
health_status["status"] = "unhealthy"
# Check audio output
try:
audio_player.test_output()
health_status["checks"]["audio_output"] = "healthy"
except Exception as e:
health_status["checks"]["audio_output"] = f"unhealthy: {str(e)}"
health_status["status"] = "unhealthy"
# Check queue status
queue_size = tts_queue.qsize()
health_status["checks"]["queue"] = {
"size": queue_size,
"capacity": tts_queue.max_size,
"utilization": f"{(queue_size/tts_queue.max_size)*100:.1f}%"
}
# Check system resources
health_status["checks"]["system"] = {
"memory_percent": psutil.virtual_memory().percent,
"cpu_percent": psutil.cpu_percent(interval=0.1)
}
status_code = 200 if health_status["status"] == "healthy" else 503
return JSONResponse(status_code=status_code, content=health_status)
Implementation Checklist
Phase 1: Core Infrastructure (Days 1-2)
1.1 Project Setup
- Initialize project directory
/mnt/NV2/Development/voice-server - Create Python virtual environment using
uv - Install core dependencies:
uv pip install fastapiuv pip install uvicorn[standard]uv pip install piper-ttsuv pip install sounddeviceuv pip install numpyuv pip install pydanticuv pip install python-dotenv
- Create
requirements.txtwith pinned versions - Create
.env.examplefor configuration template - Initialize git repository
- Create
.gitignore(Python, IDEs, .env, voice models)
1.2 FastAPI Application Structure
- Create
app/main.pywith FastAPI app initialization - Implement
/notifyendpoint skeleton - Implement
/healthendpoint skeleton - Implement
/voicesendpoint skeleton - Configure CORS middleware
- Configure JSON logging middleware
- Create Pydantic models for request/response schemas
- Test basic server startup:
uvicorn app.main:app --reload
1.3 Configuration Management
- Create
app/config.pyfor configuration loading - Implement environment variable loading
- Define configuration schema (host, port, queue size, etc.)
- Implement configuration validation at startup
- Create CLI argument parsing for overrides
- Document all configuration options in README
Phase 2: TTS Integration (Days 2-3)
2.1 Piper TTS Setup
- Create
app/tts_engine.pymodule - Implement
PiperTTSEngineclass - Download default voice model (
en_US-lessac-medium) - Implement voice model loading with caching
- Implement text-to-audio synthesis method
- Add support for configurable speech rate
- Test TTS generation with sample text
- Measure TTS latency for various text lengths
2.2 Voice Model Management
- Create
models/directory for voice model storage - Implement voice model discovery (scan
models/directory) - Implement lazy loading of voice models (load on first use)
- Create model metadata cache (name, language, quality, size)
- Implement
/voicesendpoint to list available models - Add error handling for missing/corrupted models
- Document voice model installation process
2.3 TTS Parameter Support
- Implement speech rate adjustment (50-400 WPM)
- Test rate adjustment across range
- Add voice selection via request parameter
- Implement voice validation (reject unknown voices)
- Add
voice_enabledflag for debugging/testing - Create comprehensive TTS unit tests
Phase 3: Audio Playback (Day 3)
3.1 sounddevice Integration
- Create
app/audio_player.pymodule - Implement
AudioPlayerclass with non-blockingsd.play() - Verify sounddevice detects audio devices at startup
- Implement non-blocking playback method
- Implement async
wait_async()method for queue processing - Test audio playback with sample NumPy array
- Verify non-blocking behavior with concurrent requests
3.2 Audio Error Handling
- Implement audio device detection
- Add retry logic for device failures
- Handle device disconnection gracefully
- Test with headphones unplugged during playback
- Implement fallback to different audio devices
- Add detailed audio error logging
- Create audio system health check
3.3 Playback Testing
- Test simultaneous playback (should queue)
- Test rapid successive requests
- Measure audio latency (request → sound output)
- Test with various audio formats
- Verify memory cleanup after playback
- Test long-running playback (10+ minutes)
Phase 4: Queue Management (Day 4)
4.1 Async Queue Implementation
- Create
app/queue_manager.pymodule - Implement
TTSQueueclass withasyncio.Queue - Set configurable max queue size (default: 50)
- Implement queue full detection
- Create background queue processor task
- Implement graceful queue shutdown
- Add queue metrics (size, processed, errors)
4.2 Request Processing Pipeline
- Implement request enqueueing in
/notifyendpoint - Create background worker to process queue
- Integrate TTS generation in worker
- Integrate audio playback in worker
- Implement sequential playback (one at a time)
- Add request timeout handling (max 60s per request)
- Test queue with 100+ concurrent requests
4.3 Queue Monitoring
- Add queue size to
/healthendpoint - Implement queue utilization metrics
- Add logging for queue events (enqueue, process, error)
- Create queue performance benchmarks
- Test queue overflow scenarios
- Document queue behavior and limits
Phase 5: Error Handling (Day 5)
5.1 Exception Handlers
- Implement custom exception classes
- Create
QueueFullErrorexception handler - Create
TTSEngineErrorexception handler - Create
AudioPlaybackErrorexception handler - Create
ValidationErrorexception handler - Implement generic exception handler (catch-all)
- Test all error scenarios
5.2 Logging Infrastructure
- Configure structured JSON logging
- Implement rotating file handler (10MB, 5 backups)
- Add request ID tracking across logs
- Implement log levels appropriately (DEBUG, INFO, WARNING, ERROR)
- Create log aggregation for queue processor
- Test log rotation
- Document log file locations and format
5.3 Health Monitoring
- Implement comprehensive
/healthendpoint - Add TTS engine health check
- Add audio system health check
- Add queue status to health check
- Add system resource metrics (CPU, memory)
- Test health endpoint under load
- Create health check monitoring script
Phase 6: Testing (Days 5-6)
6.1 Unit Tests
- Create
tests/directory structure - Install pytest:
uv pip install pytest pytest-asyncio - Write tests for Pydantic models
- Write tests for TTS engine
- Write tests for audio player
- Write tests for queue manager
- Write tests for configuration loading
- Achieve 80%+ code coverage
6.2 Integration Tests
- Write tests for
/notifyendpoint - Write tests for
/healthendpoint - Write tests for
/voicesendpoint - Test end-to-end request flow
- Test concurrent request handling
- Test queue overflow scenarios
- Test error scenarios (TTS failure, audio failure)
6.3 Performance Tests
- Create load testing script with
locustorwrk - Test 100 concurrent requests
- Measure request latency (p50, p95, p99)
- Measure TTS generation time
- Measure audio playback latency
- Measure memory usage under load
- Document performance characteristics
6.4 System Tests
- Test on target Linux environment (Nobara/Fedora 42)
- Test with different audio devices
- Test with PulseAudio and ALSA
- Test headphone disconnect/reconnect
- Test system resource exhaustion scenarios
- Test server restart recovery
- Test long-running stability (24+ hours)
Phase 7: Documentation & Deployment (Days 6-7)
7.1 Documentation
- Create comprehensive README.md:
- Project overview
- Installation instructions
- Configuration options
- Usage examples
- API documentation
- Troubleshooting guide
- Create CONTRIBUTING.md (if open source)
- Create CHANGELOG.md
- Document voice model installation
- Create architecture diagrams
- Add inline code documentation
- Create example client scripts (curl, Python)
7.2 Deployment Preparation
- Create systemd service file (
voice-server.service) - Test systemd service installation
- Test automatic restart on failure
- Create deployment script (
deploy.sh) - Document deployment process
- Create backup/restore procedures
- Test upgrade procedure
7.3 Production Hardening
- Enable production logging (disable debug logs)
- Configure log rotation
- Set up monitoring (optional: Prometheus, Grafana)
- Implement graceful shutdown (SIGTERM handling)
- Test crash recovery
- Implement rate limiting (optional)
- Security audit (input sanitization, resource limits)
- Performance tuning (queue size, worker count)
Testing Strategy
Unit Testing
Framework: pytest with pytest-asyncio
Test Coverage Requirements:
- Minimum 80% code coverage
- 100% coverage for critical paths (TTS, audio playback)
- All error handlers must have tests
Test Structure:
tests/
├── __init__.py
├── conftest.py # Shared fixtures
├── unit/
│ ├── test_config.py # Configuration loading tests
│ ├── test_models.py # Pydantic model tests
│ ├── test_tts_engine.py # TTS engine tests
│ ├── test_audio_player.py # Audio player tests
│ └── test_queue.py # Queue manager tests
├── integration/
│ ├── test_api.py # API endpoint tests
│ ├── test_end_to_end.py # Full request flow tests
│ └── test_errors.py # Error scenario tests
└── performance/
└── test_load.py # Load testing
Sample Unit Test:
# tests/unit/test_tts_engine.py
import pytest
from app.tts_engine import PiperTTSEngine
@pytest.fixture
def tts_engine():
"""Create TTS engine instance."""
return PiperTTSEngine(model_dir="models/")
def test_tts_engine_initialization(tts_engine):
"""Test TTS engine initializes successfully."""
assert tts_engine is not None
assert tts_engine.default_voice == "en_US-lessac-medium"
def test_text_to_audio_conversion(tts_engine):
"""Test converting text to audio."""
audio = tts_engine.synthesize("Hello world")
assert audio is not None
assert len(audio) > 0
assert audio.dtype == np.float32
def test_invalid_voice_raises_error(tts_engine):
"""Test that invalid voice raises appropriate error."""
with pytest.raises(ValueError, match="Voice model .* not found"):
tts_engine.synthesize("Hello", voice="invalid_voice")
@pytest.mark.asyncio
async def test_async_synthesis(tts_engine):
"""Test async TTS synthesis."""
audio = await tts_engine.synthesize_async("Hello world")
assert audio is not None
Sample Integration Test:
# tests/integration/test_api.py
import pytest
from fastapi.testclient import TestClient
from app.main import app
@pytest.fixture
def client():
"""Create test client."""
return TestClient(app)
def test_notify_endpoint_success(client):
"""Test successful /notify request."""
response = client.post(
"/notify",
json={"message": "Test message", "rate": 180}
)
assert response.status_code == 202
data = response.json()
assert data["status"] == "queued"
assert data["message_length"] == 12
def test_notify_endpoint_validation_error(client):
"""Test /notify with invalid parameters."""
response = client.post(
"/notify",
json={"message": "", "rate": 1000} # Empty message, invalid rate
)
assert response.status_code == 422
def test_health_endpoint(client):
"""Test /health endpoint."""
response = client.get("/health")
assert response.status_code == 200
data = response.json()
assert "status" in data
assert "queue_size" in data
Load Testing
Tool: wrk or locust
Sample wrk Test:
# Install wrk
sudo dnf install wrk
# Run load test: 100 concurrent connections, 30 seconds
wrk -t4 -c100 -d30s -s post.lua http://localhost:8888/notify
# post.lua script:
# wrk.method = "POST"
# wrk.headers["Content-Type"] = "application/json"
# wrk.body = '{"message": "Load test message"}'
Sample locust Test:
# locustfile.py
from locust import HttpUser, task, between
class VoiceServerUser(HttpUser):
wait_time = between(1, 3)
@task
def notify(self):
self.client.post("/notify", json={
"message": "This is a load test message",
"rate": 180
})
@task(5)
def health_check(self):
self.client.get("/health")
# Run: locust -f locustfile.py --host=http://localhost:8888
Performance Benchmarks:
| Metric | Target | Acceptable | Unacceptable |
|---|---|---|---|
| API Response Time (p95) | < 50ms | < 100ms | > 200ms |
| TTS Generation (500 chars) | < 2s | < 5s | > 10s |
| Requests/Second | > 50 | > 20 | < 10 |
| Memory Usage (idle) | < 200MB | < 500MB | > 1GB |
| Memory Usage (load) | < 500MB | < 1GB | > 2GB |
| Queue Processing Rate | > 10/s | > 5/s | < 2/s |
Manual Testing Checklist
Functional Testing:
- Send POST request with valid message → Hear audio playback
- Send request with long text (5000 chars) → Successful playback
- Send request with special characters → Successful sanitization
- Send request with invalid voice → Receive 422 error
- Send request with rate=50 → Slow speech playback
- Send request with rate=400 → Fast speech playback
- Send 10 concurrent requests → All play sequentially
- Fill queue to capacity → Receive 503 error
- Check /health endpoint → Receive status information
- Check /voices endpoint → See available voice models
- Check /docs endpoint → See Swagger documentation
Error Scenario Testing:
- Unplug headphones during playback → Graceful error handling
- Kill PulseAudio daemon → Audio error logged, server continues
- Send malformed JSON → Receive 400 error
- Send empty message → Receive 422 error
- Send 11,000 character message → Receive 413 error
- Restart server during playback → Queue cleared, server restarts
System Testing:
- Run server for 24 hours → No memory leaks
- Send 10,000 requests → All processed successfully
- Monitor CPU usage during load → < 50% average
- Monitor memory usage during load → < 1GB
- Test on Fedora 42 → Successful operation
- Test with ALSA (without PulseAudio) → Successful operation
Future Considerations
Optional Features (Post-v1.0)
1. Advanced Voice Control
- Pitch adjustment: Allow clients to specify pitch modification
- Volume control: Per-request volume settings
- Emotion/tone control: Happy, sad, angry voice modulation (if TTS engine supports)
- Voice cloning: Custom voice model training (Coqui TTS integration)
Implementation Complexity: Medium User Value: High for accessibility and personalization
2. Audio Format Options
- Output format selection: Support WAV, MP3, OGG output
- Sample rate options: Allow 16kHz, 22kHz, 44.1kHz selection
- Compression levels: Configurable audio quality vs file size
Implementation Complexity: Low User Value: Medium (mostly for file storage use cases)
3. Streaming Audio
- Real-time streaming: Stream audio as it's generated (WebSocket or SSE)
- Chunked TTS: Generate and stream long texts in chunks
- Lower latency: Start playback before full text is synthesized
Implementation Complexity: High User Value: High for very long texts
4. SSML Support
- Prosody control: Fine-grained control over speech characteristics
- Break insertion: Explicit pauses and timing control
- Phoneme specification: Correct pronunciation for unusual words
- Multi-voice support: Different voices within single text
Example:
<speak>
Hello, <break time="500ms"/> this is <emphasis>important</emphasis>.
<voice name="en_US-libritts">A different voice.</voice>
</speak>
Implementation Complexity: Medium User Value: High for advanced use cases
5. Caching Layer
- TTS result caching: Cache frequently requested texts
- Cache invalidation: LRU eviction policy
- Cache persistence: Store cache across restarts
- Cache statistics: Hit rate monitoring
Implementation Complexity: Low User Value: High for repeated texts (notifications, alerts)
Sample Implementation:
from functools import lru_cache
import hashlib
class TTSCache:
def __init__(self, max_size: int = 1000):
self.cache = {}
self.max_size = max_size
def get_cache_key(self, text: str, voice: str, rate: int) -> str:
"""Generate cache key from TTS parameters."""
content = f"{text}|{voice}|{rate}"
return hashlib.sha256(content.encode()).hexdigest()
def get(self, text: str, voice: str, rate: int):
"""Retrieve cached audio."""
key = self.get_cache_key(text, voice, rate)
return self.cache.get(key)
def put(self, text: str, voice: str, rate: int, audio_data):
"""Store audio in cache with LRU eviction."""
if len(self.cache) >= self.max_size:
# Evict oldest entry (simple FIFO, use OrderedDict for true LRU)
self.cache.pop(next(iter(self.cache)))
key = self.get_cache_key(text, voice, rate)
self.cache[key] = audio_data
6. Multi-Language Support
- Automatic language detection: Detect input language
- Language-specific voice selection: Match voice to detected language
- Mixed-language support: Handle multilingual texts
Implementation Complexity: Medium User Value: High for international users
7. Audio Effects
- Reverb: Add spatial audio effects
- Echo: Add echo effects
- Speed adjustment: Time-stretch without pitch change
- Normalization: Automatic volume leveling
Implementation Complexity: Medium (requires audio processing library like pydub or librosa)
User Value: Medium (aesthetic enhancement)
8. Queue Priority System
- Priority levels: High, normal, low priority requests
- Priority queues: Separate queues for different priorities
- Preemption: Allow high-priority requests to interrupt low-priority
Implementation Complexity: Medium User Value: Medium for multi-tenant scenarios
9. Webhook Notifications
- Completion webhooks: Notify external service when TTS completes
- Error webhooks: Notify on TTS failures
- Webhook retry logic: Handle webhook delivery failures
Example Request:
{
"message": "Hello world",
"webhook_url": "https://example.com/tts-complete"
}
Implementation Complexity: Low User Value: High for integration scenarios
10. Authentication & Authorization
- API key authentication: Secure endpoint access
- Rate limiting: Per-user request limits
- Usage quotas: Daily/monthly request quotas
- Multi-tenant support: Isolated queues per user
Implementation Complexity: High User Value: High for shared/production deployments
11. Web Interface
- Simple web UI: Browser-based TTS interface
- Queue visualization: Real-time queue status display
- Voice model management: Upload/download voice models via UI
- Settings configuration: Web-based configuration editor
Implementation Complexity: Medium User Value: High for non-technical users
12. Docker Deployment
- Dockerfile: Container image for easy deployment
- Docker Compose: Multi-container setup with monitoring
- Volume management: Persistent voice model storage
- Health check integration: Container health monitoring
Sample Dockerfile:
FROM python:3.11-slim
# Install system dependencies (PortAudio for sounddevice)
RUN apt-get update && apt-get install -y \
libportaudio2 \
portaudio19-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY app/ ./app/
# Download default voice model
RUN python -c "from piper import PiperVoice; PiperVoice.download('en_US-lessac-medium')"
EXPOSE 8888
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8888"]
Implementation Complexity: Low User Value: High for deployment consistency
13. Metrics & Monitoring
- Prometheus metrics: Request count, latency, queue size
- Grafana dashboards: Visual monitoring
- Alerting: Notify on errors, high queue size, etc.
- Performance profiling: Identify bottlenecks
Sample Metrics:
from prometheus_client import Counter, Histogram, Gauge
request_counter = Counter('tts_requests_total', 'Total TTS requests')
latency_histogram = Histogram('tts_latency_seconds', 'TTS latency')
queue_size_gauge = Gauge('tts_queue_size', 'Current queue size')
@app.post("/notify")
async def notify(request: NotifyRequest):
request_counter.inc()
with latency_histogram.time():
# Process request
...
queue_size_gauge.set(tts_queue.qsize())
Implementation Complexity: Medium User Value: High for production deployments
Scalability Considerations
Horizontal Scaling:
- Use Redis for shared queue across multiple server instances
- Implement distributed locking for audio device access
- Load balance requests across multiple servers
Vertical Scaling:
- Increase queue size for higher throughput
- Use GPU acceleration for TTS (CUDA support in Piper)
- Optimize voice model loading (keep models in memory)
Architecture Evolution:
- Separate TTS generation and audio playback into microservices
- Use message queue (RabbitMQ, Kafka) for request distribution
- Implement worker pool for parallel TTS generation
Appendix: References
Technical Documentation
- FastAPI Official Documentation
- Piper TTS GitHub Repository
- PyAudio Documentation
- Uvicorn Documentation
Research & Comparisons
- FastAPI vs Flask Performance Comparison - Strapi
- Flask vs FastAPI - Better Stack
- Python TTS Engines Comparison - Smallest AI
- TTS Converters for Raspberry Pi - Circuit Digest
- Piper TTS Tutorial - RMauro Dev
- Python Audio Playback - simpleaudio Docs
Tools & Libraries
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-12-18 | Atlas | Initial PRD creation |
Document Status: ✅ Complete - Ready for Implementation
Next Steps:
- Review PRD with stakeholders
- Approve technical stack decisions
- Begin Phase 1 implementation
- Set up project tracking (GitHub Issues, Jira, etc.)
- Assign development resources
Questions or Feedback: Contact Atlas at [atlas@manticorum.com]