voice-server/PRD.md
Cal Corum a34aec06f1 Initial commit: Voice server with Piper TTS
A local HTTP service that accepts text via POST and speaks it through
system speakers using Piper TTS neural voice synthesis.

Features:
- POST /notify - Queue text for TTS playback
- GET /health - Health check with TTS/audio/queue status
- GET /voices - List installed voice models
- Async queue processing (no overlapping audio)
- Non-blocking audio via sounddevice
- 73 tests covering API contract

Tech stack:
- FastAPI + Uvicorn
- Piper TTS (neural voices, offline)
- sounddevice (PortAudio)
- Pydantic for validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 00:18:12 -06:00

2148 lines
70 KiB
Markdown

# Product Requirements Document: Local Voice Server
**Version:** 1.0
**Date:** 2025-12-18
**Author:** Atlas (Principal Software Architect)
**Project:** Local HTTP Voice Server for Text-to-Speech
---
## Table of Contents
1. [Executive Summary](#executive-summary)
2. [Goals and Non-Goals](#goals-and-non-goals)
3. [Technical Requirements](#technical-requirements)
4. [System Architecture](#system-architecture)
5. [API Specification](#api-specification)
6. [TTS Engine Analysis](#tts-engine-analysis)
7. [Web Framework Selection](#web-framework-selection)
8. [Audio Playback Strategy](#audio-playback-strategy)
9. [Error Handling Strategy](#error-handling-strategy)
10. [Implementation Checklist](#implementation-checklist)
11. [Testing Strategy](#testing-strategy)
12. [Future Considerations](#future-considerations)
---
## Executive Summary
### Project Overview
This project delivers a local HTTP service that accepts POST requests containing text strings and converts them to speech through the computer's speakers. The service will run locally on Linux (Nobara/Fedora 42), providing fast, offline text-to-speech capabilities without requiring external API calls or internet connectivity.
### Success Metrics
- **Response Time:** TTS conversion and playback initiation within 200ms for short texts (< 100 characters)
- **Reliability:** 99.9% successful request handling under normal operating conditions
- **Concurrency:** Support for at least 5 concurrent TTS requests with proper queuing
- **Audio Quality:** Clear, intelligible speech output comparable to Google TTS quality
- **Startup Time:** Server ready to accept requests within 2 seconds of launch
### Technical Stack
| Component | Technology | Justification |
|-----------|-----------|---------------|
| Web Framework | FastAPI | Async support, high performance (15k-20k req/s), automatic API documentation |
| TTS Engine | Piper TTS | Neural voice quality, offline, optimized for local inference, ONNX-based |
| Audio Playback | sounddevice | Cross-platform, Pythonic API, excellent NumPy integration, non-blocking playback |
| Package Manager | uv | Fast Python package management (user preference) |
| ASGI Server | Uvicorn | High-performance ASGI server, native FastAPI integration |
| Async Runtime | asyncio | Built-in Python async support for concurrent request handling |
### Timeline Estimate
- **Phase 1 - Core Implementation:** 2-3 days (basic HTTP server + TTS integration)
- **Phase 2 - Error Handling & Testing:** 1-2 days (comprehensive error handling, unit tests)
- **Phase 3 - Concurrency & Queue Management:** 1-2 days (async queue, concurrent playback)
- **Total Estimated Time:** 4-7 days for production-ready v1.0
### Resource Requirements
- **Development:** 1 full-stack Python developer with async programming experience
- **Testing:** Access to Linux environment (Nobara/Fedora 42) with audio hardware
- **Infrastructure:** Local development machine with 2+ CPU cores, 4GB+ RAM
---
## Goals and Non-Goals
### Goals
**Primary Goals:**
1. Create a local HTTP service that accepts text via POST requests
2. Convert text to speech using high-quality offline TTS
3. Play audio through system speakers with minimal latency
4. Support concurrent requests with proper queue management
5. Provide comprehensive error handling and logging
6. Maintain zero external dependencies (fully offline capable)
**Secondary Goals:**
1. Automatic API documentation via FastAPI's built-in OpenAPI support
2. Configurable TTS parameters (voice, speed, volume) via request parameters
3. Health check endpoint for service monitoring
4. Graceful handling of long-running text conversions
5. Support for multiple voice models
### Non-Goals
**Explicitly Out of Scope:**
1. Cloud-based or external API integration
2. Speech-to-text (STT) capabilities
3. Audio file storage or retrieval
4. User authentication or authorization
5. Rate limiting or quota management
6. Multi-language UI or web interface
7. Real-time streaming audio synthesis
8. Mobile app integration
9. Persistent audio history or logging
10. Advanced audio effects (reverb, pitch shifting, etc.)
---
## Technical Requirements
### Functional Requirements
#### FR1: HTTP Server
- **FR1.1:** Server SHALL listen on configurable host and port (default: `0.0.0.0:8888`)
- **FR1.2:** Server SHALL accept POST requests to `/notify` endpoint
- **FR1.3:** Server SHALL accept JSON payload with `message` field containing text
- **FR1.4:** Server SHALL return HTTP 200 with success confirmation
- **FR1.5:** Server SHALL support CORS for local development
#### FR2: Text-to-Speech Conversion
- **FR2.1:** System SHALL convert text strings to audio using Piper TTS
- **FR2.2:** System SHALL support configurable voice models via request parameters
- **FR2.3:** System SHALL support adjustable speech rate (50-400 words per minute)
- **FR2.4:** System SHALL handle text inputs from 1 to 10,000 characters
- **FR2.5:** System SHALL use default voice if not specified in request
#### FR3: Audio Playback
- **FR3.1:** System SHALL play generated audio through default system audio output
- **FR3.2:** System SHALL support non-blocking audio playback
- **FR3.3:** System SHALL queue concurrent requests in FIFO order
- **FR3.4:** System SHALL allow configurable maximum queue size (default: 50)
- **FR3.5:** System SHALL provide feedback when queue is full
#### FR4: Configuration
- **FR4.1:** System SHALL support configuration via environment variables
- **FR4.2:** System SHALL support configuration via command-line arguments
- **FR4.3:** System SHALL provide sensible defaults for all configuration values
- **FR4.4:** System SHALL validate configuration at startup
#### FR5: Error Handling
- **FR5.1:** System SHALL return appropriate HTTP error codes for failures
- **FR5.2:** System SHALL log all errors with timestamps and context
- **FR5.3:** System SHALL continue operating after non-fatal errors
- **FR5.4:** System SHALL gracefully handle TTS engine failures
- **FR5.5:** System SHALL provide detailed error messages in responses
### Non-Functional Requirements
#### NFR1: Performance
- **NFR1.1:** API response time SHALL be < 50ms (excluding TTS processing)
- **NFR1.2:** TTS conversion SHALL complete in < 2 seconds for 500 character texts
- **NFR1.3:** System SHALL handle 20+ requests per second without degradation
- **NFR1.4:** Memory usage SHALL remain < 500MB under normal load
- **NFR1.5:** CPU usage SHALL average < 30% during active TTS processing
#### NFR2: Reliability
- **NFR2.1:** System SHALL maintain 99.9% uptime during operation
- **NFR2.2:** System SHALL recover from audio device disconnections
- **NFR2.3:** System SHALL handle Out-of-Memory conditions gracefully
- **NFR2.4:** System SHALL log all critical errors for debugging
#### NFR3: Maintainability
- **NFR3.1:** Code SHALL maintain > 80% test coverage
- **NFR3.2:** All functions SHALL include docstrings with type hints
- **NFR3.3:** Code SHALL follow PEP 8 style guidelines
- **NFR3.4:** Dependencies SHALL be pinned to specific versions
- **NFR3.5:** README SHALL provide clear setup and usage instructions
#### NFR4: Security
- **NFR4.1:** System SHALL sanitize all text inputs to prevent injection attacks
- **NFR4.2:** System SHALL limit request payload size to 1MB
- **NFR4.3:** System SHALL not expose internal stack traces in API responses
- **NFR4.4:** System SHALL log all incoming requests for audit purposes
#### NFR5: Compatibility
- **NFR5.1:** System SHALL run on Linux (Nobara/Fedora 42)
- **NFR5.2:** System SHALL support Python 3.9+
- **NFR5.3:** System SHALL work with standard ALSA/PulseAudio setups
- **NFR5.4:** System SHALL be deployable as a systemd service
---
## System Architecture
### High-Level Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Client Applications │
│ (AI Agents, Scripts, Other Services) │
└────────────────────────────┬────────────────────────────────────┘
│ HTTP POST /notify
│ JSON: {"message": "text"}
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Web Server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ /notify │ │ /health │ │ /docs │ │
│ │ endpoint │ │ endpoint │ │ (Swagger) │ │
│ └──────┬───────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ │ Validates & Enqueues │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Async Request Queue │ │
│ │ (asyncio.Queue with max size limit) │ │
│ └──────────────────┬───────────────────────────────┘ │
└────────────────────┬┼───────────────────────────────────────────┘
││
││ Background Task Processing
▼▼
┌─────────────────────────────────────────────────────────────────┐
│ TTS Processing Layer │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Piper TTS Engine │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Voice Models │ │ ONNX Runtime │ │ │
│ │ │ (.onnx + │ │ Inference │ │ │
│ │ │ .json) │ │ Engine │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────┬──────────────────────────┘ │
│ │ Generate WAV │
│ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ In-Memory Audio Buffer │ │
│ │ (NumPy array / bytes) │ │
│ └─────────────────────────┬──────────────────────────┘ │
└────────────────────────────┼───────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Audio Playback Layer │
│ ┌────────────────────────────────────────────────────┐ │
│ │ PyAudio Stream Manager │ │
│ │ - Callback-based playback │ │
│ │ - Non-blocking operation │ │
│ │ - Stream lifecycle management │ │
│ └─────────────────────────┬──────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ System Audio Output (ALSA/PulseAudio) │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
🔊 Computer Speakers
```
### Component Descriptions
#### 1. FastAPI Web Server
- **Responsibilities:**
- Accept and validate HTTP POST requests
- Provide automatic OpenAPI documentation
- Handle CORS configuration
- Route requests to appropriate handlers
- Return HTTP responses with appropriate status codes
- **Dependencies:**
- FastAPI framework
- Uvicorn ASGI server
- Pydantic for request/response validation
#### 2. Async Request Queue
- **Responsibilities:**
- Queue incoming TTS requests in FIFO order
- Prevent queue overflow with configurable max size
- Enable asynchronous processing without blocking HTTP responses
- Provide queue status information
- **Implementation:**
- `asyncio.Queue` for async-safe queuing
- Background task workers to process queue
- Queue metrics (size, processed count, errors)
#### 3. TTS Processing Layer
- **Responsibilities:**
- Load and manage Piper TTS voice models
- Convert text to audio waveforms
- Handle voice model selection
- Configure TTS parameters (rate, pitch, volume)
- Generate in-memory audio buffers
- **Implementation:**
- Piper TTS Python bindings
- ONNX Runtime for model inference
- Voice model caching for performance
- Error handling for model loading failures
#### 4. Audio Playback Layer
- **Responsibilities:**
- Initialize audio output streams
- Play audio buffers through system speakers
- Support non-blocking playback
- Handle audio device errors
- Manage stream lifecycle
- **Implementation:**
- sounddevice for cross-platform audio I/O
- Non-blocking `sd.play()` with background playback
- Simple NumPy array integration
- Graceful handling of audio device disconnections
### Data Flow
**Request Processing Flow:**
1. **HTTP Request Reception:**
- Client sends POST to `/notify` with JSON payload
- FastAPI validates request schema using Pydantic models
- Request is immediately acknowledged with HTTP 202 (Accepted)
2. **Request Enqueueing:**
- Validated request is added to async queue
- If queue is full, return HTTP 503 (Service Unavailable)
- Queue position is logged for monitoring
3. **Background Processing:**
- Background worker retrieves request from queue
- Text is passed to Piper TTS for conversion
- Piper generates WAV audio in memory
4. **Audio Playback:**
- Audio buffer is passed to PyAudio
- PyAudio streams audio to system output
- Playback occurs in callback thread (non-blocking)
- Completion is logged
5. **Error Handling:**
- Errors at any stage are caught and logged
- Failed requests are removed from queue
- Error metrics are updated
### Technology Stack Justification
#### FastAPI vs Flask
**Decision: FastAPI**
**Rationale:**
- **Performance:** FastAPI handles 15,000-20,000 req/s vs Flask's 2,000-3,000 req/s ([Strapi Comparison](https://strapi.io/blog/fastapi-vs-flask-python-framework-comparison))
- **Async Native:** Built on ASGI with native async/await support, critical for non-blocking TTS processing
- **Type Safety:** Pydantic integration provides automatic request validation and serialization
- **Documentation:** Automatic OpenAPI (Swagger) documentation generation
- **Modern Architecture:** Designed for microservices and high-concurrency applications
- **Growing Adoption:** 78k GitHub stars, 38% developer adoption in 2025 (40% YoY increase)
**Trade-offs:**
- Steeper learning curve compared to Flask
- Smaller ecosystem of extensions (though growing rapidly)
- Requires ASGI server (Uvicorn) vs Flask's built-in development server
#### Piper TTS Engine Selection
**Decision: Piper TTS**
**Rationale:**
- **Voice Quality:** Neural TTS with "Google TTS level quality" ([AntiX Forum](https://www.antixforum.com/forums/topic/tts-text-to-speech-in-linux-piper/))
- **Offline Operation:** Fully local, no internet required
- **Performance:** Optimized for local inference using ONNX Runtime
- **Resource Efficiency:** Runs on Raspberry Pi 4, suitable for desktop Linux
- **Easy Installation:** Available via pip (`pip install piper-tts`)
- **Active Development:** Maintained project with 2025 updates
- **Multiple Voices:** Extensive voice model library with quality/speed trade-offs
**Comparison with Alternatives:**
| Engine | Voice Quality | Speed | Resource Usage | Offline | Ease of Use |
|--------|---------------|-------|----------------|---------|-------------|
| **Piper TTS** | ⭐⭐⭐⭐⭐ Neural | ⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐ Medium | ✅ Yes | ⭐⭐⭐⭐ Easy |
| pyttsx3 | ⭐⭐ Robotic | ⭐⭐⭐⭐⭐ Very Fast | ⭐⭐⭐⭐⭐ Very Low | ✅ Yes | ⭐⭐⭐⭐⭐ Very Easy |
| eSpeak | ⭐⭐ Robotic | ⭐⭐⭐⭐⭐ Very Fast | ⭐⭐⭐⭐⭐ Very Low | ✅ Yes | ⭐⭐⭐⭐ Easy |
| gTTS | ⭐⭐⭐⭐⭐ Neural | ⭐⭐ Slow | ⭐⭐⭐⭐ Low | ❌ No | ⭐⭐⭐⭐⭐ Very Easy |
| Coqui TTS | ⭐⭐⭐⭐⭐ Neural | ⭐⭐⭐ Medium | ⭐⭐ High | ✅ Yes | ⭐⭐ Complex |
**Trade-offs:**
- Larger model files (~20-100MB per voice) vs simple engines
- Higher resource usage than pyttsx3/eSpeak
- Requires ONNX Runtime dependency
#### sounddevice for Audio Playback
**Decision: sounddevice**
**Rationale:**
- **Pythonic API:** Clean, intuitive interface that feels native to Python
- **NumPy Integration:** Direct support for NumPy arrays (perfect for Piper TTS output)
- **Non-Blocking:** Simple `sd.play()` returns immediately, audio plays in background
- **Cross-Platform:** Works on Linux, Windows, macOS via PortAudio backend
- **Active Maintenance:** Well-maintained with regular updates
- **Simple Async:** Easy integration with asyncio via `sd.wait()` or callbacks
**Comparison with Alternatives:**
| Library | Non-Blocking | Dependencies | Maintenance | Linux Support |
|---------|-------------|--------------|-------------|---------------|
| **sounddevice** | ✅ Native | PortAudio | ⭐⭐⭐⭐ Active | ✅ Excellent |
| PyAudio | ✅ Callbacks | PortAudio | ⭐⭐⭐ Active | ✅ Excellent |
| simpleaudio | ✅ Async | None | ❌ Archived | ⭐⭐⭐ Good |
| pygame | ⭐ Limited | SDL | ⭐⭐⭐⭐ Active | ⭐⭐⭐⭐ Excellent |
**Why sounddevice over PyAudio:**
- Simpler API - `sd.play(audio, samplerate)` vs PyAudio's stream setup
- Better NumPy support - no conversion needed from Piper's output
- More Pythonic - feels like a modern Python library
- Easier async integration - works naturally with asyncio
---
## API Specification
### Endpoint: POST /notify
**Description:** Accept text string and queue for TTS playback
**Request Schema:**
```json
{
"message": "string (required)",
"voice": "string (optional)",
"rate": "integer (optional, default: 170)",
"voice_enabled": "boolean (optional, default: true)"
}
```
**Request Parameters:**
| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| `message` | string | Yes | - | 1-10000 chars | Text to convert to speech |
| `voice` | string | No | `en_US-lessac-medium` | Valid voice model name | Piper voice model to use |
| `rate` | integer | No | `170` | 50-400 | Speech rate in words per minute |
| `voice_enabled` | boolean | No | `true` | - | Enable/disable TTS (for debugging) |
**Example Request:**
```bash
curl -X POST http://localhost:8888/notify \
-H "Content-Type: application/json" \
-d '{
"message": "Hello, this is a test of the voice server",
"rate": 200,
"voice_enabled": true
}'
```
**Response Schema (Success - 202 Accepted):**
```json
{
"status": "queued",
"message_length": 42,
"queue_position": 3,
"estimated_duration": 2.5,
"voice_model": "en_US-lessac-medium"
}
```
**Response Schema (Error - 400 Bad Request):**
```json
{
"error": "validation_error",
"detail": "message field is required",
"timestamp": "2025-12-18T10:30:45.123Z"
}
```
**Response Schema (Error - 503 Service Unavailable):**
```json
{
"error": "queue_full",
"detail": "TTS queue is full, please retry later",
"queue_size": 50,
"timestamp": "2025-12-18T10:30:45.123Z"
}
```
**HTTP Status Codes:**
| Code | Meaning | Scenario |
|------|---------|----------|
| 202 | Accepted | Request successfully queued for processing |
| 400 | Bad Request | Invalid request parameters or malformed JSON |
| 413 | Payload Too Large | Message exceeds 10,000 characters |
| 422 | Unprocessable Entity | Valid JSON but invalid parameter values |
| 500 | Internal Server Error | TTS engine failure or unexpected error |
| 503 | Service Unavailable | Queue is full or service is shutting down |
---
### Endpoint: GET /health
**Description:** Health check endpoint for monitoring
**Request:** No parameters
**Response Schema (Healthy - 200 OK):**
```json
{
"status": "healthy",
"uptime_seconds": 3600,
"queue_size": 2,
"queue_capacity": 50,
"tts_engine": "piper",
"audio_output": "available",
"voice_models_loaded": ["en_US-lessac-medium"],
"total_requests": 1523,
"failed_requests": 12,
"timestamp": "2025-12-18T10:30:45.123Z"
}
```
**Response Schema (Unhealthy - 503 Service Unavailable):**
```json
{
"status": "unhealthy",
"errors": [
"Audio output device unavailable",
"TTS engine failed to initialize"
],
"timestamp": "2025-12-18T10:30:45.123Z"
}
```
---
### Endpoint: GET /docs
**Description:** Automatic Swagger UI documentation (provided by FastAPI)
**Access:** `http://localhost:8888/docs`
**Features:**
- Interactive API testing
- Schema visualization
- Request/response examples
- Authentication testing (if implemented)
---
### Endpoint: GET /voices
**Description:** List available TTS voice models
**Request:** No parameters
**Response Schema (200 OK):**
```json
{
"voices": [
{
"name": "en_US-lessac-medium",
"language": "en_US",
"quality": "medium",
"size_mb": 63.5,
"installed": true
},
{
"name": "en_US-libritts-high",
"language": "en_US",
"quality": "high",
"size_mb": 108.2,
"installed": false
}
],
"default_voice": "en_US-lessac-medium"
}
```
---
## TTS Engine Analysis
### Detailed Comparison Matrix
| Engine | Voice Quality | Latency | CPU Usage | Memory | Offline | Linux Support | Python API | Maintenance |
|--------|---------------|---------|-----------|--------|---------|---------------|------------|-------------|
| **Piper TTS** | ⭐⭐⭐⭐⭐ | ~500ms | Medium | ~200MB | ✅ | ✅ Excellent | ✅ Native | 🟢 Active |
| **pyttsx3** | ⭐⭐ | ~100ms | Low | ~50MB | ✅ | ✅ Good | ✅ Native | 🟢 Active |
| **eSpeak-ng** | ⭐⭐ | ~50ms | Very Low | ~20MB | ✅ | ✅ Excellent | ⚠️ Wrapper | 🟢 Active |
| **gTTS** | ⭐⭐⭐⭐⭐ | ~2000ms | Low | ~30MB | ❌ | ✅ Good | ✅ Native | 🟢 Active |
| **Coqui TTS** | ⭐⭐⭐⭐⭐ | ~1500ms | High | ~500MB | ✅ | ✅ Good | ✅ Native | 🟡 Slow |
| **Festival** | ⭐⭐⭐ | ~300ms | Low | ~100MB | ✅ | ✅ Excellent | ⚠️ Wrapper | 🟡 Slow |
| **Mimic3** | ⭐⭐⭐⭐ | ~800ms | Medium | ~300MB | ✅ | ✅ Good | ❌ HTTP only | 🟢 Active |
### Detailed Engine Profiles
#### 1. Piper TTS (RECOMMENDED)
**Pros:**
- Neural TTS with natural-sounding voices
- Optimized for local inference (ONNX Runtime)
- Multiple quality levels (low/medium/high)
- Extensive language and voice support
- Active development and community
- Easy pip installation
- GPU acceleration support (CUDA)
**Cons:**
- Larger model files (20-100MB per voice)
- Higher resource usage than simple engines
- Initial model download required
- Slightly higher latency than robotic engines
**Installation:**
```bash
uv pip install piper-tts
```
**Usage Example:**
```python
from piper import PiperVoice
import wave
voice = PiperVoice.load("en_US-lessac-medium.onnx")
with wave.open("output.wav", "wb") as wav_file:
voice.synthesize("Hello world", wav_file)
```
**Voice Quality Sample:**
- **Low Quality:** Faster, smaller models (~20MB), decent quality
- **Medium Quality:** Balanced performance (~60MB), recommended default
- **High Quality:** Best quality (~100MB), slower inference
**References:**
- [GitHub Repository](https://github.com/rhasspy/piper)
- [PyPI Package](https://pypi.org/project/piper-tts/)
- [Voice Model Library](https://github.com/rhasspy/piper/blob/master/VOICES.md)
---
#### 2. pyttsx3
**Pros:**
- Extremely lightweight and fast
- Cross-platform (Windows SAPI5, macOS NSSpeech, Linux eSpeak)
- Zero external dependencies
- Simple API
- No model downloads required
**Cons:**
- Robotic voice quality
- Limited voice customization
- Depends on system TTS engines
**Installation:**
```bash
uv pip install pyttsx3
```
**Usage Example:**
```python
import pyttsx3
engine = pyttsx3.init()
engine.say("Hello world")
engine.runAndWait()
```
**References:**
- [PyPI Package](https://pypi.org/project/pyttsx3/)
- [GitHub Repository](https://github.com/nateshmbhat/pyttsx3)
---
#### 3. eSpeak-ng
**Pros:**
- Ultra-fast synthesis
- 100+ language support
- Minimal resource usage
- Highly customizable
- System-level installation
**Cons:**
- Robotic, mechanical voice quality
- Python wrapper required (not native)
- Less natural prosody
**Installation:**
```bash
# System package
sudo dnf install espeak-ng
# Python wrapper
uv pip install py3-tts # Uses eSpeak backend
```
**Usage Example:**
```bash
echo "Hello world" | espeak-ng
```
**References:**
- [eSpeak-ng Homepage](https://github.com/espeak-ng/espeak-ng)
- [Circuit Digest Comparison](https://circuitdigest.com/microcontroller-projects/best-text-to-speech-tts-converter-for-raspberry-pi-espeak-festival-google-tts-pico-and-pyttsx3)
---
#### 4. Coqui TTS
**Pros:**
- State-of-the-art neural voices
- Custom voice training support
- Multiple model architectures
- High-quality output
**Cons:**
- Very high resource requirements
- Slower inference
- Complex setup
- Larger memory footprint
- Development has slowed
**Installation:**
```bash
uv pip install TTS
```
**Usage Example:**
```python
from TTS.api import TTS
tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world", file_path="output.wav")
```
**References:**
- [Coqui TTS GitHub](https://github.com/coqui-ai/TTS)
---
### Recommendation: Piper TTS
**Final Decision:** Piper TTS is the optimal choice for this project.
**Justification:**
1. **Quality:** Neural voices with Google TTS-level quality
2. **Offline:** Fully local, no internet required (critical requirement)
3. **Performance:** Optimized for local inference, suitable for desktop Linux
4. **Python Native:** First-class Python API, easy integration
5. **Maintenance:** Actively maintained with 2025 updates
6. **Flexibility:** Multiple quality levels allow performance tuning
7. **Ease of Use:** Simple pip installation, straightforward API
**Configuration Strategy:**
- **Default Voice:** `en_US-lessac-medium` (balanced quality/performance)
- **GPU Acceleration:** Optional CUDA support for faster inference
- **Model Caching:** Pre-load voice models at startup to reduce latency
- **Quality Toggle:** Allow clients to request different quality levels
---
## Web Framework Selection
### FastAPI: Detailed Analysis
**Why FastAPI is Ideal for This Project:**
#### 1. Async-First Architecture
FastAPI is built on Starlette (ASGI framework) with native async/await support. This is critical for our use case:
```python
@app.post("/notify")
async def notify(request: NotifyRequest):
# Non-blocking enqueueing
await tts_queue.put(request)
return {"status": "queued"}
# Background worker runs concurrently
async def process_queue():
while True:
request = await tts_queue.get()
await generate_and_play_tts(request)
```
**Benefit:** HTTP responses return immediately while TTS processing happens in background.
#### 2. Performance Benchmarks
According to TechEmpower benchmarks ([Better Stack](https://betterstack.com/community/guides/scaling-python/flask-vs-fastapi/)):
- **FastAPI:** 15,000-20,000 requests/second
- **Flask:** 2,000-3,000 requests/second
**Benefit:** 5-10x higher throughput for handling concurrent TTS requests.
#### 3. Automatic API Documentation
FastAPI generates interactive OpenAPI (Swagger) documentation automatically:
```python
@app.post("/notify", response_model=NotifyResponse)
async def notify(request: NotifyRequest):
"""
Convert text to speech and play through speakers.
- **message**: Text to convert (1-10000 characters)
- **rate**: Speech rate in WPM (50-400)
- **voice**: Voice model name (optional)
"""
...
```
**Benefit:** Instant API documentation at `/docs` without manual maintenance.
#### 4. Type Safety with Pydantic
Automatic request validation and serialization:
```python
from pydantic import BaseModel, Field, validator
class NotifyRequest(BaseModel):
message: str = Field(..., min_length=1, max_length=10000)
rate: int = Field(170, ge=50, le=400)
voice_enabled: bool = True
@validator('message')
def sanitize_message(cls, v):
# Automatic validation before handler runs
return v.strip()
```
**Benefit:** Eliminates manual validation code, reduces bugs.
#### 5. Dependency Injection
Clean separation of concerns:
```python
async def get_tts_engine():
return global_tts_engine
@app.post("/notify")
async def notify(
request: NotifyRequest,
tts_engine: PiperVoice = Depends(get_tts_engine)
):
# tts_engine automatically injected
...
```
**Benefit:** Testable, maintainable code with clear dependencies.
#### 6. Background Tasks
Built-in support for fire-and-forget tasks:
```python
from fastapi import BackgroundTasks
@app.post("/notify")
async def notify(request: NotifyRequest, background_tasks: BackgroundTasks):
background_tasks.add_task(generate_tts, request.message)
return {"status": "queued"}
```
**Benefit:** Simplified async task management.
### Flask Comparison (Why Not Flask)
**Flask Limitations for This Project:**
1. **WSGI-Based:** Synchronous by default, requires Gunicorn/gevent for async
2. **Lower Performance:** 2,000-3,000 req/s vs FastAPI's 15,000-20,000 req/s
3. **Manual Documentation:** Requires Flask-RESTPlus or manual OpenAPI setup
4. **Manual Validation:** No built-in request validation, requires Flask-Pydantic extension
5. **Blocking I/O:** Natural behavior blocks request threads during TTS processing
**When Flask Would Be Better:**
- Simple synchronous applications
- Heavy reliance on Flask extensions (Flask-Login, Flask-Admin)
- Team already experienced with Flask
- Need for Jinja2 templating (not needed here)
**Verdict:** FastAPI is the clear winner for this async-heavy, high-performance use case.
---
## Audio Playback Strategy
### sounddevice Implementation Details
#### Non-Blocking Playback
sounddevice provides simple, non-blocking audio playback out of the box:
```python
import sounddevice as sd
import numpy as np
class AudioPlayer:
"""Simple audio player using sounddevice."""
def __init__(self, sample_rate: int = 22050):
self.sample_rate = sample_rate
self._current_stream = None
def play(self, audio_data: np.ndarray, sample_rate: int = None):
"""
Non-blocking audio playback.
Args:
audio_data: NumPy array of audio samples (float32 or int16)
sample_rate: Sample rate in Hz (defaults to instance default)
"""
rate = sample_rate or self.sample_rate
# Stop any currently playing audio
self.stop()
# Play audio - returns immediately, audio plays in background
sd.play(audio_data, rate)
def is_playing(self) -> bool:
"""Check if audio is currently playing."""
return sd.get_stream() is not None and sd.get_stream().active
def stop(self):
"""Stop current playback."""
sd.stop()
def wait(self):
"""Block until current playback completes."""
sd.wait()
async def wait_async(self):
"""Async wait for playback completion."""
import asyncio
while self.is_playing():
await asyncio.sleep(0.05)
```
**Benefits of sounddevice:**
- `sd.play()` returns immediately - audio plays in background thread
- Direct NumPy array support - no conversion needed from Piper TTS
- Simple API - one line to play audio
- Built-in `sd.wait()` for synchronous waiting when needed
---
#### Handling Concurrent Requests
**Strategy:** Queue-based sequential playback with async queue management.
**Rationale:**
- Playing multiple TTS outputs simultaneously would create audio chaos
- Sequential playback ensures clarity
- Queue allows buffering during high request volume
**Implementation:**
```python
import asyncio
import sounddevice as sd
import numpy as np
from typing import Dict, Any
class TTSQueue:
def __init__(self, max_size: int = 50):
self.queue = asyncio.Queue(maxsize=max_size)
self.player = AudioPlayer()
self.stats = {"processed": 0, "errors": 0}
async def enqueue(self, request: Dict[str, Any]):
"""Add TTS request to queue."""
try:
await asyncio.wait_for(
self.queue.put(request),
timeout=1.0
)
return self.queue.qsize()
except asyncio.TimeoutError:
raise QueueFullError("TTS queue is full")
async def process_queue(self):
"""Background worker to process TTS queue."""
while True:
request = await self.queue.get()
try:
# Generate TTS audio
audio_data = await self.generate_tts(request)
# Play audio (non-blocking start)
self.player.play(audio_data, sample_rate=22050)
# Wait for playback to complete (async-friendly)
await self.player.wait_async()
self.stats["processed"] += 1
except Exception as e:
logger.error(f"TTS processing error: {e}")
self.stats["errors"] += 1
finally:
self.queue.task_done()
async def generate_tts(self, request: Dict[str, Any]) -> np.ndarray:
"""Generate TTS audio using Piper."""
# Run CPU-intensive TTS in thread pool
loop = asyncio.get_event_loop()
audio_data = await loop.run_in_executor(
None,
self._sync_generate_tts,
request["message"],
request.get("voice", "en_US-lessac-medium")
)
return audio_data
def _sync_generate_tts(self, text: str, voice: str) -> np.ndarray:
"""Synchronous TTS generation (runs in thread pool)."""
# Piper TTS generation code
...
return audio_array
```
**Startup:**
```python
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: initialize queue and start processor
global tts_queue
tts_queue = TTSQueue(max_size=50)
asyncio.create_task(tts_queue.process_queue())
yield
# Shutdown: stop audio playback
sd.stop()
app = FastAPI(lifespan=lifespan)
```
---
#### Audio Device Error Handling
**Common Issues:**
1. Audio device disconnected (headphones unplugged)
2. PulseAudio/ALSA daemon crashed
3. No audio devices available
4. Device in use by another process
**Handling Strategy:**
```python
import sounddevice as sd
import numpy as np
import time
import logging
logger = logging.getLogger(__name__)
class RobustAudioPlayer:
"""Audio player with automatic retry and device recovery."""
def __init__(self, retry_attempts: int = 3, sample_rate: int = 22050):
self.retry_attempts = retry_attempts
self.sample_rate = sample_rate
self.verify_audio_devices()
def verify_audio_devices(self):
"""Verify audio devices are available."""
try:
devices = sd.query_devices()
output_devices = [d for d in devices if d['max_output_channels'] > 0]
if not output_devices:
raise AudioDeviceError("No audio output devices found")
logger.info(f"Audio initialized: {len(output_devices)} output devices found")
logger.debug(f"Default output: {sd.query_devices(kind='output')['name']}")
except Exception as e:
logger.error(f"Audio initialization failed: {e}")
raise
def play(self, audio_data: np.ndarray, sample_rate: int = None):
"""Play audio with automatic retry on device errors."""
rate = sample_rate or self.sample_rate
for attempt in range(self.retry_attempts):
try:
sd.play(audio_data, rate)
return
except sd.PortAudioError as e:
logger.warning(f"Audio playback failed (attempt {attempt+1}): {e}")
if attempt < self.retry_attempts - 1:
# Wait and retry - device may become available
sd.stop()
time.sleep(0.5)
self.verify_audio_devices()
else:
raise AudioPlaybackError(f"Failed after {self.retry_attempts} attempts: {e}")
def is_playing(self) -> bool:
"""Check if audio is currently playing."""
stream = sd.get_stream()
return stream is not None and stream.active
def stop(self):
"""Stop current playback."""
sd.stop()
async def wait_async(self):
"""Async wait for playback completion."""
import asyncio
while self.is_playing():
await asyncio.sleep(0.05)
```
**Device Query for Diagnostics:**
```python
def get_audio_diagnostics() -> dict:
"""Get audio system diagnostics for health check."""
try:
devices = sd.query_devices()
default_output = sd.query_devices(kind='output')
return {
"status": "available",
"device_count": len(devices),
"default_output": default_output['name'],
"sample_rate": default_output['default_samplerate']
}
except Exception as e:
return {
"status": "unavailable",
"error": str(e)
}
```
---
## Error Handling Strategy
### Error Categories and Handling
#### 1. Request Validation Errors
**Scenarios:**
- Missing required fields
- Invalid parameter types
- Out-of-range values
- Malformed JSON
**Handling:**
```python
from fastapi import HTTPException, status
from pydantic import BaseModel, Field, ValidationError
class NotifyRequest(BaseModel):
message: str = Field(..., min_length=1, max_length=10000)
rate: int = Field(170, ge=50, le=400)
voice: str = Field("en_US-lessac-medium", regex=r"^[\w-]+$")
@app.exception_handler(ValidationError)
async def validation_exception_handler(request, exc):
return JSONResponse(
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
content={
"error": "validation_error",
"detail": str(exc),
"timestamp": datetime.utcnow().isoformat()
}
)
```
**HTTP Status:** 422 Unprocessable Entity
---
#### 2. Queue Full Errors
**Scenario:** Too many concurrent requests, queue is at capacity
**Handling:**
```python
class QueueFullError(Exception):
pass
@app.post("/notify")
async def notify(request: NotifyRequest):
try:
position = await tts_queue.enqueue(request)
return {
"status": "queued",
"queue_position": position
}
except QueueFullError:
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail={
"error": "queue_full",
"message": "TTS queue is full, please retry later",
"queue_size": tts_queue.max_size,
"retry_after": 5 # seconds
}
)
```
**HTTP Status:** 503 Service Unavailable
**Client Action:** Implement exponential backoff retry
---
#### 3. TTS Engine Errors
**Scenarios:**
- Voice model not found
- ONNX Runtime errors
- Memory allocation failures
- Corrupted model files
**Handling:**
```python
class TTSEngineError(Exception):
pass
async def generate_tts(text: str, voice: str) -> np.ndarray:
try:
# Attempt TTS generation
audio = piper_voice.synthesize(text)
return audio
except FileNotFoundError:
raise TTSEngineError(f"Voice model '{voice}' not found")
except MemoryError:
raise TTSEngineError("Insufficient memory for TTS generation")
except Exception as e:
logger.error(f"TTS generation failed: {e}", exc_info=True)
raise TTSEngineError(f"TTS generation failed: {str(e)}")
@app.exception_handler(TTSEngineError)
async def tts_engine_exception_handler(request, exc):
return JSONResponse(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
content={
"error": "tts_engine_error",
"detail": str(exc),
"timestamp": datetime.utcnow().isoformat()
}
)
```
**HTTP Status:** 500 Internal Server Error
---
#### 4. Audio Playback Errors
**Scenarios:**
- No audio devices available
- Audio device disconnected
- ALSA/PulseAudio errors
- Permission denied
**Handling:**
```python
class AudioPlaybackError(Exception):
pass
async def play_audio(audio_data: np.ndarray):
try:
player.play_with_retry(audio_data, sample_rate=22050)
except AudioDeviceError as e:
logger.error(f"Audio device error: {e}")
raise AudioPlaybackError("No audio output devices available")
except OSError as e:
logger.error(f"Audio system error: {e}")
raise AudioPlaybackError(f"Audio playback failed: {str(e)}")
# In queue processor
try:
await play_audio(audio_data)
except AudioPlaybackError as e:
logger.error(f"Playback error: {e}")
# Continue processing queue, don't crash server
stats["errors"] += 1
```
**Action:** Log error, continue processing queue (don't crash server)
---
#### 5. System Resource Errors
**Scenarios:**
- Out of memory
- CPU overload
- Disk space exhausted
**Handling:**
```python
import psutil
async def check_system_resources():
"""Monitor system resources."""
memory = psutil.virtual_memory()
if memory.percent > 90:
logger.warning(f"High memory usage: {memory.percent}%")
cpu = psutil.cpu_percent(interval=1)
if cpu > 90:
logger.warning(f"High CPU usage: {cpu}%")
@app.middleware("http")
async def resource_monitoring_middleware(request, call_next):
"""Monitor resources on each request."""
await check_system_resources()
response = await call_next(request)
return response
```
**Action:** Log warnings, implement queue size limits to prevent resource exhaustion
---
### Logging Strategy
**Log Levels:**
```python
import logging
from logging.handlers import RotatingFileHandler
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
RotatingFileHandler(
'voice-server.log',
maxBytes=10*1024*1024, # 10MB
backupCount=5
),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
# Log levels usage:
logger.debug("TTS parameters: rate=%d, voice=%s", rate, voice) # DEBUG
logger.info("Request queued: position=%d", queue_position) # INFO
logger.warning("Queue nearly full: %d/%d", current, max_size) # WARNING
logger.error("TTS generation failed: %s", error, exc_info=True) # ERROR
logger.critical("Audio system unavailable, shutting down") # CRITICAL
```
**Structured Logging:**
```python
import json
from datetime import datetime
def log_request(request_id: str, message: str, status: str):
"""Structured JSON logging."""
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"message_length": len(message),
"status": status,
"event_type": "tts_request"
}
logger.info(json.dumps(log_entry))
```
---
### Health Check Implementation
**Comprehensive Health Checks:**
```python
@app.get("/health")
async def health_check():
"""Detailed health status."""
health_status = {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"checks": {}
}
# Check TTS engine
try:
tts_engine.test_synthesis("test")
health_status["checks"]["tts_engine"] = "healthy"
except Exception as e:
health_status["checks"]["tts_engine"] = f"unhealthy: {str(e)}"
health_status["status"] = "unhealthy"
# Check audio output
try:
audio_player.test_output()
health_status["checks"]["audio_output"] = "healthy"
except Exception as e:
health_status["checks"]["audio_output"] = f"unhealthy: {str(e)}"
health_status["status"] = "unhealthy"
# Check queue status
queue_size = tts_queue.qsize()
health_status["checks"]["queue"] = {
"size": queue_size,
"capacity": tts_queue.max_size,
"utilization": f"{(queue_size/tts_queue.max_size)*100:.1f}%"
}
# Check system resources
health_status["checks"]["system"] = {
"memory_percent": psutil.virtual_memory().percent,
"cpu_percent": psutil.cpu_percent(interval=0.1)
}
status_code = 200 if health_status["status"] == "healthy" else 503
return JSONResponse(status_code=status_code, content=health_status)
```
---
## Implementation Checklist
### Phase 1: Core Infrastructure (Days 1-2)
#### 1.1 Project Setup
- [ ] Initialize project directory `/mnt/NV2/Development/voice-server`
- [ ] Create Python virtual environment using `uv`
- [ ] Install core dependencies:
- [ ] `uv pip install fastapi`
- [ ] `uv pip install uvicorn[standard]`
- [ ] `uv pip install piper-tts`
- [ ] `uv pip install sounddevice`
- [ ] `uv pip install numpy`
- [ ] `uv pip install pydantic`
- [ ] `uv pip install python-dotenv`
- [ ] Create `requirements.txt` with pinned versions
- [ ] Create `.env.example` for configuration template
- [ ] Initialize git repository
- [ ] Create `.gitignore` (Python, IDEs, .env, voice models)
#### 1.2 FastAPI Application Structure
- [ ] Create `app/main.py` with FastAPI app initialization
- [ ] Implement `/notify` endpoint skeleton
- [ ] Implement `/health` endpoint skeleton
- [ ] Implement `/voices` endpoint skeleton
- [ ] Configure CORS middleware
- [ ] Configure JSON logging middleware
- [ ] Create Pydantic models for request/response schemas
- [ ] Test basic server startup: `uvicorn app.main:app --reload`
#### 1.3 Configuration Management
- [ ] Create `app/config.py` for configuration loading
- [ ] Implement environment variable loading
- [ ] Define configuration schema (host, port, queue size, etc.)
- [ ] Implement configuration validation at startup
- [ ] Create CLI argument parsing for overrides
- [ ] Document all configuration options in README
### Phase 2: TTS Integration (Days 2-3)
#### 2.1 Piper TTS Setup
- [ ] Create `app/tts_engine.py` module
- [ ] Implement `PiperTTSEngine` class
- [ ] Download default voice model (`en_US-lessac-medium`)
- [ ] Implement voice model loading with caching
- [ ] Implement text-to-audio synthesis method
- [ ] Add support for configurable speech rate
- [ ] Test TTS generation with sample text
- [ ] Measure TTS latency for various text lengths
#### 2.2 Voice Model Management
- [ ] Create `models/` directory for voice model storage
- [ ] Implement voice model discovery (scan `models/` directory)
- [ ] Implement lazy loading of voice models (load on first use)
- [ ] Create model metadata cache (name, language, quality, size)
- [ ] Implement `/voices` endpoint to list available models
- [ ] Add error handling for missing/corrupted models
- [ ] Document voice model installation process
#### 2.3 TTS Parameter Support
- [ ] Implement speech rate adjustment (50-400 WPM)
- [ ] Test rate adjustment across range
- [ ] Add voice selection via request parameter
- [ ] Implement voice validation (reject unknown voices)
- [ ] Add `voice_enabled` flag for debugging/testing
- [ ] Create comprehensive TTS unit tests
### Phase 3: Audio Playback (Day 3)
#### 3.1 sounddevice Integration
- [ ] Create `app/audio_player.py` module
- [ ] Implement `AudioPlayer` class with non-blocking `sd.play()`
- [ ] Verify sounddevice detects audio devices at startup
- [ ] Implement non-blocking playback method
- [ ] Implement async `wait_async()` method for queue processing
- [ ] Test audio playback with sample NumPy array
- [ ] Verify non-blocking behavior with concurrent requests
#### 3.2 Audio Error Handling
- [ ] Implement audio device detection
- [ ] Add retry logic for device failures
- [ ] Handle device disconnection gracefully
- [ ] Test with headphones unplugged during playback
- [ ] Implement fallback to different audio devices
- [ ] Add detailed audio error logging
- [ ] Create audio system health check
#### 3.3 Playback Testing
- [ ] Test simultaneous playback (should queue)
- [ ] Test rapid successive requests
- [ ] Measure audio latency (request → sound output)
- [ ] Test with various audio formats
- [ ] Verify memory cleanup after playback
- [ ] Test long-running playback (10+ minutes)
### Phase 4: Queue Management (Day 4)
#### 4.1 Async Queue Implementation
- [ ] Create `app/queue_manager.py` module
- [ ] Implement `TTSQueue` class with `asyncio.Queue`
- [ ] Set configurable max queue size (default: 50)
- [ ] Implement queue full detection
- [ ] Create background queue processor task
- [ ] Implement graceful queue shutdown
- [ ] Add queue metrics (size, processed, errors)
#### 4.2 Request Processing Pipeline
- [ ] Implement request enqueueing in `/notify` endpoint
- [ ] Create background worker to process queue
- [ ] Integrate TTS generation in worker
- [ ] Integrate audio playback in worker
- [ ] Implement sequential playback (one at a time)
- [ ] Add request timeout handling (max 60s per request)
- [ ] Test queue with 100+ concurrent requests
#### 4.3 Queue Monitoring
- [ ] Add queue size to `/health` endpoint
- [ ] Implement queue utilization metrics
- [ ] Add logging for queue events (enqueue, process, error)
- [ ] Create queue performance benchmarks
- [ ] Test queue overflow scenarios
- [ ] Document queue behavior and limits
### Phase 5: Error Handling (Day 5)
#### 5.1 Exception Handlers
- [ ] Implement custom exception classes
- [ ] Create `QueueFullError` exception handler
- [ ] Create `TTSEngineError` exception handler
- [ ] Create `AudioPlaybackError` exception handler
- [ ] Create `ValidationError` exception handler
- [ ] Implement generic exception handler (catch-all)
- [ ] Test all error scenarios
#### 5.2 Logging Infrastructure
- [ ] Configure structured JSON logging
- [ ] Implement rotating file handler (10MB, 5 backups)
- [ ] Add request ID tracking across logs
- [ ] Implement log levels appropriately (DEBUG, INFO, WARNING, ERROR)
- [ ] Create log aggregation for queue processor
- [ ] Test log rotation
- [ ] Document log file locations and format
#### 5.3 Health Monitoring
- [ ] Implement comprehensive `/health` endpoint
- [ ] Add TTS engine health check
- [ ] Add audio system health check
- [ ] Add queue status to health check
- [ ] Add system resource metrics (CPU, memory)
- [ ] Test health endpoint under load
- [ ] Create health check monitoring script
### Phase 6: Testing (Days 5-6)
#### 6.1 Unit Tests
- [ ] Create `tests/` directory structure
- [ ] Install pytest: `uv pip install pytest pytest-asyncio`
- [ ] Write tests for Pydantic models
- [ ] Write tests for TTS engine
- [ ] Write tests for audio player
- [ ] Write tests for queue manager
- [ ] Write tests for configuration loading
- [ ] Achieve 80%+ code coverage
#### 6.2 Integration Tests
- [ ] Write tests for `/notify` endpoint
- [ ] Write tests for `/health` endpoint
- [ ] Write tests for `/voices` endpoint
- [ ] Test end-to-end request flow
- [ ] Test concurrent request handling
- [ ] Test queue overflow scenarios
- [ ] Test error scenarios (TTS failure, audio failure)
#### 6.3 Performance Tests
- [ ] Create load testing script with `locust` or `wrk`
- [ ] Test 100 concurrent requests
- [ ] Measure request latency (p50, p95, p99)
- [ ] Measure TTS generation time
- [ ] Measure audio playback latency
- [ ] Measure memory usage under load
- [ ] Document performance characteristics
#### 6.4 System Tests
- [ ] Test on target Linux environment (Nobara/Fedora 42)
- [ ] Test with different audio devices
- [ ] Test with PulseAudio and ALSA
- [ ] Test headphone disconnect/reconnect
- [ ] Test system resource exhaustion scenarios
- [ ] Test server restart recovery
- [ ] Test long-running stability (24+ hours)
### Phase 7: Documentation & Deployment (Days 6-7)
#### 7.1 Documentation
- [ ] Create comprehensive README.md:
- [ ] Project overview
- [ ] Installation instructions
- [ ] Configuration options
- [ ] Usage examples
- [ ] API documentation
- [ ] Troubleshooting guide
- [ ] Create CONTRIBUTING.md (if open source)
- [ ] Create CHANGELOG.md
- [ ] Document voice model installation
- [ ] Create architecture diagrams
- [ ] Add inline code documentation
- [ ] Create example client scripts (curl, Python)
#### 7.2 Deployment Preparation
- [ ] Create systemd service file (`voice-server.service`)
- [ ] Test systemd service installation
- [ ] Test automatic restart on failure
- [ ] Create deployment script (`deploy.sh`)
- [ ] Document deployment process
- [ ] Create backup/restore procedures
- [ ] Test upgrade procedure
#### 7.3 Production Hardening
- [ ] Enable production logging (disable debug logs)
- [ ] Configure log rotation
- [ ] Set up monitoring (optional: Prometheus, Grafana)
- [ ] Implement graceful shutdown (SIGTERM handling)
- [ ] Test crash recovery
- [ ] Implement rate limiting (optional)
- [ ] Security audit (input sanitization, resource limits)
- [ ] Performance tuning (queue size, worker count)
---
## Testing Strategy
### Unit Testing
**Framework:** pytest with pytest-asyncio
**Test Coverage Requirements:**
- Minimum 80% code coverage
- 100% coverage for critical paths (TTS, audio playback)
- All error handlers must have tests
**Test Structure:**
```
tests/
├── __init__.py
├── conftest.py # Shared fixtures
├── unit/
│ ├── test_config.py # Configuration loading tests
│ ├── test_models.py # Pydantic model tests
│ ├── test_tts_engine.py # TTS engine tests
│ ├── test_audio_player.py # Audio player tests
│ └── test_queue.py # Queue manager tests
├── integration/
│ ├── test_api.py # API endpoint tests
│ ├── test_end_to_end.py # Full request flow tests
│ └── test_errors.py # Error scenario tests
└── performance/
└── test_load.py # Load testing
```
**Sample Unit Test:**
```python
# tests/unit/test_tts_engine.py
import pytest
from app.tts_engine import PiperTTSEngine
@pytest.fixture
def tts_engine():
"""Create TTS engine instance."""
return PiperTTSEngine(model_dir="models/")
def test_tts_engine_initialization(tts_engine):
"""Test TTS engine initializes successfully."""
assert tts_engine is not None
assert tts_engine.default_voice == "en_US-lessac-medium"
def test_text_to_audio_conversion(tts_engine):
"""Test converting text to audio."""
audio = tts_engine.synthesize("Hello world")
assert audio is not None
assert len(audio) > 0
assert audio.dtype == np.float32
def test_invalid_voice_raises_error(tts_engine):
"""Test that invalid voice raises appropriate error."""
with pytest.raises(ValueError, match="Voice model .* not found"):
tts_engine.synthesize("Hello", voice="invalid_voice")
@pytest.mark.asyncio
async def test_async_synthesis(tts_engine):
"""Test async TTS synthesis."""
audio = await tts_engine.synthesize_async("Hello world")
assert audio is not None
```
**Sample Integration Test:**
```python
# tests/integration/test_api.py
import pytest
from fastapi.testclient import TestClient
from app.main import app
@pytest.fixture
def client():
"""Create test client."""
return TestClient(app)
def test_notify_endpoint_success(client):
"""Test successful /notify request."""
response = client.post(
"/notify",
json={"message": "Test message", "rate": 180}
)
assert response.status_code == 202
data = response.json()
assert data["status"] == "queued"
assert data["message_length"] == 12
def test_notify_endpoint_validation_error(client):
"""Test /notify with invalid parameters."""
response = client.post(
"/notify",
json={"message": "", "rate": 1000} # Empty message, invalid rate
)
assert response.status_code == 422
def test_health_endpoint(client):
"""Test /health endpoint."""
response = client.get("/health")
assert response.status_code == 200
data = response.json()
assert "status" in data
assert "queue_size" in data
```
---
### Load Testing
**Tool:** wrk or locust
**Sample wrk Test:**
```bash
# Install wrk
sudo dnf install wrk
# Run load test: 100 concurrent connections, 30 seconds
wrk -t4 -c100 -d30s -s post.lua http://localhost:8888/notify
# post.lua script:
# wrk.method = "POST"
# wrk.headers["Content-Type"] = "application/json"
# wrk.body = '{"message": "Load test message"}'
```
**Sample locust Test:**
```python
# locustfile.py
from locust import HttpUser, task, between
class VoiceServerUser(HttpUser):
wait_time = between(1, 3)
@task
def notify(self):
self.client.post("/notify", json={
"message": "This is a load test message",
"rate": 180
})
@task(5)
def health_check(self):
self.client.get("/health")
# Run: locust -f locustfile.py --host=http://localhost:8888
```
**Performance Benchmarks:**
| Metric | Target | Acceptable | Unacceptable |
|--------|--------|------------|--------------|
| API Response Time (p95) | < 50ms | < 100ms | > 200ms |
| TTS Generation (500 chars) | < 2s | < 5s | > 10s |
| Requests/Second | > 50 | > 20 | < 10 |
| Memory Usage (idle) | < 200MB | < 500MB | > 1GB |
| Memory Usage (load) | < 500MB | < 1GB | > 2GB |
| Queue Processing Rate | > 10/s | > 5/s | < 2/s |
---
### Manual Testing Checklist
**Functional Testing:**
- [ ] Send POST request with valid message Hear audio playback
- [ ] Send request with long text (5000 chars) Successful playback
- [ ] Send request with special characters Successful sanitization
- [ ] Send request with invalid voice Receive 422 error
- [ ] Send request with rate=50 Slow speech playback
- [ ] Send request with rate=400 Fast speech playback
- [ ] Send 10 concurrent requests All play sequentially
- [ ] Fill queue to capacity Receive 503 error
- [ ] Check /health endpoint Receive status information
- [ ] Check /voices endpoint See available voice models
- [ ] Check /docs endpoint See Swagger documentation
**Error Scenario Testing:**
- [ ] Unplug headphones during playback Graceful error handling
- [ ] Kill PulseAudio daemon Audio error logged, server continues
- [ ] Send malformed JSON Receive 400 error
- [ ] Send empty message Receive 422 error
- [ ] Send 11,000 character message Receive 413 error
- [ ] Restart server during playback Queue cleared, server restarts
**System Testing:**
- [ ] Run server for 24 hours No memory leaks
- [ ] Send 10,000 requests All processed successfully
- [ ] Monitor CPU usage during load < 50% average
- [ ] Monitor memory usage during load < 1GB
- [ ] Test on Fedora 42 Successful operation
- [ ] Test with ALSA (without PulseAudio) Successful operation
---
## Future Considerations
### Optional Features (Post-v1.0)
#### 1. Advanced Voice Control
- **Pitch adjustment:** Allow clients to specify pitch modification
- **Volume control:** Per-request volume settings
- **Emotion/tone control:** Happy, sad, angry voice modulation (if TTS engine supports)
- **Voice cloning:** Custom voice model training (Coqui TTS integration)
**Implementation Complexity:** Medium
**User Value:** High for accessibility and personalization
---
#### 2. Audio Format Options
- **Output format selection:** Support WAV, MP3, OGG output
- **Sample rate options:** Allow 16kHz, 22kHz, 44.1kHz selection
- **Compression levels:** Configurable audio quality vs file size
**Implementation Complexity:** Low
**User Value:** Medium (mostly for file storage use cases)
---
#### 3. Streaming Audio
- **Real-time streaming:** Stream audio as it's generated (WebSocket or SSE)
- **Chunked TTS:** Generate and stream long texts in chunks
- **Lower latency:** Start playback before full text is synthesized
**Implementation Complexity:** High
**User Value:** High for very long texts
---
#### 4. SSML Support
- **Prosody control:** Fine-grained control over speech characteristics
- **Break insertion:** Explicit pauses and timing control
- **Phoneme specification:** Correct pronunciation for unusual words
- **Multi-voice support:** Different voices within single text
**Example:**
```xml
<speak>
Hello, <break time="500ms"/> this is <emphasis>important</emphasis>.
<voice name="en_US-libritts">A different voice.</voice>
</speak>
```
**Implementation Complexity:** Medium
**User Value:** High for advanced use cases
---
#### 5. Caching Layer
- **TTS result caching:** Cache frequently requested texts
- **Cache invalidation:** LRU eviction policy
- **Cache persistence:** Store cache across restarts
- **Cache statistics:** Hit rate monitoring
**Implementation Complexity:** Low
**User Value:** High for repeated texts (notifications, alerts)
**Sample Implementation:**
```python
from functools import lru_cache
import hashlib
class TTSCache:
def __init__(self, max_size: int = 1000):
self.cache = {}
self.max_size = max_size
def get_cache_key(self, text: str, voice: str, rate: int) -> str:
"""Generate cache key from TTS parameters."""
content = f"{text}|{voice}|{rate}"
return hashlib.sha256(content.encode()).hexdigest()
def get(self, text: str, voice: str, rate: int):
"""Retrieve cached audio."""
key = self.get_cache_key(text, voice, rate)
return self.cache.get(key)
def put(self, text: str, voice: str, rate: int, audio_data):
"""Store audio in cache with LRU eviction."""
if len(self.cache) >= self.max_size:
# Evict oldest entry (simple FIFO, use OrderedDict for true LRU)
self.cache.pop(next(iter(self.cache)))
key = self.get_cache_key(text, voice, rate)
self.cache[key] = audio_data
```
---
#### 6. Multi-Language Support
- **Automatic language detection:** Detect input language
- **Language-specific voice selection:** Match voice to detected language
- **Mixed-language support:** Handle multilingual texts
**Implementation Complexity:** Medium
**User Value:** High for international users
---
#### 7. Audio Effects
- **Reverb:** Add spatial audio effects
- **Echo:** Add echo effects
- **Speed adjustment:** Time-stretch without pitch change
- **Normalization:** Automatic volume leveling
**Implementation Complexity:** Medium (requires audio processing library like `pydub` or `librosa`)
**User Value:** Medium (aesthetic enhancement)
---
#### 8. Queue Priority System
- **Priority levels:** High, normal, low priority requests
- **Priority queues:** Separate queues for different priorities
- **Preemption:** Allow high-priority requests to interrupt low-priority
**Implementation Complexity:** Medium
**User Value:** Medium for multi-tenant scenarios
---
#### 9. Webhook Notifications
- **Completion webhooks:** Notify external service when TTS completes
- **Error webhooks:** Notify on TTS failures
- **Webhook retry logic:** Handle webhook delivery failures
**Example Request:**
```json
{
"message": "Hello world",
"webhook_url": "https://example.com/tts-complete"
}
```
**Implementation Complexity:** Low
**User Value:** High for integration scenarios
---
#### 10. Authentication & Authorization
- **API key authentication:** Secure endpoint access
- **Rate limiting:** Per-user request limits
- **Usage quotas:** Daily/monthly request quotas
- **Multi-tenant support:** Isolated queues per user
**Implementation Complexity:** High
**User Value:** High for shared/production deployments
---
#### 11. Web Interface
- **Simple web UI:** Browser-based TTS interface
- **Queue visualization:** Real-time queue status display
- **Voice model management:** Upload/download voice models via UI
- **Settings configuration:** Web-based configuration editor
**Implementation Complexity:** Medium
**User Value:** High for non-technical users
---
#### 12. Docker Deployment
- **Dockerfile:** Container image for easy deployment
- **Docker Compose:** Multi-container setup with monitoring
- **Volume management:** Persistent voice model storage
- **Health check integration:** Container health monitoring
**Sample Dockerfile:**
```dockerfile
FROM python:3.11-slim
# Install system dependencies (PortAudio for sounddevice)
RUN apt-get update && apt-get install -y \
libportaudio2 \
portaudio19-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY app/ ./app/
# Download default voice model
RUN python -c "from piper import PiperVoice; PiperVoice.download('en_US-lessac-medium')"
EXPOSE 8888
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8888"]
```
**Implementation Complexity:** Low
**User Value:** High for deployment consistency
---
#### 13. Metrics & Monitoring
- **Prometheus metrics:** Request count, latency, queue size
- **Grafana dashboards:** Visual monitoring
- **Alerting:** Notify on errors, high queue size, etc.
- **Performance profiling:** Identify bottlenecks
**Sample Metrics:**
```python
from prometheus_client import Counter, Histogram, Gauge
request_counter = Counter('tts_requests_total', 'Total TTS requests')
latency_histogram = Histogram('tts_latency_seconds', 'TTS latency')
queue_size_gauge = Gauge('tts_queue_size', 'Current queue size')
@app.post("/notify")
async def notify(request: NotifyRequest):
request_counter.inc()
with latency_histogram.time():
# Process request
...
queue_size_gauge.set(tts_queue.qsize())
```
**Implementation Complexity:** Medium
**User Value:** High for production deployments
---
### Scalability Considerations
**Horizontal Scaling:**
- Use Redis for shared queue across multiple server instances
- Implement distributed locking for audio device access
- Load balance requests across multiple servers
**Vertical Scaling:**
- Increase queue size for higher throughput
- Use GPU acceleration for TTS (CUDA support in Piper)
- Optimize voice model loading (keep models in memory)
**Architecture Evolution:**
- Separate TTS generation and audio playback into microservices
- Use message queue (RabbitMQ, Kafka) for request distribution
- Implement worker pool for parallel TTS generation
---
## Appendix: References
### Technical Documentation
- [FastAPI Official Documentation](https://fastapi.tiangolo.com/)
- [Piper TTS GitHub Repository](https://github.com/rhasspy/piper)
- [PyAudio Documentation](https://people.csail.mit.edu/hubert/pyaudio/docs/)
- [Uvicorn Documentation](https://www.uvicorn.org/)
### Research & Comparisons
- [FastAPI vs Flask Performance Comparison - Strapi](https://strapi.io/blog/fastapi-vs-flask-python-framework-comparison)
- [Flask vs FastAPI - Better Stack](https://betterstack.com/community/guides/scaling-python/flask-vs-fastapi/)
- [Python TTS Engines Comparison - Smallest AI](https://smallest.ai/blog/python-packages-realistic-text-to-speech)
- [TTS Converters for Raspberry Pi - Circuit Digest](https://circuitdigest.com/microcontroller-projects/best-text-to-speech-tts-converter-for-raspberry-pi-espeak-festival-google-tts-pico-and-pyttsx3)
- [Piper TTS Tutorial - RMauro Dev](https://rmauro.dev/how-to-run-piper-tts-on-your-raspberry-pi-offline-voice-zero-internet-needed/)
- [Python Audio Playback - simpleaudio Docs](https://simpleaudio.readthedocs.io/)
### Tools & Libraries
- [uv - Fast Python Package Manager](https://github.com/astral-sh/uv)
- [pytest - Testing Framework](https://docs.pytest.org/)
- [locust - Load Testing](https://locust.io/)
---
## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-12-18 | Atlas | Initial PRD creation |
---
**Document Status:** Complete - Ready for Implementation
**Next Steps:**
1. Review PRD with stakeholders
2. Approve technical stack decisions
3. Begin Phase 1 implementation
4. Set up project tracking (GitHub Issues, Jira, etc.)
5. Assign development resources
**Questions or Feedback:** Contact Atlas at [atlas@manticorum.com]