A local HTTP service that accepts text via POST and speaks it through system speakers using Piper TTS neural voice synthesis. Features: - POST /notify - Queue text for TTS playback - GET /health - Health check with TTS/audio/queue status - GET /voices - List installed voice models - Async queue processing (no overlapping audio) - Non-blocking audio via sounddevice - 73 tests covering API contract Tech stack: - FastAPI + Uvicorn - Piper TTS (neural voices, offline) - sounddevice (PortAudio) - Pydantic for validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2148 lines
70 KiB
Markdown
2148 lines
70 KiB
Markdown
# Product Requirements Document: Local Voice Server
|
|
|
|
**Version:** 1.0
|
|
**Date:** 2025-12-18
|
|
**Author:** Atlas (Principal Software Architect)
|
|
**Project:** Local HTTP Voice Server for Text-to-Speech
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Executive Summary](#executive-summary)
|
|
2. [Goals and Non-Goals](#goals-and-non-goals)
|
|
3. [Technical Requirements](#technical-requirements)
|
|
4. [System Architecture](#system-architecture)
|
|
5. [API Specification](#api-specification)
|
|
6. [TTS Engine Analysis](#tts-engine-analysis)
|
|
7. [Web Framework Selection](#web-framework-selection)
|
|
8. [Audio Playback Strategy](#audio-playback-strategy)
|
|
9. [Error Handling Strategy](#error-handling-strategy)
|
|
10. [Implementation Checklist](#implementation-checklist)
|
|
11. [Testing Strategy](#testing-strategy)
|
|
12. [Future Considerations](#future-considerations)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
### Project Overview
|
|
|
|
This project delivers a local HTTP service that accepts POST requests containing text strings and converts them to speech through the computer's speakers. The service will run locally on Linux (Nobara/Fedora 42), providing fast, offline text-to-speech capabilities without requiring external API calls or internet connectivity.
|
|
|
|
### Success Metrics
|
|
|
|
- **Response Time:** TTS conversion and playback initiation within 200ms for short texts (< 100 characters)
|
|
- **Reliability:** 99.9% successful request handling under normal operating conditions
|
|
- **Concurrency:** Support for at least 5 concurrent TTS requests with proper queuing
|
|
- **Audio Quality:** Clear, intelligible speech output comparable to Google TTS quality
|
|
- **Startup Time:** Server ready to accept requests within 2 seconds of launch
|
|
|
|
### Technical Stack
|
|
|
|
| Component | Technology | Justification |
|
|
|-----------|-----------|---------------|
|
|
| Web Framework | FastAPI | Async support, high performance (15k-20k req/s), automatic API documentation |
|
|
| TTS Engine | Piper TTS | Neural voice quality, offline, optimized for local inference, ONNX-based |
|
|
| Audio Playback | sounddevice | Cross-platform, Pythonic API, excellent NumPy integration, non-blocking playback |
|
|
| Package Manager | uv | Fast Python package management (user preference) |
|
|
| ASGI Server | Uvicorn | High-performance ASGI server, native FastAPI integration |
|
|
| Async Runtime | asyncio | Built-in Python async support for concurrent request handling |
|
|
|
|
### Timeline Estimate
|
|
|
|
- **Phase 1 - Core Implementation:** 2-3 days (basic HTTP server + TTS integration)
|
|
- **Phase 2 - Error Handling & Testing:** 1-2 days (comprehensive error handling, unit tests)
|
|
- **Phase 3 - Concurrency & Queue Management:** 1-2 days (async queue, concurrent playback)
|
|
- **Total Estimated Time:** 4-7 days for production-ready v1.0
|
|
|
|
### Resource Requirements
|
|
|
|
- **Development:** 1 full-stack Python developer with async programming experience
|
|
- **Testing:** Access to Linux environment (Nobara/Fedora 42) with audio hardware
|
|
- **Infrastructure:** Local development machine with 2+ CPU cores, 4GB+ RAM
|
|
|
|
---
|
|
|
|
## Goals and Non-Goals
|
|
|
|
### Goals
|
|
|
|
**Primary Goals:**
|
|
1. Create a local HTTP service that accepts text via POST requests
|
|
2. Convert text to speech using high-quality offline TTS
|
|
3. Play audio through system speakers with minimal latency
|
|
4. Support concurrent requests with proper queue management
|
|
5. Provide comprehensive error handling and logging
|
|
6. Maintain zero external dependencies (fully offline capable)
|
|
|
|
**Secondary Goals:**
|
|
1. Automatic API documentation via FastAPI's built-in OpenAPI support
|
|
2. Configurable TTS parameters (voice, speed, volume) via request parameters
|
|
3. Health check endpoint for service monitoring
|
|
4. Graceful handling of long-running text conversions
|
|
5. Support for multiple voice models
|
|
|
|
### Non-Goals
|
|
|
|
**Explicitly Out of Scope:**
|
|
1. Cloud-based or external API integration
|
|
2. Speech-to-text (STT) capabilities
|
|
3. Audio file storage or retrieval
|
|
4. User authentication or authorization
|
|
5. Rate limiting or quota management
|
|
6. Multi-language UI or web interface
|
|
7. Real-time streaming audio synthesis
|
|
8. Mobile app integration
|
|
9. Persistent audio history or logging
|
|
10. Advanced audio effects (reverb, pitch shifting, etc.)
|
|
|
|
---
|
|
|
|
## Technical Requirements
|
|
|
|
### Functional Requirements
|
|
|
|
#### FR1: HTTP Server
|
|
- **FR1.1:** Server SHALL listen on configurable host and port (default: `0.0.0.0:8888`)
|
|
- **FR1.2:** Server SHALL accept POST requests to `/notify` endpoint
|
|
- **FR1.3:** Server SHALL accept JSON payload with `message` field containing text
|
|
- **FR1.4:** Server SHALL return HTTP 200 with success confirmation
|
|
- **FR1.5:** Server SHALL support CORS for local development
|
|
|
|
#### FR2: Text-to-Speech Conversion
|
|
- **FR2.1:** System SHALL convert text strings to audio using Piper TTS
|
|
- **FR2.2:** System SHALL support configurable voice models via request parameters
|
|
- **FR2.3:** System SHALL support adjustable speech rate (50-400 words per minute)
|
|
- **FR2.4:** System SHALL handle text inputs from 1 to 10,000 characters
|
|
- **FR2.5:** System SHALL use default voice if not specified in request
|
|
|
|
#### FR3: Audio Playback
|
|
- **FR3.1:** System SHALL play generated audio through default system audio output
|
|
- **FR3.2:** System SHALL support non-blocking audio playback
|
|
- **FR3.3:** System SHALL queue concurrent requests in FIFO order
|
|
- **FR3.4:** System SHALL allow configurable maximum queue size (default: 50)
|
|
- **FR3.5:** System SHALL provide feedback when queue is full
|
|
|
|
#### FR4: Configuration
|
|
- **FR4.1:** System SHALL support configuration via environment variables
|
|
- **FR4.2:** System SHALL support configuration via command-line arguments
|
|
- **FR4.3:** System SHALL provide sensible defaults for all configuration values
|
|
- **FR4.4:** System SHALL validate configuration at startup
|
|
|
|
#### FR5: Error Handling
|
|
- **FR5.1:** System SHALL return appropriate HTTP error codes for failures
|
|
- **FR5.2:** System SHALL log all errors with timestamps and context
|
|
- **FR5.3:** System SHALL continue operating after non-fatal errors
|
|
- **FR5.4:** System SHALL gracefully handle TTS engine failures
|
|
- **FR5.5:** System SHALL provide detailed error messages in responses
|
|
|
|
### Non-Functional Requirements
|
|
|
|
#### NFR1: Performance
|
|
- **NFR1.1:** API response time SHALL be < 50ms (excluding TTS processing)
|
|
- **NFR1.2:** TTS conversion SHALL complete in < 2 seconds for 500 character texts
|
|
- **NFR1.3:** System SHALL handle 20+ requests per second without degradation
|
|
- **NFR1.4:** Memory usage SHALL remain < 500MB under normal load
|
|
- **NFR1.5:** CPU usage SHALL average < 30% during active TTS processing
|
|
|
|
#### NFR2: Reliability
|
|
- **NFR2.1:** System SHALL maintain 99.9% uptime during operation
|
|
- **NFR2.2:** System SHALL recover from audio device disconnections
|
|
- **NFR2.3:** System SHALL handle Out-of-Memory conditions gracefully
|
|
- **NFR2.4:** System SHALL log all critical errors for debugging
|
|
|
|
#### NFR3: Maintainability
|
|
- **NFR3.1:** Code SHALL maintain > 80% test coverage
|
|
- **NFR3.2:** All functions SHALL include docstrings with type hints
|
|
- **NFR3.3:** Code SHALL follow PEP 8 style guidelines
|
|
- **NFR3.4:** Dependencies SHALL be pinned to specific versions
|
|
- **NFR3.5:** README SHALL provide clear setup and usage instructions
|
|
|
|
#### NFR4: Security
|
|
- **NFR4.1:** System SHALL sanitize all text inputs to prevent injection attacks
|
|
- **NFR4.2:** System SHALL limit request payload size to 1MB
|
|
- **NFR4.3:** System SHALL not expose internal stack traces in API responses
|
|
- **NFR4.4:** System SHALL log all incoming requests for audit purposes
|
|
|
|
#### NFR5: Compatibility
|
|
- **NFR5.1:** System SHALL run on Linux (Nobara/Fedora 42)
|
|
- **NFR5.2:** System SHALL support Python 3.9+
|
|
- **NFR5.3:** System SHALL work with standard ALSA/PulseAudio setups
|
|
- **NFR5.4:** System SHALL be deployable as a systemd service
|
|
|
|
---
|
|
|
|
## System Architecture
|
|
|
|
### High-Level Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Client Applications │
|
|
│ (AI Agents, Scripts, Other Services) │
|
|
└────────────────────────────┬────────────────────────────────────┘
|
|
│ HTTP POST /notify
|
|
│ JSON: {"message": "text"}
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ FastAPI Web Server │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ /notify │ │ /health │ │ /docs │ │
|
|
│ │ endpoint │ │ endpoint │ │ (Swagger) │ │
|
|
│ └──────┬───────┘ └──────────────┘ └──────────────┘ │
|
|
│ │ │
|
|
│ │ Validates & Enqueues │
|
|
│ ▼ │
|
|
│ ┌──────────────────────────────────────────────────┐ │
|
|
│ │ Async Request Queue │ │
|
|
│ │ (asyncio.Queue with max size limit) │ │
|
|
│ └──────────────────┬───────────────────────────────┘ │
|
|
└────────────────────┬┼───────────────────────────────────────────┘
|
|
││
|
|
││ Background Task Processing
|
|
▼▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ TTS Processing Layer │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ Piper TTS Engine │ │
|
|
│ │ ┌──────────────┐ ┌──────────────┐ │ │
|
|
│ │ │ Voice Models │ │ ONNX Runtime │ │ │
|
|
│ │ │ (.onnx + │ │ Inference │ │ │
|
|
│ │ │ .json) │ │ Engine │ │ │
|
|
│ │ └──────────────┘ └──────────────┘ │ │
|
|
│ └─────────────────────────┬──────────────────────────┘ │
|
|
│ │ Generate WAV │
|
|
│ ▼ │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ In-Memory Audio Buffer │ │
|
|
│ │ (NumPy array / bytes) │ │
|
|
│ └─────────────────────────┬──────────────────────────┘ │
|
|
└────────────────────────────┼───────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Audio Playback Layer │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ PyAudio Stream Manager │ │
|
|
│ │ - Callback-based playback │ │
|
|
│ │ - Non-blocking operation │ │
|
|
│ │ - Stream lifecycle management │ │
|
|
│ └─────────────────────────┬──────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ System Audio Output (ALSA/PulseAudio) │ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
🔊 Computer Speakers
|
|
```
|
|
|
|
### Component Descriptions
|
|
|
|
#### 1. FastAPI Web Server
|
|
- **Responsibilities:**
|
|
- Accept and validate HTTP POST requests
|
|
- Provide automatic OpenAPI documentation
|
|
- Handle CORS configuration
|
|
- Route requests to appropriate handlers
|
|
- Return HTTP responses with appropriate status codes
|
|
|
|
- **Dependencies:**
|
|
- FastAPI framework
|
|
- Uvicorn ASGI server
|
|
- Pydantic for request/response validation
|
|
|
|
#### 2. Async Request Queue
|
|
- **Responsibilities:**
|
|
- Queue incoming TTS requests in FIFO order
|
|
- Prevent queue overflow with configurable max size
|
|
- Enable asynchronous processing without blocking HTTP responses
|
|
- Provide queue status information
|
|
|
|
- **Implementation:**
|
|
- `asyncio.Queue` for async-safe queuing
|
|
- Background task workers to process queue
|
|
- Queue metrics (size, processed count, errors)
|
|
|
|
#### 3. TTS Processing Layer
|
|
- **Responsibilities:**
|
|
- Load and manage Piper TTS voice models
|
|
- Convert text to audio waveforms
|
|
- Handle voice model selection
|
|
- Configure TTS parameters (rate, pitch, volume)
|
|
- Generate in-memory audio buffers
|
|
|
|
- **Implementation:**
|
|
- Piper TTS Python bindings
|
|
- ONNX Runtime for model inference
|
|
- Voice model caching for performance
|
|
- Error handling for model loading failures
|
|
|
|
#### 4. Audio Playback Layer
|
|
- **Responsibilities:**
|
|
- Initialize audio output streams
|
|
- Play audio buffers through system speakers
|
|
- Support non-blocking playback
|
|
- Handle audio device errors
|
|
- Manage stream lifecycle
|
|
|
|
- **Implementation:**
|
|
- sounddevice for cross-platform audio I/O
|
|
- Non-blocking `sd.play()` with background playback
|
|
- Simple NumPy array integration
|
|
- Graceful handling of audio device disconnections
|
|
|
|
### Data Flow
|
|
|
|
**Request Processing Flow:**
|
|
|
|
1. **HTTP Request Reception:**
|
|
- Client sends POST to `/notify` with JSON payload
|
|
- FastAPI validates request schema using Pydantic models
|
|
- Request is immediately acknowledged with HTTP 202 (Accepted)
|
|
|
|
2. **Request Enqueueing:**
|
|
- Validated request is added to async queue
|
|
- If queue is full, return HTTP 503 (Service Unavailable)
|
|
- Queue position is logged for monitoring
|
|
|
|
3. **Background Processing:**
|
|
- Background worker retrieves request from queue
|
|
- Text is passed to Piper TTS for conversion
|
|
- Piper generates WAV audio in memory
|
|
|
|
4. **Audio Playback:**
|
|
- Audio buffer is passed to PyAudio
|
|
- PyAudio streams audio to system output
|
|
- Playback occurs in callback thread (non-blocking)
|
|
- Completion is logged
|
|
|
|
5. **Error Handling:**
|
|
- Errors at any stage are caught and logged
|
|
- Failed requests are removed from queue
|
|
- Error metrics are updated
|
|
|
|
### Technology Stack Justification
|
|
|
|
#### FastAPI vs Flask
|
|
|
|
**Decision: FastAPI**
|
|
|
|
**Rationale:**
|
|
- **Performance:** FastAPI handles 15,000-20,000 req/s vs Flask's 2,000-3,000 req/s ([Strapi Comparison](https://strapi.io/blog/fastapi-vs-flask-python-framework-comparison))
|
|
- **Async Native:** Built on ASGI with native async/await support, critical for non-blocking TTS processing
|
|
- **Type Safety:** Pydantic integration provides automatic request validation and serialization
|
|
- **Documentation:** Automatic OpenAPI (Swagger) documentation generation
|
|
- **Modern Architecture:** Designed for microservices and high-concurrency applications
|
|
- **Growing Adoption:** 78k GitHub stars, 38% developer adoption in 2025 (40% YoY increase)
|
|
|
|
**Trade-offs:**
|
|
- Steeper learning curve compared to Flask
|
|
- Smaller ecosystem of extensions (though growing rapidly)
|
|
- Requires ASGI server (Uvicorn) vs Flask's built-in development server
|
|
|
|
#### Piper TTS Engine Selection
|
|
|
|
**Decision: Piper TTS**
|
|
|
|
**Rationale:**
|
|
- **Voice Quality:** Neural TTS with "Google TTS level quality" ([AntiX Forum](https://www.antixforum.com/forums/topic/tts-text-to-speech-in-linux-piper/))
|
|
- **Offline Operation:** Fully local, no internet required
|
|
- **Performance:** Optimized for local inference using ONNX Runtime
|
|
- **Resource Efficiency:** Runs on Raspberry Pi 4, suitable for desktop Linux
|
|
- **Easy Installation:** Available via pip (`pip install piper-tts`)
|
|
- **Active Development:** Maintained project with 2025 updates
|
|
- **Multiple Voices:** Extensive voice model library with quality/speed trade-offs
|
|
|
|
**Comparison with Alternatives:**
|
|
|
|
| Engine | Voice Quality | Speed | Resource Usage | Offline | Ease of Use |
|
|
|--------|---------------|-------|----------------|---------|-------------|
|
|
| **Piper TTS** | ⭐⭐⭐⭐⭐ Neural | ⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐ Medium | ✅ Yes | ⭐⭐⭐⭐ Easy |
|
|
| pyttsx3 | ⭐⭐ Robotic | ⭐⭐⭐⭐⭐ Very Fast | ⭐⭐⭐⭐⭐ Very Low | ✅ Yes | ⭐⭐⭐⭐⭐ Very Easy |
|
|
| eSpeak | ⭐⭐ Robotic | ⭐⭐⭐⭐⭐ Very Fast | ⭐⭐⭐⭐⭐ Very Low | ✅ Yes | ⭐⭐⭐⭐ Easy |
|
|
| gTTS | ⭐⭐⭐⭐⭐ Neural | ⭐⭐ Slow | ⭐⭐⭐⭐ Low | ❌ No | ⭐⭐⭐⭐⭐ Very Easy |
|
|
| Coqui TTS | ⭐⭐⭐⭐⭐ Neural | ⭐⭐⭐ Medium | ⭐⭐ High | ✅ Yes | ⭐⭐ Complex |
|
|
|
|
**Trade-offs:**
|
|
- Larger model files (~20-100MB per voice) vs simple engines
|
|
- Higher resource usage than pyttsx3/eSpeak
|
|
- Requires ONNX Runtime dependency
|
|
|
|
#### sounddevice for Audio Playback
|
|
|
|
**Decision: sounddevice**
|
|
|
|
**Rationale:**
|
|
- **Pythonic API:** Clean, intuitive interface that feels native to Python
|
|
- **NumPy Integration:** Direct support for NumPy arrays (perfect for Piper TTS output)
|
|
- **Non-Blocking:** Simple `sd.play()` returns immediately, audio plays in background
|
|
- **Cross-Platform:** Works on Linux, Windows, macOS via PortAudio backend
|
|
- **Active Maintenance:** Well-maintained with regular updates
|
|
- **Simple Async:** Easy integration with asyncio via `sd.wait()` or callbacks
|
|
|
|
**Comparison with Alternatives:**
|
|
|
|
| Library | Non-Blocking | Dependencies | Maintenance | Linux Support |
|
|
|---------|-------------|--------------|-------------|---------------|
|
|
| **sounddevice** | ✅ Native | PortAudio | ⭐⭐⭐⭐ Active | ✅ Excellent |
|
|
| PyAudio | ✅ Callbacks | PortAudio | ⭐⭐⭐ Active | ✅ Excellent |
|
|
| simpleaudio | ✅ Async | None | ❌ Archived | ⭐⭐⭐ Good |
|
|
| pygame | ⭐ Limited | SDL | ⭐⭐⭐⭐ Active | ⭐⭐⭐⭐ Excellent |
|
|
|
|
**Why sounddevice over PyAudio:**
|
|
- Simpler API - `sd.play(audio, samplerate)` vs PyAudio's stream setup
|
|
- Better NumPy support - no conversion needed from Piper's output
|
|
- More Pythonic - feels like a modern Python library
|
|
- Easier async integration - works naturally with asyncio
|
|
|
|
---
|
|
|
|
## API Specification
|
|
|
|
### Endpoint: POST /notify
|
|
|
|
**Description:** Accept text string and queue for TTS playback
|
|
|
|
**Request Schema:**
|
|
|
|
```json
|
|
{
|
|
"message": "string (required)",
|
|
"voice": "string (optional)",
|
|
"rate": "integer (optional, default: 170)",
|
|
"voice_enabled": "boolean (optional, default: true)"
|
|
}
|
|
```
|
|
|
|
**Request Parameters:**
|
|
|
|
| Parameter | Type | Required | Default | Constraints | Description |
|
|
|-----------|------|----------|---------|-------------|-------------|
|
|
| `message` | string | Yes | - | 1-10000 chars | Text to convert to speech |
|
|
| `voice` | string | No | `en_US-lessac-medium` | Valid voice model name | Piper voice model to use |
|
|
| `rate` | integer | No | `170` | 50-400 | Speech rate in words per minute |
|
|
| `voice_enabled` | boolean | No | `true` | - | Enable/disable TTS (for debugging) |
|
|
|
|
**Example Request:**
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8888/notify \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"message": "Hello, this is a test of the voice server",
|
|
"rate": 200,
|
|
"voice_enabled": true
|
|
}'
|
|
```
|
|
|
|
**Response Schema (Success - 202 Accepted):**
|
|
|
|
```json
|
|
{
|
|
"status": "queued",
|
|
"message_length": 42,
|
|
"queue_position": 3,
|
|
"estimated_duration": 2.5,
|
|
"voice_model": "en_US-lessac-medium"
|
|
}
|
|
```
|
|
|
|
**Response Schema (Error - 400 Bad Request):**
|
|
|
|
```json
|
|
{
|
|
"error": "validation_error",
|
|
"detail": "message field is required",
|
|
"timestamp": "2025-12-18T10:30:45.123Z"
|
|
}
|
|
```
|
|
|
|
**Response Schema (Error - 503 Service Unavailable):**
|
|
|
|
```json
|
|
{
|
|
"error": "queue_full",
|
|
"detail": "TTS queue is full, please retry later",
|
|
"queue_size": 50,
|
|
"timestamp": "2025-12-18T10:30:45.123Z"
|
|
}
|
|
```
|
|
|
|
**HTTP Status Codes:**
|
|
|
|
| Code | Meaning | Scenario |
|
|
|------|---------|----------|
|
|
| 202 | Accepted | Request successfully queued for processing |
|
|
| 400 | Bad Request | Invalid request parameters or malformed JSON |
|
|
| 413 | Payload Too Large | Message exceeds 10,000 characters |
|
|
| 422 | Unprocessable Entity | Valid JSON but invalid parameter values |
|
|
| 500 | Internal Server Error | TTS engine failure or unexpected error |
|
|
| 503 | Service Unavailable | Queue is full or service is shutting down |
|
|
|
|
---
|
|
|
|
### Endpoint: GET /health
|
|
|
|
**Description:** Health check endpoint for monitoring
|
|
|
|
**Request:** No parameters
|
|
|
|
**Response Schema (Healthy - 200 OK):**
|
|
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"uptime_seconds": 3600,
|
|
"queue_size": 2,
|
|
"queue_capacity": 50,
|
|
"tts_engine": "piper",
|
|
"audio_output": "available",
|
|
"voice_models_loaded": ["en_US-lessac-medium"],
|
|
"total_requests": 1523,
|
|
"failed_requests": 12,
|
|
"timestamp": "2025-12-18T10:30:45.123Z"
|
|
}
|
|
```
|
|
|
|
**Response Schema (Unhealthy - 503 Service Unavailable):**
|
|
|
|
```json
|
|
{
|
|
"status": "unhealthy",
|
|
"errors": [
|
|
"Audio output device unavailable",
|
|
"TTS engine failed to initialize"
|
|
],
|
|
"timestamp": "2025-12-18T10:30:45.123Z"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### Endpoint: GET /docs
|
|
|
|
**Description:** Automatic Swagger UI documentation (provided by FastAPI)
|
|
|
|
**Access:** `http://localhost:8888/docs`
|
|
|
|
**Features:**
|
|
- Interactive API testing
|
|
- Schema visualization
|
|
- Request/response examples
|
|
- Authentication testing (if implemented)
|
|
|
|
---
|
|
|
|
### Endpoint: GET /voices
|
|
|
|
**Description:** List available TTS voice models
|
|
|
|
**Request:** No parameters
|
|
|
|
**Response Schema (200 OK):**
|
|
|
|
```json
|
|
{
|
|
"voices": [
|
|
{
|
|
"name": "en_US-lessac-medium",
|
|
"language": "en_US",
|
|
"quality": "medium",
|
|
"size_mb": 63.5,
|
|
"installed": true
|
|
},
|
|
{
|
|
"name": "en_US-libritts-high",
|
|
"language": "en_US",
|
|
"quality": "high",
|
|
"size_mb": 108.2,
|
|
"installed": false
|
|
}
|
|
],
|
|
"default_voice": "en_US-lessac-medium"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## TTS Engine Analysis
|
|
|
|
### Detailed Comparison Matrix
|
|
|
|
| Engine | Voice Quality | Latency | CPU Usage | Memory | Offline | Linux Support | Python API | Maintenance |
|
|
|--------|---------------|---------|-----------|--------|---------|---------------|------------|-------------|
|
|
| **Piper TTS** | ⭐⭐⭐⭐⭐ | ~500ms | Medium | ~200MB | ✅ | ✅ Excellent | ✅ Native | 🟢 Active |
|
|
| **pyttsx3** | ⭐⭐ | ~100ms | Low | ~50MB | ✅ | ✅ Good | ✅ Native | 🟢 Active |
|
|
| **eSpeak-ng** | ⭐⭐ | ~50ms | Very Low | ~20MB | ✅ | ✅ Excellent | ⚠️ Wrapper | 🟢 Active |
|
|
| **gTTS** | ⭐⭐⭐⭐⭐ | ~2000ms | Low | ~30MB | ❌ | ✅ Good | ✅ Native | 🟢 Active |
|
|
| **Coqui TTS** | ⭐⭐⭐⭐⭐ | ~1500ms | High | ~500MB | ✅ | ✅ Good | ✅ Native | 🟡 Slow |
|
|
| **Festival** | ⭐⭐⭐ | ~300ms | Low | ~100MB | ✅ | ✅ Excellent | ⚠️ Wrapper | 🟡 Slow |
|
|
| **Mimic3** | ⭐⭐⭐⭐ | ~800ms | Medium | ~300MB | ✅ | ✅ Good | ❌ HTTP only | 🟢 Active |
|
|
|
|
### Detailed Engine Profiles
|
|
|
|
#### 1. Piper TTS (RECOMMENDED)
|
|
|
|
**Pros:**
|
|
- Neural TTS with natural-sounding voices
|
|
- Optimized for local inference (ONNX Runtime)
|
|
- Multiple quality levels (low/medium/high)
|
|
- Extensive language and voice support
|
|
- Active development and community
|
|
- Easy pip installation
|
|
- GPU acceleration support (CUDA)
|
|
|
|
**Cons:**
|
|
- Larger model files (20-100MB per voice)
|
|
- Higher resource usage than simple engines
|
|
- Initial model download required
|
|
- Slightly higher latency than robotic engines
|
|
|
|
**Installation:**
|
|
```bash
|
|
uv pip install piper-tts
|
|
```
|
|
|
|
**Usage Example:**
|
|
```python
|
|
from piper import PiperVoice
|
|
import wave
|
|
|
|
voice = PiperVoice.load("en_US-lessac-medium.onnx")
|
|
with wave.open("output.wav", "wb") as wav_file:
|
|
voice.synthesize("Hello world", wav_file)
|
|
```
|
|
|
|
**Voice Quality Sample:**
|
|
- **Low Quality:** Faster, smaller models (~20MB), decent quality
|
|
- **Medium Quality:** Balanced performance (~60MB), recommended default
|
|
- **High Quality:** Best quality (~100MB), slower inference
|
|
|
|
**References:**
|
|
- [GitHub Repository](https://github.com/rhasspy/piper)
|
|
- [PyPI Package](https://pypi.org/project/piper-tts/)
|
|
- [Voice Model Library](https://github.com/rhasspy/piper/blob/master/VOICES.md)
|
|
|
|
---
|
|
|
|
#### 2. pyttsx3
|
|
|
|
**Pros:**
|
|
- Extremely lightweight and fast
|
|
- Cross-platform (Windows SAPI5, macOS NSSpeech, Linux eSpeak)
|
|
- Zero external dependencies
|
|
- Simple API
|
|
- No model downloads required
|
|
|
|
**Cons:**
|
|
- Robotic voice quality
|
|
- Limited voice customization
|
|
- Depends on system TTS engines
|
|
|
|
**Installation:**
|
|
```bash
|
|
uv pip install pyttsx3
|
|
```
|
|
|
|
**Usage Example:**
|
|
```python
|
|
import pyttsx3
|
|
|
|
engine = pyttsx3.init()
|
|
engine.say("Hello world")
|
|
engine.runAndWait()
|
|
```
|
|
|
|
**References:**
|
|
- [PyPI Package](https://pypi.org/project/pyttsx3/)
|
|
- [GitHub Repository](https://github.com/nateshmbhat/pyttsx3)
|
|
|
|
---
|
|
|
|
#### 3. eSpeak-ng
|
|
|
|
**Pros:**
|
|
- Ultra-fast synthesis
|
|
- 100+ language support
|
|
- Minimal resource usage
|
|
- Highly customizable
|
|
- System-level installation
|
|
|
|
**Cons:**
|
|
- Robotic, mechanical voice quality
|
|
- Python wrapper required (not native)
|
|
- Less natural prosody
|
|
|
|
**Installation:**
|
|
```bash
|
|
# System package
|
|
sudo dnf install espeak-ng
|
|
|
|
# Python wrapper
|
|
uv pip install py3-tts # Uses eSpeak backend
|
|
```
|
|
|
|
**Usage Example:**
|
|
```bash
|
|
echo "Hello world" | espeak-ng
|
|
```
|
|
|
|
**References:**
|
|
- [eSpeak-ng Homepage](https://github.com/espeak-ng/espeak-ng)
|
|
- [Circuit Digest Comparison](https://circuitdigest.com/microcontroller-projects/best-text-to-speech-tts-converter-for-raspberry-pi-espeak-festival-google-tts-pico-and-pyttsx3)
|
|
|
|
---
|
|
|
|
#### 4. Coqui TTS
|
|
|
|
**Pros:**
|
|
- State-of-the-art neural voices
|
|
- Custom voice training support
|
|
- Multiple model architectures
|
|
- High-quality output
|
|
|
|
**Cons:**
|
|
- Very high resource requirements
|
|
- Slower inference
|
|
- Complex setup
|
|
- Larger memory footprint
|
|
- Development has slowed
|
|
|
|
**Installation:**
|
|
```bash
|
|
uv pip install TTS
|
|
```
|
|
|
|
**Usage Example:**
|
|
```python
|
|
from TTS.api import TTS
|
|
|
|
tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
|
|
tts.tts_to_file(text="Hello world", file_path="output.wav")
|
|
```
|
|
|
|
**References:**
|
|
- [Coqui TTS GitHub](https://github.com/coqui-ai/TTS)
|
|
|
|
---
|
|
|
|
### Recommendation: Piper TTS
|
|
|
|
**Final Decision:** Piper TTS is the optimal choice for this project.
|
|
|
|
**Justification:**
|
|
1. **Quality:** Neural voices with Google TTS-level quality
|
|
2. **Offline:** Fully local, no internet required (critical requirement)
|
|
3. **Performance:** Optimized for local inference, suitable for desktop Linux
|
|
4. **Python Native:** First-class Python API, easy integration
|
|
5. **Maintenance:** Actively maintained with 2025 updates
|
|
6. **Flexibility:** Multiple quality levels allow performance tuning
|
|
7. **Ease of Use:** Simple pip installation, straightforward API
|
|
|
|
**Configuration Strategy:**
|
|
- **Default Voice:** `en_US-lessac-medium` (balanced quality/performance)
|
|
- **GPU Acceleration:** Optional CUDA support for faster inference
|
|
- **Model Caching:** Pre-load voice models at startup to reduce latency
|
|
- **Quality Toggle:** Allow clients to request different quality levels
|
|
|
|
---
|
|
|
|
## Web Framework Selection
|
|
|
|
### FastAPI: Detailed Analysis
|
|
|
|
**Why FastAPI is Ideal for This Project:**
|
|
|
|
#### 1. Async-First Architecture
|
|
FastAPI is built on Starlette (ASGI framework) with native async/await support. This is critical for our use case:
|
|
|
|
```python
|
|
@app.post("/notify")
|
|
async def notify(request: NotifyRequest):
|
|
# Non-blocking enqueueing
|
|
await tts_queue.put(request)
|
|
return {"status": "queued"}
|
|
|
|
# Background worker runs concurrently
|
|
async def process_queue():
|
|
while True:
|
|
request = await tts_queue.get()
|
|
await generate_and_play_tts(request)
|
|
```
|
|
|
|
**Benefit:** HTTP responses return immediately while TTS processing happens in background.
|
|
|
|
#### 2. Performance Benchmarks
|
|
|
|
According to TechEmpower benchmarks ([Better Stack](https://betterstack.com/community/guides/scaling-python/flask-vs-fastapi/)):
|
|
- **FastAPI:** 15,000-20,000 requests/second
|
|
- **Flask:** 2,000-3,000 requests/second
|
|
|
|
**Benefit:** 5-10x higher throughput for handling concurrent TTS requests.
|
|
|
|
#### 3. Automatic API Documentation
|
|
|
|
FastAPI generates interactive OpenAPI (Swagger) documentation automatically:
|
|
|
|
```python
|
|
@app.post("/notify", response_model=NotifyResponse)
|
|
async def notify(request: NotifyRequest):
|
|
"""
|
|
Convert text to speech and play through speakers.
|
|
|
|
- **message**: Text to convert (1-10000 characters)
|
|
- **rate**: Speech rate in WPM (50-400)
|
|
- **voice**: Voice model name (optional)
|
|
"""
|
|
...
|
|
```
|
|
|
|
**Benefit:** Instant API documentation at `/docs` without manual maintenance.
|
|
|
|
#### 4. Type Safety with Pydantic
|
|
|
|
Automatic request validation and serialization:
|
|
|
|
```python
|
|
from pydantic import BaseModel, Field, validator
|
|
|
|
class NotifyRequest(BaseModel):
|
|
message: str = Field(..., min_length=1, max_length=10000)
|
|
rate: int = Field(170, ge=50, le=400)
|
|
voice_enabled: bool = True
|
|
|
|
@validator('message')
|
|
def sanitize_message(cls, v):
|
|
# Automatic validation before handler runs
|
|
return v.strip()
|
|
```
|
|
|
|
**Benefit:** Eliminates manual validation code, reduces bugs.
|
|
|
|
#### 5. Dependency Injection
|
|
|
|
Clean separation of concerns:
|
|
|
|
```python
|
|
async def get_tts_engine():
|
|
return global_tts_engine
|
|
|
|
@app.post("/notify")
|
|
async def notify(
|
|
request: NotifyRequest,
|
|
tts_engine: PiperVoice = Depends(get_tts_engine)
|
|
):
|
|
# tts_engine automatically injected
|
|
...
|
|
```
|
|
|
|
**Benefit:** Testable, maintainable code with clear dependencies.
|
|
|
|
#### 6. Background Tasks
|
|
|
|
Built-in support for fire-and-forget tasks:
|
|
|
|
```python
|
|
from fastapi import BackgroundTasks
|
|
|
|
@app.post("/notify")
|
|
async def notify(request: NotifyRequest, background_tasks: BackgroundTasks):
|
|
background_tasks.add_task(generate_tts, request.message)
|
|
return {"status": "queued"}
|
|
```
|
|
|
|
**Benefit:** Simplified async task management.
|
|
|
|
### Flask Comparison (Why Not Flask)
|
|
|
|
**Flask Limitations for This Project:**
|
|
|
|
1. **WSGI-Based:** Synchronous by default, requires Gunicorn/gevent for async
|
|
2. **Lower Performance:** 2,000-3,000 req/s vs FastAPI's 15,000-20,000 req/s
|
|
3. **Manual Documentation:** Requires Flask-RESTPlus or manual OpenAPI setup
|
|
4. **Manual Validation:** No built-in request validation, requires Flask-Pydantic extension
|
|
5. **Blocking I/O:** Natural behavior blocks request threads during TTS processing
|
|
|
|
**When Flask Would Be Better:**
|
|
- Simple synchronous applications
|
|
- Heavy reliance on Flask extensions (Flask-Login, Flask-Admin)
|
|
- Team already experienced with Flask
|
|
- Need for Jinja2 templating (not needed here)
|
|
|
|
**Verdict:** FastAPI is the clear winner for this async-heavy, high-performance use case.
|
|
|
|
---
|
|
|
|
## Audio Playback Strategy
|
|
|
|
### sounddevice Implementation Details
|
|
|
|
#### Non-Blocking Playback
|
|
|
|
sounddevice provides simple, non-blocking audio playback out of the box:
|
|
|
|
```python
|
|
import sounddevice as sd
|
|
import numpy as np
|
|
|
|
class AudioPlayer:
|
|
"""Simple audio player using sounddevice."""
|
|
|
|
def __init__(self, sample_rate: int = 22050):
|
|
self.sample_rate = sample_rate
|
|
self._current_stream = None
|
|
|
|
def play(self, audio_data: np.ndarray, sample_rate: int = None):
|
|
"""
|
|
Non-blocking audio playback.
|
|
|
|
Args:
|
|
audio_data: NumPy array of audio samples (float32 or int16)
|
|
sample_rate: Sample rate in Hz (defaults to instance default)
|
|
"""
|
|
rate = sample_rate or self.sample_rate
|
|
|
|
# Stop any currently playing audio
|
|
self.stop()
|
|
|
|
# Play audio - returns immediately, audio plays in background
|
|
sd.play(audio_data, rate)
|
|
|
|
def is_playing(self) -> bool:
|
|
"""Check if audio is currently playing."""
|
|
return sd.get_stream() is not None and sd.get_stream().active
|
|
|
|
def stop(self):
|
|
"""Stop current playback."""
|
|
sd.stop()
|
|
|
|
def wait(self):
|
|
"""Block until current playback completes."""
|
|
sd.wait()
|
|
|
|
async def wait_async(self):
|
|
"""Async wait for playback completion."""
|
|
import asyncio
|
|
while self.is_playing():
|
|
await asyncio.sleep(0.05)
|
|
```
|
|
|
|
**Benefits of sounddevice:**
|
|
- `sd.play()` returns immediately - audio plays in background thread
|
|
- Direct NumPy array support - no conversion needed from Piper TTS
|
|
- Simple API - one line to play audio
|
|
- Built-in `sd.wait()` for synchronous waiting when needed
|
|
|
|
---
|
|
|
|
#### Handling Concurrent Requests
|
|
|
|
**Strategy:** Queue-based sequential playback with async queue management.
|
|
|
|
**Rationale:**
|
|
- Playing multiple TTS outputs simultaneously would create audio chaos
|
|
- Sequential playback ensures clarity
|
|
- Queue allows buffering during high request volume
|
|
|
|
**Implementation:**
|
|
|
|
```python
|
|
import asyncio
|
|
import sounddevice as sd
|
|
import numpy as np
|
|
from typing import Dict, Any
|
|
|
|
class TTSQueue:
|
|
def __init__(self, max_size: int = 50):
|
|
self.queue = asyncio.Queue(maxsize=max_size)
|
|
self.player = AudioPlayer()
|
|
self.stats = {"processed": 0, "errors": 0}
|
|
|
|
async def enqueue(self, request: Dict[str, Any]):
|
|
"""Add TTS request to queue."""
|
|
try:
|
|
await asyncio.wait_for(
|
|
self.queue.put(request),
|
|
timeout=1.0
|
|
)
|
|
return self.queue.qsize()
|
|
except asyncio.TimeoutError:
|
|
raise QueueFullError("TTS queue is full")
|
|
|
|
async def process_queue(self):
|
|
"""Background worker to process TTS queue."""
|
|
while True:
|
|
request = await self.queue.get()
|
|
|
|
try:
|
|
# Generate TTS audio
|
|
audio_data = await self.generate_tts(request)
|
|
|
|
# Play audio (non-blocking start)
|
|
self.player.play(audio_data, sample_rate=22050)
|
|
|
|
# Wait for playback to complete (async-friendly)
|
|
await self.player.wait_async()
|
|
|
|
self.stats["processed"] += 1
|
|
|
|
except Exception as e:
|
|
logger.error(f"TTS processing error: {e}")
|
|
self.stats["errors"] += 1
|
|
|
|
finally:
|
|
self.queue.task_done()
|
|
|
|
async def generate_tts(self, request: Dict[str, Any]) -> np.ndarray:
|
|
"""Generate TTS audio using Piper."""
|
|
# Run CPU-intensive TTS in thread pool
|
|
loop = asyncio.get_event_loop()
|
|
audio_data = await loop.run_in_executor(
|
|
None,
|
|
self._sync_generate_tts,
|
|
request["message"],
|
|
request.get("voice", "en_US-lessac-medium")
|
|
)
|
|
return audio_data
|
|
|
|
def _sync_generate_tts(self, text: str, voice: str) -> np.ndarray:
|
|
"""Synchronous TTS generation (runs in thread pool)."""
|
|
# Piper TTS generation code
|
|
...
|
|
return audio_array
|
|
```
|
|
|
|
**Startup:**
|
|
|
|
```python
|
|
from contextlib import asynccontextmanager
|
|
|
|
@asynccontextmanager
|
|
async def lifespan(app: FastAPI):
|
|
# Startup: initialize queue and start processor
|
|
global tts_queue
|
|
tts_queue = TTSQueue(max_size=50)
|
|
asyncio.create_task(tts_queue.process_queue())
|
|
yield
|
|
# Shutdown: stop audio playback
|
|
sd.stop()
|
|
|
|
app = FastAPI(lifespan=lifespan)
|
|
```
|
|
|
|
---
|
|
|
|
#### Audio Device Error Handling
|
|
|
|
**Common Issues:**
|
|
1. Audio device disconnected (headphones unplugged)
|
|
2. PulseAudio/ALSA daemon crashed
|
|
3. No audio devices available
|
|
4. Device in use by another process
|
|
|
|
**Handling Strategy:**
|
|
|
|
```python
|
|
import sounddevice as sd
|
|
import numpy as np
|
|
import time
|
|
import logging
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
class RobustAudioPlayer:
|
|
"""Audio player with automatic retry and device recovery."""
|
|
|
|
def __init__(self, retry_attempts: int = 3, sample_rate: int = 22050):
|
|
self.retry_attempts = retry_attempts
|
|
self.sample_rate = sample_rate
|
|
self.verify_audio_devices()
|
|
|
|
def verify_audio_devices(self):
|
|
"""Verify audio devices are available."""
|
|
try:
|
|
devices = sd.query_devices()
|
|
output_devices = [d for d in devices if d['max_output_channels'] > 0]
|
|
if not output_devices:
|
|
raise AudioDeviceError("No audio output devices found")
|
|
logger.info(f"Audio initialized: {len(output_devices)} output devices found")
|
|
logger.debug(f"Default output: {sd.query_devices(kind='output')['name']}")
|
|
except Exception as e:
|
|
logger.error(f"Audio initialization failed: {e}")
|
|
raise
|
|
|
|
def play(self, audio_data: np.ndarray, sample_rate: int = None):
|
|
"""Play audio with automatic retry on device errors."""
|
|
rate = sample_rate or self.sample_rate
|
|
|
|
for attempt in range(self.retry_attempts):
|
|
try:
|
|
sd.play(audio_data, rate)
|
|
return
|
|
except sd.PortAudioError as e:
|
|
logger.warning(f"Audio playback failed (attempt {attempt+1}): {e}")
|
|
|
|
if attempt < self.retry_attempts - 1:
|
|
# Wait and retry - device may become available
|
|
sd.stop()
|
|
time.sleep(0.5)
|
|
self.verify_audio_devices()
|
|
else:
|
|
raise AudioPlaybackError(f"Failed after {self.retry_attempts} attempts: {e}")
|
|
|
|
def is_playing(self) -> bool:
|
|
"""Check if audio is currently playing."""
|
|
stream = sd.get_stream()
|
|
return stream is not None and stream.active
|
|
|
|
def stop(self):
|
|
"""Stop current playback."""
|
|
sd.stop()
|
|
|
|
async def wait_async(self):
|
|
"""Async wait for playback completion."""
|
|
import asyncio
|
|
while self.is_playing():
|
|
await asyncio.sleep(0.05)
|
|
```
|
|
|
|
**Device Query for Diagnostics:**
|
|
|
|
```python
|
|
def get_audio_diagnostics() -> dict:
|
|
"""Get audio system diagnostics for health check."""
|
|
try:
|
|
devices = sd.query_devices()
|
|
default_output = sd.query_devices(kind='output')
|
|
return {
|
|
"status": "available",
|
|
"device_count": len(devices),
|
|
"default_output": default_output['name'],
|
|
"sample_rate": default_output['default_samplerate']
|
|
}
|
|
except Exception as e:
|
|
return {
|
|
"status": "unavailable",
|
|
"error": str(e)
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Error Handling Strategy
|
|
|
|
### Error Categories and Handling
|
|
|
|
#### 1. Request Validation Errors
|
|
|
|
**Scenarios:**
|
|
- Missing required fields
|
|
- Invalid parameter types
|
|
- Out-of-range values
|
|
- Malformed JSON
|
|
|
|
**Handling:**
|
|
|
|
```python
|
|
from fastapi import HTTPException, status
|
|
from pydantic import BaseModel, Field, ValidationError
|
|
|
|
class NotifyRequest(BaseModel):
|
|
message: str = Field(..., min_length=1, max_length=10000)
|
|
rate: int = Field(170, ge=50, le=400)
|
|
voice: str = Field("en_US-lessac-medium", regex=r"^[\w-]+$")
|
|
|
|
@app.exception_handler(ValidationError)
|
|
async def validation_exception_handler(request, exc):
|
|
return JSONResponse(
|
|
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
|
|
content={
|
|
"error": "validation_error",
|
|
"detail": str(exc),
|
|
"timestamp": datetime.utcnow().isoformat()
|
|
}
|
|
)
|
|
```
|
|
|
|
**HTTP Status:** 422 Unprocessable Entity
|
|
|
|
---
|
|
|
|
#### 2. Queue Full Errors
|
|
|
|
**Scenario:** Too many concurrent requests, queue is at capacity
|
|
|
|
**Handling:**
|
|
|
|
```python
|
|
class QueueFullError(Exception):
|
|
pass
|
|
|
|
@app.post("/notify")
|
|
async def notify(request: NotifyRequest):
|
|
try:
|
|
position = await tts_queue.enqueue(request)
|
|
return {
|
|
"status": "queued",
|
|
"queue_position": position
|
|
}
|
|
except QueueFullError:
|
|
raise HTTPException(
|
|
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
|
detail={
|
|
"error": "queue_full",
|
|
"message": "TTS queue is full, please retry later",
|
|
"queue_size": tts_queue.max_size,
|
|
"retry_after": 5 # seconds
|
|
}
|
|
)
|
|
```
|
|
|
|
**HTTP Status:** 503 Service Unavailable
|
|
**Client Action:** Implement exponential backoff retry
|
|
|
|
---
|
|
|
|
#### 3. TTS Engine Errors
|
|
|
|
**Scenarios:**
|
|
- Voice model not found
|
|
- ONNX Runtime errors
|
|
- Memory allocation failures
|
|
- Corrupted model files
|
|
|
|
**Handling:**
|
|
|
|
```python
|
|
class TTSEngineError(Exception):
|
|
pass
|
|
|
|
async def generate_tts(text: str, voice: str) -> np.ndarray:
|
|
try:
|
|
# Attempt TTS generation
|
|
audio = piper_voice.synthesize(text)
|
|
return audio
|
|
except FileNotFoundError:
|
|
raise TTSEngineError(f"Voice model '{voice}' not found")
|
|
except MemoryError:
|
|
raise TTSEngineError("Insufficient memory for TTS generation")
|
|
except Exception as e:
|
|
logger.error(f"TTS generation failed: {e}", exc_info=True)
|
|
raise TTSEngineError(f"TTS generation failed: {str(e)}")
|
|
|
|
@app.exception_handler(TTSEngineError)
|
|
async def tts_engine_exception_handler(request, exc):
|
|
return JSONResponse(
|
|
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
|
content={
|
|
"error": "tts_engine_error",
|
|
"detail": str(exc),
|
|
"timestamp": datetime.utcnow().isoformat()
|
|
}
|
|
)
|
|
```
|
|
|
|
**HTTP Status:** 500 Internal Server Error
|
|
|
|
---
|
|
|
|
#### 4. Audio Playback Errors
|
|
|
|
**Scenarios:**
|
|
- No audio devices available
|
|
- Audio device disconnected
|
|
- ALSA/PulseAudio errors
|
|
- Permission denied
|
|
|
|
**Handling:**
|
|
|
|
```python
|
|
class AudioPlaybackError(Exception):
|
|
pass
|
|
|
|
async def play_audio(audio_data: np.ndarray):
|
|
try:
|
|
player.play_with_retry(audio_data, sample_rate=22050)
|
|
except AudioDeviceError as e:
|
|
logger.error(f"Audio device error: {e}")
|
|
raise AudioPlaybackError("No audio output devices available")
|
|
except OSError as e:
|
|
logger.error(f"Audio system error: {e}")
|
|
raise AudioPlaybackError(f"Audio playback failed: {str(e)}")
|
|
|
|
# In queue processor
|
|
try:
|
|
await play_audio(audio_data)
|
|
except AudioPlaybackError as e:
|
|
logger.error(f"Playback error: {e}")
|
|
# Continue processing queue, don't crash server
|
|
stats["errors"] += 1
|
|
```
|
|
|
|
**Action:** Log error, continue processing queue (don't crash server)
|
|
|
|
---
|
|
|
|
#### 5. System Resource Errors
|
|
|
|
**Scenarios:**
|
|
- Out of memory
|
|
- CPU overload
|
|
- Disk space exhausted
|
|
|
|
**Handling:**
|
|
|
|
```python
|
|
import psutil
|
|
|
|
async def check_system_resources():
|
|
"""Monitor system resources."""
|
|
memory = psutil.virtual_memory()
|
|
if memory.percent > 90:
|
|
logger.warning(f"High memory usage: {memory.percent}%")
|
|
|
|
cpu = psutil.cpu_percent(interval=1)
|
|
if cpu > 90:
|
|
logger.warning(f"High CPU usage: {cpu}%")
|
|
|
|
@app.middleware("http")
|
|
async def resource_monitoring_middleware(request, call_next):
|
|
"""Monitor resources on each request."""
|
|
await check_system_resources()
|
|
response = await call_next(request)
|
|
return response
|
|
```
|
|
|
|
**Action:** Log warnings, implement queue size limits to prevent resource exhaustion
|
|
|
|
---
|
|
|
|
### Logging Strategy
|
|
|
|
**Log Levels:**
|
|
|
|
```python
|
|
import logging
|
|
from logging.handlers import RotatingFileHandler
|
|
|
|
# Configure logging
|
|
logging.basicConfig(
|
|
level=logging.INFO,
|
|
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
|
handlers=[
|
|
RotatingFileHandler(
|
|
'voice-server.log',
|
|
maxBytes=10*1024*1024, # 10MB
|
|
backupCount=5
|
|
),
|
|
logging.StreamHandler()
|
|
]
|
|
)
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Log levels usage:
|
|
logger.debug("TTS parameters: rate=%d, voice=%s", rate, voice) # DEBUG
|
|
logger.info("Request queued: position=%d", queue_position) # INFO
|
|
logger.warning("Queue nearly full: %d/%d", current, max_size) # WARNING
|
|
logger.error("TTS generation failed: %s", error, exc_info=True) # ERROR
|
|
logger.critical("Audio system unavailable, shutting down") # CRITICAL
|
|
```
|
|
|
|
**Structured Logging:**
|
|
|
|
```python
|
|
import json
|
|
from datetime import datetime
|
|
|
|
def log_request(request_id: str, message: str, status: str):
|
|
"""Structured JSON logging."""
|
|
log_entry = {
|
|
"timestamp": datetime.utcnow().isoformat(),
|
|
"request_id": request_id,
|
|
"message_length": len(message),
|
|
"status": status,
|
|
"event_type": "tts_request"
|
|
}
|
|
logger.info(json.dumps(log_entry))
|
|
```
|
|
|
|
---
|
|
|
|
### Health Check Implementation
|
|
|
|
**Comprehensive Health Checks:**
|
|
|
|
```python
|
|
@app.get("/health")
|
|
async def health_check():
|
|
"""Detailed health status."""
|
|
health_status = {
|
|
"status": "healthy",
|
|
"timestamp": datetime.utcnow().isoformat(),
|
|
"checks": {}
|
|
}
|
|
|
|
# Check TTS engine
|
|
try:
|
|
tts_engine.test_synthesis("test")
|
|
health_status["checks"]["tts_engine"] = "healthy"
|
|
except Exception as e:
|
|
health_status["checks"]["tts_engine"] = f"unhealthy: {str(e)}"
|
|
health_status["status"] = "unhealthy"
|
|
|
|
# Check audio output
|
|
try:
|
|
audio_player.test_output()
|
|
health_status["checks"]["audio_output"] = "healthy"
|
|
except Exception as e:
|
|
health_status["checks"]["audio_output"] = f"unhealthy: {str(e)}"
|
|
health_status["status"] = "unhealthy"
|
|
|
|
# Check queue status
|
|
queue_size = tts_queue.qsize()
|
|
health_status["checks"]["queue"] = {
|
|
"size": queue_size,
|
|
"capacity": tts_queue.max_size,
|
|
"utilization": f"{(queue_size/tts_queue.max_size)*100:.1f}%"
|
|
}
|
|
|
|
# Check system resources
|
|
health_status["checks"]["system"] = {
|
|
"memory_percent": psutil.virtual_memory().percent,
|
|
"cpu_percent": psutil.cpu_percent(interval=0.1)
|
|
}
|
|
|
|
status_code = 200 if health_status["status"] == "healthy" else 503
|
|
return JSONResponse(status_code=status_code, content=health_status)
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
### Phase 1: Core Infrastructure (Days 1-2)
|
|
|
|
#### 1.1 Project Setup
|
|
- [ ] Initialize project directory `/mnt/NV2/Development/voice-server`
|
|
- [ ] Create Python virtual environment using `uv`
|
|
- [ ] Install core dependencies:
|
|
- [ ] `uv pip install fastapi`
|
|
- [ ] `uv pip install uvicorn[standard]`
|
|
- [ ] `uv pip install piper-tts`
|
|
- [ ] `uv pip install sounddevice`
|
|
- [ ] `uv pip install numpy`
|
|
- [ ] `uv pip install pydantic`
|
|
- [ ] `uv pip install python-dotenv`
|
|
- [ ] Create `requirements.txt` with pinned versions
|
|
- [ ] Create `.env.example` for configuration template
|
|
- [ ] Initialize git repository
|
|
- [ ] Create `.gitignore` (Python, IDEs, .env, voice models)
|
|
|
|
#### 1.2 FastAPI Application Structure
|
|
- [ ] Create `app/main.py` with FastAPI app initialization
|
|
- [ ] Implement `/notify` endpoint skeleton
|
|
- [ ] Implement `/health` endpoint skeleton
|
|
- [ ] Implement `/voices` endpoint skeleton
|
|
- [ ] Configure CORS middleware
|
|
- [ ] Configure JSON logging middleware
|
|
- [ ] Create Pydantic models for request/response schemas
|
|
- [ ] Test basic server startup: `uvicorn app.main:app --reload`
|
|
|
|
#### 1.3 Configuration Management
|
|
- [ ] Create `app/config.py` for configuration loading
|
|
- [ ] Implement environment variable loading
|
|
- [ ] Define configuration schema (host, port, queue size, etc.)
|
|
- [ ] Implement configuration validation at startup
|
|
- [ ] Create CLI argument parsing for overrides
|
|
- [ ] Document all configuration options in README
|
|
|
|
### Phase 2: TTS Integration (Days 2-3)
|
|
|
|
#### 2.1 Piper TTS Setup
|
|
- [ ] Create `app/tts_engine.py` module
|
|
- [ ] Implement `PiperTTSEngine` class
|
|
- [ ] Download default voice model (`en_US-lessac-medium`)
|
|
- [ ] Implement voice model loading with caching
|
|
- [ ] Implement text-to-audio synthesis method
|
|
- [ ] Add support for configurable speech rate
|
|
- [ ] Test TTS generation with sample text
|
|
- [ ] Measure TTS latency for various text lengths
|
|
|
|
#### 2.2 Voice Model Management
|
|
- [ ] Create `models/` directory for voice model storage
|
|
- [ ] Implement voice model discovery (scan `models/` directory)
|
|
- [ ] Implement lazy loading of voice models (load on first use)
|
|
- [ ] Create model metadata cache (name, language, quality, size)
|
|
- [ ] Implement `/voices` endpoint to list available models
|
|
- [ ] Add error handling for missing/corrupted models
|
|
- [ ] Document voice model installation process
|
|
|
|
#### 2.3 TTS Parameter Support
|
|
- [ ] Implement speech rate adjustment (50-400 WPM)
|
|
- [ ] Test rate adjustment across range
|
|
- [ ] Add voice selection via request parameter
|
|
- [ ] Implement voice validation (reject unknown voices)
|
|
- [ ] Add `voice_enabled` flag for debugging/testing
|
|
- [ ] Create comprehensive TTS unit tests
|
|
|
|
### Phase 3: Audio Playback (Day 3)
|
|
|
|
#### 3.1 sounddevice Integration
|
|
- [ ] Create `app/audio_player.py` module
|
|
- [ ] Implement `AudioPlayer` class with non-blocking `sd.play()`
|
|
- [ ] Verify sounddevice detects audio devices at startup
|
|
- [ ] Implement non-blocking playback method
|
|
- [ ] Implement async `wait_async()` method for queue processing
|
|
- [ ] Test audio playback with sample NumPy array
|
|
- [ ] Verify non-blocking behavior with concurrent requests
|
|
|
|
#### 3.2 Audio Error Handling
|
|
- [ ] Implement audio device detection
|
|
- [ ] Add retry logic for device failures
|
|
- [ ] Handle device disconnection gracefully
|
|
- [ ] Test with headphones unplugged during playback
|
|
- [ ] Implement fallback to different audio devices
|
|
- [ ] Add detailed audio error logging
|
|
- [ ] Create audio system health check
|
|
|
|
#### 3.3 Playback Testing
|
|
- [ ] Test simultaneous playback (should queue)
|
|
- [ ] Test rapid successive requests
|
|
- [ ] Measure audio latency (request → sound output)
|
|
- [ ] Test with various audio formats
|
|
- [ ] Verify memory cleanup after playback
|
|
- [ ] Test long-running playback (10+ minutes)
|
|
|
|
### Phase 4: Queue Management (Day 4)
|
|
|
|
#### 4.1 Async Queue Implementation
|
|
- [ ] Create `app/queue_manager.py` module
|
|
- [ ] Implement `TTSQueue` class with `asyncio.Queue`
|
|
- [ ] Set configurable max queue size (default: 50)
|
|
- [ ] Implement queue full detection
|
|
- [ ] Create background queue processor task
|
|
- [ ] Implement graceful queue shutdown
|
|
- [ ] Add queue metrics (size, processed, errors)
|
|
|
|
#### 4.2 Request Processing Pipeline
|
|
- [ ] Implement request enqueueing in `/notify` endpoint
|
|
- [ ] Create background worker to process queue
|
|
- [ ] Integrate TTS generation in worker
|
|
- [ ] Integrate audio playback in worker
|
|
- [ ] Implement sequential playback (one at a time)
|
|
- [ ] Add request timeout handling (max 60s per request)
|
|
- [ ] Test queue with 100+ concurrent requests
|
|
|
|
#### 4.3 Queue Monitoring
|
|
- [ ] Add queue size to `/health` endpoint
|
|
- [ ] Implement queue utilization metrics
|
|
- [ ] Add logging for queue events (enqueue, process, error)
|
|
- [ ] Create queue performance benchmarks
|
|
- [ ] Test queue overflow scenarios
|
|
- [ ] Document queue behavior and limits
|
|
|
|
### Phase 5: Error Handling (Day 5)
|
|
|
|
#### 5.1 Exception Handlers
|
|
- [ ] Implement custom exception classes
|
|
- [ ] Create `QueueFullError` exception handler
|
|
- [ ] Create `TTSEngineError` exception handler
|
|
- [ ] Create `AudioPlaybackError` exception handler
|
|
- [ ] Create `ValidationError` exception handler
|
|
- [ ] Implement generic exception handler (catch-all)
|
|
- [ ] Test all error scenarios
|
|
|
|
#### 5.2 Logging Infrastructure
|
|
- [ ] Configure structured JSON logging
|
|
- [ ] Implement rotating file handler (10MB, 5 backups)
|
|
- [ ] Add request ID tracking across logs
|
|
- [ ] Implement log levels appropriately (DEBUG, INFO, WARNING, ERROR)
|
|
- [ ] Create log aggregation for queue processor
|
|
- [ ] Test log rotation
|
|
- [ ] Document log file locations and format
|
|
|
|
#### 5.3 Health Monitoring
|
|
- [ ] Implement comprehensive `/health` endpoint
|
|
- [ ] Add TTS engine health check
|
|
- [ ] Add audio system health check
|
|
- [ ] Add queue status to health check
|
|
- [ ] Add system resource metrics (CPU, memory)
|
|
- [ ] Test health endpoint under load
|
|
- [ ] Create health check monitoring script
|
|
|
|
### Phase 6: Testing (Days 5-6)
|
|
|
|
#### 6.1 Unit Tests
|
|
- [ ] Create `tests/` directory structure
|
|
- [ ] Install pytest: `uv pip install pytest pytest-asyncio`
|
|
- [ ] Write tests for Pydantic models
|
|
- [ ] Write tests for TTS engine
|
|
- [ ] Write tests for audio player
|
|
- [ ] Write tests for queue manager
|
|
- [ ] Write tests for configuration loading
|
|
- [ ] Achieve 80%+ code coverage
|
|
|
|
#### 6.2 Integration Tests
|
|
- [ ] Write tests for `/notify` endpoint
|
|
- [ ] Write tests for `/health` endpoint
|
|
- [ ] Write tests for `/voices` endpoint
|
|
- [ ] Test end-to-end request flow
|
|
- [ ] Test concurrent request handling
|
|
- [ ] Test queue overflow scenarios
|
|
- [ ] Test error scenarios (TTS failure, audio failure)
|
|
|
|
#### 6.3 Performance Tests
|
|
- [ ] Create load testing script with `locust` or `wrk`
|
|
- [ ] Test 100 concurrent requests
|
|
- [ ] Measure request latency (p50, p95, p99)
|
|
- [ ] Measure TTS generation time
|
|
- [ ] Measure audio playback latency
|
|
- [ ] Measure memory usage under load
|
|
- [ ] Document performance characteristics
|
|
|
|
#### 6.4 System Tests
|
|
- [ ] Test on target Linux environment (Nobara/Fedora 42)
|
|
- [ ] Test with different audio devices
|
|
- [ ] Test with PulseAudio and ALSA
|
|
- [ ] Test headphone disconnect/reconnect
|
|
- [ ] Test system resource exhaustion scenarios
|
|
- [ ] Test server restart recovery
|
|
- [ ] Test long-running stability (24+ hours)
|
|
|
|
### Phase 7: Documentation & Deployment (Days 6-7)
|
|
|
|
#### 7.1 Documentation
|
|
- [ ] Create comprehensive README.md:
|
|
- [ ] Project overview
|
|
- [ ] Installation instructions
|
|
- [ ] Configuration options
|
|
- [ ] Usage examples
|
|
- [ ] API documentation
|
|
- [ ] Troubleshooting guide
|
|
- [ ] Create CONTRIBUTING.md (if open source)
|
|
- [ ] Create CHANGELOG.md
|
|
- [ ] Document voice model installation
|
|
- [ ] Create architecture diagrams
|
|
- [ ] Add inline code documentation
|
|
- [ ] Create example client scripts (curl, Python)
|
|
|
|
#### 7.2 Deployment Preparation
|
|
- [ ] Create systemd service file (`voice-server.service`)
|
|
- [ ] Test systemd service installation
|
|
- [ ] Test automatic restart on failure
|
|
- [ ] Create deployment script (`deploy.sh`)
|
|
- [ ] Document deployment process
|
|
- [ ] Create backup/restore procedures
|
|
- [ ] Test upgrade procedure
|
|
|
|
#### 7.3 Production Hardening
|
|
- [ ] Enable production logging (disable debug logs)
|
|
- [ ] Configure log rotation
|
|
- [ ] Set up monitoring (optional: Prometheus, Grafana)
|
|
- [ ] Implement graceful shutdown (SIGTERM handling)
|
|
- [ ] Test crash recovery
|
|
- [ ] Implement rate limiting (optional)
|
|
- [ ] Security audit (input sanitization, resource limits)
|
|
- [ ] Performance tuning (queue size, worker count)
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Testing
|
|
|
|
**Framework:** pytest with pytest-asyncio
|
|
|
|
**Test Coverage Requirements:**
|
|
- Minimum 80% code coverage
|
|
- 100% coverage for critical paths (TTS, audio playback)
|
|
- All error handlers must have tests
|
|
|
|
**Test Structure:**
|
|
|
|
```
|
|
tests/
|
|
├── __init__.py
|
|
├── conftest.py # Shared fixtures
|
|
├── unit/
|
|
│ ├── test_config.py # Configuration loading tests
|
|
│ ├── test_models.py # Pydantic model tests
|
|
│ ├── test_tts_engine.py # TTS engine tests
|
|
│ ├── test_audio_player.py # Audio player tests
|
|
│ └── test_queue.py # Queue manager tests
|
|
├── integration/
|
|
│ ├── test_api.py # API endpoint tests
|
|
│ ├── test_end_to_end.py # Full request flow tests
|
|
│ └── test_errors.py # Error scenario tests
|
|
└── performance/
|
|
└── test_load.py # Load testing
|
|
```
|
|
|
|
**Sample Unit Test:**
|
|
|
|
```python
|
|
# tests/unit/test_tts_engine.py
|
|
import pytest
|
|
from app.tts_engine import PiperTTSEngine
|
|
|
|
@pytest.fixture
|
|
def tts_engine():
|
|
"""Create TTS engine instance."""
|
|
return PiperTTSEngine(model_dir="models/")
|
|
|
|
def test_tts_engine_initialization(tts_engine):
|
|
"""Test TTS engine initializes successfully."""
|
|
assert tts_engine is not None
|
|
assert tts_engine.default_voice == "en_US-lessac-medium"
|
|
|
|
def test_text_to_audio_conversion(tts_engine):
|
|
"""Test converting text to audio."""
|
|
audio = tts_engine.synthesize("Hello world")
|
|
assert audio is not None
|
|
assert len(audio) > 0
|
|
assert audio.dtype == np.float32
|
|
|
|
def test_invalid_voice_raises_error(tts_engine):
|
|
"""Test that invalid voice raises appropriate error."""
|
|
with pytest.raises(ValueError, match="Voice model .* not found"):
|
|
tts_engine.synthesize("Hello", voice="invalid_voice")
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_async_synthesis(tts_engine):
|
|
"""Test async TTS synthesis."""
|
|
audio = await tts_engine.synthesize_async("Hello world")
|
|
assert audio is not None
|
|
```
|
|
|
|
**Sample Integration Test:**
|
|
|
|
```python
|
|
# tests/integration/test_api.py
|
|
import pytest
|
|
from fastapi.testclient import TestClient
|
|
from app.main import app
|
|
|
|
@pytest.fixture
|
|
def client():
|
|
"""Create test client."""
|
|
return TestClient(app)
|
|
|
|
def test_notify_endpoint_success(client):
|
|
"""Test successful /notify request."""
|
|
response = client.post(
|
|
"/notify",
|
|
json={"message": "Test message", "rate": 180}
|
|
)
|
|
assert response.status_code == 202
|
|
data = response.json()
|
|
assert data["status"] == "queued"
|
|
assert data["message_length"] == 12
|
|
|
|
def test_notify_endpoint_validation_error(client):
|
|
"""Test /notify with invalid parameters."""
|
|
response = client.post(
|
|
"/notify",
|
|
json={"message": "", "rate": 1000} # Empty message, invalid rate
|
|
)
|
|
assert response.status_code == 422
|
|
|
|
def test_health_endpoint(client):
|
|
"""Test /health endpoint."""
|
|
response = client.get("/health")
|
|
assert response.status_code == 200
|
|
data = response.json()
|
|
assert "status" in data
|
|
assert "queue_size" in data
|
|
```
|
|
|
|
---
|
|
|
|
### Load Testing
|
|
|
|
**Tool:** wrk or locust
|
|
|
|
**Sample wrk Test:**
|
|
|
|
```bash
|
|
# Install wrk
|
|
sudo dnf install wrk
|
|
|
|
# Run load test: 100 concurrent connections, 30 seconds
|
|
wrk -t4 -c100 -d30s -s post.lua http://localhost:8888/notify
|
|
|
|
# post.lua script:
|
|
# wrk.method = "POST"
|
|
# wrk.headers["Content-Type"] = "application/json"
|
|
# wrk.body = '{"message": "Load test message"}'
|
|
```
|
|
|
|
**Sample locust Test:**
|
|
|
|
```python
|
|
# locustfile.py
|
|
from locust import HttpUser, task, between
|
|
|
|
class VoiceServerUser(HttpUser):
|
|
wait_time = between(1, 3)
|
|
|
|
@task
|
|
def notify(self):
|
|
self.client.post("/notify", json={
|
|
"message": "This is a load test message",
|
|
"rate": 180
|
|
})
|
|
|
|
@task(5)
|
|
def health_check(self):
|
|
self.client.get("/health")
|
|
|
|
# Run: locust -f locustfile.py --host=http://localhost:8888
|
|
```
|
|
|
|
**Performance Benchmarks:**
|
|
|
|
| Metric | Target | Acceptable | Unacceptable |
|
|
|--------|--------|------------|--------------|
|
|
| API Response Time (p95) | < 50ms | < 100ms | > 200ms |
|
|
| TTS Generation (500 chars) | < 2s | < 5s | > 10s |
|
|
| Requests/Second | > 50 | > 20 | < 10 |
|
|
| Memory Usage (idle) | < 200MB | < 500MB | > 1GB |
|
|
| Memory Usage (load) | < 500MB | < 1GB | > 2GB |
|
|
| Queue Processing Rate | > 10/s | > 5/s | < 2/s |
|
|
|
|
---
|
|
|
|
### Manual Testing Checklist
|
|
|
|
**Functional Testing:**
|
|
- [ ] Send POST request with valid message → Hear audio playback
|
|
- [ ] Send request with long text (5000 chars) → Successful playback
|
|
- [ ] Send request with special characters → Successful sanitization
|
|
- [ ] Send request with invalid voice → Receive 422 error
|
|
- [ ] Send request with rate=50 → Slow speech playback
|
|
- [ ] Send request with rate=400 → Fast speech playback
|
|
- [ ] Send 10 concurrent requests → All play sequentially
|
|
- [ ] Fill queue to capacity → Receive 503 error
|
|
- [ ] Check /health endpoint → Receive status information
|
|
- [ ] Check /voices endpoint → See available voice models
|
|
- [ ] Check /docs endpoint → See Swagger documentation
|
|
|
|
**Error Scenario Testing:**
|
|
- [ ] Unplug headphones during playback → Graceful error handling
|
|
- [ ] Kill PulseAudio daemon → Audio error logged, server continues
|
|
- [ ] Send malformed JSON → Receive 400 error
|
|
- [ ] Send empty message → Receive 422 error
|
|
- [ ] Send 11,000 character message → Receive 413 error
|
|
- [ ] Restart server during playback → Queue cleared, server restarts
|
|
|
|
**System Testing:**
|
|
- [ ] Run server for 24 hours → No memory leaks
|
|
- [ ] Send 10,000 requests → All processed successfully
|
|
- [ ] Monitor CPU usage during load → < 50% average
|
|
- [ ] Monitor memory usage during load → < 1GB
|
|
- [ ] Test on Fedora 42 → Successful operation
|
|
- [ ] Test with ALSA (without PulseAudio) → Successful operation
|
|
|
|
---
|
|
|
|
## Future Considerations
|
|
|
|
### Optional Features (Post-v1.0)
|
|
|
|
#### 1. Advanced Voice Control
|
|
- **Pitch adjustment:** Allow clients to specify pitch modification
|
|
- **Volume control:** Per-request volume settings
|
|
- **Emotion/tone control:** Happy, sad, angry voice modulation (if TTS engine supports)
|
|
- **Voice cloning:** Custom voice model training (Coqui TTS integration)
|
|
|
|
**Implementation Complexity:** Medium
|
|
**User Value:** High for accessibility and personalization
|
|
|
|
---
|
|
|
|
#### 2. Audio Format Options
|
|
- **Output format selection:** Support WAV, MP3, OGG output
|
|
- **Sample rate options:** Allow 16kHz, 22kHz, 44.1kHz selection
|
|
- **Compression levels:** Configurable audio quality vs file size
|
|
|
|
**Implementation Complexity:** Low
|
|
**User Value:** Medium (mostly for file storage use cases)
|
|
|
|
---
|
|
|
|
#### 3. Streaming Audio
|
|
- **Real-time streaming:** Stream audio as it's generated (WebSocket or SSE)
|
|
- **Chunked TTS:** Generate and stream long texts in chunks
|
|
- **Lower latency:** Start playback before full text is synthesized
|
|
|
|
**Implementation Complexity:** High
|
|
**User Value:** High for very long texts
|
|
|
|
---
|
|
|
|
#### 4. SSML Support
|
|
- **Prosody control:** Fine-grained control over speech characteristics
|
|
- **Break insertion:** Explicit pauses and timing control
|
|
- **Phoneme specification:** Correct pronunciation for unusual words
|
|
- **Multi-voice support:** Different voices within single text
|
|
|
|
**Example:**
|
|
```xml
|
|
<speak>
|
|
Hello, <break time="500ms"/> this is <emphasis>important</emphasis>.
|
|
<voice name="en_US-libritts">A different voice.</voice>
|
|
</speak>
|
|
```
|
|
|
|
**Implementation Complexity:** Medium
|
|
**User Value:** High for advanced use cases
|
|
|
|
---
|
|
|
|
#### 5. Caching Layer
|
|
- **TTS result caching:** Cache frequently requested texts
|
|
- **Cache invalidation:** LRU eviction policy
|
|
- **Cache persistence:** Store cache across restarts
|
|
- **Cache statistics:** Hit rate monitoring
|
|
|
|
**Implementation Complexity:** Low
|
|
**User Value:** High for repeated texts (notifications, alerts)
|
|
|
|
**Sample Implementation:**
|
|
|
|
```python
|
|
from functools import lru_cache
|
|
import hashlib
|
|
|
|
class TTSCache:
|
|
def __init__(self, max_size: int = 1000):
|
|
self.cache = {}
|
|
self.max_size = max_size
|
|
|
|
def get_cache_key(self, text: str, voice: str, rate: int) -> str:
|
|
"""Generate cache key from TTS parameters."""
|
|
content = f"{text}|{voice}|{rate}"
|
|
return hashlib.sha256(content.encode()).hexdigest()
|
|
|
|
def get(self, text: str, voice: str, rate: int):
|
|
"""Retrieve cached audio."""
|
|
key = self.get_cache_key(text, voice, rate)
|
|
return self.cache.get(key)
|
|
|
|
def put(self, text: str, voice: str, rate: int, audio_data):
|
|
"""Store audio in cache with LRU eviction."""
|
|
if len(self.cache) >= self.max_size:
|
|
# Evict oldest entry (simple FIFO, use OrderedDict for true LRU)
|
|
self.cache.pop(next(iter(self.cache)))
|
|
|
|
key = self.get_cache_key(text, voice, rate)
|
|
self.cache[key] = audio_data
|
|
```
|
|
|
|
---
|
|
|
|
#### 6. Multi-Language Support
|
|
- **Automatic language detection:** Detect input language
|
|
- **Language-specific voice selection:** Match voice to detected language
|
|
- **Mixed-language support:** Handle multilingual texts
|
|
|
|
**Implementation Complexity:** Medium
|
|
**User Value:** High for international users
|
|
|
|
---
|
|
|
|
#### 7. Audio Effects
|
|
- **Reverb:** Add spatial audio effects
|
|
- **Echo:** Add echo effects
|
|
- **Speed adjustment:** Time-stretch without pitch change
|
|
- **Normalization:** Automatic volume leveling
|
|
|
|
**Implementation Complexity:** Medium (requires audio processing library like `pydub` or `librosa`)
|
|
**User Value:** Medium (aesthetic enhancement)
|
|
|
|
---
|
|
|
|
#### 8. Queue Priority System
|
|
- **Priority levels:** High, normal, low priority requests
|
|
- **Priority queues:** Separate queues for different priorities
|
|
- **Preemption:** Allow high-priority requests to interrupt low-priority
|
|
|
|
**Implementation Complexity:** Medium
|
|
**User Value:** Medium for multi-tenant scenarios
|
|
|
|
---
|
|
|
|
#### 9. Webhook Notifications
|
|
- **Completion webhooks:** Notify external service when TTS completes
|
|
- **Error webhooks:** Notify on TTS failures
|
|
- **Webhook retry logic:** Handle webhook delivery failures
|
|
|
|
**Example Request:**
|
|
```json
|
|
{
|
|
"message": "Hello world",
|
|
"webhook_url": "https://example.com/tts-complete"
|
|
}
|
|
```
|
|
|
|
**Implementation Complexity:** Low
|
|
**User Value:** High for integration scenarios
|
|
|
|
---
|
|
|
|
#### 10. Authentication & Authorization
|
|
- **API key authentication:** Secure endpoint access
|
|
- **Rate limiting:** Per-user request limits
|
|
- **Usage quotas:** Daily/monthly request quotas
|
|
- **Multi-tenant support:** Isolated queues per user
|
|
|
|
**Implementation Complexity:** High
|
|
**User Value:** High for shared/production deployments
|
|
|
|
---
|
|
|
|
#### 11. Web Interface
|
|
- **Simple web UI:** Browser-based TTS interface
|
|
- **Queue visualization:** Real-time queue status display
|
|
- **Voice model management:** Upload/download voice models via UI
|
|
- **Settings configuration:** Web-based configuration editor
|
|
|
|
**Implementation Complexity:** Medium
|
|
**User Value:** High for non-technical users
|
|
|
|
---
|
|
|
|
#### 12. Docker Deployment
|
|
- **Dockerfile:** Container image for easy deployment
|
|
- **Docker Compose:** Multi-container setup with monitoring
|
|
- **Volume management:** Persistent voice model storage
|
|
- **Health check integration:** Container health monitoring
|
|
|
|
**Sample Dockerfile:**
|
|
|
|
```dockerfile
|
|
FROM python:3.11-slim
|
|
|
|
# Install system dependencies (PortAudio for sounddevice)
|
|
RUN apt-get update && apt-get install -y \
|
|
libportaudio2 \
|
|
portaudio19-dev \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
WORKDIR /app
|
|
|
|
# Install Python dependencies
|
|
COPY requirements.txt .
|
|
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
# Copy application
|
|
COPY app/ ./app/
|
|
|
|
# Download default voice model
|
|
RUN python -c "from piper import PiperVoice; PiperVoice.download('en_US-lessac-medium')"
|
|
|
|
EXPOSE 8888
|
|
|
|
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8888"]
|
|
```
|
|
|
|
**Implementation Complexity:** Low
|
|
**User Value:** High for deployment consistency
|
|
|
|
---
|
|
|
|
#### 13. Metrics & Monitoring
|
|
- **Prometheus metrics:** Request count, latency, queue size
|
|
- **Grafana dashboards:** Visual monitoring
|
|
- **Alerting:** Notify on errors, high queue size, etc.
|
|
- **Performance profiling:** Identify bottlenecks
|
|
|
|
**Sample Metrics:**
|
|
|
|
```python
|
|
from prometheus_client import Counter, Histogram, Gauge
|
|
|
|
request_counter = Counter('tts_requests_total', 'Total TTS requests')
|
|
latency_histogram = Histogram('tts_latency_seconds', 'TTS latency')
|
|
queue_size_gauge = Gauge('tts_queue_size', 'Current queue size')
|
|
|
|
@app.post("/notify")
|
|
async def notify(request: NotifyRequest):
|
|
request_counter.inc()
|
|
with latency_histogram.time():
|
|
# Process request
|
|
...
|
|
queue_size_gauge.set(tts_queue.qsize())
|
|
```
|
|
|
|
**Implementation Complexity:** Medium
|
|
**User Value:** High for production deployments
|
|
|
|
---
|
|
|
|
### Scalability Considerations
|
|
|
|
**Horizontal Scaling:**
|
|
- Use Redis for shared queue across multiple server instances
|
|
- Implement distributed locking for audio device access
|
|
- Load balance requests across multiple servers
|
|
|
|
**Vertical Scaling:**
|
|
- Increase queue size for higher throughput
|
|
- Use GPU acceleration for TTS (CUDA support in Piper)
|
|
- Optimize voice model loading (keep models in memory)
|
|
|
|
**Architecture Evolution:**
|
|
- Separate TTS generation and audio playback into microservices
|
|
- Use message queue (RabbitMQ, Kafka) for request distribution
|
|
- Implement worker pool for parallel TTS generation
|
|
|
|
---
|
|
|
|
## Appendix: References
|
|
|
|
### Technical Documentation
|
|
- [FastAPI Official Documentation](https://fastapi.tiangolo.com/)
|
|
- [Piper TTS GitHub Repository](https://github.com/rhasspy/piper)
|
|
- [PyAudio Documentation](https://people.csail.mit.edu/hubert/pyaudio/docs/)
|
|
- [Uvicorn Documentation](https://www.uvicorn.org/)
|
|
|
|
### Research & Comparisons
|
|
- [FastAPI vs Flask Performance Comparison - Strapi](https://strapi.io/blog/fastapi-vs-flask-python-framework-comparison)
|
|
- [Flask vs FastAPI - Better Stack](https://betterstack.com/community/guides/scaling-python/flask-vs-fastapi/)
|
|
- [Python TTS Engines Comparison - Smallest AI](https://smallest.ai/blog/python-packages-realistic-text-to-speech)
|
|
- [TTS Converters for Raspberry Pi - Circuit Digest](https://circuitdigest.com/microcontroller-projects/best-text-to-speech-tts-converter-for-raspberry-pi-espeak-festival-google-tts-pico-and-pyttsx3)
|
|
- [Piper TTS Tutorial - RMauro Dev](https://rmauro.dev/how-to-run-piper-tts-on-your-raspberry-pi-offline-voice-zero-internet-needed/)
|
|
- [Python Audio Playback - simpleaudio Docs](https://simpleaudio.readthedocs.io/)
|
|
|
|
### Tools & Libraries
|
|
- [uv - Fast Python Package Manager](https://github.com/astral-sh/uv)
|
|
- [pytest - Testing Framework](https://docs.pytest.org/)
|
|
- [locust - Load Testing](https://locust.io/)
|
|
|
|
---
|
|
|
|
## Document History
|
|
|
|
| Version | Date | Author | Changes |
|
|
|---------|------|--------|---------|
|
|
| 1.0 | 2025-12-18 | Atlas | Initial PRD creation |
|
|
|
|
---
|
|
|
|
**Document Status:** ✅ Complete - Ready for Implementation
|
|
|
|
**Next Steps:**
|
|
1. Review PRD with stakeholders
|
|
2. Approve technical stack decisions
|
|
3. Begin Phase 1 implementation
|
|
4. Set up project tracking (GitHub Issues, Jira, etc.)
|
|
5. Assign development resources
|
|
|
|
**Questions or Feedback:** Contact Atlas at [atlas@manticorum.com]
|