Custom TTS integration
Learn to integrate your own text-to-speech system with VAPI
Overview
Integrate your own Text-to-Speech (TTS) system with VAPI Assistant for complete control over voice synthesis. Whether you need brand-specific voices, advanced audio quality, or cost optimization, custom TTS gives you the flexibility to use any TTS provider while maintaining real-time performance.
In this guide, you’ll learn to:
- Set up webhook authentication between VAPI and your TTS endpoint
- Build a TTS server that handles VAPI’s audio requirements
- Process text requests and return properly formatted audio
- Handle edge cases and troubleshoot common issues
Custom TTS maintains VAPI’s real-time performance while giving you complete flexibility over voice synthesis, language support, and audio quality.
How custom TTS works
VAPI’s custom TTS system operates through a webhook pattern:
What you’ll need
- Web server that can receive POST requests and return audio data
- TTS system (cloud API, local model, or custom solution)
- VAPI account with access to custom voice configuration
Your TTS system must generate audio in specific PCM format requirements to ensure proper playback quality.
Authentication setup
VAPI needs secure communication with your TTS endpoint. Choose from these authentication options:
Secret header authentication
The most common approach uses a secret token in the X-VAPI-SECRET
header:
Enhanced authentication with custom headers
Add extra headers for API versioning or enhanced security:
Enterprise customers can use OAuth2 authentication through webhook credentials for larger deployments.
Building your TTS integration
Configure your VAPI assistant
Set up your assistant to use your custom TTS endpoint with fallback options:
Build your TTS server
Here’s a complete Node.js implementation that handles all requirements:
Handle text processing
Implement proper text validation and preprocessing:
Request and response formats
VAPI request structure
Every TTS request from VAPI follows this format:
Required fields
Your response requirements
Your endpoint must respond with:
- HTTP 200 status
- Content-Type: application/octet-stream
- Raw PCM audio data in the response body
Audio format requirements
PCM specifications
Your TTS system must generate audio with these exact specifications:
- Format: Raw PCM (no headers or containers)
- Channels: 1 (mono only)
- Bit Depth: 16-bit signed integer
- Byte Order: Little-endian
- Sample Rate: Must exactly match the
sampleRate
in the request
Any deviation from these specifications will cause audio distortion, playback failures, or call quality issues. VAPI streams audio in real-time during phone calls.
Testing your integration
Create a test call
Use VAPI’s API to create a test call that exercises your TTS system:
Monitor TTS requests
Set up logging to see exactly what VAPI sends to your endpoint:
Quick endpoint test
Test your endpoint directly before full integration:
Troubleshooting
Request timeouts
Symptoms: VAPI doesn’t receive your audio response, calls may drop
Common causes:
- TTS processing takes longer than configured timeout
- Network connectivity issues between VAPI and your server
- Server overload or unresponsiveness
Solutions:
- Increase
timeoutSeconds
in your assistant configuration - Optimize your TTS processing speed
- Implement proper error handling and timeout protection
Audio playback problems
Symptoms: No audio during calls, or distorted/garbled sound
Common causes:
- Wrong audio format (not raw PCM)
- Incorrect sample rate
- Sending stereo audio instead of mono
- Including audio file headers in response
Solutions:
- Ensure raw PCM format with no headers
- Match the exact sample rate from the request
- Generate mono audio only
- Verify 16-bit little-endian format
Authentication failures
Symptoms: 401 Unauthorized responses in your server logs
Common causes:
- Secret token mismatch between VAPI config and server
- Missing X-VAPI-SECRET header validation
- Case sensitivity issues with header names
Solutions:
- Verify secret token matches exactly
- Implement proper header validation
- Check for case-sensitive header handling
High latency
Symptoms: Noticeable delays during conversations
Common causes:
- TTS model loading time on first request
- Complex text processing before synthesis
- Network latency between services
Solutions:
- Pre-load TTS models at startup
- Optimize text preprocessing
- Use faster TTS models for real-time performance
- Consider geographic proximity of services
Next steps
Now that you have custom TTS integration working:
- Advanced features: Explore SSML support for enhanced voice control
- Performance optimization: Implement caching and model warming strategies
- Voice cloning: Integrate voice cloning APIs for personalized experiences
- Multi-language support: Add language detection and voice switching
Consider implementing a fallback voice provider in your assistant configuration to ensure call continuity if your custom TTS endpoint experiences issues.