Custom TTS integration

Learn to integrate your own text-to-speech system with VAPI

Overview

Integrate your own Text-to-Speech (TTS) system with VAPI Assistant for complete control over voice synthesis. Whether you need brand-specific voices, advanced audio quality, or cost optimization, custom TTS gives you the flexibility to use any TTS provider while maintaining real-time performance.

In this guide, you’ll learn to:

  • Set up webhook authentication between VAPI and your TTS endpoint
  • Build a TTS server that handles VAPI’s audio requirements
  • Process text requests and return properly formatted audio
  • Handle edge cases and troubleshoot common issues

Custom TTS maintains VAPI’s real-time performance while giving you complete flexibility over voice synthesis, language support, and audio quality.

How custom TTS works

VAPI’s custom TTS system operates through a webhook pattern:

1

Text conversion trigger

During a conversation, VAPI needs to convert text to speech

2

Request to your endpoint

VAPI sends a POST request to your TTS endpoint with text and audio specifications

3

Audio generation

Your system generates audio and returns it as raw PCM data

4

Real-time playback

VAPI streams the audio to the caller in real-time

What you’ll need

  • Web server that can receive POST requests and return audio data
  • TTS system (cloud API, local model, or custom solution)
  • VAPI account with access to custom voice configuration

Your TTS system must generate audio in specific PCM format requirements to ensure proper playback quality.

Authentication setup

VAPI needs secure communication with your TTS endpoint. Choose from these authentication options:

Secret header authentication

The most common approach uses a secret token in the X-VAPI-SECRET header:

Assistant Configuration
1{
2 "voice": {
3 "provider": "custom-voice",
4 "server": {
5 "url": "https://your-tts-api.com/synthesize",
6 "secret": "your-secret-token-here",
7 "timeoutSeconds": 30
8 }
9 }
10}

Enhanced authentication with custom headers

Add extra headers for API versioning or enhanced security:

Assistant Configuration with Custom Headers
1{
2 "voice": {
3 "provider": "custom-voice",
4 "server": {
5 "url": "https://your-tts-api.com/synthesize",
6 "secret": "your-secret-token",
7 "headers": {
8 "X-API-Version": "v1",
9 "X-Client-ID": "vapi-integration"
10 }
11 }
12 }
13}

Enterprise customers can use OAuth2 authentication through webhook credentials for larger deployments.

Building your TTS integration

Configure your VAPI assistant

Set up your assistant to use your custom TTS endpoint with fallback options:

Complete Assistant Configuration
1{
2 "name": "Custom Voice Assistant",
3 "voice": {
4 "provider": "custom-voice",
5 "server": {
6 "url": "https://your-tts-endpoint.com/api/synthesize",
7 "secret": "your-webhook-secret",
8 "timeoutSeconds": 45,
9 "headers": {
10 "Content-Type": "application/json",
11 "X-API-Version": "v1"
12 }
13 },
14 "fallbackPlan": {
15 "voices": [
16 {
17 "provider": "eleven-labs",
18 "voiceId": "21m00Tcm4TlvDq8ikWAM"
19 }
20 ]
21 }
22 }
23}

Build your TTS server

Here’s a complete Node.js implementation that handles all requirements:

tts-server.js
1const express = require('express');
2const crypto = require('crypto');
3
4const app = express();
5app.use(express.json({ limit: '50mb' }));
6
7// Main TTS endpoint
8app.post('/api/synthesize', async (req, res) => {
9const requestId = crypto.randomUUID();
10const startTime = Date.now();
11
12// Set up timeout protection
13const timeout = setTimeout(() => {
14if (!res.headersSent) {
15res.status(408).json({ error: 'Request timeout' });
16}
17}, 30000);
18
19try {
20console.log(`TTS request started: ${requestId}`);
21
22 // Extract and validate the request
23 const { message } = req.body;
24 if (!message) {
25 clearTimeout(timeout);
26 return res.status(400).json({ error: 'Missing message object' });
27 }
28
29 const { type, text, sampleRate } = message;
30
31 // Validate message type
32 if (type !== 'voice-request') {
33 clearTimeout(timeout);
34 return res.status(400).json({ error: 'Invalid message type' });
35 }
36
37 // Validate text content
38 if (!text || typeof text !== 'string' || text.trim().length === 0) {
39 clearTimeout(timeout);
40 return res.status(400).json({ error: 'Invalid or missing text' });
41 }
42
43 // Validate sample rate
44 const validSampleRates = [8000, 16000, 22050, 24000, 44100];
45 if (!validSampleRates.includes(sampleRate)) {
46 clearTimeout(timeout);
47 return res.status(400).json({
48 error: 'Unsupported sample rate',
49 supportedRates: validSampleRates,
50 });
51 }
52
53 console.log(
54 `Synthesizing: ${requestId}, length=${text.length}, rate=${sampleRate}Hz`
55 );
56
57 // Generate the audio (replace with your TTS implementation)
58 const audioBuffer = await synthesizeAudio(text, sampleRate);
59
60 if (!audioBuffer || audioBuffer.length === 0) {
61 throw new Error('TTS synthesis produced no audio');
62 }
63
64 clearTimeout(timeout);
65
66 // Return raw PCM audio to VAPI
67 res.setHeader('Content-Type', 'application/octet-stream');
68 res.setHeader('Content-Length', audioBuffer.length);
69 res.write(audioBuffer);
70 res.end();
71
72 const duration = Date.now() - startTime;
73 console.log(
74 `TTS completed: ${requestId}, ${duration}ms, ${audioBuffer.length} bytes`
75 );
76
77} catch (error) {
78clearTimeout(timeout);
79const duration = Date.now() - startTime;
80console.error(`TTS failed: ${requestId}, ${duration}ms, ${error.message}`);
81
82 if (!res.headersSent) {
83 res.status(500).json({ error: 'TTS synthesis failed', requestId });
84 }
85
86}
87});
88
89// Replace with your actual TTS implementation
90async function synthesizeAudio(text, sampleRate) {
91// Example: Call your TTS API here
92// const audioBuffer = await yourTTSProvider.synthesize(text, sampleRate);
93
94// Demo implementation (replace with real TTS)
95await new Promise(resolve => setTimeout(resolve, 100));
96
97const duration = Math.min(text.length _ 0.1, 10); // Max 10 seconds
98const samples = Math.floor(duration _ sampleRate);
99const buffer = Buffer.alloc(samples \* 2); // 16-bit = 2 bytes per sample
100
101for (let i = 0; i < samples; i++) {
102const value = Math.sin((2 _ Math.PI _ 440 _ i) / sampleRate) _ 16000;
103buffer.writeInt16LE(Math.round(value), i \* 2);
104}
105
106return buffer;
107}
108
109// Error handling middleware
110app.use((error, req, res, next) => {
111console.error('Unhandled error:', error.message);
112res.status(500).json({ error: 'Internal server error' });
113});
114
115const PORT = process.env.PORT || 3000;
116app.listen(PORT, () => {
117console.log(`TTS server listening on port ${PORT}`);
118});
119
120module.exports = app;

Handle text processing

Implement proper text validation and preprocessing:

Text Processing Functions
1function preprocessText(text) {
2 // Handle SSML tags if your TTS supports them
3 if (text.includes('<speak>')) {
4 return parseSSML(text);
5 }
6
7 // Clean up problematic characters
8 return text
9 .replace(/[^\w\s\.,!?-]/g, '') // Remove special characters
10 .replace(/\s+/g, ' ') // Normalize whitespace
11 .trim();
12}
13
14function validateText(text) {
15 if (!text || text.trim().length === 0) {
16 throw new Error('Empty text provided');
17 }
18
19 if (!/[a-zA-Z]/.test(text)) {
20 throw new Error('No readable text found');
21 }
22
23 return text.trim();
24}

Request and response formats

VAPI request structure

Every TTS request from VAPI follows this format:

VAPI TTS Request
1{
2 "message": {
3 "type": "voice-request",
4 "text": "Hello, world! How can I help you today?",
5 "sampleRate": 24000,
6 "timestamp": 1677123456789,
7 "call": {
8 "id": "call-123",
9 "orgId": "org-456"
10 },
11 "assistant": {
12 "id": "assistant-789",
13 "name": "Customer Service Bot"
14 },
15 "customer": {
16 "number": "+1234567890"
17 }
18 }
19}

Required fields

FieldTypeDescription
typestringAlways “voice-request”
textstringText to synthesize
sampleRatenumberTarget audio sample rate (8000, 16000, 22050, or 24000 Hz)
timestampnumberUnix timestamp in milliseconds

Your response requirements

Your endpoint must respond with:

  • HTTP 200 status
  • Content-Type: application/octet-stream
  • Raw PCM audio data in the response body
Correct Response Headers
1HTTP/1.1 200 OK
2Content-Type: application/octet-stream
3Transfer-Encoding: chunked
4
5[Raw PCM audio bytes]

Audio format requirements

PCM specifications

Your TTS system must generate audio with these exact specifications:

  • Format: Raw PCM (no headers or containers)
  • Channels: 1 (mono only)
  • Bit Depth: 16-bit signed integer
  • Byte Order: Little-endian
  • Sample Rate: Must exactly match the sampleRate in the request

Any deviation from these specifications will cause audio distortion, playback failures, or call quality issues. VAPI streams audio in real-time during phone calls.

Testing your integration

Create a test call

Use VAPI’s API to create a test call that exercises your TTS system:

Test Call Creation
1async function testTTSWithVAPICall() {
2 const vapiApiKey = 'your-vapi-api-key';
3 const assistantId = 'your-assistant-id'; // Assistant with custom TTS
4
5 const callData = {
6 assistant: { id: assistantId },
7 phoneNumberId: 'your-phone-number-id',
8 customer: { number: '+1234567890' }, // Your test number
9 };
10
11 try {
12 const response = await fetch('https://api.vapi.ai/call', {
13 method: 'POST',
14 headers: {
15 Authorization: `Bearer ${vapiApiKey}`,
16 'Content-Type': 'application/json',
17 },
18 body: JSON.stringify(callData),
19 });
20
21 const call = await response.json();
22 console.log('Test call created:', call.id);
23 return call;
24 } catch (error) {
25 console.error('Failed to create test call:', error);
26 }
27}

Monitor TTS requests

Set up logging to see exactly what VAPI sends to your endpoint:

Request Monitoring Middleware
1app.use('/api/synthesize', (req, res, next) => {
2 console.log('=== Incoming TTS Request ===');
3 console.log('Headers:', req.headers);
4 console.log('Body:', JSON.stringify(req.body, null, 2));
5 console.log('==============================');
6 next();
7});

Quick endpoint test

Test your endpoint directly before full integration:

Direct Endpoint Test
1async function quickEndpointTest() {
2 const testPayload = {
3 message: {
4 type: 'voice-request',
5 text: 'Hello, this is a test of the custom TTS system.',
6 sampleRate: 24000,
7 timestamp: Date.now(),
8 },
9 };
10
11try {
12const response = await fetch('http://localhost:3000/api/synthesize', {
13method: 'POST',
14headers: {
15'Content-Type': 'application/json',
16'X-VAPI-SECRET': 'your-test-secret',
17},
18body: JSON.stringify(testPayload),
19});
20
21 if (response.ok) {
22 const audioBuffer = await response.buffer();
23 console.log(`Audio generated: ${audioBuffer.length} bytes`);
24 require('fs').writeFileSync('test-output.pcm', audioBuffer);
25 } else {
26 console.error('Test failed:', await response.text());
27 }
28
29} catch (error) {
30console.error('Test error:', error);
31}
32}

Troubleshooting

Symptoms: VAPI doesn’t receive your audio response, calls may drop

Common causes:

  • TTS processing takes longer than configured timeout
  • Network connectivity issues between VAPI and your server
  • Server overload or unresponsiveness

Solutions:

  • Increase timeoutSeconds in your assistant configuration
  • Optimize your TTS processing speed
  • Implement proper error handling and timeout protection

Symptoms: No audio during calls, or distorted/garbled sound

Common causes:

  • Wrong audio format (not raw PCM)
  • Incorrect sample rate
  • Sending stereo audio instead of mono
  • Including audio file headers in response

Solutions:

  • Ensure raw PCM format with no headers
  • Match the exact sample rate from the request
  • Generate mono audio only
  • Verify 16-bit little-endian format

Symptoms: 401 Unauthorized responses in your server logs

Common causes:

  • Secret token mismatch between VAPI config and server
  • Missing X-VAPI-SECRET header validation
  • Case sensitivity issues with header names

Solutions:

  • Verify secret token matches exactly
  • Implement proper header validation
  • Check for case-sensitive header handling

Symptoms: Noticeable delays during conversations

Common causes:

  • TTS model loading time on first request
  • Complex text processing before synthesis
  • Network latency between services

Solutions:

  • Pre-load TTS models at startup
  • Optimize text preprocessing
  • Use faster TTS models for real-time performance
  • Consider geographic proximity of services

Next steps

Now that you have custom TTS integration working:

  • Advanced features: Explore SSML support for enhanced voice control
  • Performance optimization: Implement caching and model warming strategies
  • Voice cloning: Integrate voice cloning APIs for personalized experiences
  • Multi-language support: Add language detection and voice switching

Consider implementing a fallback voice provider in your assistant configuration to ensure call continuity if your custom TTS endpoint experiences issues.