Custom TTS integration

Overview

Integrate your own Text-to-Speech (TTS) system with VAPI Assistant for complete control over voice synthesis. Whether you need brand-specific voices, advanced audio quality, or cost optimization, custom TTS gives you the flexibility to use any TTS provider while maintaining real-time performance.

In this guide, you’ll learn to:

Set up webhook authentication between VAPI and your TTS endpoint
Build a TTS server that handles VAPI’s audio requirements
Process text requests and return properly formatted audio
Handle edge cases and troubleshoot common issues

Custom TTS maintains VAPI’s real-time performance while giving you complete flexibility over voice synthesis, language support, and audio quality.

How custom TTS works

VAPI’s custom TTS system operates through a webhook pattern:

Text conversion trigger

During a conversation, VAPI needs to convert text to speech

Request to your endpoint

VAPI sends a POST request to your TTS endpoint with text and audio specifications

Audio generation

Your system generates audio and returns it as raw PCM data

Real-time playback

VAPI streams the audio to the caller in real-time

What you’ll need

Web server that can receive POST requests and return audio data
TTS system (cloud API, local model, or custom solution)
VAPI account with access to custom voice configuration

Your TTS system must generate audio in specific PCM format requirements to ensure proper playback quality.

Authentication setup

VAPI needs secure communication with your TTS endpoint. Choose from these authentication options:

Secret header authentication

The most common approach uses a secret token in the X-VAPI-SECRET header:

Assistant Configuration

1 {
2   "voice": {
3     "provider": "custom-voice",
4     "server": {
5       "url": "https://your-tts-api.com/synthesize",
6       "secret": "your-secret-token-here",
7       "timeoutSeconds": 30
8     }
9   }
10 }

Enhanced authentication with custom headers

Add extra headers for API versioning or enhanced security:

Assistant Configuration with Custom Headers

1 {
2   "voice": {
3     "provider": "custom-voice",
4     "server": {
5       "url": "https://your-tts-api.com/synthesize",
6       "secret": "your-secret-token",
7       "headers": {
8         "X-API-Version": "v1",
9         "X-Client-ID": "vapi-integration"
10       }
11     }
12   }
13 }

Enterprise customers can use OAuth2 authentication through webhook credentials for larger deployments.

Building your TTS integration

Configure your VAPI assistant

Set up your assistant to use your custom TTS endpoint with fallback options:

Complete Assistant Configuration

1 {
2   "name": "Custom Voice Assistant",
3   "voice": {
4     "provider": "custom-voice",
5     "server": {
6       "url": "https://your-tts-endpoint.com/api/synthesize",
7       "secret": "your-webhook-secret",
8       "timeoutSeconds": 45,
9       "headers": {
10         "Content-Type": "application/json",
11         "X-API-Version": "v1"
12       }
13     },
14     "fallbackPlan": {
15       "voices": [
16         {
17           "provider": "eleven-labs",
18           "voiceId": "21m00Tcm4TlvDq8ikWAM"
19         }
20       ]
21     }
22   }
23 }

Build your TTS server

Here’s a complete Node.js implementation that handles all requirements:

tts-server.js

1 const express = require('express');
2 const crypto = require('crypto');
3 
4 const app = express();
5 app.use(express.json({ limit: '50mb' }));
6 
7 // Main TTS endpoint
8 app.post('/api/synthesize', async (req, res) => {
9 const requestId = crypto.randomUUID();
10 const startTime = Date.now();
11 
12 // Set up timeout protection
13 const timeout = setTimeout(() => {
14 if (!res.headersSent) {
15 res.status(408).json({ error: 'Request timeout' });
16 }
17 }, 30000);
18 
19 try {
20 console.log(`TTS request started: ${requestId}`);
21 
22     // Extract and validate the request
23     const { message } = req.body;
24     if (!message) {
25       clearTimeout(timeout);
26       return res.status(400).json({ error: 'Missing message object' });
27     }
28 
29     const { type, text, sampleRate } = message;
30 
31     // Validate message type
32     if (type !== 'voice-request') {
33       clearTimeout(timeout);
34       return res.status(400).json({ error: 'Invalid message type' });
35     }
36 
37     // Validate text content
38     if (!text || typeof text !== 'string' || text.trim().length === 0) {
39       clearTimeout(timeout);
40       return res.status(400).json({ error: 'Invalid or missing text' });
41     }
42 
43     // Validate sample rate
44     const validSampleRates = [8000, 16000, 22050, 24000, 44100];
45     if (!validSampleRates.includes(sampleRate)) {
46       clearTimeout(timeout);
47       return res.status(400).json({
48         error: 'Unsupported sample rate',
49         supportedRates: validSampleRates,
50       });
51     }
52 
53     console.log(
54       `Synthesizing: ${requestId}, length=${text.length}, rate=${sampleRate}Hz`
55     );
56 
57     // Generate the audio (replace with your TTS implementation)
58     const audioBuffer = await synthesizeAudio(text, sampleRate);
59 
60     if (!audioBuffer || audioBuffer.length === 0) {
61       throw new Error('TTS synthesis produced no audio');
62     }
63 
64     clearTimeout(timeout);
65 
66     // Return raw PCM audio to VAPI
67     res.setHeader('Content-Type', 'application/octet-stream');
68     res.setHeader('Content-Length', audioBuffer.length);
69     res.write(audioBuffer);
70     res.end();
71 
72     const duration = Date.now() - startTime;
73     console.log(
74       `TTS completed: ${requestId}, ${duration}ms, ${audioBuffer.length} bytes`
75     );
76 
77 } catch (error) {
78 clearTimeout(timeout);
79 const duration = Date.now() - startTime;
80 console.error(`TTS failed: ${requestId}, ${duration}ms, ${error.message}`);
81 
82     if (!res.headersSent) {
83       res.status(500).json({ error: 'TTS synthesis failed', requestId });
84     }
85 
86 }
87 });
88 
89 // Replace with your actual TTS implementation
90 async function synthesizeAudio(text, sampleRate) {
91 // Example: Call your TTS API here
92 // const audioBuffer = await yourTTSProvider.synthesize(text, sampleRate);
93 
94 // Demo implementation (replace with real TTS)
95 await new Promise(resolve => setTimeout(resolve, 100));
96 
97 const duration = Math.min(text.length _ 0.1, 10); // Max 10 seconds
98 const samples = Math.floor(duration _ sampleRate);
99 const buffer = Buffer.alloc(samples \* 2); // 16-bit = 2 bytes per sample
100 
101 for (let i = 0; i < samples; i++) {
102 const value = Math.sin((2 _ Math.PI _ 440 _ i) / sampleRate) _ 16000;
103 buffer.writeInt16LE(Math.round(value), i \* 2);
104 }
105 
106 return buffer;
107 }
108 
109 // Error handling middleware
110 app.use((error, req, res, next) => {
111 console.error('Unhandled error:', error.message);
112 res.status(500).json({ error: 'Internal server error' });
113 });
114 
115 const PORT = process.env.PORT || 3000;
116 app.listen(PORT, () => {
117 console.log(`TTS server listening on port ${PORT}`);
118 });
119 
120 module.exports = app;

Handle text processing

Implement proper text validation and preprocessing:

Text Processing Functions

1 function preprocessText(text) {
2   // Handle SSML tags if your TTS supports them
3   if (text.includes('<speak>')) {
4     return parseSSML(text);
5   }
6 
7   // Clean up problematic characters
8   return text
9     .replace(/[^\w\s\.,!?-]/g, '') // Remove special characters
10     .replace(/\s+/g, ' ') // Normalize whitespace
11     .trim();
12 }
13 
14 function validateText(text) {
15   if (!text || text.trim().length === 0) {
16     throw new Error('Empty text provided');
17   }
18 
19   if (!/[a-zA-Z]/.test(text)) {
20     throw new Error('No readable text found');
21   }
22 
23   return text.trim();
24 }

Request and response formats

VAPI request structure

Every TTS request from VAPI follows this format:

VAPI TTS Request

1 {
2   "message": {
3     "type": "voice-request",
4     "text": "Hello, world! How can I help you today?",
5     "sampleRate": 24000,
6     "timestamp": 1677123456789,
7     "call": {
8       "id": "call-123",
9       "orgId": "org-456"
10     },
11     "assistant": {
12       "id": "assistant-789",
13       "name": "Customer Service Bot"
14     },
15     "customer": {
16       "number": "+1234567890"
17     }
18   }
19 }

Required fields

Field	Type	Description
`type`	string	Always “voice-request”
`text`	string	Text to synthesize
`sampleRate`	number	Target audio sample rate (8000, 16000, 22050, or 24000 Hz)
`timestamp`	number	Unix timestamp in milliseconds

Your response requirements

Your endpoint must respond with:

HTTP 200 status
Content-Type: application/octet-stream
Raw PCM audio data in the response body

Correct Response Headers

1 HTTP/1.1 200 OK
2 Content-Type: application/octet-stream
3 Transfer-Encoding: chunked
4 
5 [Raw PCM audio bytes]

Audio format requirements

PCM specifications

Your TTS system must generate audio with these exact specifications:

Format: Raw PCM (no headers or containers)
Channels: 1 (mono only)
Bit Depth: 16-bit signed integer
Byte Order: Little-endian
Sample Rate: Must exactly match the sampleRate in the request

Any deviation from these specifications will cause audio distortion, playback failures, or call quality issues. VAPI streams audio in real-time during phone calls.

Testing your integration

Create a test call

Use VAPI’s API to create a test call that exercises your TTS system:

Test Call Creation

1 async function testTTSWithVAPICall() {
2   const vapiApiKey = 'your-vapi-api-key';
3   const assistantId = 'your-assistant-id'; // Assistant with custom TTS
4 
5   const callData = {
6     assistant: { id: assistantId },
7     phoneNumberId: 'your-phone-number-id',
8     customer: { number: '+1234567890' }, // Your test number
9   };
10 
11   try {
12     const response = await fetch('https://api.vapi.ai/call', {
13       method: 'POST',
14       headers: {
15         Authorization: `Bearer ${vapiApiKey}`,
16         'Content-Type': 'application/json',
17       },
18       body: JSON.stringify(callData),
19     });
20 
21     const call = await response.json();
22     console.log('Test call created:', call.id);
23     return call;
24   } catch (error) {
25     console.error('Failed to create test call:', error);
26   }
27 }

Monitor TTS requests

Set up logging to see exactly what VAPI sends to your endpoint:

Request Monitoring Middleware

1 app.use('/api/synthesize', (req, res, next) => {
2   console.log('=== Incoming TTS Request ===');
3   console.log('Headers:', req.headers);
4   console.log('Body:', JSON.stringify(req.body, null, 2));
5   console.log('==============================');
6   next();
7 });

Quick endpoint test

Test your endpoint directly before full integration:

Direct Endpoint Test

1 async function quickEndpointTest() {
2   const testPayload = {
3     message: {
4       type: 'voice-request',
5       text: 'Hello, this is a test of the custom TTS system.',
6       sampleRate: 24000,
7       timestamp: Date.now(),
8     },
9   };
10 
11 try {
12 const response = await fetch('http://localhost:3000/api/synthesize', {
13 method: 'POST',
14 headers: {
15 'Content-Type': 'application/json',
16 'X-VAPI-SECRET': 'your-test-secret',
17 },
18 body: JSON.stringify(testPayload),
19 });
20 
21     if (response.ok) {
22       const audioBuffer = await response.buffer();
23       console.log(`Audio generated: ${audioBuffer.length} bytes`);
24       require('fs').writeFileSync('test-output.pcm', audioBuffer);
25     } else {
26       console.error('Test failed:', await response.text());
27     }
28 
29 } catch (error) {
30 console.error('Test error:', error);
31 }
32 }

Troubleshooting

Request timeouts

Symptoms: VAPI doesn’t receive your audio response, calls may drop

Common causes:

TTS processing takes longer than configured timeout
Network connectivity issues between VAPI and your server
Server overload or unresponsiveness

Solutions:

Increase timeoutSeconds in your assistant configuration
Optimize your TTS processing speed
Implement proper error handling and timeout protection

Audio playback problems

Symptoms: No audio during calls, or distorted/garbled sound

Common causes:

Wrong audio format (not raw PCM)
Incorrect sample rate
Sending stereo audio instead of mono
Including audio file headers in response

Solutions:

Ensure raw PCM format with no headers
Match the exact sample rate from the request
Generate mono audio only
Verify 16-bit little-endian format

Authentication failures

Symptoms: 401 Unauthorized responses in your server logs

Common causes:

Secret token mismatch between VAPI config and server
Missing X-VAPI-SECRET header validation
Case sensitivity issues with header names

Solutions:

Verify secret token matches exactly
Implement proper header validation
Check for case-sensitive header handling

High latency

Symptoms: Noticeable delays during conversations

Common causes:

TTS model loading time on first request
Complex text processing before synthesis
Network latency between services

Solutions:

Pre-load TTS models at startup
Optimize text preprocessing
Use faster TTS models for real-time performance
Consider geographic proximity of services

Next steps

Now that you have custom TTS integration working:

Advanced features: Explore SSML support for enhanced voice control
Performance optimization: Implement caching and model warming strategies
Voice cloning: Integrate voice cloning APIs for personalized experiences
Multi-language support: Add language detection and voice switching

Consider implementing a fallback voice provider in your assistant configuration to ensure call continuity if your custom TTS endpoint experiences issues.