For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Configure start and stop speaking plans for natural conversation flow
Overview
Configure VAPI’s voice pipeline to create natural conversation experiences through precise timing control. This guide covers how voice data moves through processing stages and how to optimize endpointing and interruption detection.
Voice pipeline configuration enables you to:
Fine-tune conversation timing for specific use cases
Control when and how your assistant begins responding
Configure interruption detection and recovery behavior
Optimize response timing for different languages and contexts
If threshold met → Clear pipeline → Apply backoffSeconds → Ready for next
input
Start speaking plan
The start speaking plan determines when your assistant begins responding after a user stops talking.
Transcription endpointing
Analyzes transcription text to determine user completion based on patterns like punctuation and numbers.
This plan is only used if smartEndpointingPlan is not set and transcriber does not have built-in endpointing capabilities. If both are provided, smartEndpointingPlan takes precedence. This plan will also be overridden by any matching customEndpointingRules.
Configuration
Properties
1
{
2
"startSpeakingPlan": {
3
"transcriptionEndpointingPlan": {
4
"onPunctuationSeconds": 0.1,
5
"onNoPunctuationSeconds": 1.5,
6
"onNumberSeconds": 0.5
7
},
8
"waitSeconds": 0.4
9
}
10
}
When to use:
Non-English languages (LiveKit not supported)
Fallback when smart endpointing unavailable
Predictable, rule-based endpointing behavior
Smart endpointing
Uses AI models to analyze speech patterns, context, and audio cues to predict when users have finished speaking. Only available for English conversations.
Important: If your transcriber has built-in end-of-turn detection (like Deepgram Flux or Assembly) and you don’t configure a smart endpointing plan, the system will automatically use the transcriber’s EOT detection instead of smart endpointing.
Deepgram Flux: English and Multi-lingual conversations using Deepgram as a transcriber.
Assembly: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
LiveKit: English conversations where Deepgram is not the transcriber of choice.
Vapi: Non-English conversations with default stop speaking plan settings
Krisp: Non-English conversations with a robustly configured stop speaking plan
Deepgram Flux configuration
Deepgram Flux’s end-of-turn detection is configured at the transcriber level, allowing you to fine-tune how aggressive or conservative the bot should be in detecting when users finish speaking. Do NOT set a smartEndpointingPlan to leverage Deepgram’s end-of-turn events.
Configuration parameters:
eotThreshold (Default: 0.7): Confidence level required to trigger end-of-turn detection
0.5-0.6: Aggressive detection - responds quickly but may interrupt users mid-sentence
0.6-0.8: Balanced detection (default: 0.7) - good balance between responsiveness and accuracy
0.9-1.0: Conservative detection - waits longer to ensure users have finished speaking
eotTimeoutMs (Default: 5000): Maximum wait time in milliseconds before forcing turn end
2000-3000: Fast timeout for quick interactions
4000-6000: Standard timeout (default: 5000) - natural conversation flow
7000-10000: Extended timeout for complex or thoughtful responses
Configuration examples:
English
Multilingual
1
{
2
"transcriber": {
3
"provider": "deepgram",
4
"model": "flux-general-en",
5
"language": "en",
6
"eotThreshold": 0.7,
7
"eotTimeoutMs": 5000
8
}
9
}
LiveKit’s Wait function
Mathematical expression that determines wait time based on speech completion probability. The function takes a confidence value (0-1) and returns a wait time in milliseconds.
Behavior: Waits for natural pauses in conversation
Use case: Most conversations, general purpose
Timing: ~800ms wait at 50% confidence, ~300ms at 90% confidence
Conservative (Careful Response):
1
"waitFunction": "700 + 4000 * max(0, x-0.5)"
Behavior: Very patient, rarely interrupts users
Use case: Healthcare, formal settings, sensitive conversations
Timing: ~2700ms wait at 50% confidence, ~700ms at 90% confidence
Vapi heuristic endpointing
Vapi’s text-based endpointing uses heuristic rules to analyze transcription patterns and determine when users have finished speaking. The system applies these rules in priority order using the transcriptionEndpointingPlan settings:
Heuristic priority order:
Number detection: If the latest message ends with a number, waits for onNumberSeconds (default: 0.5)
Punctuation detection: If the message contains punctuation, waits for onPunctuationSeconds (default: 0.1)
No punctuation fallback: If no punctuation is detected, waits for onNoPunctuationSeconds (default: 1.5)
Default: If no rules match, waits 0ms (immediate response)
How it works:
The system continuously analyzes the latest user message and applies the first matching rule. Each rule sets a specific timeout delay before triggering the end-of-turn event.
Configuration example:
1
{
2
"startSpeakingPlan": {
3
"smartEndpointingPlan": {
4
"provider": "vapi"
5
},
6
"transcriptionEndpointingPlan": {
7
"onPunctuationSeconds": 0.1,
8
"onNoPunctuationSeconds": 1.5,
9
"onNumberSeconds": 0.5
10
}
11
}
12
}
When to use:
Non-English conversations where LiveKit isn’t available
Fallback option when other smart endpointing providers aren’t suitable
Krisp threshold configuration
Krisp’s audio-base model returns a probability between 0 and 1, where 1 means the user definitely stopped speaking and 0 means they’re still speaking.
Threshold settings:
0.0-0.3: Very aggressive detection - responds quickly but may interrupt users mid-sentence
0.4-0.6: Balanced detection (default: 0.5) - good balance between responsiveness and accuracy
0.7-1.0: Conservative detection - waits longer to ensure users have finished speaking
Configuration example:
1
{
2
"startSpeakingPlan": {
3
"smartEndpointingPlan": {
4
"provider": "krisp",
5
"threshold": 0.5
6
}
7
}
8
}
Important considerations:
Since Krisp is audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Configure the stop speaking plan with appropriate acknowledgementPhrases and numWords settings to handle backchanneling properly.
Assembly turn detection
AssemblyAI’s turn detection model uses a neural network to detect when someone has finished speaking. The model understands the meaning and flow of speech to make better decisions about when a turn has ended.
When the model detects an end-of-turn, it returns end_of_turn=True in the response.
Quick start configurations:
To use Assembly’s turn detection, set Assembly as your transcriber provider and configure these fields in the assistant’s transcriber (do not set any smartEndpointingPlan):
Aggressive (Fast Response):
1
{
2
"endOfTurnConfidenceThreshold": 0.4,
3
"minEndOfTurnSilenceWhenConfident": 160,
4
"maxTurnSilence": 400
5
}
Use cases: Agent Assist, IVR replacements, Retail/E-commerce, Telecom
Behavior: Ends turns very quickly, optimized for short responses
Endpointing timing: Varies by method (smart vs transcription)
LLM processing: ~800ms average for standard responses
TTS generation: ~500ms average for short responses
waitSeconds: Applied as final delay before audio output
Complete pipeline timeline
Understanding exact timing helps optimize your voice pipeline configuration. This timeline shows what happens at every moment during the conversation flow.
0.0s: User stops speaking
0.1s: Smart endpointing evaluation begins
0.6s: Smart endpointing triggers (varies by waitFunction)
The 0.6s endpointing time varies based on your waitFunction choice
Aggressive functions reduce endpointing to ~0.2s
Conservative functions increase endpointing to ~2.7s
Total response time ranges from 1.9s (aggressive) to 4.7s (conservative)
Custom endpointing rules
Highest priority rules that override all other endpointing decisions when patterns match.
1
{
2
"customEndpointingRules": [
3
{
4
"type": "assistant",
5
"regex": "(phone|email|address)",
6
"timeoutSeconds": 3.0
7
},
8
{
9
"type": "user",
10
"regex": "\\d{3}-\\d{3}-\\d{4}",
11
"timeoutSeconds": 2.0
12
}
13
]
14
}
Use cases:
Data collection: Extended wait times for phone numbers, addresses
Spelling: Extra time for letter-by-letter input
Complex responses: Additional processing time for detailed information
Stop speaking plan
The stop speaking plan controls how interruptions are detected and handled when users speak while the assistant is talking.
Number of words
Sets the interruption detection method and threshold.
VAD-based (numWords = 0):
1
{
2
"stopSpeakingPlan": {
3
"numWords": 0,
4
"voiceSeconds": 0.2
5
}
6
}
How it works: Uses Voice Activity Detection for faster interruption (50-100ms)
Benefits: Language independent, very responsive
Considerations: More sensitive to background noise
Transcription-based (numWords > 0):
1
{
2
"stopSpeakingPlan": {
3
"numWords": 2
4
}
5
}
How it works: Waits for specified number of transcribed words
Benefits: More accurate, reduces false positives
Considerations: Slower response (200-500ms delay)
Range: 0-10 words (Default: 0)
Voice seconds
VAD duration threshold when numWords = 0. Determines how long voice activity must be detected before triggering an interruption.
Range: 0-0.5 seconds (Default: 0.2)
Recommended settings:
0.1: Very sensitive (risk of background noise triggering)
0.2: Balanced sensitivity (recommended)
0.4: Conservative (reduces false positives)
The numWords=0 and voiceSeconds relationship
When numWords = 0, the voice pipeline uses Voice Activity Detection (VAD) instead of waiting for transcription:
User Starts Speaking → VAD Detects Voice → Continuous for voiceSeconds Duration → Interrupt Assistant
Why this matters:
Faster: VAD detection ~50-100ms vs transcription 200-500ms
More sensitive: Detects “um”, “uh”, throat clearing, background noise
Language independent: Works with any language
Backoff seconds
Duration that blocks all assistant audio output after user interruption, creating a recovery period.
Range: 0-10 seconds (Default: 1.0)
Recommended settings:
0.5: Quick recovery for fast-paced interactions
1.0: Natural pause for most conversations
2.0: Deliberate pause for formal settings
Pipeline timing relationship
User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output → Ready for New Input
Relationship with waitSeconds:
backoffSeconds: Applied during interruption (blocks output)
waitSeconds: Applied to normal responses (delays output)
Sequential, not cumulative:backoffSeconds completes first, then normal flow resumes with waitSeconds
Complete interruption timeline
How to read this timeline: This shows the complete flow from interruption to recovery. Notice how backoffSeconds creates a “quiet period” before normal processing resumes.
0.0s: Assistant speaking: "I can help you book..."
1.2s: User interrupts: "Actually, wait"
1.2s: backoffSeconds (1.0s) starts → All audio blocked
2.2s: backoffSeconds completes → Ready for new input
2.5s: User says: "What about tomorrow?"
3.0s: Endpointing triggers → LLM processes
3.8s: TTS completes → waitSeconds (0.4s) starts
4.2s: Assistant responds: "For tomorrow..."
Total Recovery Time: backoffSeconds (1.0s) + normal processing (1.8s) + waitSeconds (0.4s) = 3.2s
Key insight: Adjust backoffSeconds based on how quickly you want the assistant to recover from interruptions. Healthcare might use 2.0s for deliberate pauses, while gaming might use 0.5s for quick recovery.