All messages sent to your Server URL are POST requests with this body shape:
Common metadata included on most events:
phoneNumber, timestampartifact (recording, transcript, messages, etc.)assistant, customer, call, chatMost events are informational and do not require a response. Responses are only expected for these types sent to your Server URL:
Note: Some specialized messages like “voice-request” and “call.endpointing.request” are sent to their dedicated servers if configured (e.g. assistant.voice.server.url, assistant.startSpeakingPlan.smartEndpointingPlan.server.url).
Vapi supports OpenAI-style tool/function calling. Assistants can ping your server to perform actions.
Example assistant configuration (excerpt):
When tools are triggered, your Server URL receives a tool-calls message:
Respond with results for each tool call:
Optionally include a message to speak to the user while or after running the tool.
If a tool does not need a response immediately, you can design it to be asynchronous.
For inbound phone calls, you can specify the assistant dynamically. If a PhoneNumber doesn’t have an assistantId, Vapi may request one from your server:
You must respond to the assistant-request webhook within 7.5 seconds end-to-end. This limit is fixed and not configurable: the telephony provider enforces a 15-second cap, and Vapi reserves ~7.5 seconds for call setup. The timeout value shown elsewhere in the dashboard does not apply to this webhook.
To avoid timeouts:
assistantId or a minimal assistant, then enrich context asynchronously after the call starts using Live Call Control.us-west-2 to reduce latency, and target < ~6s to allow for network jitter.Respond with either an existing assistant ID, a transient assistant, or transfer destination:
If you want to immediately transfer the call without using an assistant, return a destination in your assistant-request response. This bypasses AI handling.
When destination is present in the assistant-request response, the call forwards immediately and assistantId, assistant, squadId, and squad are ignored.
You must still respond within 7.5 seconds.
To transfer silently, set destination.message to an empty string.
For caller ID behavior, see Call features.
Or return an error message to be spoken to the caller:
scheduled: Call scheduled.queued: Call queued.ringing: The call is ringing.in-progress: The call has started.forwarding: The call is about to be forwarded.ended: The call has ended.Use this to surface delays or notify your team.
Sent when an update is committed to the conversation history.
Partial and final transcripts from the transcriber.
For final-only events, you may receive type: "transcript[transcriptType=\"final\"]".
Sent as the assistant begins speaking each segment of a turn, synchronized to audio playback. Designed for live captions, karaoke-style word highlighting, and any UI that needs to track what’s being spoken in real time.
This event is opt-in. Add "assistant.speechStarted" to your assistant’s serverMessages and/or clientMessages to receive it.
timing.type: "word-alignment" — ElevenLabsPer-word timestamps from ElevenLabs’ alignment API. Events arrive at audio playback cadence (~50–200ms apart). The words[] array includes space entries with real timing — join them and track a running character cursor to highlight text up to that position. No client-side interpolation needed.
timing.type: "word-progress" — Minimax (with voice.subtitleType: "word")Cursor-based per-segment progress.
Minimax only attaches subtitle data to the final audio chunk of each synthesis segment, so each assistant.speechStarted event for a Minimax turn fires near the end of that segment’s audio playback — not at the start, and not per-word. The wordsSpoken value jumps in segment-sized increments, and the words[] array carries timestamps for the segment that just finished. Use it to retroactively animate that segment, or to extrapolate forward — but it cannot drive smooth real-time highlighting during the current segment. For true playback-cadence per-word events, use ElevenLabs.
totalWords: 0 is a valid sentinel on the very first event of a turn before Minimax confirms its word count — guard against divide-by-zero when computing a progress fraction. See the Minimax voice provider page for full configuration details.
timing field — text-only fallbackAll other providers (Cartesia, Deepgram, Azure, OpenAI, Inworld, etc.) emit text-only events with no timing object. One event per TTS chunk, gated to actual audio playback. Display text as a caption block, or interpolate a word cursor at a flat rate (~3.5 words/sec) between events for an approximate cursor.
force-say events always emit as text-only, even on ElevenLabs and Minimax — there’s no provider-level alignment for forced utterances (firstMessage, queued say actions).user-interrupted message and use the most recent wordsSpoken (or joined char cursor) to know what was actually spoken.assistant.speechStopped event. Use speech-update (status: "stopped") or watch turn increment to detect end-of-turn.timing.words[]; raw PCM responses produce text-only events.Tokens or tool-call outputs as the model generates. The optional turnId groups all tokens from the same LLM response, so you can correlate output with a specific turn.
Requested when the model wants to transfer but the destination is not yet known and must be provided by your server.
This event is emitted only if the assistant did not supply a destination when calling a transferCall tool (for example, it did not include a custom parameter like phoneNumber). If the assistant includes the destination directly, Vapi will transfer immediately and will not send this webhook.
Respond with a destination and optionally a message:
Fires whenever a transfer occurs.
Sent when the user interrupts the assistant. The optional turnId identifies the LLM turn that was interrupted, matching the turnId on model-output messages so you can discard that turn’s tokens.
Sent when the transcriber switches based on detected language.
When requested in assistant.serverMessages, hangup and forwarding are delegated to your server.
If using assistant.knowledgeBase.provider = "custom-knowledge-base".
Respond with documents (and optionally a custom message to speak):
Sent to assistant.voice.server.url. Respond with raw 1-channel 16-bit PCM audio at the requested sample rate (not JSON).
Sent to assistant.startSpeakingPlan.smartEndpointingPlan.server.url.
Respond with the timeout before considering the user’s speech finished:
chat.created: Sent when a new chat is created.chat.deleted: Sent when a chat is deleted.session.created: Sent when a session is created.session.updated: Sent when a session is updated.session.deleted: Sent when a session is deleted.