Web calls
Overview
Build powerful voice applications that work across web browsers, mobile apps, and backend systems. This guide covers both client-side voice interfaces and server-side call management using Vapi’s comprehensive SDK ecosystem.
In this quickstart, you’ll learn to:
- Create real-time voice interfaces for web and mobile
- Build automated outbound and inbound call systems
- Handle events and webhooks for call management
- Implement voice widgets and backend integrations
Developing locally? The Vapi CLI makes it easy to initialize projects and test webhooks:
Choose your integration approach
Best for: User-facing applications, voice widgets, mobile apps
- Browser-based voice assistants and widgets
- Real-time voice conversations
- Mobile voice applications (iOS, Android, React Native, Flutter)
- Direct user interaction with assistants
Best for: Backend automation, bulk operations, system integrations
- Automated outbound call campaigns
- Inbound call routing and management
- CRM integrations and bulk operations
- Webhook processing and real-time events
Web voice interfaces
Build browser-based voice assistants and widgets for real-time user interaction.
Installation and setup
Web SDK
React Native
Flutter
iOS
Build browser-based voice interfaces:
Live captions and word-level timing
For UIs that need to render live captions or karaoke-style word highlighting as the assistant speaks, subscribe to the opt-in assistant.speechStarted message. Add it to your assistant’s clientMessages:
Each event carries the full assistant turn text, the turn number, the source ("model", "force-say", or "custom-voice"), and optional timing data whose shape depends on your voice provider:
Cadence and granularity vary significantly by voice provider — pick the one that matches your UI requirements:
- ElevenLabs (
word-alignment) is the only provider that emits at true playback cadence with real per-word timestamps. Best for smooth karaoke-style highlighting with no client-side interpolation. - Minimax (
word-progress) withsubtitleType: "word"emits once per synthesis segment, near the end of that segment’s playback. The per-wordtiming.words[]array carries timestamps for the segment that just finished — useful for retroactive animation or forward extrapolation, but not for driving real-time highlighting during that segment. See the Minimax provider page for details. - All other providers emit text-only events (no
timing). One event per TTS chunk; you can interpolate a word cursor at a flat rate (~3.5 words/sec) between events for an approximate cursor.
force-say events (your firstMessage, say actions) always emit as text-only, even on ElevenLabs and Minimax. On user barge-in, no further events fire for the interrupted turn — pair with the user-interrupted message to know what was actually spoken.
For the full event schema and field reference, see Server events → Assistant Speech Started.
Voice widget implementation
Create a voice widget for your website:
HTML Script Tag
React/TypeScript
The fastest way to get started. Copy this snippet into your website:
Server-side call management
Automate outbound calls and handle inbound call processing with server-side SDKs.
Installation and setup
TypeScript
Python
Java
Ruby
C#
Go
Install the TypeScript Server SDK:
Creating assistants
TypeScript
Python
Java
Ruby
C#
Go
Bulk operations
Run automated call campaigns for sales, surveys, or notifications:
TypeScript
Python
Java
Ruby
C#
Go
Webhook integration
Handle real-time events for both client and server applications:
TypeScript
Python
Java
Ruby
C#
Go
Next steps
Now that you understand both client and server SDK capabilities:
- Explore use cases: Check out our examples section for complete implementations
- Add tools: Connect your voice agents to external APIs and databases with custom tools
- Configure models: Try different speech and language models for better performance
- Scale with squads: Use Squads for multi-assistant setups and complex processes
Resources
Client SDKs:
Server SDKs:
Documentation: