Simulations advanced
Overview
This guide covers advanced simulation strategies, testing patterns, and best practices for building robust test suites that ensure your AI voice agents work reliably in production.
You’ll learn:
- Advanced scenario configuration (tool mocks, hooks)
- Strategic testing approaches (smoke, regression, edge cases)
- Performance optimization techniques
- CI/CD integration strategies
- Maintenance and troubleshooting methods
Advanced scenario configuration
Tool mocks
Mock tool call responses at the scenario level to test specific paths without calling real APIs. This is useful for:
- Testing error handling paths
- Simulating unavailable services
- Deterministic test results
- Faster test execution (no real API calls)
Dashboard
cURL
Common tool mock patterns:
Success response
Error response
Partial success
Tool mock tips:
- Mock tool names must exactly match the function name configured in your assistant’s tools
- Use realistic error responses that match your actual API error formats
- Create separate scenarios for success paths and error paths
- Disable mocks (
enabled: false) to test against real APIs
Simulation hooks
Trigger actions on simulation lifecycle events. Hooks are useful for:
- Notifying external systems when tests start/end
- Logging test execution to your own systems
- Triggering follow-up workflows
- Custom analytics and reporting
Hooks are only supported in voice mode. Hooks require vapi.websocket transport and will not trigger with vapi.webchat (chat mode).
Dashboard
cURL
Webhook payload examples:
Using existing structured outputs
Instead of defining inline structured outputs in each scenario, you can reference structured outputs you’ve already created. This provides:
- Reusability across multiple scenarios
- Centralized management of evaluation criteria
- Consistency in how data is extracted
Dashboard
cURL
- Go to Structured Outputs in the sidebar
- Create a new structured output or find an existing one
- Copy the ID
- In your scenario, select Use Existing when adding an evaluation
- Paste the structured output ID
When to use existing vs inline:
- Existing (by ID): When the same evaluation criteria is used across multiple scenarios
- Inline: For scenario-specific evaluations that won’t be reused
Testing strategies
Smoke tests
Quick validation that core functionality works. Run these first to catch obvious issues.
Purpose: Verify your assistant responds and basic conversation flow works before running comprehensive tests.
Characteristics:
- Minimal evaluation criteria (just check for any response)
- Fast execution (simple instructions)
- Run before detailed tests
- Use chat mode for speed
When to use:
- Before running expensive voice test suites
- After deploying configuration changes
- As health checks in monitoring
- Quick validation during development
Regression tests
Ensure fixes and updates don’t break existing functionality.
Purpose: Validate that known issues stay fixed and features keep working.
Dashboard
cURL
- Name scenarios with “Regression: ” prefix
- Include issue ticket number in the name
- Add the exact scenario that previously failed
- Document what was fixed
Example:
- Name: “Regression: Appointment Parsing Bug #1234”
- Instructions: Scenario that triggered the bug
Best practices:
- Name tests after bugs they prevent
- Include ticket/issue numbers
- Add regression tests when fixing bugs
- Run full regression suite before major releases
Edge case testing
Test boundary conditions and unusual inputs your assistant might encounter.
Common edge cases to test:
Confused or unclear requests
Rapid topic changes
Interruptions
vapi.websocket) to test actual audio interruptions.Invalid data input
Edge case categories to cover:
- Input boundaries: Empty, maximum length, special characters
- Data formats: Invalid dates, malformed phone numbers, unusual names
- Conversation patterns: Interruptions, topic changes, contradictions
- Emotional scenarios: Frustrated caller, confused caller, impatient caller
Best practices
Evaluation design principles
Each evaluation should test one specific outcome.
✅ Good: “Was the appointment booked?”
❌ Bad: “Was the appointment booked, confirmed, and email sent?”
Use descriptive names that explain what’s being tested.
✅ Good: “Booking - Handles Unavailable Slot”
❌ Bad: “Test 1” or “Scenario ABC”
Model test personalities after actual customer types.
Consider: decisive, confused, impatient, detail-oriented, non-native speakers
Use boolean or numeric structured outputs that produce clear pass/fail results.
Avoid subjective criteria that are hard to evaluate consistently.
Choosing voice vs chat mode
CI/CD integration
Automate simulation runs in your deployment pipeline.
Basic workflow
Advanced patterns
Staging validation before production
Run full simulation suite against staging before promoting to production:
Scheduled nightly regression
Run full regression suite nightly:
Quality gates
Block deployment if pass rate falls below threshold:
Maintenance strategies
Regular review cycle
Weekly: Review failed tests
Investigate all failures. Update tests if requirements changed, or fix assistant if behavior regressed.
When to update simulations
Troubleshooting
Common issues
Debugging tips
Getting help
Include these details when reporting issues:
- Simulation run ID
- Scenario and personality IDs
- Transport mode used (voice/chat)
- Expected vs actual behavior
- Assistant configuration
Resources:
Next steps
Return to quickstart guide for basic setup
Learn about chat-based testing with mock conversations
Learn how to define structured outputs for evaluations
Create and configure assistants to test
Summary
Key takeaways for advanced simulation testing:
Configuration:
- Use tool mocks to test error paths without real API calls
- Use hooks for external notifications (voice mode only)
- Reference existing structured outputs for consistency
Testing strategy:
- Start with smoke tests, then regression, then edge cases
- Use chat mode for speed, voice mode for final validation
- Create personalities based on real customer types
CI/CD:
- Automate smoke tests in PR pipelines
- Run full regression before production deploys
- Set quality gate thresholds
Maintenance:
- Review failures weekly
- Audit coverage monthly
- Add regression tests when fixing bugs