This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you’ll create mock conversations, define expected behaviors, and validate your agents work correctly before production.
Evals is Vapi’s AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:
Evals help you maintain quality and catch issues early:
An evaluation suite for an appointment booking assistant that tests:
Sign up at dashboard.vapi.ai
Get your API key from API Keys in sidebar
You’ll also need an existing assistant or squad to test. You can create one in the Dashboard or use the API.
Define a mock conversation to test your assistant’s greeting behavior.
Message structure: Each conversation turn has a role (user, assistant,
system, or tool). Assistant messages with judgePlan define what to validate.
Execute the evaluation against your assistant or squad.
You can also run evaluations with transient assistant or squad configurations
by providing assistant or squad objects instead of IDs in the target.
Learn to interpret evaluation results and identify issues.
When all checks pass, you’ll see:
Pass criteria:
status is “ended”endedReason is “mockConversation.done”results[0].status is “pass”judge.status values are “pass”When validation fails, you’ll see details:
Failure indicators:
results[0].status is “fail”judge.status is “fail”judge.failureReason explains why validation failedIf endedReason is not “mockConversation.done”, the test encountered an error
(like “assistant-error” or “pipeline-error-openai-llm-failed”). Check your
assistant configuration.
Validate that your assistant calls functions with correct arguments.
Test appointment booking with exact argument matching:
date: “2025-01-20”time: “14:00”{"status": "success", "confirmationId": "APT-12345"}Exact match - Full validation:
Validates both function name AND all argument values exactly.
Partial match - Name only:
Validates only that the function was called (arguments can vary).
Multiple tool calls:
Validates multiple function calls in sequence.
Tool calls are validated in the order they’re defined. Use type: "exact" for
strict validation or type: "regex" for flexible validation.
When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.
Greeting variations:
Matches: “Hello, I can help…”, “Hi I’ll help…”, “Hey let me help…”
Responses with variables:
Matches any confirmation message with appointment ID format.
Date patterns:
Matches responses mentioning weekdays.
Case-insensitive matching:
The (?i) flag makes matching case-insensitive.
.*appointment.*(confirmed|booked).*\d{1,2}:\d{2}.*Regex tips: - Use .* to match any characters - Use (option1|option2)
for alternatives - Use \d for digits, \s for whitespace - Use .*? for
non-greedy matching - Test your patterns with sample responses first
For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.
Template structure:
Template variables:
{{messages}} - The entire conversation history (all messages exchanged){{messages[-1]}} - The last assistant message onlyModels: gpt-4o, gpt-4-turbo, gpt-3.5-turbo
Best for general-purpose evaluation
Models: claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for nuanced evaluation
Models: gemini-1.5-pro, gemini-1.5-flash Best for multilingual content
Models: llama-3.1-70b-versatile, mixtral-8x7b-32768
Best for fast evaluation
Custom LLM:
Tips for reliable AI judging: - Be specific with pass/fail criteria (avoid
ambiguous requirements) - Use “ALL pass criteria must be met” logic - Use “ANY
fail criteria triggers fail” logic - Include conversation context with {{ messages }} syntax - Request exact “pass” or “fail” output (no
explanations) - Test criteria with known good/bad responses before production
Define what happens after an evaluation passes or fails using continuePlan.
Stop the test immediately if a critical check fails:
Use case: Skip expensive subsequent tests when initial validation fails.
Provide fallback responses to continue testing even when validation fails:
Use case: Test error recovery paths or force specific tool calls for subsequent validation.
If exitOnFailureEnabled is true and validation fails, the test stops
immediately. Subsequent conversation turns are not executed. Use this for
critical checkpoints.
Validate multi-turn interactions that simulate real user conversations.
Create a comprehensive test:
bookAppointment{"status": "success", "confirmationId": "APT-12345"}sendEmailInject system prompts mid-conversation to test dynamic behavior changes:
Multi-turn testing tips: - Keep conversations focused (5-10 turns for most tests) - Use exit-on-failure for early turns to save time - Test one primary flow per evaluation - Mix judge types (exact, regex, AI) for comprehensive validation - Include tool responses to simulate real interactions
List, update, and organize your evaluation suite.
Deleting an evaluation does NOT delete its run history. Past run results remain accessible.
Indicators of success:
status is “ended”endedReason is “mockConversation.done”results[0].status is “pass”judge.status values are “pass”Indicators of failure:
results[0].status is “fail”judge.status is “fail”judge.failureReason provides specific detailsFull conversation transcripts show both expected and actual values, making debugging straightforward.
Combine exact, regex, and AI judges for comprehensive testing:
Validate smooth transitions between squad members:
Organize related tests for systematic validation:
Run multiple evals sequentially to validate all greeting scenarios.
“mockConversation.done” not reached:
endedReason for actual error (e.g., “assistant-error”, “pipeline-error-openai-llm-failed”)Judge validation fails unexpectedly:
failureReasonTool calls not validated:
If you see endedReason: "assistant-error", your assistant configuration has
issues. Test the assistant manually first before running evals.
Learn testing patterns, best practices, and CI/CD integration
Create and configure assistants to test
Build custom tools and validate their behavior
Complete API documentation for evals
Best practices for reliable testing: - Start simple with exact matches, then add complexity - One behavior per evaluation turn keeps tests focused - Use descriptive names that explain what’s being tested - Test both happy paths and edge cases - Version control your evals alongside assistant configs - Run critical tests first to fail fast - Review failure reasons promptly and iterate - Document why each test exists (use descriptions)
Need assistance? We’re here to help: