Evals quickstart
Overview
This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you’ll create mock conversations, define expected behaviors, and validate your agents work correctly before production.
What are Evals?
Evals is Vapi’s AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:
- Creating mock conversations - Define user messages and expected assistant responses
- Validating behavior - Use exact match, regex patterns, or AI-powered judging
- Testing tool calls - Verify function calls with specific arguments
- Running automated tests - Execute tests and receive detailed pass/fail results
- Debugging failures - Review full conversation transcripts with evaluation details
When are Evals useful?
Evals help you maintain quality and catch issues early:
- Pre-deployment testing - Validate new assistant configurations before going live
- Regression testing - Ensure prompt or tool changes don’t break existing behaviors
- Conversation flow validation - Test multi-turn interactions and complex scenarios
- Tool calling verification - Validate function calls with correct arguments
- Squad handoff testing - Ensure smooth transitions between squad members
- CI/CD integration - Automate quality gates in your deployment pipeline
What you’ll build
An evaluation suite for an appointment booking assistant that tests:
- Greeting and initial response validation
- Tool call execution with specific arguments
- Response pattern matching with regex
- Semantic validation using AI judges
- Multi-turn conversation flows
Prerequisites
Sign up at dashboard.vapi.ai
Get your API key from API Keys in sidebar
You’ll also need an existing assistant or squad to test. You can create one in the Dashboard or use the API.
Step 1: Create your first evaluation
Define a mock conversation to test your assistant’s greeting behavior.
Dashboard
cURL
Message structure: Each conversation turn has a role
(user, assistant,
system, or tool). Assistant messages with judgePlan
define what to validate.
Step 2: Run your evaluation
Execute the evaluation against your assistant or squad.
Dashboard
cURL
Open your evaluation
- Navigate to Evals in the sidebar
- Click on “Greeting Test” from your evaluations list
You can also run evaluations with transient assistant or squad configurations
by providing assistant
or squad
objects instead of IDs in the target.
Step 3: Understand test results
Learn to interpret evaluation results and identify issues.
Successful evaluation
When all checks pass, you’ll see:
Pass criteria:
status
is “ended”endedReason
is “mockConversation.done”results[0].status
is “pass”- All
judge.status
values are “pass”
Failed evaluation
When validation fails, you’ll see details:
Failure indicators:
results[0].status
is “fail”judge.status
is “fail”judge.failureReason
explains why validation failed
If endedReason
is not “mockConversation.done”, the test encountered an error
(like “assistant-error” or “pipeline-error-openai-llm-failed”). Check your
assistant configuration.
Step 4: Test tool/function calls
Validate that your assistant calls functions with correct arguments.
Basic tool call validation
Test appointment booking with exact argument matching:
Dashboard
cURL
- Create new evaluation: “Appointment Booking Test”
- Add user message: “Book me an appointment for next Monday at 2pm”
- Add assistant message with evaluation enabled
- Select Exact Match judge type
- Click Add Tool Call
- Enter function name: “bookAppointment”
- Add arguments:
date
: “2025-01-20”time
: “14:00”
- Add tool response message:
- Type: Tool
- Content:
{"status": "success", "confirmationId": "APT-12345"}
- Add final assistant message to verify confirmation
- Save evaluation
Tool call validation modes
Exact match - Full validation:
Validates both function name AND all argument values exactly.
Partial match - Name only:
Validates only that the function was called (arguments can vary).
Multiple tool calls:
Validates multiple function calls in sequence.
Tool calls are validated in the order they’re defined. Use type: "exact"
for
strict validation or type: "regex"
for flexible validation.
Step 5: Use regex for flexible validation
When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.
Common regex patterns
Greeting variations:
Matches: “Hello, I can help…”, “Hi I’ll help…”, “Hey let me help…”
Responses with variables:
Matches any confirmation message with appointment ID format.
Date patterns:
Matches responses mentioning weekdays.
Case-insensitive matching:
The (?i)
flag makes matching case-insensitive.
Example: Flexible booking confirmation
Dashboard
cURL
- Add assistant message with evaluation enabled
- Select Regex as judge type
- Enter pattern:
.*appointment.*(confirmed|booked).*\d{1,2}:\d{2}.*
- This matches various confirmation phrasings with time mentions
Regex tips: - Use .*
to match any characters - Use (option1|option2)
for alternatives - Use \d
for digits, \s
for whitespace - Use .*?
for
non-greedy matching - Test your patterns with sample responses first
Step 6: Use AI judge for semantic validation
For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.
AI judge structure
Writing effective judge prompts
Template structure:
Template variables:
{{messages}}
- The entire conversation history (all messages exchanged){{messages[-1]}}
- The last assistant message only
Example: Evaluate helpfulness and tone
Dashboard
cURL
- Add assistant message with evaluation enabled
- Select AI Judge as judge type
- Choose provider: OpenAI
- Select model: gpt-4o
- Enter evaluation prompt (see template above)
- Customize pass/fail criteria for your use case
Supported AI judge providers
Models: gpt-4o, gpt-4-turbo, gpt-3.5-turbo
Best for general-purpose evaluation
Models: claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for nuanced evaluation
Models: gemini-1.5-pro, gemini-1.5-flash Best for multilingual content
Models: llama-3.1-70b-versatile, mixtral-8x7b-32768
Best for fast evaluation
Custom LLM:
AI judge best practices
Tips for reliable AI judging: - Be specific with pass/fail criteria (avoid
ambiguous requirements) - Use “ALL pass criteria must be met” logic - Use “ANY
fail criteria triggers fail” logic - Include conversation context with {{ messages }}
syntax - Request exact “pass” or “fail” output (no
explanations) - Test criteria with known good/bad responses before production
- Use consistent evaluation standards across similar tests
Step 7: Control flow with Continue Plan
Define what happens after an evaluation passes or fails using continuePlan
.
Exit on failure
Stop the test immediately if a critical check fails:
Use case: Skip expensive subsequent tests when initial validation fails.
Override responses on failure
Provide fallback responses to continue testing even when validation fails:
Use case: Test error recovery paths or force specific tool calls for subsequent validation.
Example: Multi-step with exit control
Dashboard
cURL
- Create evaluation with multiple conversation turns
- For each assistant message with critical validation:
- Enable evaluation
- Configure judge plan (exact, regex, or AI)
- Toggle Exit on Failure to stop test early
- For non-critical checks, leave Exit on Failure disabled
If exitOnFailureEnabled
is true
and validation fails, the test stops
immediately. Subsequent conversation turns are not executed. Use this for
critical checkpoints.
Step 8: Test complete conversation flows
Validate multi-turn interactions that simulate real user conversations.
Complete booking flow example
Dashboard
cURL
Create a comprehensive test:
- Turn 1 - Initial request:
- User: “I need to schedule an appointment”
- Assistant evaluation: AI judge checking acknowledgment
- Turn 2 - Provide details:
- User: “Next Monday at 2pm”
- Assistant evaluation: Exact match on tool call
bookAppointment
- Turn 3 - Tool response:
- Tool:
{"status": "success", "confirmationId": "APT-12345"}
- Tool:
- Turn 4 - Confirmation:
- Assistant evaluation: Regex matching confirmation with ID
- Turn 5 - Follow-up:
- User: “Can I get that via email?”
- Assistant evaluation: Exact match on tool call
sendEmail
System message injection
Inject system prompts mid-conversation to test dynamic behavior changes:
Multi-turn testing tips: - Keep conversations focused (5-10 turns for most tests) - Use exit-on-failure for early turns to save time - Test one primary flow per evaluation - Mix judge types (exact, regex, AI) for comprehensive validation - Include tool responses to simulate real interactions
Step 9: Manage evaluations
List, update, and organize your evaluation suite.
List all evaluations
Dashboard
cURL
- Navigate to Evals in the sidebar
- View all evaluations in a table with:
- Name and description
- Created date
- Last run status
- Actions (Edit, Run, Delete)
- Use search to filter by name
- Sort by date or status
Update an evaluation
Dashboard
cURL
- Navigate to Evals and click on an evaluation
- Click Edit button
- Modify conversation turns, judge plans, or settings
- Click Save Changes
- Previous test runs remain unchanged
Delete an evaluation
Dashboard
cURL
- Navigate to Evals
- Click on an evaluation
- Click Delete button
- Confirm deletion
Deleting an evaluation does NOT delete its run history. Past run results remain accessible.
View run history
Dashboard
cURL
- Navigate to Evals
- Click on an evaluation
- View Runs tab showing:
- Run timestamp
- Target (assistant/squad)
- Status (pass/fail)
- Duration
- Click any run to view detailed results
Expected output
Successful run
Indicators of success:
- ✅
status
is “ended” - ✅
endedReason
is “mockConversation.done” - ✅
results[0].status
is “pass” - ✅ All
judge.status
values are “pass”
Failed run
Indicators of failure:
- ❌
results[0].status
is “fail” - ❌
judge.status
is “fail” - ❌
judge.failureReason
provides specific details
Full conversation transcripts show both expected and actual values, making debugging straightforward.
Common patterns
Multiple validation types in one eval
Combine exact, regex, and AI judges for comprehensive testing:
Test squad handoffs
Validate smooth transitions between squad members:
Regression test suite
Organize related tests for systematic validation:
Run multiple evals sequentially to validate all greeting scenarios.
Troubleshooting
Common errors
“mockConversation.done” not reached:
- Check
endedReason
for actual error (e.g., “assistant-error”, “pipeline-error-openai-llm-failed”) - Verify assistant configuration (model, voice, tools)
- Check API key validity and rate limits
Judge validation fails unexpectedly:
- Review actual vs expected output in
failureReason
- For exact match: Check for extra spaces, punctuation, or case differences
- For regex: Test pattern with online regex validators
- For AI judge: Verify prompt clarity and binary pass/fail logic
Tool calls not validated:
- Ensure tool is properly configured in assistant
- Check argument types match exactly (string “14:00” vs number 14)
- Verify tool function names are spelled correctly
If you see endedReason: "assistant-error"
, your assistant configuration has
issues. Test the assistant manually first before running evals.
Next steps
Learn testing patterns, best practices, and CI/CD integration
Create and configure assistants to test
Build custom tools and validate their behavior
Complete API documentation for evals
Tips for success
Best practices for reliable testing: - Start simple with exact matches, then add complexity - One behavior per evaluation turn keeps tests focused - Use descriptive names that explain what’s being tested - Test both happy paths and edge cases - Version control your evals alongside assistant configs - Run critical tests first to fail fast - Review failure reasons promptly and iterate - Document why each test exists (use descriptions)
Get help
Need assistance? We’re here to help: