Evals quickstart

Overview

This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you’ll create mock conversations, define expected behaviors, and validate your agents work correctly before production.

What are Evals?

Evals is Vapi’s AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:

Creating mock conversations - Define user messages and expected assistant responses
Validating behavior - Use exact match, regex patterns, or AI-powered judging
Testing tool calls - Verify function calls with specific arguments
Running automated tests - Execute tests and receive detailed pass/fail results
Debugging failures - Review full conversation transcripts with evaluation details

When are Evals useful?

Evals help you maintain quality and catch issues early:

Pre-deployment testing - Validate new assistant configurations before going live
Regression testing - Ensure prompt or tool changes don’t break existing behaviors
Conversation flow validation - Test multi-turn interactions and complex scenarios
Tool calling verification - Validate function calls with correct arguments
Squad handoff testing - Ensure smooth transitions between squad members
CI/CD integration - Automate quality gates in your deployment pipeline

What you’ll build

An evaluation suite for an appointment booking assistant that tests:

Greeting and initial response validation
Tool call execution with specific arguments
Response pattern matching with regex
Semantic validation using AI judges
Multi-turn conversation flows

Prerequisites

Vapi account

API key

Get your API key from API Keys in sidebar

You’ll also need an existing assistant or squad to test. You can create one in the Dashboard or use the API.

Step 1: Create your first evaluation

Define a mock conversation to test your assistant’s greeting behavior.

Dashboard

cURL

Navigate to Evals

Log in to dashboard.vapi.ai
Click on Evals in the left sidebar (under Observability)
Click Create Evaluation

Configure basic settings

Name: Enter “Greeting Test”
Description: Add “Verify assistant greets users appropriately”
Type: Automatically set to “chat.mockConversation”

Add conversation turns

Click Add Message
Select User message type
Enter content: “Hello”
Click Add Message again
Select Assistant message type
Click Enable Evaluation toggle
Select Exact Match as judge type
Enter expected content: “Hello! How can I help you today?”
Click Save Evaluation

Your evaluation is now saved. You can run it against any assistant or squad.

Message structure: Each conversation turn has a role (user, assistant, system, or tool). Assistant messages with judgePlan define what to validate.

Step 2: Run your evaluation

Execute the evaluation against your assistant or squad.

Dashboard

cURL

Open your evaluation

Navigate to Evals in the sidebar
Click on “Greeting Test” from your evaluations list

Select target and run

In the evaluation detail page, find the Run Test section
Select Assistant or Squad as the target type
Choose your assistant/squad from the dropdown
Click Run Evaluation
Watch real-time progress as the test executes

View results

Results appear automatically when the test completes:

✅ Green checkmark indicates evaluation passed
❌ Red X indicates evaluation failed
Click View Details to see full conversation transcript

You can also run evaluations with transient assistant or squad configurations by providing assistant or squad objects instead of IDs in the target.

Step 3: Understand test results

Learn to interpret evaluation results and identify issues.

Successful evaluation

When all checks pass, you’ll see:

1 {
2   "id": "eval-run-123",
3   "evalId": "550e8400-e29b-41d4-a716-446655440000",
4   "status": "ended",
5   "endedReason": "mockConversation.done",
6   "results": [
7     {
8       "status": "pass",
9       "messages": [
10         {
11           "role": "user",
12           "content": "Hello"
13         },
14         {
15           "role": "assistant",
16           "content": "Hello! How can I help you today?",
17           "judge": {
18             "status": "pass"
19           }
20         }
21       ]
22     }
23   ]
24 }

Pass criteria:

status is “ended”
endedReason is “mockConversation.done”
results[0].status is “pass”
All judge.status values are “pass”

Failed evaluation

When validation fails, you’ll see details:

1 {
2   "status": "ended",
3   "endedReason": "mockConversation.done",
4   "results": [
5     {
6       "status": "fail",
7       "messages": [
8         {
9           "role": "user",
10           "content": "Hello"
11         },
12         {
13           "role": "assistant",
14           "content": "Hi there! What can I do for you?",
15           "judge": {
16             "status": "fail",
17             "failureReason": "Expected exact match: 'Hello! How can I help you today?' but got: 'Hi there! What can I do for you?'"
18           }
19         }
20       ]
21     }
22   ]
23 }

Failure indicators:

results[0].status is “fail”
judge.status is “fail”
judge.failureReason explains why validation failed

If endedReason is not “mockConversation.done”, the test encountered an error (like “assistant-error” or “pipeline-error-openai-llm-failed”). Check your assistant configuration.

Step 4: Test tool/function calls

Validate that your assistant calls functions with correct arguments.

Basic tool call validation

Test appointment booking with exact argument matching:

Dashboard

cURL

Create new evaluation: “Appointment Booking Test”
Add user message: “Book me an appointment for next Monday at 2pm”
Add assistant message with evaluation enabled
Select Exact Match judge type
Click Add Tool Call
Enter function name: “bookAppointment”
Add arguments:
- date: “2025-01-20”
- time: “14:00”
Add tool response message:
- Type: Tool
- Content: {"status": "success", "confirmationId": "APT-12345"}
Add final assistant message to verify confirmation
Save evaluation

Tool call validation modes

Exact match - Full validation:

1 {
2   "judgePlan": {
3     "type": "exact",
4     "toolCalls": [
5       {
6         "name": "bookAppointment",
7         "arguments": {
8           "date": "2025-01-20",
9           "time": "14:00"
10         }
11       }
12     ]
13   }
14 }

Validates both function name AND all argument values exactly.

Partial match - Name only:

1 {
2   "judgePlan": {
3     "type": "regex",
4     "toolCalls": [
5       {
6         "name": "bookAppointment"
7       }
8     ]
9   }
10 }

Validates only that the function was called (arguments can vary).

Multiple tool calls:

1 {
2   "judgePlan": {
3     "type": "exact",
4     "toolCalls": [
5       {
6         "name": "checkAvailability",
7         "arguments": { "date": "2025-01-20" }
8       },
9       {
10         "name": "bookAppointment",
11         "arguments": { "date": "2025-01-20", "time": "14:00" }
12       }
13     ]
14   }
15 }

Validates multiple function calls in sequence.

Tool calls are validated in the order they’re defined. Use type: "exact" for strict validation or type: "regex" for flexible validation.

Step 5: Use regex for flexible validation

When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.

Common regex patterns

Greeting variations:

1 {
2   "judgePlan": {
3     "type": "regex",
4     "content": "^(Hello|Hi|Hey),? (I can|I'll|let me) help.*"
5   }
6 }

Matches: “Hello, I can help…”, “Hi I’ll help…”, “Hey let me help…”

Responses with variables:

1 {
2   "judgePlan": {
3     "type": "regex",
4     "content": ".*appointment.*confirmed.*[A-Z]{3}-[0-9]{5}.*"
5   }
6 }

Matches any confirmation message with appointment ID format.

Date patterns:

1 {
2   "judgePlan": {
3     "type": "regex",
4     "content": ".*scheduled for (Monday|Tuesday|Wednesday|Thursday|Friday).*"
5   }
6 }

Matches responses mentioning weekdays.

Case-insensitive matching:

1 {
2   "judgePlan": {
3     "type": "regex",
4     "content": "(?i)booking confirmed"
5   }
6 }

The (?i) flag makes matching case-insensitive.

Example: Flexible booking confirmation

Dashboard

cURL

Add assistant message with evaluation enabled
Select Regex as judge type
Enter pattern: .*appointment.*(confirmed|booked).*\d{1,2}:\d{2}.*
This matches various confirmation phrasings with time mentions

Regex tips: - Use .* to match any characters - Use (option1|option2) for alternatives - Use \d for digits, \s for whitespace - Use .*? for non-greedy matching - Test your patterns with sample responses first

Step 6: Use AI judge for semantic validation

For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.

AI judge structure

1 {
2   "role": "assistant",
3   "judgePlan": {
4     "type": "ai",
5     "model": {
6       "provider": "openai",
7       "model": "gpt-4o",
8       "messages": [
9         {
10           "role": "system",
11           "content": "Your evaluation prompt here"
12         }
13       ]
14     }
15   }
16 }

Writing effective judge prompts

Template structure:

You are an LLM-Judge. Evaluate ONLY the last assistant message in the mock conversation: {{messages[-1]}}.
Include the full conversation history for context: {{messages}}
Decision rule:
- PASS if ALL "pass criteria" are satisfied AND NONE of the "fail criteria" are triggered.
- Otherwise FAIL.
Pass criteria:
- [Specific requirement 1]
- [Specific requirement 2]
Fail criteria (any one triggers FAIL):
- [Specific failure condition 1]
- [Specific failure condition 2]
Output format: respond with exactly one word: pass or fail
- No explanations
- No punctuation
- No additional text

Template variables:

{{messages}} - The entire conversation history (all messages exchanged)
{{messages[-1]}} - The last assistant message only

Example: Evaluate helpfulness and tone

Dashboard

cURL

Add assistant message with evaluation enabled
Select AI Judge as judge type
Choose provider: OpenAI
Select model: gpt-4o
Enter evaluation prompt (see template above)
Customize pass/fail criteria for your use case

Supported AI judge providers

OpenAI

Models: gpt-4o, gpt-4-turbo, gpt-3.5-turbo

Best for general-purpose evaluation

Anthropic

Models: claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for nuanced evaluation

Google

Models: gemini-1.5-pro, gemini-1.5-flash Best for multilingual content

Groq

Models: llama-3.1-70b-versatile, mixtral-8x7b-32768

Best for fast evaluation

Custom LLM:

1 {
2   "model": {
3     "provider": "custom-llm",
4     "model": "your-model-name",
5     "url": "https://your-api-endpoint.com/chat/completions",
6     "messages": [...]
7   }
8 }

AI judge best practices

Tips for reliable AI judging: - Be specific with pass/fail criteria (avoid ambiguous requirements) - Use “ALL pass criteria must be met” logic - Use “ANY fail criteria triggers fail” logic - Include conversation context with {{ messages }} syntax - Request exact “pass” or “fail” output (no explanations) - Test criteria with known good/bad responses before production

Use consistent evaluation standards across similar tests

Step 7: Control flow with Continue Plan

Define what happens after an evaluation passes or fails using continuePlan.

Exit on failure

Stop the test immediately if a critical check fails:

1 {
2   "role": "assistant",
3   "judgePlan": {
4     "type": "exact",
5     "content": "I can help you with that."
6   },
7   "continuePlan": {
8     "exitOnFailureEnabled": true
9   }
10 }

Use case: Skip expensive subsequent tests when initial validation fails.

Override responses on failure

Provide fallback responses to continue testing even when validation fails:

1 {
2   "role": "assistant",
3   "judgePlan": {
4     "type": "exact",
5     "content": "I've processed your request."
6   },
7   "continuePlan": {
8     "exitOnFailureEnabled": false,
9     "contentOverride": "Let me rephrase that...",
10     "toolCallsOverride": [
11       {
12         "name": "retryProcessing",
13         "arguments": { "retry": "true" }
14       }
15     ]
16   }
17 }

Use case: Test error recovery paths or force specific tool calls for subsequent validation.

Example: Multi-step with exit control

Dashboard

cURL

Create evaluation with multiple conversation turns
For each assistant message with critical validation:
- Enable evaluation
- Configure judge plan (exact, regex, or AI)
- Toggle Exit on Failure to stop test early
For non-critical checks, leave Exit on Failure disabled

If exitOnFailureEnabled is true and validation fails, the test stops immediately. Subsequent conversation turns are not executed. Use this for critical checkpoints.

Step 8: Test complete conversation flows

Validate multi-turn interactions that simulate real user conversations.

Complete booking flow example

Dashboard

cURL

Create a comprehensive test:

Turn 1 - Initial request:
- User: “I need to schedule an appointment”
- Assistant evaluation: AI judge checking acknowledgment
Turn 2 - Provide details:
- User: “Next Monday at 2pm”
- Assistant evaluation: Exact match on tool call bookAppointment
Turn 3 - Tool response:
- Tool: {"status": "success", "confirmationId": "APT-12345"}
Turn 4 - Confirmation:
- Assistant evaluation: Regex matching confirmation with ID
Turn 5 - Follow-up:
- User: “Can I get that via email?”
- Assistant evaluation: Exact match on tool call sendEmail

System message injection

Inject system prompts mid-conversation to test dynamic behavior changes:

1 {
2   "messages": [
3     {
4       "role": "user",
5       "content": "Hello"
6     },
7     {
8       "role": "assistant",
9       "judgePlan": {
10         "type": "regex",
11         "content": ".*help.*"
12       }
13     },
14     {
15       "role": "system",
16       "content": "You are now in urgent mode. Prioritize speed."
17     },
18     {
19       "role": "user",
20       "content": "I need immediate help"
21     },
22     {
23       "role": "assistant",
24       "judgePlan": {
25         "type": "ai",
26         "model": {
27           "provider": "openai",
28           "model": "gpt-4o",
29           "messages": [
30             {
31               "role": "system",
32               "content": "PASS if response shows urgency. FAIL if response is casual. Output: pass or fail"
33             }
34           ]
35         }
36       }
37     }
38   ]
39 }

Multi-turn testing tips: - Keep conversations focused (5-10 turns for most tests) - Use exit-on-failure for early turns to save time - Test one primary flow per evaluation - Mix judge types (exact, regex, AI) for comprehensive validation - Include tool responses to simulate real interactions

Step 9: Manage evaluations

List, update, and organize your evaluation suite.

List all evaluations

Dashboard

cURL

Navigate to Evals in the sidebar
View all evaluations in a table with:
- Name and description
- Created date
- Last run status
- Actions (Edit, Run, Delete)
Use search to filter by name
Sort by date or status

Update an evaluation

Dashboard

cURL

Navigate to Evals and click on an evaluation
Click Edit button
Modify conversation turns, judge plans, or settings
Click Save Changes
Previous test runs remain unchanged

Delete an evaluation

Dashboard

cURL

Navigate to Evals
Click on an evaluation
Click Delete button
Confirm deletion

Deleting an evaluation does NOT delete its run history. Past run results remain accessible.

View run history

Dashboard

cURL

Navigate to Evals
Click on an evaluation
View Runs tab showing:
- Run timestamp
- Target (assistant/squad)
- Status (pass/fail)
- Duration
Click any run to view detailed results

Expected output

Successful run

1 {
2   "id": "eval-run-123",
3   "evalId": "550e8400-e29b-41d4-a716-446655440000",
4   "orgId": "org-123",
5   "status": "ended",
6   "endedReason": "mockConversation.done",
7   "createdAt": "2024-01-15T09:35:00Z",
8   "updatedAt": "2024-01-15T09:35:45Z",
9   "results": [
10     {
11       "status": "pass",
12       "messages": [
13         {
14           "role": "user",
15           "content": "Hello"
16         },
17         {
18           "role": "assistant",
19           "content": "Hello! How can I help you today?",
20           "judge": {
21             "status": "pass"
22           }
23         }
24       ]
25     }
26   ],
27   "target": {
28     "type": "assistant",
29     "assistantId": "your-assistant-id"
30   }
31 }

Indicators of success:

✅ status is “ended”
✅ endedReason is “mockConversation.done”
✅ results[0].status is “pass”
✅ All judge.status values are “pass”

Failed run

1 {
2   "id": "eval-run-124",
3   "status": "ended",
4   "endedReason": "mockConversation.done",
5   "results": [
6     {
7       "status": "fail",
8       "messages": [
9         {
10           "role": "user",
11           "content": "Book an appointment for Monday at 2pm"
12         },
13         {
14           "role": "assistant",
15           "content": "Sure, let me help you with that.",
16           "toolCalls": [
17             {
18               "name": "bookAppointment",
19               "arguments": {
20                 "date": "2025-01-20",
21                 "time": "2:00 PM"
22               }
23             }
24           ],
25           "judge": {
26             "status": "fail",
27             "failureReason": "Tool call arguments mismatch. Expected time: '14:00' but got: '2:00 PM'"
28           }
29         }
30       ]
31     }
32   ]
33 }

Indicators of failure:

❌ results[0].status is “fail”
❌ judge.status is “fail”
❌ judge.failureReason provides specific details

Full conversation transcripts show both expected and actual values, making debugging straightforward.

Common patterns

Multiple validation types in one eval

Combine exact, regex, and AI judges for comprehensive testing:

1 {
2   "messages": [
3     {
4       "role": "user",
5       "content": "Hello"
6     },
7     {
8       "role": "assistant",
9       "judgePlan": {
10         "type": "exact",
11         "content": "Hello! How can I help you?"
12       }
13     },
14     {
15       "role": "user",
16       "content": "Book appointment for Monday"
17     },
18     {
19       "role": "assistant",
20       "judgePlan": {
21         "type": "regex",
22         "content": ".*(Monday|next week).*"
23       }
24     },
25     {
26       "role": "user",
27       "content": "Thanks for your help"
28     },
29     {
30       "role": "assistant",
31       "judgePlan": {
32         "type": "ai",
33         "model": {
34           "provider": "openai",
35           "model": "gpt-4o",
36           "messages": [
37             {
38               "role": "system",
39               "content": "PASS if response is polite and acknowledges thanks. Output: pass or fail"
40             }
41           ]
42         }
43       }
44     }
45   ]
46 }

Test squad handoffs

Validate smooth transitions between squad members:

1 {
2   "name": "Squad Handoff Test",
3   "messages": [
4     {
5       "role": "user",
6       "content": "I need technical support"
7     },
8     {
9       "role": "assistant",
10       "judgePlan": {
11         "type": "exact",
12         "toolCalls": [
13           {
14             "name": "transferToSquadMember",
15             "arguments": {
16               "destination": "technical-support-agent"
17             }
18           }
19         ]
20       }
21     }
22   ],
23   "target": {
24     "type": "squad",
25     "squadId": "your-squad-id"
26   }
27 }

Regression test suite

Organize related tests for systematic validation:

1 {
2   "name": "Greeting Regression Suite",
3   "tests": [
4     "Greeting Test - Formal",
5     "Greeting Test - Casual",
6     "Greeting Test - Multilingual"
7   ]
8 }

Run multiple evals sequentially to validate all greeting scenarios.

Troubleshooting

Issue	Solution
Eval always fails	Verify exact match strings character-by-character. Consider using regex for flexibility
AI judge inconsistent	Make pass/fail criteria more specific and binary. Test with known examples
Tool calls not matching	Check argument types (string vs number). Ensure exact spelling of function names
Run stuck in “running”	Verify assistant configuration. Check for errors in assistant’s tools or prompts
Timeout errors	Reduce conversation length or simplify evaluations. Check assistant response times
Regex not matching	Test regex patterns separately. Remember to escape special characters like `.` or `?`
Empty results array	Check `endedReason` field. Assistant may have encountered an error before completion
Missing judge results	Verify `judgePlan` is properly configured in assistant messages

Common errors

“mockConversation.done” not reached:

Check endedReason for actual error (e.g., “assistant-error”, “pipeline-error-openai-llm-failed”)
Verify assistant configuration (model, voice, tools)
Check API key validity and rate limits

Judge validation fails unexpectedly:

Review actual vs expected output in failureReason
For exact match: Check for extra spaces, punctuation, or case differences
For regex: Test pattern with online regex validators
For AI judge: Verify prompt clarity and binary pass/fail logic

Tool calls not validated:

Ensure tool is properly configured in assistant
Check argument types match exactly (string “14:00” vs number 14)
Verify tool function names are spelled correctly

If you see endedReason: "assistant-error", your assistant configuration has issues. Test the assistant manually first before running evals.

Next steps

Advanced testing strategies

Learn testing patterns, best practices, and CI/CD integration

Assistants guide

Create and configure assistants to test

Tools documentation

Build custom tools and validate their behavior

Eval API reference

Complete API documentation for evals

Tips for success

Best practices for reliable testing: - Start simple with exact matches, then add complexity - One behavior per evaluation turn keeps tests focused - Use descriptive names that explain what’s being tested - Test both happy paths and edge cases - Version control your evals alongside assistant configs - Run critical tests first to fail fast - Review failure reasons promptly and iterate - Document why each test exists (use descriptions)

Get help

Need assistance? We’re here to help: