Evals quickstart

Get started with AI agent testing in 5 minutes

Overview

This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you’ll create mock conversations, define expected behaviors, and validate your agents work correctly before production.

What are Evals?

Evals is Vapi’s AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:

  1. Creating mock conversations - Define user messages and expected assistant responses
  2. Validating behavior - Use exact match, regex patterns, or AI-powered judging
  3. Testing tool calls - Verify function calls with specific arguments
  4. Running automated tests - Execute tests and receive detailed pass/fail results
  5. Debugging failures - Review full conversation transcripts with evaluation details

When are Evals useful?

Evals help you maintain quality and catch issues early:

  • Pre-deployment testing - Validate new assistant configurations before going live
  • Regression testing - Ensure prompt or tool changes don’t break existing behaviors
  • Conversation flow validation - Test multi-turn interactions and complex scenarios
  • Tool calling verification - Validate function calls with correct arguments
  • Squad handoff testing - Ensure smooth transitions between squad members
  • CI/CD integration - Automate quality gates in your deployment pipeline

What you’ll build

An evaluation suite for an appointment booking assistant that tests:

  • Greeting and initial response validation
  • Tool call execution with specific arguments
  • Response pattern matching with regex
  • Semantic validation using AI judges
  • Multi-turn conversation flows

Prerequisites

Vapi account
API key

Get your API key from API Keys in sidebar

You’ll also need an existing assistant or squad to test. You can create one in the Dashboard or use the API.

Step 1: Create your first evaluation

Define a mock conversation to test your assistant’s greeting behavior.

2

Configure basic settings

  1. Name: Enter “Greeting Test”
  2. Description: Add “Verify assistant greets users appropriately”
  3. Type: Automatically set to “chat.mockConversation”
3

Add conversation turns

  1. Click Add Message
  2. Select User message type
  3. Enter content: “Hello”
  4. Click Add Message again
  5. Select Assistant message type
  6. Click Enable Evaluation toggle
  7. Select Exact Match as judge type
  8. Enter expected content: “Hello! How can I help you today?”
  9. Click Save Evaluation

Your evaluation is now saved. You can run it against any assistant or squad.

Message structure: Each conversation turn has a role (user, assistant, system, or tool). Assistant messages with judgePlan define what to validate.

Step 2: Run your evaluation

Execute the evaluation against your assistant or squad.

1

Open your evaluation

  1. Navigate to Evals in the sidebar
  2. Click on “Greeting Test” from your evaluations list
2

Select target and run

  1. In the evaluation detail page, find the Run Test section
  2. Select Assistant or Squad as the target type
  3. Choose your assistant/squad from the dropdown
  4. Click Run Evaluation
  5. Watch real-time progress as the test executes
3

View results

Results appear automatically when the test completes:

  • Green checkmark indicates evaluation passed
  • Red X indicates evaluation failed
  • Click View Details to see full conversation transcript

You can also run evaluations with transient assistant or squad configurations by providing assistant or squad objects instead of IDs in the target.

Step 3: Understand test results

Learn to interpret evaluation results and identify issues.

Successful evaluation

When all checks pass, you’ll see:

1{
2 "id": "eval-run-123",
3 "evalId": "550e8400-e29b-41d4-a716-446655440000",
4 "status": "ended",
5 "endedReason": "mockConversation.done",
6 "results": [
7 {
8 "status": "pass",
9 "messages": [
10 {
11 "role": "user",
12 "content": "Hello"
13 },
14 {
15 "role": "assistant",
16 "content": "Hello! How can I help you today?",
17 "judge": {
18 "status": "pass"
19 }
20 }
21 ]
22 }
23 ]
24}

Pass criteria:

  • status is “ended”
  • endedReason is “mockConversation.done”
  • results[0].status is “pass”
  • All judge.status values are “pass”

Failed evaluation

When validation fails, you’ll see details:

1{
2 "status": "ended",
3 "endedReason": "mockConversation.done",
4 "results": [
5 {
6 "status": "fail",
7 "messages": [
8 {
9 "role": "user",
10 "content": "Hello"
11 },
12 {
13 "role": "assistant",
14 "content": "Hi there! What can I do for you?",
15 "judge": {
16 "status": "fail",
17 "failureReason": "Expected exact match: 'Hello! How can I help you today?' but got: 'Hi there! What can I do for you?'"
18 }
19 }
20 ]
21 }
22 ]
23}

Failure indicators:

  • results[0].status is “fail”
  • judge.status is “fail”
  • judge.failureReason explains why validation failed

If endedReason is not “mockConversation.done”, the test encountered an error (like “assistant-error” or “pipeline-error-openai-llm-failed”). Check your assistant configuration.

Step 4: Test tool/function calls

Validate that your assistant calls functions with correct arguments.

Basic tool call validation

Test appointment booking with exact argument matching:

  1. Create new evaluation: “Appointment Booking Test”
  2. Add user message: “Book me an appointment for next Monday at 2pm”
  3. Add assistant message with evaluation enabled
  4. Select Exact Match judge type
  5. Click Add Tool Call
  6. Enter function name: “bookAppointment”
  7. Add arguments:
    • date: “2025-01-20”
    • time: “14:00”
  8. Add tool response message:
    • Type: Tool
    • Content: {"status": "success", "confirmationId": "APT-12345"}
  9. Add final assistant message to verify confirmation
  10. Save evaluation

Tool call validation modes

Exact match - Full validation:

1{
2 "judgePlan": {
3 "type": "exact",
4 "toolCalls": [
5 {
6 "name": "bookAppointment",
7 "arguments": {
8 "date": "2025-01-20",
9 "time": "14:00"
10 }
11 }
12 ]
13 }
14}

Validates both function name AND all argument values exactly.

Partial match - Name only:

1{
2 "judgePlan": {
3 "type": "regex",
4 "toolCalls": [
5 {
6 "name": "bookAppointment"
7 }
8 ]
9 }
10}

Validates only that the function was called (arguments can vary).

Multiple tool calls:

1{
2 "judgePlan": {
3 "type": "exact",
4 "toolCalls": [
5 {
6 "name": "checkAvailability",
7 "arguments": { "date": "2025-01-20" }
8 },
9 {
10 "name": "bookAppointment",
11 "arguments": { "date": "2025-01-20", "time": "14:00" }
12 }
13 ]
14 }
15}

Validates multiple function calls in sequence.

Tool calls are validated in the order they’re defined. Use type: "exact" for strict validation or type: "regex" for flexible validation.

Step 5: Use regex for flexible validation

When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.

Common regex patterns

Greeting variations:

1{
2 "judgePlan": {
3 "type": "regex",
4 "content": "^(Hello|Hi|Hey),? (I can|I'll|let me) help.*"
5 }
6}

Matches: “Hello, I can help…”, “Hi I’ll help…”, “Hey let me help…”

Responses with variables:

1{
2 "judgePlan": {
3 "type": "regex",
4 "content": ".*appointment.*confirmed.*[A-Z]{3}-[0-9]{5}.*"
5 }
6}

Matches any confirmation message with appointment ID format.

Date patterns:

1{
2 "judgePlan": {
3 "type": "regex",
4 "content": ".*scheduled for (Monday|Tuesday|Wednesday|Thursday|Friday).*"
5 }
6}

Matches responses mentioning weekdays.

Case-insensitive matching:

1{
2 "judgePlan": {
3 "type": "regex",
4 "content": "(?i)booking confirmed"
5 }
6}

The (?i) flag makes matching case-insensitive.

Example: Flexible booking confirmation

  1. Add assistant message with evaluation enabled
  2. Select Regex as judge type
  3. Enter pattern: .*appointment.*(confirmed|booked).*\d{1,2}:\d{2}.*
  4. This matches various confirmation phrasings with time mentions

Regex tips: - Use .* to match any characters - Use (option1|option2) for alternatives - Use \d for digits, \s for whitespace - Use .*? for non-greedy matching - Test your patterns with sample responses first

Step 6: Use AI judge for semantic validation

For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.

AI judge structure

1{
2 "role": "assistant",
3 "judgePlan": {
4 "type": "ai",
5 "model": {
6 "provider": "openai",
7 "model": "gpt-4o",
8 "messages": [
9 {
10 "role": "system",
11 "content": "Your evaluation prompt here"
12 }
13 ]
14 }
15 }
16}

Writing effective judge prompts

Template structure:

You are an LLM-Judge. Evaluate ONLY the last assistant message in the mock conversation: {{messages[-1]}}.
Include the full conversation history for context: {{messages}}
Decision rule:
- PASS if ALL "pass criteria" are satisfied AND NONE of the "fail criteria" are triggered.
- Otherwise FAIL.
Pass criteria:
- [Specific requirement 1]
- [Specific requirement 2]
Fail criteria (any one triggers FAIL):
- [Specific failure condition 1]
- [Specific failure condition 2]
Output format: respond with exactly one word: pass or fail
- No explanations
- No punctuation
- No additional text

Template variables:

  • {{messages}} - The entire conversation history (all messages exchanged)
  • {{messages[-1]}} - The last assistant message only

Example: Evaluate helpfulness and tone

  1. Add assistant message with evaluation enabled
  2. Select AI Judge as judge type
  3. Choose provider: OpenAI
  4. Select model: gpt-4o
  5. Enter evaluation prompt (see template above)
  6. Customize pass/fail criteria for your use case

Supported AI judge providers

OpenAI

Models: gpt-4o, gpt-4-turbo, gpt-3.5-turbo

Best for general-purpose evaluation

Anthropic

Models: claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for nuanced evaluation

Google

Models: gemini-1.5-pro, gemini-1.5-flash Best for multilingual content

Groq

Models: llama-3.1-70b-versatile, mixtral-8x7b-32768

Best for fast evaluation

Custom LLM:

1{
2 "model": {
3 "provider": "custom-llm",
4 "model": "your-model-name",
5 "url": "https://your-api-endpoint.com/chat/completions",
6 "messages": [...]
7 }
8}

AI judge best practices

Tips for reliable AI judging: - Be specific with pass/fail criteria (avoid ambiguous requirements) - Use “ALL pass criteria must be met” logic - Use “ANY fail criteria triggers fail” logic - Include conversation context with {{ messages }} syntax - Request exact “pass” or “fail” output (no explanations) - Test criteria with known good/bad responses before production

  • Use consistent evaluation standards across similar tests

Step 7: Control flow with Continue Plan

Define what happens after an evaluation passes or fails using continuePlan.

Exit on failure

Stop the test immediately if a critical check fails:

1{
2 "role": "assistant",
3 "judgePlan": {
4 "type": "exact",
5 "content": "I can help you with that."
6 },
7 "continuePlan": {
8 "exitOnFailureEnabled": true
9 }
10}

Use case: Skip expensive subsequent tests when initial validation fails.

Override responses on failure

Provide fallback responses to continue testing even when validation fails:

1{
2 "role": "assistant",
3 "judgePlan": {
4 "type": "exact",
5 "content": "I've processed your request."
6 },
7 "continuePlan": {
8 "exitOnFailureEnabled": false,
9 "contentOverride": "Let me rephrase that...",
10 "toolCallsOverride": [
11 {
12 "name": "retryProcessing",
13 "arguments": { "retry": "true" }
14 }
15 ]
16 }
17}

Use case: Test error recovery paths or force specific tool calls for subsequent validation.

Example: Multi-step with exit control

  1. Create evaluation with multiple conversation turns
  2. For each assistant message with critical validation:
    • Enable evaluation
    • Configure judge plan (exact, regex, or AI)
    • Toggle Exit on Failure to stop test early
  3. For non-critical checks, leave Exit on Failure disabled

If exitOnFailureEnabled is true and validation fails, the test stops immediately. Subsequent conversation turns are not executed. Use this for critical checkpoints.

Step 8: Test complete conversation flows

Validate multi-turn interactions that simulate real user conversations.

Complete booking flow example

Create a comprehensive test:

  1. Turn 1 - Initial request:
    • User: “I need to schedule an appointment”
    • Assistant evaluation: AI judge checking acknowledgment
  2. Turn 2 - Provide details:
    • User: “Next Monday at 2pm”
    • Assistant evaluation: Exact match on tool call bookAppointment
  3. Turn 3 - Tool response:
    • Tool: {"status": "success", "confirmationId": "APT-12345"}
  4. Turn 4 - Confirmation:
    • Assistant evaluation: Regex matching confirmation with ID
  5. Turn 5 - Follow-up:
    • User: “Can I get that via email?”
    • Assistant evaluation: Exact match on tool call sendEmail

System message injection

Inject system prompts mid-conversation to test dynamic behavior changes:

1{
2 "messages": [
3 {
4 "role": "user",
5 "content": "Hello"
6 },
7 {
8 "role": "assistant",
9 "judgePlan": {
10 "type": "regex",
11 "content": ".*help.*"
12 }
13 },
14 {
15 "role": "system",
16 "content": "You are now in urgent mode. Prioritize speed."
17 },
18 {
19 "role": "user",
20 "content": "I need immediate help"
21 },
22 {
23 "role": "assistant",
24 "judgePlan": {
25 "type": "ai",
26 "model": {
27 "provider": "openai",
28 "model": "gpt-4o",
29 "messages": [
30 {
31 "role": "system",
32 "content": "PASS if response shows urgency. FAIL if response is casual. Output: pass or fail"
33 }
34 ]
35 }
36 }
37 }
38 ]
39}

Multi-turn testing tips: - Keep conversations focused (5-10 turns for most tests) - Use exit-on-failure for early turns to save time - Test one primary flow per evaluation - Mix judge types (exact, regex, AI) for comprehensive validation - Include tool responses to simulate real interactions

Step 9: Manage evaluations

List, update, and organize your evaluation suite.

List all evaluations

  1. Navigate to Evals in the sidebar
  2. View all evaluations in a table with:
    • Name and description
    • Created date
    • Last run status
    • Actions (Edit, Run, Delete)
  3. Use search to filter by name
  4. Sort by date or status

Update an evaluation

  1. Navigate to Evals and click on an evaluation
  2. Click Edit button
  3. Modify conversation turns, judge plans, or settings
  4. Click Save Changes
  5. Previous test runs remain unchanged

Delete an evaluation

  1. Navigate to Evals
  2. Click on an evaluation
  3. Click Delete button
  4. Confirm deletion

Deleting an evaluation does NOT delete its run history. Past run results remain accessible.

View run history

  1. Navigate to Evals
  2. Click on an evaluation
  3. View Runs tab showing:
    • Run timestamp
    • Target (assistant/squad)
    • Status (pass/fail)
    • Duration
  4. Click any run to view detailed results

Expected output

Successful run

1{
2 "id": "eval-run-123",
3 "evalId": "550e8400-e29b-41d4-a716-446655440000",
4 "orgId": "org-123",
5 "status": "ended",
6 "endedReason": "mockConversation.done",
7 "createdAt": "2024-01-15T09:35:00Z",
8 "updatedAt": "2024-01-15T09:35:45Z",
9 "results": [
10 {
11 "status": "pass",
12 "messages": [
13 {
14 "role": "user",
15 "content": "Hello"
16 },
17 {
18 "role": "assistant",
19 "content": "Hello! How can I help you today?",
20 "judge": {
21 "status": "pass"
22 }
23 }
24 ]
25 }
26 ],
27 "target": {
28 "type": "assistant",
29 "assistantId": "your-assistant-id"
30 }
31}

Indicators of success:

  • status is “ended”
  • endedReason is “mockConversation.done”
  • results[0].status is “pass”
  • ✅ All judge.status values are “pass”

Failed run

1{
2 "id": "eval-run-124",
3 "status": "ended",
4 "endedReason": "mockConversation.done",
5 "results": [
6 {
7 "status": "fail",
8 "messages": [
9 {
10 "role": "user",
11 "content": "Book an appointment for Monday at 2pm"
12 },
13 {
14 "role": "assistant",
15 "content": "Sure, let me help you with that.",
16 "toolCalls": [
17 {
18 "name": "bookAppointment",
19 "arguments": {
20 "date": "2025-01-20",
21 "time": "2:00 PM"
22 }
23 }
24 ],
25 "judge": {
26 "status": "fail",
27 "failureReason": "Tool call arguments mismatch. Expected time: '14:00' but got: '2:00 PM'"
28 }
29 }
30 ]
31 }
32 ]
33}

Indicators of failure:

  • results[0].status is “fail”
  • judge.status is “fail”
  • judge.failureReason provides specific details

Full conversation transcripts show both expected and actual values, making debugging straightforward.

Common patterns

Multiple validation types in one eval

Combine exact, regex, and AI judges for comprehensive testing:

1{
2 "messages": [
3 {
4 "role": "user",
5 "content": "Hello"
6 },
7 {
8 "role": "assistant",
9 "judgePlan": {
10 "type": "exact",
11 "content": "Hello! How can I help you?"
12 }
13 },
14 {
15 "role": "user",
16 "content": "Book appointment for Monday"
17 },
18 {
19 "role": "assistant",
20 "judgePlan": {
21 "type": "regex",
22 "content": ".*(Monday|next week).*"
23 }
24 },
25 {
26 "role": "user",
27 "content": "Thanks for your help"
28 },
29 {
30 "role": "assistant",
31 "judgePlan": {
32 "type": "ai",
33 "model": {
34 "provider": "openai",
35 "model": "gpt-4o",
36 "messages": [
37 {
38 "role": "system",
39 "content": "PASS if response is polite and acknowledges thanks. Output: pass or fail"
40 }
41 ]
42 }
43 }
44 }
45 ]
46}

Test squad handoffs

Validate smooth transitions between squad members:

1{
2 "name": "Squad Handoff Test",
3 "messages": [
4 {
5 "role": "user",
6 "content": "I need technical support"
7 },
8 {
9 "role": "assistant",
10 "judgePlan": {
11 "type": "exact",
12 "toolCalls": [
13 {
14 "name": "transferToSquadMember",
15 "arguments": {
16 "destination": "technical-support-agent"
17 }
18 }
19 ]
20 }
21 }
22 ],
23 "target": {
24 "type": "squad",
25 "squadId": "your-squad-id"
26 }
27}

Regression test suite

Organize related tests for systematic validation:

1{
2 "name": "Greeting Regression Suite",
3 "tests": [
4 "Greeting Test - Formal",
5 "Greeting Test - Casual",
6 "Greeting Test - Multilingual"
7 ]
8}

Run multiple evals sequentially to validate all greeting scenarios.

Troubleshooting

IssueSolution
Eval always failsVerify exact match strings character-by-character. Consider using regex for flexibility
AI judge inconsistentMake pass/fail criteria more specific and binary. Test with known examples
Tool calls not matchingCheck argument types (string vs number). Ensure exact spelling of function names
Run stuck in “running”Verify assistant configuration. Check for errors in assistant’s tools or prompts
Timeout errorsReduce conversation length or simplify evaluations. Check assistant response times
Regex not matchingTest regex patterns separately. Remember to escape special characters like . or ?
Empty results arrayCheck endedReason field. Assistant may have encountered an error before completion
Missing judge resultsVerify judgePlan is properly configured in assistant messages

Common errors

“mockConversation.done” not reached:

  • Check endedReason for actual error (e.g., “assistant-error”, “pipeline-error-openai-llm-failed”)
  • Verify assistant configuration (model, voice, tools)
  • Check API key validity and rate limits

Judge validation fails unexpectedly:

  • Review actual vs expected output in failureReason
  • For exact match: Check for extra spaces, punctuation, or case differences
  • For regex: Test pattern with online regex validators
  • For AI judge: Verify prompt clarity and binary pass/fail logic

Tool calls not validated:

  • Ensure tool is properly configured in assistant
  • Check argument types match exactly (string “14:00” vs number 14)
  • Verify tool function names are spelled correctly

If you see endedReason: "assistant-error", your assistant configuration has issues. Test the assistant manually first before running evals.

Next steps

Tips for success

Best practices for reliable testing: - Start simple with exact matches, then add complexity - One behavior per evaluation turn keeps tests focused - Use descriptive names that explain what’s being tested - Test both happy paths and edge cases - Version control your evals alongside assistant configs - Run critical tests first to fail fast - Review failure reasons promptly and iterate - Document why each test exists (use descriptions)

Get help

Need assistance? We’re here to help: