> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.vapi.ai/llms.txt.
> For full documentation content, see https://docs.vapi.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.vapi.ai/_mcp/server.

# Evals quickstart

## Overview

This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you'll create mock conversations, define expected behaviors, and validate your agents work correctly before production.

<iframe src="https://www.tella.tv/video/cmgu6muyb002m0bktda8r6nou/embed?b=0&title=0&a=1&loop=0&t=0&muted=0&wt=1" allowfullscreen allowtransparency />

### What are Evals?

Evals is Vapi's AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:

1. **Creating mock conversations** - Define user messages and expected assistant responses
2. **Validating behavior** - Use exact match, regex patterns, or AI-powered judging
3. **Testing tool calls** - Verify function calls with specific arguments
4. **Running automated tests** - Execute tests and receive detailed pass/fail results
5. **Debugging failures** - Review full conversation transcripts with evaluation details

### When are Evals useful?

Evals help you maintain quality and catch issues early:

* **Pre-deployment testing** - Validate new assistant configurations before going live
* **Regression testing** - Ensure prompt or tool changes don't break existing behaviors
* **Conversation flow validation** - Test multi-turn interactions and complex scenarios
* **Tool calling verification** - Validate function calls with correct arguments
* **Squad handoff testing** - Ensure smooth transitions between squad members
* **CI/CD integration** - Automate quality gates in your deployment pipeline

### What you'll build

An evaluation suite for an appointment booking assistant that tests:

* Greeting and initial response validation
* Tool call execution with specific arguments
* Response pattern matching with regex
* Semantic validation using AI judges
* Multi-turn conversation flows

## Prerequisites

Sign up at [dashboard.vapi.ai](https://dashboard.vapi.ai)

Get your API key from **API Keys** in sidebar

You'll also need an existing assistant or squad to test. You can create one in
the Dashboard or use the API.

## Step 1: Create your first evaluation

Define a mock conversation to test your assistant's greeting behavior.

1. Log in to [dashboard.vapi.ai](https://dashboard.vapi.ai)
2. Click on **Evals** in the left sidebar (under Observability)
3. Click **Create Evaluation**

1) **Name**: Enter "Greeting Test"
2) **Description**: Add "Verify assistant greets users appropriately"
3) **Type**: Automatically set to "chat.mockConversation"

1. Click **Add Message**
2. Select **User** message type
3. Enter content: "Hello"
4. Click **Add Message** again
5. Select **Assistant** message type
6. Click **Enable Evaluation** toggle
7. Select **Exact Match** as judge type
8. Enter expected content: "Hello! How can I help you today?"
9. Click **Save Evaluation**

Your evaluation is now saved. You can run it against any assistant or squad.

```bash
curl -X POST "https://api.vapi.ai/eval" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Greeting Test",
    "description": "Verify assistant greets users appropriately",
    "type": "chat.mockConversation",
    "messages": [
      {
        "role": "user",
        "content": "Hello"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "exact",
          "content": "Hello! How can I help you today?"
        }
      }
    ]
  }'
```

**Response:**

```json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "orgId": "org-123",
  "type": "chat.mockConversation",
  "name": "Greeting Test",
  "description": "Verify assistant greets users appropriately",
  "messages": [...],
  "createdAt": "2024-01-15T09:30:00Z",
  "updatedAt": "2024-01-15T09:30:00Z"
}
```

Save the returned `id` - you'll need it to run the evaluation.

For complete API details, see [Create Eval](/api-reference/eval/create).

**Message structure:** Each conversation turn has a `role` (user, assistant,
system, or tool). Assistant messages with `judgePlan` define what to validate.

## Step 2: Run your evaluation

Execute the evaluation against your assistant or squad.

1. Navigate to **Evals** in the sidebar
2. Click on "Greeting Test" from your evaluations list

1) In the evaluation detail page, find the **Run Test** section
2) Select **Assistant** or **Squad** as the target type
3) Choose your assistant/squad from the dropdown
4) Click **Run Evaluation**
5) Watch real-time progress as the test executes

Results appear automatically when the test completes:

* ✅ **Green checkmark** indicates evaluation passed
* ❌ **Red X** indicates evaluation failed
* Click **View Details** to see full conversation transcript

**Create an eval run:**

```bash
curl -X POST "https://api.vapi.ai/eval/run" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "evalId": "550e8400-e29b-41d4-a716-446655440000",
    "target": {
      "type": "assistant",
      "assistantId": "your-assistant-id"
    }
  }'
```

**Response:**

```json
{
  "id": "eval-run-123",
  "evalId": "550e8400-e29b-41d4-a716-446655440000",
  "orgId": "org-123",
  "status": "queued",
  "createdAt": "2024-01-15T09:35:00Z",
  "updatedAt": "2024-01-15T09:35:00Z"
}
```

**Check results:**

```bash
curl -X GET "https://api.vapi.ai/eval/run/eval-run-123" \
  -H "Authorization: Bearer $VAPI_API_KEY"
```

For complete API details, see [Create Eval Run](/api-reference/eval/run) and [Get Eval Run](/api-reference/eval/get-run).

You can also run evaluations with transient assistant or squad configurations
by providing `assistant` or `squad` objects instead of IDs in the target.

## Step 3: Understand test results

Learn to interpret evaluation results and identify issues.

### Successful evaluation

When all checks pass, you'll see:

```json
{
  "id": "eval-run-123",
  "evalId": "550e8400-e29b-41d4-a716-446655440000",
  "status": "ended",
  "endedReason": "mockConversation.done",
  "results": [
    {
      "status": "pass",
      "messages": [
        {
          "role": "user",
          "content": "Hello"
        },
        {
          "role": "assistant",
          "content": "Hello! How can I help you today?",
          "judge": {
            "status": "pass"
          }
        }
      ]
    }
  ]
}
```

**Pass criteria:**

* `status` is "ended"
* `endedReason` is "mockConversation.done"
* `results[0].status` is "pass"
* All `judge.status` values are "pass"

### Failed evaluation

When validation fails, you'll see details:

```json
{
  "status": "ended",
  "endedReason": "mockConversation.done",
  "results": [
    {
      "status": "fail",
      "messages": [
        {
          "role": "user",
          "content": "Hello"
        },
        {
          "role": "assistant",
          "content": "Hi there! What can I do for you?",
          "judge": {
            "status": "fail",
            "failureReason": "Expected exact match: 'Hello! How can I help you today?' but got: 'Hi there! What can I do for you?'"
          }
        }
      ]
    }
  ]
}
```

**Failure indicators:**

* `results[0].status` is "fail"
* `judge.status` is "fail"
* `judge.failureReason` explains why validation failed

If `endedReason` is not "mockConversation.done", the test encountered an error
(like "assistant-error" or "pipeline-error-openai-llm-failed"). Check your
assistant configuration.

## Step 4: Test tool/function calls

Validate that your assistant calls functions with correct arguments.

### Basic tool call validation

Test appointment booking with exact argument matching:

1. Create new evaluation: "Appointment Booking Test"
2. Add user message: "Book me an appointment for next Monday at 2pm"
3. Add assistant message with evaluation enabled
4. Select **Exact Match** judge type
5. Click **Add Tool Call**
6. Enter function name: "bookAppointment"
7. Add arguments:
   * `date`: "2025-01-20"
   * `time`: "14:00"
8. Add tool response message:
   * Type: **Tool**
   * Content: `{"status": "success", "confirmationId": "APT-12345"}`
9. Add final assistant message to verify confirmation
10. Save evaluation

```bash
curl -X POST "https://api.vapi.ai/eval" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Appointment Booking Test",
    "type": "chat.mockConversation",
    "messages": [
      {
        "role": "user",
        "content": "Book me an appointment for next Monday at 2pm"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "exact",
          "toolCalls": [{
            "name": "bookAppointment",
            "arguments": {
              "date": "2025-01-20",
              "time": "14:00"
            }
          }]
        }
      },
      {
        "role": "tool",
        "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "regex",
          "content": ".*confirmed.*APT-12345.*"
        }
      }
    ]
  }'
```

For API details, see [Create Eval](/api-reference/eval/create).

### Tool call validation modes

**Exact match - Full validation:**

```json
{
  "judgePlan": {
    "type": "exact",
    "toolCalls": [
      {
        "name": "bookAppointment",
        "arguments": {
          "date": "2025-01-20",
          "time": "14:00"
        }
      }
    ]
  }
}
```

Validates both function name AND all argument values exactly.

**Partial match - Name only:**

```json
{
  "judgePlan": {
    "type": "regex",
    "toolCalls": [
      {
        "name": "bookAppointment"
      }
    ]
  }
}
```

Validates only that the function was called (arguments can vary).

**Multiple tool calls:**

```json
{
  "judgePlan": {
    "type": "exact",
    "toolCalls": [
      {
        "name": "checkAvailability",
        "arguments": { "date": "2025-01-20" }
      },
      {
        "name": "bookAppointment",
        "arguments": { "date": "2025-01-20", "time": "14:00" }
      }
    ]
  }
}
```

Validates multiple function calls in sequence.

Tool calls are validated in the order they're defined. Use `type: "exact"` for
strict validation or `type: "regex"` for flexible validation.

## Step 5: Use regex for flexible validation

When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.

### Common regex patterns

**Greeting variations:**

```json
{
  "judgePlan": {
    "type": "regex",
    "content": "^(Hello|Hi|Hey),? (I can|I'll|let me) help.*"
  }
}
```

Matches: "Hello, I can help...", "Hi I'll help...", "Hey let me help..."

**Responses with variables:**

```json
{
  "judgePlan": {
    "type": "regex",
    "content": ".*appointment.*confirmed.*[A-Z]{3}-[0-9]{5}.*"
  }
}
```

Matches any confirmation message with appointment ID format.

**Date patterns:**

```json
{
  "judgePlan": {
    "type": "regex",
    "content": ".*scheduled for (Monday|Tuesday|Wednesday|Thursday|Friday).*"
  }
}
```

Matches responses mentioning weekdays.

**Case-insensitive matching:**

```json
{
  "judgePlan": {
    "type": "regex",
    "content": "(?i)booking confirmed"
  }
}
```

The `(?i)` flag makes matching case-insensitive.

### Example: Flexible booking confirmation

1. Add assistant message with evaluation enabled
2. Select **Regex** as judge type
3. Enter pattern: `.*appointment.*(confirmed|booked).*\d{1,2}:\d{2}.*`
4. This matches various confirmation phrasings with time mentions

```bash
curl -X POST "https://api.vapi.ai/eval" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Flexible Booking Test",
    "type": "chat.mockConversation",
    "messages": [
      {
        "role": "user",
        "content": "I need to schedule an appointment"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "regex",
          "content": ".*(schedule|book|set up).*appointment.*"
        }
      }
    ]
  }'
```

**Regex tips:** - Use `.*` to match any characters - Use `(option1|option2)`
for alternatives - Use `\d` for digits, `\s` for whitespace - Use `.*?` for
non-greedy matching - Test your patterns with sample responses first

## Step 6: Use AI judge for semantic validation

For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.

### AI judge structure

```json
{
  "role": "assistant",
  "judgePlan": {
    "type": "ai",
    "model": {
      "provider": "openai",
      "model": "gpt-4o",
      "messages": [
        {
          "role": "system",
          "content": "Your evaluation prompt here"
        }
      ]
    }
  }
}
```

### Writing effective judge prompts

**Template structure:**

```
You are an LLM-Judge. Evaluate ONLY the last assistant message in the mock conversation: {{messages[-1]}}.

Include the full conversation history for context: {{messages}}

Decision rule:
- PASS if ALL "pass criteria" are satisfied AND NONE of the "fail criteria" are triggered.
- Otherwise FAIL.

Pass criteria:
- [Specific requirement 1]
- [Specific requirement 2]

Fail criteria (any one triggers FAIL):
- [Specific failure condition 1]
- [Specific failure condition 2]

Output format: respond with exactly one word: pass or fail
- No explanations
- No punctuation
- No additional text
```

**Template variables:**

* `{{messages}}` - The entire conversation history (all messages exchanged)
* `{{messages[-1]}}` - The last assistant message only

### Example: Evaluate helpfulness and tone

1. Add assistant message with evaluation enabled
2. Select **AI Judge** as judge type
3. Choose provider: **OpenAI**
4. Select model: **gpt-4o**
5. Enter evaluation prompt (see template above)
6. Customize pass/fail criteria for your use case

```bash
curl -X POST "https://api.vapi.ai/eval" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Helpfulness Test",
    "type": "chat.mockConversation",
    "messages": [
      {
        "role": "user",
        "content": "I need help with my account"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "ai",
          "model": {
            "provider": "openai",
            "model": "gpt-4o",
            "messages": [{
              "role": "system",
              "content": "You are an LLM-Judge. Evaluate ONLY the last assistant message: {{messages[-1]}}.\n\nInclude context: {{messages}}\n\nDecision rule:\n- PASS if ALL pass criteria are met AND NO fail criteria are triggered.\n- Otherwise FAIL.\n\nPass criteria:\n- Response acknowledges the user request\n- Response offers specific help or next steps\n- Tone is professional and friendly\n\nFail criteria (any triggers FAIL):\n- Response is rude or dismissive\n- Response ignores the user request\n- Response provides no actionable information\n\nOutput format: respond with exactly one word: pass or fail"
            }]
          }
        }
      }
    ]
  }'
```

### Supported AI judge providers

**Models:** gpt-4o, gpt-4-turbo, gpt-3.5-turbo

Best for general-purpose evaluation

{" "}

**Models:** claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for
nuanced evaluation

{" "}

**Models:** gemini-1.5-pro, gemini-1.5-flash Best for multilingual content

**Models:** llama-3.1-70b-versatile, mixtral-8x7b-32768

Best for fast evaluation

**Custom LLM:**

```json
{
  "model": {
    "provider": "custom-llm",
    "model": "your-model-name",
    "url": "https://your-api-endpoint.com/chat/completions",
    "messages": [...]
  }
}
```

### AI judge best practices

**Tips for reliable AI judging:** - Be specific with pass/fail criteria (avoid
ambiguous requirements) - Use "ALL pass criteria must be met" logic - Use "ANY
fail criteria triggers fail" logic - Include conversation context with `   {{ messages }}` syntax - Request exact "pass" or "fail" output (no
explanations) - Test criteria with known good/bad responses before production

* Use consistent evaluation standards across similar tests

## Step 7: Control flow with Continue Plan

Define what happens after an evaluation passes or fails using `continuePlan`.

### Exit on failure

Stop the test immediately if a critical check fails:

```json
{
  "role": "assistant",
  "judgePlan": {
    "type": "exact",
    "content": "I can help you with that."
  },
  "continuePlan": {
    "exitOnFailureEnabled": true
  }
}
```

**Use case:** Skip expensive subsequent tests when initial validation fails.

### Override responses on failure

Provide fallback responses to continue testing even when validation fails:

```json
{
  "role": "assistant",
  "judgePlan": {
    "type": "exact",
    "content": "I've processed your request."
  },
  "continuePlan": {
    "exitOnFailureEnabled": false,
    "contentOverride": "Let me rephrase that...",
    "toolCallsOverride": [
      {
        "name": "retryProcessing",
        "arguments": { "retry": "true" }
      }
    ]
  }
}
```

**Use case:** Test error recovery paths or force specific tool calls for subsequent validation.

### Example: Multi-step with exit control

1. Create evaluation with multiple conversation turns
2. For each assistant message with critical validation:
   * Enable evaluation
   * Configure judge plan (exact, regex, or AI)
   * Toggle **Exit on Failure** to stop test early
3. For non-critical checks, leave **Exit on Failure** disabled

```bash
curl -X POST "https://api.vapi.ai/eval" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Multi-Step with Control",
    "type": "chat.mockConversation",
    "messages": [
      {
        "role": "user",
        "content": "I want to book an appointment"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "exact",
          "content": "I can help you book an appointment."
        },
        "continuePlan": {
          "exitOnFailureEnabled": true
        }
      },
      {
        "role": "user",
        "content": "Monday at 2pm"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "exact",
          "toolCalls": [{"name": "bookAppointment"}]
        },
        "continuePlan": {
          "exitOnFailureEnabled": false,
          "contentOverride": "Booking confirmed for Monday at 2pm.",
          "toolCallsOverride": [{
            "name": "bookAppointment",
            "arguments": {"date": "2025-01-20", "time": "14:00"}
          }]
        }
      }
    ]
  }'
```

If `exitOnFailureEnabled` is `true` and validation fails, the test stops
immediately. Subsequent conversation turns are not executed. Use this for
critical checkpoints.

## Step 8: Test complete conversation flows

Validate multi-turn interactions that simulate real user conversations.

### Complete booking flow example

Create a comprehensive test:

1. **Turn 1 - Initial request:**
   * User: "I need to schedule an appointment"
   * Assistant evaluation: AI judge checking acknowledgment
2. **Turn 2 - Provide details:**
   * User: "Next Monday at 2pm"
   * Assistant evaluation: Exact match on tool call `bookAppointment`
3. **Turn 3 - Tool response:**
   * Tool: `{"status": "success", "confirmationId": "APT-12345"}`
4. **Turn 4 - Confirmation:**
   * Assistant evaluation: Regex matching confirmation with ID
5. **Turn 5 - Follow-up:**
   * User: "Can I get that via email?"
   * Assistant evaluation: Exact match on tool call `sendEmail`

```bash
curl -X POST "https://api.vapi.ai/eval" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Complete Booking Flow",
    "description": "Test full appointment booking conversation",
    "type": "chat.mockConversation",
    "messages": [
      {
        "role": "user",
        "content": "I need to schedule an appointment"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "ai",
          "model": {
            "provider": "openai",
            "model": "gpt-4o",
            "messages": [{
              "role": "system",
              "content": "Evaluate: {{messages[-1]}}\n\nPASS if:\n- Response acknowledges appointment request\n- Response asks for details or preferences\n\nFAIL if:\n- Response is dismissive\n- Response ignores request\n\nOutput: pass or fail"
            }]
          }
        }
      },
      {
        "role": "user",
        "content": "Next Monday at 2pm"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "exact",
          "toolCalls": [{
            "name": "bookAppointment",
            "arguments": {
              "date": "2025-01-20",
              "time": "14:00"
            }
          }]
        }
      },
      {
        "role": "tool",
        "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "regex",
          "content": ".*confirmed.*APT-12345.*"
        }
      },
      {
        "role": "user",
        "content": "Can I get that via email?"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "exact",
          "toolCalls": [{
            "name": "sendEmail"
          }]
        }
      }
    ]
  }'
```

For API details, see [Create Eval](/api-reference/eval/create).

### System message injection

Inject system prompts mid-conversation to test dynamic behavior changes:

```json
{
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "regex",
        "content": ".*help.*"
      }
    },
    {
      "role": "system",
      "content": "You are now in urgent mode. Prioritize speed."
    },
    {
      "role": "user",
      "content": "I need immediate help"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "ai",
        "model": {
          "provider": "openai",
          "model": "gpt-4o",
          "messages": [
            {
              "role": "system",
              "content": "PASS if response shows urgency. FAIL if response is casual. Output: pass or fail"
            }
          ]
        }
      }
    }
  ]
}
```

**Multi-turn testing tips:** - Keep conversations focused (5-10 turns for most
tests) - Use exit-on-failure for early turns to save time - Test one primary
flow per evaluation - Mix judge types (exact, regex, AI) for comprehensive
validation - Include tool responses to simulate real interactions

## Step 9: Manage evaluations

List, update, and organize your evaluation suite.

### List all evaluations

1. Navigate to **Evals** in the sidebar
2. View all evaluations in a table with:
   * Name and description
   * Created date
   * Last run status
   * Actions (Edit, Run, Delete)
3. Use search to filter by name
4. Sort by date or status

```bash
curl -X GET "https://api.vapi.ai/eval" \
  -H "Authorization: Bearer $VAPI_API_KEY"
```

**Response:**

```json
{
  "results": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "Greeting Test",
      "description": "Verify assistant greets users appropriately",
      "type": "chat.mockConversation",
      "createdAt": "2024-01-15T09:30:00Z",
      "updatedAt": "2024-01-15T09:30:00Z"
    }
  ],
  "page": 1,
  "total": 1
}
```

For API details, see [List Evals](/api-reference/eval/list).

### Update an evaluation

1. Navigate to **Evals** and click on an evaluation
2. Click **Edit** button
3. Modify conversation turns, judge plans, or settings
4. Click **Save Changes**
5. Previous test runs remain unchanged

```bash
curl -X PATCH "https://api.vapi.ai/eval/550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Updated Greeting Test",
    "description": "Enhanced greeting validation",
    "messages": [
      {
        "role": "user",
        "content": "Hi there"
      },
      {
        "role": "assistant",
        "judgePlan": {
          "type": "regex",
          "content": "^(Hello|Hi|Hey).*"
        }
      }
    ]
  }'
```

For API details, see [Update Eval](/api-reference/eval/update).

### Delete an evaluation

1. Navigate to **Evals**
2. Click on an evaluation
3. Click **Delete** button
4. Confirm deletion

Deleting an evaluation does NOT delete its run history. Past run results remain accessible.

```bash
curl -X DELETE "https://api.vapi.ai/eval/550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer $VAPI_API_KEY"
```

For API details, see [Delete Eval](/api-reference/eval/delete).

### View run history

1. Navigate to **Evals**
2. Click on an evaluation
3. View **Runs** tab showing:
   * Run timestamp
   * Target (assistant/squad)
   * Status (pass/fail)
   * Duration
4. Click any run to view detailed results

**List all runs:**

```bash
curl -X GET "https://api.vapi.ai/eval/run" \
  -H "Authorization: Bearer $VAPI_API_KEY"
```

**Filter by eval ID:**

```bash
curl -X GET "https://api.vapi.ai/eval/run?evalId=550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer $VAPI_API_KEY"
```

For API details, see [List Eval Runs](/api-reference/eval/list-runs).

## Expected output

### Successful run

```json
{
  "id": "eval-run-123",
  "evalId": "550e8400-e29b-41d4-a716-446655440000",
  "orgId": "org-123",
  "status": "ended",
  "endedReason": "mockConversation.done",
  "createdAt": "2024-01-15T09:35:00Z",
  "updatedAt": "2024-01-15T09:35:45Z",
  "results": [
    {
      "status": "pass",
      "messages": [
        {
          "role": "user",
          "content": "Hello"
        },
        {
          "role": "assistant",
          "content": "Hello! How can I help you today?",
          "judge": {
            "status": "pass"
          }
        }
      ]
    }
  ],
  "target": {
    "type": "assistant",
    "assistantId": "your-assistant-id"
  }
}
```

**Indicators of success:**

* ✅ `status` is "ended"
* ✅ `endedReason` is "mockConversation.done"
* ✅ `results[0].status` is "pass"
* ✅ All `judge.status` values are "pass"

### Failed run

```json
{
  "id": "eval-run-124",
  "status": "ended",
  "endedReason": "mockConversation.done",
  "results": [
    {
      "status": "fail",
      "messages": [
        {
          "role": "user",
          "content": "Book an appointment for Monday at 2pm"
        },
        {
          "role": "assistant",
          "content": "Sure, let me help you with that.",
          "toolCalls": [
            {
              "name": "bookAppointment",
              "arguments": {
                "date": "2025-01-20",
                "time": "2:00 PM"
              }
            }
          ],
          "judge": {
            "status": "fail",
            "failureReason": "Tool call arguments mismatch. Expected time: '14:00' but got: '2:00 PM'"
          }
        }
      ]
    }
  ]
}
```

**Indicators of failure:**

* ❌ `results[0].status` is "fail"
* ❌ `judge.status` is "fail"
* ❌ `judge.failureReason` provides specific details

Full conversation transcripts show both expected and actual values, making
debugging straightforward.

## Common patterns

### Multiple validation types in one eval

Combine exact, regex, and AI judges for comprehensive testing:

```json
{
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "exact",
        "content": "Hello! How can I help you?"
      }
    },
    {
      "role": "user",
      "content": "Book appointment for Monday"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "regex",
        "content": ".*(Monday|next week).*"
      }
    },
    {
      "role": "user",
      "content": "Thanks for your help"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "ai",
        "model": {
          "provider": "openai",
          "model": "gpt-4o",
          "messages": [
            {
              "role": "system",
              "content": "PASS if response is polite and acknowledges thanks. Output: pass or fail"
            }
          ]
        }
      }
    }
  ]
}
```

### Test squad handoffs

Validate smooth transitions between squad members:

```json
{
  "name": "Squad Handoff Test",
  "messages": [
    {
      "role": "user",
      "content": "I need technical support"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "exact",
        "toolCalls": [
          {
            "name": "transferToSquadMember",
            "arguments": {
              "destination": "technical-support-agent"
            }
          }
        ]
      }
    }
  ],
  "target": {
    "type": "squad",
    "squadId": "your-squad-id"
  }
}
```

### Regression test suite

Organize related tests for systematic validation:

```json
{
  "name": "Greeting Regression Suite",
  "tests": [
    "Greeting Test - Formal",
    "Greeting Test - Casual",
    "Greeting Test - Multilingual"
  ]
}
```

Run multiple evals sequentially to validate all greeting scenarios.

## Troubleshooting

| Issue                   | Solution                                                                                |
| ----------------------- | --------------------------------------------------------------------------------------- |
| Eval always fails       | Verify exact match strings character-by-character. Consider using regex for flexibility |
| AI judge inconsistent   | Make pass/fail criteria more specific and binary. Test with known examples              |
| Tool calls not matching | Check argument types (string vs number). Ensure exact spelling of function names        |
| Run stuck in "running"  | Verify assistant configuration. Check for errors in assistant's tools or prompts        |
| Timeout errors          | Reduce conversation length or simplify evaluations. Check assistant response times      |
| Regex not matching      | Test regex patterns separately. Remember to escape special characters like `.` or `?`   |
| Empty results array     | Check `endedReason` field. Assistant may have encountered an error before completion    |
| Missing judge results   | Verify `judgePlan` is properly configured in assistant messages                         |

### Common errors

**"mockConversation.done" not reached:**

* Check `endedReason` for actual error (e.g., "assistant-error", "pipeline-error-openai-llm-failed")
* Verify assistant configuration (model, voice, tools)
* Check API key validity and rate limits

**Judge validation fails unexpectedly:**

* Review actual vs expected output in `failureReason`
* For exact match: Check for extra spaces, punctuation, or case differences
* For regex: Test pattern with online regex validators
* For AI judge: Verify prompt clarity and binary pass/fail logic

**Tool calls not validated:**

* Ensure tool is properly configured in assistant
* Check argument types match exactly (string "14:00" vs number 14)
* Verify tool function names are spelled correctly

If you see `endedReason: "assistant-error"`, your assistant configuration has
issues. Test the assistant manually first before running evals.

## Next steps

Learn testing patterns, best practices, and CI/CD integration

{" "}

Create and configure assistants to test

{" "}

Build custom tools and validate their behavior

Complete API documentation for evals

## Tips for success

**Best practices for reliable testing:** - Start simple with exact matches,
then add complexity - One behavior per evaluation turn keeps tests focused -
Use descriptive names that explain what's being tested - Test both happy paths
and edge cases - Version control your evals alongside assistant configs - Run
critical tests first to fail fast - Review failure reasons promptly and
iterate - Document why each test exists (use descriptions)

## Get help

Need assistance? We're here to help:

* [Eval API Reference](/api-reference/eval/eval-controller-create)
* [Discord Community](https://discord.gg/pUFNcf2WmH)
* [Support](mailto:support@vapi.ai)