Advanced eval testing

Overview

This guide covers advanced evaluation strategies, testing patterns, and best practices for building robust test suites that ensure your AI agents work reliably in production.

You’ll learn:

Strategic testing approaches (smoke, regression, edge case)
Testing patterns for different scenarios
Performance optimization techniques
Maintenance and CI/CD integration strategies
Advanced troubleshooting methods

Testing strategies

Smoke tests

Quick validation that core functionality works. Run these first to catch obvious issues.

Purpose: Verify assistant responds and basic conversation flow works.

1 {
2   "name": "Smoke Test - Basic Response",
3   "description": "Verify assistant responds to simple greeting",
4   "type": "chat.mockConversation",
5   "messages": [
6     {
7       "role": "user",
8       "content": "Hello"
9     },
10     {
11       "role": "assistant",
12       "judgePlan": {
13         "type": "regex",
14         "content": ".+"
15       }
16     }
17   ]
18 }

Characteristics:

Minimal validation (just check for any response)
Fast execution (1-2 turns)
Run before detailed tests
Exit early if smoke tests fail

When to use:

Before running expensive test suites
After deploying configuration changes
As health checks in monitoring
Quick validation during development

Regression tests

Ensure fixes and updates don’t break existing functionality.

Purpose: Validate that known issues stay fixed and features keep working.

Dashboard

cURL

Create evaluation named with “Regression: ” prefix
Include issue ticket number in description
Add exact scenario that previously failed
Validate the fix still works

Example:

Name: “Regression: Date Parsing Bug #1234”
Description: “Verify dates like ‘3/15’ parse correctly after bug fix”

Best practices:

Name tests after bugs they prevent
Include ticket/issue numbers in descriptions
Add regression tests when fixing bugs
Run full regression suite before major releases
Archive tests only when features are removed

Edge case testing

Test boundary conditions and unusual inputs.

Common edge cases to test:

Empty or minimal input

1 {
2   "messages": [
3     {"role": "user", "content": ""},
4     {"role": "assistant", "judgePlan": {
5       "type": "ai",
6       "model": {
7         "provider": "openai",
8         "model": "gpt-4o",
9         "messages": [{
10           "role": "system",
11           "content": "PASS if response asks for clarification politely. Output: pass or fail"
12         }]
13       }
14     }}
15   ]
16 }

Very long input

1 {
2   "messages": [
3     {
4       "role": "user",
5       "content": "I need help with... (repeat 1000 times)"
6     },
7     {
8       "role": "assistant",
9       "judgePlan": {
10         "type": "regex",
11         "content": ".+"
12       },
13       "continuePlan": {
14         "exitOnFailureEnabled": true
15       }
16     }
17   ]
18 }

Special characters and unicode

1 {
2   "messages": [
3     {
4       "role": "user",
5       "content": "My name is François José 王明"
6     },
7     {
8       "role": "assistant",
9       "judgePlan": {
10         "type": "ai",
11         "model": {
12           "provider": "openai",
13           "model": "gpt-4o",
14           "messages": [{
15             "role": "system",
16             "content": "PASS if response correctly acknowledges the name with special characters. Output: pass or fail"
17           }]
18         }
19       }
20     }
21   ]
22 }

Ambiguous or unclear requests

1 {
2   "messages": [
3     {
4       "role": "user",
5       "content": "asdfghjkl"
6     },
7     {
8       "role": "assistant",
9       "judgePlan": {
10         "type": "ai",
11         "model": {
12           "provider": "openai",
13           "model": "gpt-4o",
14           "messages": [{
15             "role": "system",
16             "content": "PASS if response asks for clarification without being rude. Output: pass or fail"
17           }]
18         }
19       }
20     }
21   ]
22 }

Rapid conversation changes

1 {
2   "messages": [
3     {"role": "user", "content": "Book appointment"},
4     {"role": "assistant", "judgePlan": {"type": "regex", "content": ".*appointment.*"}},
5     {"role": "user", "content": "Actually, cancel that. I need tech support."},
6     {"role": "assistant", "judgePlan": {
7       "type": "ai",
8       "model": {
9         "provider": "openai",
10         "model": "gpt-4o",
11         "messages": [{
12           "role": "system",
13           "content": "PASS if response pivots to tech support without confusion. Output: pass or fail"
14         }]
15       }
16     }}
17   ]
18 }

Edge case categories:

Input boundaries: Empty, maximum length, special characters
Data formats: Invalid dates, malformed phone numbers, unusual names
Conversation patterns: Interruptions, topic changes, contradictions
Timing: Very fast responses, long pauses, timeout scenarios

Testing patterns

Happy path testing

Validate ideal user journeys where everything works correctly.

Structure:

User provides clear, complete information
Assistant responds appropriately
Tools execute successfully
Conversation completes with desired outcome

Example: Perfect booking flow

1 {
2   "name": "Happy Path - Complete Booking",
3   "description": "User provides all info clearly, booking succeeds",
4   "type": "chat.mockConversation",
5   "messages": [
6     {
7       "role": "user",
8       "content": "I'd like to book an appointment"
9     },
10     {
11       "role": "assistant",
12       "judgePlan": {
13         "type": "ai",
14         "model": {
15           "provider": "openai",
16           "model": "gpt-4o",
17           "messages": [
18             {
19               "role": "system",
20               "content": "PASS if response asks for date/time preferences. Output: pass or fail"
21             }
22           ]
23         }
24       }
25     },
26     {
27       "role": "user",
28       "content": "Next Monday at 2pm please"
29     },
30     {
31       "role": "assistant",
32       "judgePlan": {
33         "type": "exact",
34         "toolCalls": [
35           {
36             "name": "bookAppointment",
37             "arguments": {
38               "date": "2025-01-20",
39               "time": "14:00"
40             }
41           }
42         ]
43       }
44     },
45     {
46       "role": "tool",
47       "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}"
48     },
49     {
50       "role": "assistant",
51       "judgePlan": {
52         "type": "regex",
53         "content": ".*(confirmed|booked).*APT-12345.*"
54       }
55     }
56   ]
57 }

Happy path coverage:

Test primary user goals
Verify expected tool executions
Validate success messages
Confirm data accuracy

Error handling testing

Test how your assistant handles failures gracefully.

Tool failure scenarios:

1 {
2   "name": "Error Handling - Booking Unavailable",
3   "messages": [
4     {
5       "role": "user",
6       "content": "Book Monday at 2pm"
7     },
8     {
9       "role": "assistant",
10       "judgePlan": {
11         "type": "exact",
12         "toolCalls": [{ "name": "bookAppointment" }]
13       }
14     },
15     {
16       "role": "tool",
17       "content": "{\"status\": \"error\", \"message\": \"Time slot unavailable\"}"
18     },
19     {
20       "role": "assistant",
21       "judgePlan": {
22         "type": "ai",
23         "model": {
24           "provider": "openai",
25           "model": "gpt-4o",
26           "messages": [
27             {
28               "role": "system",
29               "content": "Evaluate: {{messages[-1]}}\n\nPASS if:\n- Response acknowledges the time is unavailable\n- Response offers alternatives or asks for different time\n- Tone remains helpful (not apologetic to excess)\n\nFAIL if:\n- Response ignores the error\n- Response doesn't offer next steps\n- Tone is frustrated or rude\n\nOutput: pass or fail"
30             }
31           ]
32         }
33       }
34     }
35   ]
36 }

Invalid input handling:

1 {
2   "name": "Error Handling - Invalid Date Format",
3   "messages": [
4     {
5       "role": "user",
6       "content": "Book me for the 45th of Octember"
7     },
8     {
9       "role": "assistant",
10       "judgePlan": {
11         "type": "ai",
12         "model": {
13           "provider": "openai",
14           "model": "gpt-4o",
15           "messages": [
16             {
17               "role": "system",
18               "content": "PASS if response politely asks for valid date without mocking user. Output: pass or fail"
19             }
20           ]
21         }
22       }
23     }
24   ]
25 }

API timeout simulation:

1 {
2   "name": "Error Handling - Tool Timeout",
3   "messages": [
4     {
5       "role": "user",
6       "content": "Check my order status"
7     },
8     {
9       "role": "assistant",
10       "judgePlan": {
11         "type": "exact",
12         "toolCalls": [{ "name": "checkOrderStatus" }]
13       }
14     },
15     {
16       "role": "tool",
17       "content": "{\"status\": \"error\", \"message\": \"Request timeout\"}"
18     },
19     {
20       "role": "assistant",
21       "judgePlan": {
22         "type": "ai",
23         "model": {
24           "provider": "openai",
25           "model": "gpt-4o",
26           "messages": [
27             {
28               "role": "system",
29               "content": "PASS if response acknowledges technical issue and suggests retry or alternative. Output: pass or fail"
30             }
31           ]
32         }
33       }
34     }
35   ]
36 }

Error categories to test:

Tool/API failures
Invalid user input
Timeout scenarios
Rate limit errors
Partial data availability
Permission/authorization issues

Boundary testing

Test limits and thresholds of your system.

Maximum conversation length:

1 {
2   "name": "Boundary - Max Turns",
3   "description": "Test assistant handles long conversations (20+ turns)",
4   "messages": [
5     { "role": "user", "content": "Question 1" },
6     { "role": "assistant", "judgePlan": { "type": "regex", "content": ".+" } },
7     { "role": "user", "content": "Question 2" },
8     { "role": "assistant", "judgePlan": { "type": "regex", "content": ".+" } },
9     // ... repeat up to boundary ...
10     { "role": "user", "content": "Final question" },
11     {
12       "role": "assistant",
13       "judgePlan": {
14         "type": "ai",
15         "model": {
16           "provider": "openai",
17           "model": "gpt-4o",
18           "messages": [
19             {
20               "role": "system",
21               "content": "PASS if response is coherent and maintains context from earlier conversation. Output: pass or fail"
22             }
23           ]
24         }
25       }
26     }
27   ]
28 }

Rate limits:

Test behavior at or near rate limits:

Multiple tool calls in succession
Rapid user input
Large data processing requests

Data size boundaries:

1 {
2   "name": "Boundary - Large Data Response",
3   "messages": [
4     {
5       "role": "user",
6       "content": "Get all customer records"
7     },
8     {
9       "role": "assistant",
10       "judgePlan": {
11         "type": "exact",
12         "toolCalls": [{ "name": "getAllCustomers" }]
13       }
14     },
15     {
16       "role": "tool",
17       "content": "{\"customers\": [/* 1000 customer objects */]}"
18     },
19     {
20       "role": "assistant",
21       "judgePlan": {
22         "type": "ai",
23         "model": {
24           "provider": "openai",
25           "model": "gpt-4o",
26           "messages": [
27             {
28               "role": "system",
29               "content": "PASS if response summarizes data rather than reading full list. Output: pass or fail"
30             }
31           ]
32         }
33       }
34     }
35   ]
36 }

Best practices

Evaluation design principles

Single responsibility

Each evaluation should test one specific behavior or feature.

✅ Good: “Test greeting acknowledgment”

❌ Bad: “Test greeting, booking, and error handling”

Clear naming

Use descriptive names that explain what’s being tested. ✅ Good: “Booking

Validates Date Format” ❌ Bad: “Test 1” or “Eval ABC”

Comprehensive descriptions

Document why the test exists and what it validates. Include context: business requirement, bug ticket, or feature spec.

Maintainable complexity

Keep evaluations focused (5-10 turns max).

Split complex scenarios into multiple targeted tests.

Validation approach selection

Choose the right judge type for each scenario:

Use Exact Match when...

Ideal for:

Critical business data (confirmation IDs, totals, dates)
Tool call validation with specific arguments
Compliance-required exact wording
Success/failure status messages

Example: Booking confirmation ID must be exact

1 {
2   "judgePlan": {
3     "type": "exact",
4     "content": "Your confirmation ID is APT-12345"
5   }
6 }

Use Regex when...

Ideal for:

Responses with variable data (names, dates, IDs)
Pattern matching (email formats, phone numbers)
Flexible phrasing with specific keywords
Multiple acceptable phrasings

Example: Confirmation with variable ID format

1 {
2   "judgePlan": {
3     "type": "regex",
4     "content": ".*confirmation (ID|number|code): [A-Z]{3}-[0-9]{5}.*"
5   }
6 }

Use AI Judge when...

Ideal for:

Semantic meaning validation
Tone and sentiment evaluation
Contextual appropriateness
Complex multi-factor criteria
Helpfulness assessment

Example: Validate polite rejection

1 {
2   "judgePlan": {
3     "type": "ai",
4     "model": {
5       "provider": "openai",
6       "model": "gpt-4o",
7       "messages": [{
8         "role": "system",
9         "content": "PASS if response politely declines without being rude and offers alternative. Output: pass or fail"
10       }]
11     }
12   }
13 }

Decision tree:

Is the exact wording critical?
├─ Yes → Use Exact Match
└─ No → Does it follow a pattern?
    ├─ Yes → Use Regex
    └─ No → Does it require understanding context/tone?
        ├─ Yes → Use AI Judge
        └─ No → Use Regex with flexible pattern

Performance optimization

Minimize test execution time:

Use exit-on-failure for early steps:

1 {
2   "continuePlan": {
3     "exitOnFailureEnabled": true
4   }
5 }

Stops test immediately when critical validation fails.

Run critical tests first: Organize test suites so smoke tests and critical validations run before expensive tests.
Keep conversations focused: Aim for 5-10 turns maximum. Split longer scenarios into multiple tests.
Batch related tests: Group similar evaluations to run sequentially rather than one-off.
Optimize AI judge prompts:

Use faster models (gpt-3.5-turbo) for simple validations
Use advanced models (gpt-4o) only for complex semantic evaluation
Keep prompts concise and specific

Performance comparison:

Judge Type	Speed	Cost	Use Case
Exact	⚡⚡⚡ Fast	$ Low	Critical exact matches
Regex	⚡⚡ Fast	$ Low	Pattern matching
AI (GPT-3.5)	⚡ Medium	$$ Medium	Simple semantic checks
AI (GPT-4)	⏱ Slower	$$$ Higher	Complex evaluation

Maintenance strategies

Version control your evaluations:

Store evaluation definitions alongside your codebase:

$ /tests
>   /evals
>     /greeting
>       - basic-greeting.json
>       - multilingual-greeting.json
>     /booking
>       - happy-path-booking.json
>       - error-handling-booking.json
>     /regression
>       - date-parsing-bug-1234.json

Regular review cycle:

Weekly: Review failed tests

Investigate all failures. Update tests if expectations changed, or fix assistant if behavior regressed.

Monthly: Audit test coverage

Review test suite completeness: - All critical user flows covered? - New features have tests? - Deprecated features removed?

Quarterly: Refactor and optimize

Remove duplicate tests
Update outdated validation criteria
Optimize slow-running tests
Document test rationale

Update tests when:

Assistant prompts or behavior change intentionally
New features are added
Bugs are fixed (add regression tests)
User feedback reveals edge cases
Business requirements evolve

Deprecation strategy:

Don’t delete tests immediately when features change:

Mark test as “deprecated” in description
Update expected behavior to match new requirements
Run for one release cycle to verify
Archive after confirmed stable

CI/CD integration

Automate evaluation runs in your deployment pipeline.

Basic workflow:

1 # .github/workflows/test-assistant.yml
2 name: Test Assistant Changes
3 
4 on:
5   pull_request:
6     paths:
7       - "assistants/**"
8       - "prompts/**"
9 
10 jobs:
11   run-evals:
12     runs-on: ubuntu-latest
13     steps:
14       - name: Run critical evals
15         run: |
16           # Run smoke tests
17           curl -X POST "https://api.vapi.ai/eval/run" \
18             -H "Authorization: Bearer ${{ secrets.VAPI_API_KEY }}" \
19             -d '{"evalId": "$SMOKE_TEST_ID", "target": {...}}'
20 
21           # Check results
22           # Fail build if tests fail

Advanced patterns:

Staging environment validation

Run full eval suite against staging before production deploy:

$ # Run all evals against staging assistant
> for eval_id in $EVAL_IDS; do
>   run_result=$(curl -X POST "https://api.vapi.ai/eval/run" \
>     -H "Authorization: Bearer $VAPI_API_KEY" \
>     -d "{\"evalId\": \"$eval_id\", \"target\": {\"type\": \"assistant\", \"assistantId\": \"$STAGING_ASSISTANT_ID\"}}")
>   
>   # Check if passed
>   status=$(echo $run_result | jq -r '.results[0].status')
>   if [ "$status" != "pass" ]; then
>     echo "Eval $eval_id failed!"
>     exit 1
>   fi
> done

Parallel test execution

Run multiple evals concurrently to speed up CI:

$ # Run evals in parallel
> for eval_id in $EVAL_IDS; do
>   (curl -X POST "https://api.vapi.ai/eval/run" \
>     -H "Authorization: Bearer $VAPI_API_KEY" \
>     -d "{\"evalId\": \"$eval_id\", ...}" > results_$eval_id.json) &
> done
> wait
> 
> # Aggregate results
> for result_file in results_*.json; do
>   # Check each result
> done

Quality gates

Block deployment if test pass rate falls below threshold: bash # Calculate pass rate total_tests=10 passed_tests=$(grep -c '"status":"pass"' all_results.json) pass_rate=$((passed_tests * 100 / total_tests)) if [ $pass_rate -lt 95 ]; then echo "Pass rate $pass_rate% below threshold 95%" exit 1 fi

Scheduled regression runs

Run full regression suite nightly:

1 # .github/workflows/nightly-regression.yml
2 on:
3   schedule:
4     - cron: '0 2 * * *'  # 2 AM daily
5 
6 jobs:
7   regression-suite:
8     runs-on: ubuntu-latest
9     steps:
10       - name: Run regression tests
11         run: ./scripts/run-regression-suite.sh
12       
13       - name: Notify on failures
14         if: failure()
15         run: |
16           # Send Slack notification
17           # Create GitHub issue

Advanced troubleshooting

Debugging failed evaluations

Step-by-step investigation:

Examine failure reason

Check judge.failureReason for specific details:

1 {
2   "judge": {
3     "status": "fail",
4     "failureReason": "Expected exact match: 'confirmed' but got: 'booked'"
5   }
6 }

This tells you exactly what differed.

Review full conversation transcript

Look at results[0].messages to see complete interaction: - What did the user actually say? - How did the assistant respond? - Were tool calls made correctly? - Did tool responses contain expected data?

Compare expected vs actual

For exact match failures: - Check for extra spaces or newlines - Verify punctuation matches exactly - Look for case sensitivity issues For tool call failures: - Verify argument types (string vs number) - Check for extra/missing arguments - Validate argument values

Test validation logic separately

For regex: - Test pattern with online validators - Try pattern against actual response - Check for escaped special characters For AI judge: - Test prompt with known good/bad examples - Verify binary pass/fail criteria - Check for ambiguous requirements

Reproduce manually

Test the assistant interactively:

Use same input as eval
Compare live behavior to eval results
Check if issue is with assistant or eval validation

Common failure patterns

Exact match fails with similar text

Problem: Expected “Hello, how can I help?” but got “Hello, how may I help?”

Solutions:

Switch to regex for flexibility: Hello, how (can|may) I help\?
Use AI judge for semantic matching
Update expected value if new phrasing is acceptable

Tool calls don't match

Problem: Arguments have different types or extra fields Solutions: - Check argument types: "14:00" (string) vs 14 (number) - Use partial matching: omit arguments to match only function name - Normalize data formats in tool implementation

AI judge inconsistent results

Problem: Same response sometimes passes, sometimes fails Solutions: - Make criteria more specific and binary - Add explicit examples of pass/fail cases in prompt - Use temperature=0 for deterministic evaluation - Switch to regex if pattern-based validation works

Test times out or hangs

Problem: Eval status stuck in “running” Solutions: - Check assistant configuration for errors - Verify tool endpoints are accessible - Reduce conversation complexity - Check for infinite loops in assistant logic

Regex doesn't match expected

Problem: Pattern seems correct but fails

Solutions:

Escape special regex characters: ., ?, *, +, (, )
Use .* for flexible matching around keywords
Test pattern with online regex validators
Check for hidden characters or unicode

Debugging tools and techniques

Use structured logging:

Track eval executions systematically:

1 {
2   "timestamp": "2024-01-15T10:30:00Z",
3   "evalId": "eval-123",
4   "evalName": "Booking Flow Test",
5   "runId": "run-456",
6   "target": "assistant-789",
7   "result": "fail",
8   "failedStep": 3,
9   "failureReason": "Tool call mismatch",
10   "actualBehavior": "Called cancelAppointment instead of bookAppointment"
11 }

Isolate variables:

When tests fail inconsistently:

Run same eval multiple times
Test with different assistants (A/B comparison)
Simplify conversation to minimum reproduction
Check for race conditions or timing issues

Progressive validation:

Build up complexity gradually:

1 // Step 1: Verify basic response
2 {"judgePlan": {"type": "regex", "content": ".+"}}
3 
4 // Step 2: Verify contains keyword
5 {"judgePlan": {"type": "regex", "content": ".*appointment.*"}}
6 
7 // Step 3: Verify exact format
8 {"judgePlan": {"type": "exact", "content": "Appointment confirmed"}}

Troubleshooting reference

Status and error codes

Status	Ended Reason	Meaning	Action
ended	mockConversation.done	✅ Test completed normally	Check results[0].status for pass/fail
ended	assistant-error	❌ Assistant configuration error	Fix assistant setup, re-run
ended	pipeline-error-*	❌ Provider API error	Check provider status, API keys
running	-	⏳ Test in progress	Wait or check for timeout
queued	-	⏳ Test waiting to start	Normal, should start soon

Quick diagnostic checklist

When an eval fails, check: - [ ] endedReason is “mockConversation.done”

Assistant works correctly in manual testing - [ ] Tool endpoints are accessible - [ ] Validation criteria match actual behavior - [ ] Regex patterns are properly escaped - [ ] AI judge prompts are specific and binary - [ ] Arguments match expected types (string vs number) - [ ] API keys and permissions are valid - [ ] No rate limits or quota issues

Getting help

Include these details when reporting issues:

Eval ID and run ID
Full endedReason value
Conversation transcript (results[0].messages)
Expected vs actual behavior
Assistant/squad configuration
Provider and model being used

Resources:

Eval API Reference
Discord Community - #testing channel
Support - Include eval run ID

Next steps

Evals quickstart

Return to quickstart guide for basic evaluation setup

Assistants guide

Learn to build and configure assistants for testing

Tools documentation

Build custom tools and test their behavior

Eval API reference

Complete API documentation for evaluations

Summary

Key takeaways for advanced eval testing:

Testing strategy:

Use smoke tests before comprehensive suites
Build regression tests when fixing bugs
Cover edge cases systematically

Validation selection:

Exact match for critical data
Regex for pattern matching
AI judge for semantic evaluation

Performance:

Exit early on critical failures
Keep conversations focused (5-10 turns)
Batch related tests together

Maintenance:

Version control evaluations
Review failures promptly
Update tests with features
Document test purpose clearly

CI/CD:

Automate critical tests in pipelines
Use staging for full suite validation
Set quality gate thresholds
Run regression suites regularly