Advanced eval testing

Master testing strategies and best practices for production AI agents

Overview

This guide covers advanced evaluation strategies, testing patterns, and best practices for building robust test suites that ensure your AI agents work reliably in production.

You’ll learn:

  • Strategic testing approaches (smoke, regression, edge case)
  • Testing patterns for different scenarios
  • Performance optimization techniques
  • Maintenance and CI/CD integration strategies
  • Advanced troubleshooting methods

Testing strategies

Smoke tests

Quick validation that core functionality works. Run these first to catch obvious issues.

Purpose: Verify assistant responds and basic conversation flow works.

1{
2 "name": "Smoke Test - Basic Response",
3 "description": "Verify assistant responds to simple greeting",
4 "type": "chat.mockConversation",
5 "messages": [
6 {
7 "role": "user",
8 "content": "Hello"
9 },
10 {
11 "role": "assistant",
12 "judgePlan": {
13 "type": "regex",
14 "content": ".+"
15 }
16 }
17 ]
18}

Characteristics:

  • Minimal validation (just check for any response)
  • Fast execution (1-2 turns)
  • Run before detailed tests
  • Exit early if smoke tests fail

When to use:

  • Before running expensive test suites
  • After deploying configuration changes
  • As health checks in monitoring
  • Quick validation during development

Regression tests

Ensure fixes and updates don’t break existing functionality.

Purpose: Validate that known issues stay fixed and features keep working.

  1. Create evaluation named with “Regression: ” prefix
  2. Include issue ticket number in description
  3. Add exact scenario that previously failed
  4. Validate the fix still works

Example:

  • Name: “Regression: Date Parsing Bug #1234”
  • Description: “Verify dates like ‘3/15’ parse correctly after bug fix”

Best practices:

  • Name tests after bugs they prevent
  • Include ticket/issue numbers in descriptions
  • Add regression tests when fixing bugs
  • Run full regression suite before major releases
  • Archive tests only when features are removed

Edge case testing

Test boundary conditions and unusual inputs.

Common edge cases to test:

1{
2 "messages": [
3 {"role": "user", "content": ""},
4 {"role": "assistant", "judgePlan": {
5 "type": "ai",
6 "model": {
7 "provider": "openai",
8 "model": "gpt-4o",
9 "messages": [{
10 "role": "system",
11 "content": "PASS if response asks for clarification politely. Output: pass or fail"
12 }]
13 }
14 }}
15 ]
16}
1{
2 "messages": [
3 {
4 "role": "user",
5 "content": "I need help with... (repeat 1000 times)"
6 },
7 {
8 "role": "assistant",
9 "judgePlan": {
10 "type": "regex",
11 "content": ".+"
12 },
13 "continuePlan": {
14 "exitOnFailureEnabled": true
15 }
16 }
17 ]
18}
1{
2 "messages": [
3 {
4 "role": "user",
5 "content": "My name is François José 王明"
6 },
7 {
8 "role": "assistant",
9 "judgePlan": {
10 "type": "ai",
11 "model": {
12 "provider": "openai",
13 "model": "gpt-4o",
14 "messages": [{
15 "role": "system",
16 "content": "PASS if response correctly acknowledges the name with special characters. Output: pass or fail"
17 }]
18 }
19 }
20 }
21 ]
22}
1{
2 "messages": [
3 {
4 "role": "user",
5 "content": "asdfghjkl"
6 },
7 {
8 "role": "assistant",
9 "judgePlan": {
10 "type": "ai",
11 "model": {
12 "provider": "openai",
13 "model": "gpt-4o",
14 "messages": [{
15 "role": "system",
16 "content": "PASS if response asks for clarification without being rude. Output: pass or fail"
17 }]
18 }
19 }
20 }
21 ]
22}
1{
2 "messages": [
3 {"role": "user", "content": "Book appointment"},
4 {"role": "assistant", "judgePlan": {"type": "regex", "content": ".*appointment.*"}},
5 {"role": "user", "content": "Actually, cancel that. I need tech support."},
6 {"role": "assistant", "judgePlan": {
7 "type": "ai",
8 "model": {
9 "provider": "openai",
10 "model": "gpt-4o",
11 "messages": [{
12 "role": "system",
13 "content": "PASS if response pivots to tech support without confusion. Output: pass or fail"
14 }]
15 }
16 }}
17 ]
18}

Edge case categories:

  • Input boundaries: Empty, maximum length, special characters
  • Data formats: Invalid dates, malformed phone numbers, unusual names
  • Conversation patterns: Interruptions, topic changes, contradictions
  • Timing: Very fast responses, long pauses, timeout scenarios

Testing patterns

Happy path testing

Validate ideal user journeys where everything works correctly.

Structure:

  1. User provides clear, complete information
  2. Assistant responds appropriately
  3. Tools execute successfully
  4. Conversation completes with desired outcome

Example: Perfect booking flow

1{
2 "name": "Happy Path - Complete Booking",
3 "description": "User provides all info clearly, booking succeeds",
4 "type": "chat.mockConversation",
5 "messages": [
6 {
7 "role": "user",
8 "content": "I'd like to book an appointment"
9 },
10 {
11 "role": "assistant",
12 "judgePlan": {
13 "type": "ai",
14 "model": {
15 "provider": "openai",
16 "model": "gpt-4o",
17 "messages": [
18 {
19 "role": "system",
20 "content": "PASS if response asks for date/time preferences. Output: pass or fail"
21 }
22 ]
23 }
24 }
25 },
26 {
27 "role": "user",
28 "content": "Next Monday at 2pm please"
29 },
30 {
31 "role": "assistant",
32 "judgePlan": {
33 "type": "exact",
34 "toolCalls": [
35 {
36 "name": "bookAppointment",
37 "arguments": {
38 "date": "2025-01-20",
39 "time": "14:00"
40 }
41 }
42 ]
43 }
44 },
45 {
46 "role": "tool",
47 "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}"
48 },
49 {
50 "role": "assistant",
51 "judgePlan": {
52 "type": "regex",
53 "content": ".*(confirmed|booked).*APT-12345.*"
54 }
55 }
56 ]
57}

Happy path coverage:

  • Test primary user goals
  • Verify expected tool executions
  • Validate success messages
  • Confirm data accuracy

Error handling testing

Test how your assistant handles failures gracefully.

Tool failure scenarios:

1{
2 "name": "Error Handling - Booking Unavailable",
3 "messages": [
4 {
5 "role": "user",
6 "content": "Book Monday at 2pm"
7 },
8 {
9 "role": "assistant",
10 "judgePlan": {
11 "type": "exact",
12 "toolCalls": [{ "name": "bookAppointment" }]
13 }
14 },
15 {
16 "role": "tool",
17 "content": "{\"status\": \"error\", \"message\": \"Time slot unavailable\"}"
18 },
19 {
20 "role": "assistant",
21 "judgePlan": {
22 "type": "ai",
23 "model": {
24 "provider": "openai",
25 "model": "gpt-4o",
26 "messages": [
27 {
28 "role": "system",
29 "content": "Evaluate: {{messages[-1]}}\n\nPASS if:\n- Response acknowledges the time is unavailable\n- Response offers alternatives or asks for different time\n- Tone remains helpful (not apologetic to excess)\n\nFAIL if:\n- Response ignores the error\n- Response doesn't offer next steps\n- Tone is frustrated or rude\n\nOutput: pass or fail"
30 }
31 ]
32 }
33 }
34 }
35 ]
36}

Invalid input handling:

1{
2 "name": "Error Handling - Invalid Date Format",
3 "messages": [
4 {
5 "role": "user",
6 "content": "Book me for the 45th of Octember"
7 },
8 {
9 "role": "assistant",
10 "judgePlan": {
11 "type": "ai",
12 "model": {
13 "provider": "openai",
14 "model": "gpt-4o",
15 "messages": [
16 {
17 "role": "system",
18 "content": "PASS if response politely asks for valid date without mocking user. Output: pass or fail"
19 }
20 ]
21 }
22 }
23 }
24 ]
25}

API timeout simulation:

1{
2 "name": "Error Handling - Tool Timeout",
3 "messages": [
4 {
5 "role": "user",
6 "content": "Check my order status"
7 },
8 {
9 "role": "assistant",
10 "judgePlan": {
11 "type": "exact",
12 "toolCalls": [{ "name": "checkOrderStatus" }]
13 }
14 },
15 {
16 "role": "tool",
17 "content": "{\"status\": \"error\", \"message\": \"Request timeout\"}"
18 },
19 {
20 "role": "assistant",
21 "judgePlan": {
22 "type": "ai",
23 "model": {
24 "provider": "openai",
25 "model": "gpt-4o",
26 "messages": [
27 {
28 "role": "system",
29 "content": "PASS if response acknowledges technical issue and suggests retry or alternative. Output: pass or fail"
30 }
31 ]
32 }
33 }
34 }
35 ]
36}

Error categories to test:

  • Tool/API failures
  • Invalid user input
  • Timeout scenarios
  • Rate limit errors
  • Partial data availability
  • Permission/authorization issues

Boundary testing

Test limits and thresholds of your system.

Maximum conversation length:

1{
2 "name": "Boundary - Max Turns",
3 "description": "Test assistant handles long conversations (20+ turns)",
4 "messages": [
5 { "role": "user", "content": "Question 1" },
6 { "role": "assistant", "judgePlan": { "type": "regex", "content": ".+" } },
7 { "role": "user", "content": "Question 2" },
8 { "role": "assistant", "judgePlan": { "type": "regex", "content": ".+" } },
9 // ... repeat up to boundary ...
10 { "role": "user", "content": "Final question" },
11 {
12 "role": "assistant",
13 "judgePlan": {
14 "type": "ai",
15 "model": {
16 "provider": "openai",
17 "model": "gpt-4o",
18 "messages": [
19 {
20 "role": "system",
21 "content": "PASS if response is coherent and maintains context from earlier conversation. Output: pass or fail"
22 }
23 ]
24 }
25 }
26 }
27 ]
28}

Rate limits:

Test behavior at or near rate limits:

  • Multiple tool calls in succession
  • Rapid user input
  • Large data processing requests

Data size boundaries:

1{
2 "name": "Boundary - Large Data Response",
3 "messages": [
4 {
5 "role": "user",
6 "content": "Get all customer records"
7 },
8 {
9 "role": "assistant",
10 "judgePlan": {
11 "type": "exact",
12 "toolCalls": [{ "name": "getAllCustomers" }]
13 }
14 },
15 {
16 "role": "tool",
17 "content": "{\"customers\": [/* 1000 customer objects */]}"
18 },
19 {
20 "role": "assistant",
21 "judgePlan": {
22 "type": "ai",
23 "model": {
24 "provider": "openai",
25 "model": "gpt-4o",
26 "messages": [
27 {
28 "role": "system",
29 "content": "PASS if response summarizes data rather than reading full list. Output: pass or fail"
30 }
31 ]
32 }
33 }
34 }
35 ]
36}

Best practices

Evaluation design principles

Single responsibility

Each evaluation should test one specific behavior or feature.

Good: “Test greeting acknowledgment”

Bad: “Test greeting, booking, and error handling”

Clear naming

Use descriptive names that explain what’s being tested. ✅ Good: “Booking

  • Validates Date Format” ❌ Bad: “Test 1” or “Eval ABC”
Comprehensive descriptions

Document why the test exists and what it validates. Include context: business requirement, bug ticket, or feature spec.

Maintainable complexity

Keep evaluations focused (5-10 turns max).

Split complex scenarios into multiple targeted tests.

Validation approach selection

Choose the right judge type for each scenario:

Ideal for:

  • Critical business data (confirmation IDs, totals, dates)
  • Tool call validation with specific arguments
  • Compliance-required exact wording
  • Success/failure status messages

Example: Booking confirmation ID must be exact

1{
2 "judgePlan": {
3 "type": "exact",
4 "content": "Your confirmation ID is APT-12345"
5 }
6}

Ideal for:

  • Responses with variable data (names, dates, IDs)
  • Pattern matching (email formats, phone numbers)
  • Flexible phrasing with specific keywords
  • Multiple acceptable phrasings

Example: Confirmation with variable ID format

1{
2 "judgePlan": {
3 "type": "regex",
4 "content": ".*confirmation (ID|number|code): [A-Z]{3}-[0-9]{5}.*"
5 }
6}

Ideal for:

  • Semantic meaning validation
  • Tone and sentiment evaluation
  • Contextual appropriateness
  • Complex multi-factor criteria
  • Helpfulness assessment

Example: Validate polite rejection

1{
2 "judgePlan": {
3 "type": "ai",
4 "model": {
5 "provider": "openai",
6 "model": "gpt-4o",
7 "messages": [{
8 "role": "system",
9 "content": "PASS if response politely declines without being rude and offers alternative. Output: pass or fail"
10 }]
11 }
12 }
13}

Decision tree:

Is the exact wording critical?
├─ Yes → Use Exact Match
└─ No → Does it follow a pattern?
├─ Yes → Use Regex
└─ No → Does it require understanding context/tone?
├─ Yes → Use AI Judge
└─ No → Use Regex with flexible pattern

Performance optimization

Minimize test execution time:

  1. Use exit-on-failure for early steps:
1{
2 "continuePlan": {
3 "exitOnFailureEnabled": true
4 }
5}

Stops test immediately when critical validation fails.

  1. Run critical tests first: Organize test suites so smoke tests and critical validations run before expensive tests.

  2. Keep conversations focused: Aim for 5-10 turns maximum. Split longer scenarios into multiple tests.

  3. Batch related tests: Group similar evaluations to run sequentially rather than one-off.

  4. Optimize AI judge prompts:

  • Use faster models (gpt-3.5-turbo) for simple validations
  • Use advanced models (gpt-4o) only for complex semantic evaluation
  • Keep prompts concise and specific

Performance comparison:

Judge TypeSpeedCostUse Case
Exact⚡⚡⚡ Fast$ LowCritical exact matches
Regex⚡⚡ Fast$ LowPattern matching
AI (GPT-3.5)⚡ Medium$$ MediumSimple semantic checks
AI (GPT-4)⏱ Slower$$$ HigherComplex evaluation

Maintenance strategies

Version control your evaluations:

Store evaluation definitions alongside your codebase:

$/tests
> /evals
> /greeting
> - basic-greeting.json
> - multilingual-greeting.json
> /booking
> - happy-path-booking.json
> - error-handling-booking.json
> /regression
> - date-parsing-bug-1234.json

Regular review cycle:

1

Weekly: Review failed tests

Investigate all failures. Update tests if expectations changed, or fix assistant if behavior regressed.

3

Monthly: Audit test coverage

Review test suite completeness: - All critical user flows covered? - New features have tests? - Deprecated features removed?

4

Quarterly: Refactor and optimize

  • Remove duplicate tests
  • Update outdated validation criteria
  • Optimize slow-running tests
  • Document test rationale

Update tests when:

  • Assistant prompts or behavior change intentionally
  • New features are added
  • Bugs are fixed (add regression tests)
  • User feedback reveals edge cases
  • Business requirements evolve

Deprecation strategy:

Don’t delete tests immediately when features change:

  1. Mark test as “deprecated” in description
  2. Update expected behavior to match new requirements
  3. Run for one release cycle to verify
  4. Archive after confirmed stable

CI/CD integration

Automate evaluation runs in your deployment pipeline.

Basic workflow:

1# .github/workflows/test-assistant.yml
2name: Test Assistant Changes
3
4on:
5 pull_request:
6 paths:
7 - "assistants/**"
8 - "prompts/**"
9
10jobs:
11 run-evals:
12 runs-on: ubuntu-latest
13 steps:
14 - name: Run critical evals
15 run: |
16 # Run smoke tests
17 curl -X POST "https://api.vapi.ai/eval/run" \
18 -H "Authorization: Bearer ${{ secrets.VAPI_API_KEY }}" \
19 -d '{"evalId": "$SMOKE_TEST_ID", "target": {...}}'
20
21 # Check results
22 # Fail build if tests fail

Advanced patterns:

Run full eval suite against staging before production deploy:

$# Run all evals against staging assistant
>for eval_id in $EVAL_IDS; do
> run_result=$(curl -X POST "https://api.vapi.ai/eval/run" \
> -H "Authorization: Bearer $VAPI_API_KEY" \
> -d "{\"evalId\": \"$eval_id\", \"target\": {\"type\": \"assistant\", \"assistantId\": \"$STAGING_ASSISTANT_ID\"}}")
>
> # Check if passed
> status=$(echo $run_result | jq -r '.results[0].status')
> if [ "$status" != "pass" ]; then
> echo "Eval $eval_id failed!"
> exit 1
> fi
>done

Run multiple evals concurrently to speed up CI:

$# Run evals in parallel
>for eval_id in $EVAL_IDS; do
> (curl -X POST "https://api.vapi.ai/eval/run" \
> -H "Authorization: Bearer $VAPI_API_KEY" \
> -d "{\"evalId\": \"$eval_id\", ...}" > results_$eval_id.json) &
>done
>wait
>
># Aggregate results
>for result_file in results_*.json; do
> # Check each result
>done

Block deployment if test pass rate falls below threshold: bash # Calculate pass rate total_tests=10 passed_tests=$(grep -c '"status":"pass"' all_results.json) pass_rate=$((passed_tests * 100 / total_tests)) if [ $pass_rate -lt 95 ]; then echo "Pass rate $pass_rate% below threshold 95%" exit 1 fi

Run full regression suite nightly:

1# .github/workflows/nightly-regression.yml
2on:
3 schedule:
4 - cron: '0 2 * * *' # 2 AM daily
5
6jobs:
7 regression-suite:
8 runs-on: ubuntu-latest
9 steps:
10 - name: Run regression tests
11 run: ./scripts/run-regression-suite.sh
12
13 - name: Notify on failures
14 if: failure()
15 run: |
16 # Send Slack notification
17 # Create GitHub issue

Advanced troubleshooting

Debugging failed evaluations

Step-by-step investigation:

1

Examine failure reason

Check judge.failureReason for specific details:

1{
2 "judge": {
3 "status": "fail",
4 "failureReason": "Expected exact match: 'confirmed' but got: 'booked'"
5 }
6}

This tells you exactly what differed.

3

Review full conversation transcript

Look at results[0].messages to see complete interaction: - What did the user actually say? - How did the assistant respond? - Were tool calls made correctly? - Did tool responses contain expected data?

5

Compare expected vs actual

For exact match failures: - Check for extra spaces or newlines - Verify punctuation matches exactly - Look for case sensitivity issues For tool call failures: - Verify argument types (string vs number) - Check for extra/missing arguments - Validate argument values

7

Test validation logic separately

For regex: - Test pattern with online validators - Try pattern against actual response - Check for escaped special characters For AI judge: - Test prompt with known good/bad examples - Verify binary pass/fail criteria - Check for ambiguous requirements

8

Reproduce manually

Test the assistant interactively:

  • Use same input as eval
  • Compare live behavior to eval results
  • Check if issue is with assistant or eval validation

Common failure patterns

Problem: Expected “Hello, how can I help?” but got “Hello, how may I help?”

Solutions:

  • Switch to regex for flexibility: Hello, how (can|may) I help\?
  • Use AI judge for semantic matching
  • Update expected value if new phrasing is acceptable

Problem: Arguments have different types or extra fields Solutions: - Check argument types: "14:00" (string) vs 14 (number) - Use partial matching: omit arguments to match only function name - Normalize data formats in tool implementation

Problem: Same response sometimes passes, sometimes fails Solutions: - Make criteria more specific and binary - Add explicit examples of pass/fail cases in prompt - Use temperature=0 for deterministic evaluation - Switch to regex if pattern-based validation works

Problem: Eval status stuck in “running” Solutions: - Check assistant configuration for errors - Verify tool endpoints are accessible - Reduce conversation complexity - Check for infinite loops in assistant logic

Problem: Pattern seems correct but fails

Solutions:

  • Escape special regex characters: ., ?, *, +, (, )
  • Use .* for flexible matching around keywords
  • Test pattern with online regex validators
  • Check for hidden characters or unicode

Debugging tools and techniques

Use structured logging:

Track eval executions systematically:

1{
2 "timestamp": "2024-01-15T10:30:00Z",
3 "evalId": "eval-123",
4 "evalName": "Booking Flow Test",
5 "runId": "run-456",
6 "target": "assistant-789",
7 "result": "fail",
8 "failedStep": 3,
9 "failureReason": "Tool call mismatch",
10 "actualBehavior": "Called cancelAppointment instead of bookAppointment"
11}

Isolate variables:

When tests fail inconsistently:

  1. Run same eval multiple times
  2. Test with different assistants (A/B comparison)
  3. Simplify conversation to minimum reproduction
  4. Check for race conditions or timing issues

Progressive validation:

Build up complexity gradually:

1// Step 1: Verify basic response
2{"judgePlan": {"type": "regex", "content": ".+"}}
3
4// Step 2: Verify contains keyword
5{"judgePlan": {"type": "regex", "content": ".*appointment.*"}}
6
7// Step 3: Verify exact format
8{"judgePlan": {"type": "exact", "content": "Appointment confirmed"}}

Troubleshooting reference

Status and error codes

StatusEnded ReasonMeaningAction
endedmockConversation.done✅ Test completed normallyCheck results[0].status for pass/fail
endedassistant-error❌ Assistant configuration errorFix assistant setup, re-run
endedpipeline-error-*❌ Provider API errorCheck provider status, API keys
running-⏳ Test in progressWait or check for timeout
queued-⏳ Test waiting to startNormal, should start soon

Quick diagnostic checklist

When an eval fails, check: - [ ] endedReason is “mockConversation.done”

  • Assistant works correctly in manual testing - [ ] Tool endpoints are accessible - [ ] Validation criteria match actual behavior - [ ] Regex patterns are properly escaped - [ ] AI judge prompts are specific and binary - [ ] Arguments match expected types (string vs number) - [ ] API keys and permissions are valid - [ ] No rate limits or quota issues

Getting help

Include these details when reporting issues:

  • Eval ID and run ID
  • Full endedReason value
  • Conversation transcript (results[0].messages)
  • Expected vs actual behavior
  • Assistant/squad configuration
  • Provider and model being used

Resources:

Next steps

Summary

Key takeaways for advanced eval testing:

Testing strategy:

  • Use smoke tests before comprehensive suites
  • Build regression tests when fixing bugs
  • Cover edge cases systematically

Validation selection:

  • Exact match for critical data
  • Regex for pattern matching
  • AI judge for semantic evaluation

Performance:

  • Exit early on critical failures
  • Keep conversations focused (5-10 turns)
  • Batch related tests together

Maintenance:

  • Version control evaluations
  • Review failures promptly
  • Update tests with features
  • Document test purpose clearly

CI/CD:

  • Automate critical tests in pipelines
  • Use staging for full suite validation
  • Set quality gate thresholds
  • Run regression suites regularly