Advanced eval testing
Overview
This guide covers advanced evaluation strategies, testing patterns, and best practices for building robust test suites that ensure your AI agents work reliably in production.
You’ll learn:
- Strategic testing approaches (smoke, regression, edge case)
- Testing patterns for different scenarios
- Performance optimization techniques
- Maintenance and CI/CD integration strategies
- Advanced troubleshooting methods
Testing strategies
Smoke tests
Quick validation that core functionality works. Run these first to catch obvious issues.
Purpose: Verify assistant responds and basic conversation flow works.
Characteristics:
- Minimal validation (just check for any response)
- Fast execution (1-2 turns)
- Run before detailed tests
- Exit early if smoke tests fail
When to use:
- Before running expensive test suites
- After deploying configuration changes
- As health checks in monitoring
- Quick validation during development
Regression tests
Ensure fixes and updates don’t break existing functionality.
Purpose: Validate that known issues stay fixed and features keep working.
Dashboard
cURL
- Create evaluation named with “Regression: ” prefix
- Include issue ticket number in description
- Add exact scenario that previously failed
- Validate the fix still works
Example:
- Name: “Regression: Date Parsing Bug #1234”
- Description: “Verify dates like ‘3/15’ parse correctly after bug fix”
Best practices:
- Name tests after bugs they prevent
- Include ticket/issue numbers in descriptions
- Add regression tests when fixing bugs
- Run full regression suite before major releases
- Archive tests only when features are removed
Edge case testing
Test boundary conditions and unusual inputs.
Common edge cases to test:
Empty or minimal input
Very long input
Special characters and unicode
Ambiguous or unclear requests
Rapid conversation changes
Edge case categories:
- Input boundaries: Empty, maximum length, special characters
- Data formats: Invalid dates, malformed phone numbers, unusual names
- Conversation patterns: Interruptions, topic changes, contradictions
- Timing: Very fast responses, long pauses, timeout scenarios
Testing patterns
Happy path testing
Validate ideal user journeys where everything works correctly.
Structure:
- User provides clear, complete information
- Assistant responds appropriately
- Tools execute successfully
- Conversation completes with desired outcome
Example: Perfect booking flow
Happy path coverage:
- Test primary user goals
- Verify expected tool executions
- Validate success messages
- Confirm data accuracy
Error handling testing
Test how your assistant handles failures gracefully.
Tool failure scenarios:
Invalid input handling:
API timeout simulation:
Error categories to test:
- Tool/API failures
- Invalid user input
- Timeout scenarios
- Rate limit errors
- Partial data availability
- Permission/authorization issues
Boundary testing
Test limits and thresholds of your system.
Maximum conversation length:
Rate limits:
Test behavior at or near rate limits:
- Multiple tool calls in succession
- Rapid user input
- Large data processing requests
Data size boundaries:
Best practices
Evaluation design principles
Each evaluation should test one specific behavior or feature.
✅ Good: “Test greeting acknowledgment”
❌ Bad: “Test greeting, booking, and error handling”
Use descriptive names that explain what’s being tested. ✅ Good: “Booking
- Validates Date Format” ❌ Bad: “Test 1” or “Eval ABC”
Document why the test exists and what it validates. Include context: business requirement, bug ticket, or feature spec.
Keep evaluations focused (5-10 turns max).
Split complex scenarios into multiple targeted tests.
Validation approach selection
Choose the right judge type for each scenario:
Use Exact Match when...
Ideal for:
- Critical business data (confirmation IDs, totals, dates)
- Tool call validation with specific arguments
- Compliance-required exact wording
- Success/failure status messages
Example: Booking confirmation ID must be exact
Use Regex when...
Ideal for:
- Responses with variable data (names, dates, IDs)
- Pattern matching (email formats, phone numbers)
- Flexible phrasing with specific keywords
- Multiple acceptable phrasings
Example: Confirmation with variable ID format
Use AI Judge when...
Ideal for:
- Semantic meaning validation
- Tone and sentiment evaluation
- Contextual appropriateness
- Complex multi-factor criteria
- Helpfulness assessment
Example: Validate polite rejection
Decision tree:
Performance optimization
Minimize test execution time:
- Use exit-on-failure for early steps:
Stops test immediately when critical validation fails.
-
Run critical tests first: Organize test suites so smoke tests and critical validations run before expensive tests.
-
Keep conversations focused: Aim for 5-10 turns maximum. Split longer scenarios into multiple tests.
-
Batch related tests: Group similar evaluations to run sequentially rather than one-off.
-
Optimize AI judge prompts:
- Use faster models (gpt-3.5-turbo) for simple validations
- Use advanced models (gpt-4o) only for complex semantic evaluation
- Keep prompts concise and specific
Performance comparison:
Maintenance strategies
Version control your evaluations:
Store evaluation definitions alongside your codebase:
Regular review cycle:
Weekly: Review failed tests
Investigate all failures. Update tests if expectations changed, or fix assistant if behavior regressed.
Update tests when:
- Assistant prompts or behavior change intentionally
- New features are added
- Bugs are fixed (add regression tests)
- User feedback reveals edge cases
- Business requirements evolve
Deprecation strategy:
Don’t delete tests immediately when features change:
- Mark test as “deprecated” in description
- Update expected behavior to match new requirements
- Run for one release cycle to verify
- Archive after confirmed stable
CI/CD integration
Automate evaluation runs in your deployment pipeline.
Basic workflow:
Advanced patterns:
Staging environment validation
Run full eval suite against staging before production deploy:
Parallel test execution
Run multiple evals concurrently to speed up CI:
Quality gates
Block deployment if test pass rate falls below threshold: bash # Calculate pass rate total_tests=10 passed_tests=$(grep -c '"status":"pass"' all_results.json) pass_rate=$((passed_tests * 100 / total_tests)) if [ $pass_rate -lt 95 ]; then echo "Pass rate $pass_rate% below threshold 95%" exit 1 fi
Scheduled regression runs
Run full regression suite nightly:
Advanced troubleshooting
Debugging failed evaluations
Step-by-step investigation:
Examine failure reason
Check judge.failureReason
for specific details:
This tells you exactly what differed.
Review full conversation transcript
Look at results[0].messages
to see complete interaction: - What did the user
actually say? - How did the assistant respond? - Were tool calls made
correctly? - Did tool responses contain expected data?
Compare expected vs actual
For exact match failures: - Check for extra spaces or newlines - Verify punctuation matches exactly - Look for case sensitivity issues For tool call failures: - Verify argument types (string vs number) - Check for extra/missing arguments - Validate argument values
Common failure patterns
Exact match fails with similar text
Problem: Expected “Hello, how can I help?” but got “Hello, how may I help?”
Solutions:
- Switch to regex for flexibility:
Hello, how (can|may) I help\?
- Use AI judge for semantic matching
- Update expected value if new phrasing is acceptable
Tool calls don't match
Problem: Arguments have different types or extra fields Solutions: -
Check argument types: "14:00"
(string) vs 14
(number) - Use partial
matching: omit arguments
to match only function name - Normalize data
formats in tool implementation
AI judge inconsistent results
Problem: Same response sometimes passes, sometimes fails Solutions: - Make criteria more specific and binary - Add explicit examples of pass/fail cases in prompt - Use temperature=0 for deterministic evaluation - Switch to regex if pattern-based validation works
Test times out or hangs
Problem: Eval status stuck in “running” Solutions: - Check assistant configuration for errors - Verify tool endpoints are accessible - Reduce conversation complexity - Check for infinite loops in assistant logic
Regex doesn't match expected
Problem: Pattern seems correct but fails
Solutions:
- Escape special regex characters:
.
,?
,*
,+
,(
,)
- Use
.*
for flexible matching around keywords - Test pattern with online regex validators
- Check for hidden characters or unicode
Debugging tools and techniques
Use structured logging:
Track eval executions systematically:
Isolate variables:
When tests fail inconsistently:
- Run same eval multiple times
- Test with different assistants (A/B comparison)
- Simplify conversation to minimum reproduction
- Check for race conditions or timing issues
Progressive validation:
Build up complexity gradually:
Troubleshooting reference
Status and error codes
Quick diagnostic checklist
When an eval fails, check: - [ ] endedReason
is “mockConversation.done”
- Assistant works correctly in manual testing - [ ] Tool endpoints are accessible - [ ] Validation criteria match actual behavior - [ ] Regex patterns are properly escaped - [ ] AI judge prompts are specific and binary - [ ] Arguments match expected types (string vs number) - [ ] API keys and permissions are valid - [ ] No rate limits or quota issues
Getting help
Include these details when reporting issues:
- Eval ID and run ID
- Full
endedReason
value - Conversation transcript (
results[0].messages
) - Expected vs actual behavior
- Assistant/squad configuration
- Provider and model being used
Resources:
- Eval API Reference
- Discord Community - #testing channel
- Support - Include eval run ID
Next steps
Return to quickstart guide for basic evaluation setup
Learn to build and configure assistants for testing
Build custom tools and test their behavior
Complete API documentation for evaluations
Summary
Key takeaways for advanced eval testing:
Testing strategy:
- Use smoke tests before comprehensive suites
- Build regression tests when fixing bugs
- Cover edge cases systematically
Validation selection:
- Exact match for critical data
- Regex for pattern matching
- AI judge for semantic evaluation
Performance:
- Exit early on critical failures
- Keep conversations focused (5-10 turns)
- Batch related tests together
Maintenance:
- Version control evaluations
- Review failures promptly
- Update tests with features
- Document test purpose clearly
CI/CD:
- Automate critical tests in pipelines
- Use staging for full suite validation
- Set quality gate thresholds
- Run regression suites regularly