For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Master testing strategies and best practices for production AI agents
Overview
This guide covers advanced evaluation strategies, testing patterns, and best practices for building robust test suites that ensure your AI agents work reliably in production.
"content": "Evaluate: {{messages[-1]}}\n\nPASS if:\n- Response acknowledges the time is unavailable\n- Response offers alternatives or asks for different time\n- Tone remains helpful (not apologetic to excess)\n\nFAIL if:\n- Response ignores the error\n- Response doesn't offer next steps\n- Tone is frustrated or rude\n\nOutput: pass or fail"
30
}
31
]
32
}
33
}
34
}
35
]
36
}
Invalid input handling:
1
{
2
"name": "Error Handling - Invalid Date Format",
3
"messages": [
4
{
5
"role": "user",
6
"content": "Book me for the 45th of Octember"
7
},
8
{
9
"role": "assistant",
10
"judgePlan": {
11
"type": "ai",
12
"model": {
13
"provider": "openai",
14
"model": "gpt-4o",
15
"messages": [
16
{
17
"role": "system",
18
"content": "PASS if response politely asks for valid date without mocking user. Output: pass or fail"
Look at results[0].messages to see complete interaction: - What did the user
actually say? - How did the assistant respond? - Were tool calls made
correctly? - Did tool responses contain expected data?
For exact match failures: - Check for extra spaces or newlines - Verify
punctuation matches exactly - Look for case sensitivity issues For tool call
failures: - Verify argument types (string vs number) - Check for extra/missing
arguments - Validate argument values
For regex: - Test pattern with online validators - Try pattern against actual
response - Check for escaped special characters For AI judge: - Test prompt
with known good/bad examples - Verify binary pass/fail criteria - Check for
ambiguous requirements
Check if issue is with assistant or eval validation
Common failure patterns
Exact match fails with similar text
Problem: Expected “Hello, how can I help?” but got “Hello, how may I help?”
Solutions:
Switch to regex for flexibility: Hello, how (can|may) I help\?
Use AI judge for semantic matching
Update expected value if new phrasing is acceptable
Tool calls don't match
Problem: Arguments have different types or extra fields Solutions: -
Check argument types: "14:00" (string) vs 14 (number) - Use partial
matching: omit arguments to match only function name - Normalize data
formats in tool implementation
AI judge inconsistent results
Problem: Same response sometimes passes, sometimes fails Solutions: -
Make criteria more specific and binary - Add explicit examples of pass/fail
cases in prompt - Use temperature=0 for deterministic evaluation - Switch to
regex if pattern-based validation works
Test times out or hangs
Problem: Eval status stuck in “running” Solutions: - Check assistant
configuration for errors - Verify tool endpoints are accessible - Reduce
conversation complexity - Check for infinite loops in assistant logic
Regex doesn't match expected
Problem: Pattern seems correct but fails
Solutions:
Escape special regex characters: ., ?, *, +, (, )
Use .* for flexible matching around keywords
Test pattern with online regex validators
Check for hidden characters or unicode
Debugging tools and techniques
Use structured logging:
Track eval executions systematically:
1
{
2
"timestamp": "2024-01-15T10:30:00Z",
3
"evalId": "eval-123",
4
"evalName": "Booking Flow Test",
5
"runId": "run-456",
6
"target": "assistant-789",
7
"result": "fail",
8
"failedStep": 3,
9
"failureReason": "Tool call mismatch",
10
"actualBehavior": "Called cancelAppointment instead of bookAppointment"
When an eval fails, check: - [ ] endedReason is “mockConversation.done”
Assistant works correctly in manual testing - [ ] Tool endpoints are
accessible - [ ] Validation criteria match actual behavior - [ ] Regex
patterns are properly escaped - [ ] AI judge prompts are specific and binary -
[ ] Arguments match expected types (string vs number) - [ ] API keys and
permissions are valid - [ ] No rate limits or quota issues