September 28, 2025

Evaluation Execution & Results Processing

Evaluation Execution Engine: Run comprehensive assistant evaluations with EvalRun and CreateEvalRunDTO. Execute your mock conversations against live assistants and squads to validate performance and behavior in controlled environments.
Multiple Evaluation Models: Choose from various AI models for LLM-as-a-judge evaluation:
- EvalOpenAIModel: GPT models including GPT-4.1, o1-mini, o3, and regional variants
- EvalAnthropicModel: Claude models with optional thinking features for complex evaluations
- EvalGoogleModel: Gemini models from 1.0 Pro to 2.5 Pro for diverse evaluation needs
- EvalGroqModel: High-speed inference models including Llama and custom options
- EvalCustomModel: Your own evaluation models with custom endpoints
Evaluation Results: Comprehensive result tracking with EvalRunResult:
- status: Pass/fail evaluation outcomes
- messages: Complete conversation transcript from the evaluation
- startedAt and endedAt: Precise timing information for performance analysis
Target Flexibility: Run evaluations against different targets:
- EvalRunTargetAssistant: Test individual assistants with optional overrides
- EvalRunTargetSquad: Evaluate entire squad performance and coordination
Evaluation Status Tracking: Monitor evaluation progress with detailed status information:
- running: Evaluation in progress
- ended: Evaluation completed
- queued: Evaluation waiting to start
- Detailed endedReason including success, error, timeout, and cancellation states
Judge Configuration: Optimize evaluation accuracy with model-specific settings:
- maxTokens: Recommended 50-10000 tokens (1 token for simple pass/fail responses)
- temperature: 0-0.3 recommended for LLM-as-a-judge to reduce hallucinations

For LLM-as-a-judge evaluations, the judge model must respond with exactly “pass” or “fail”. Design your evaluation prompts to ensure clear, deterministic responses.

Evaluation Capabilities

Multi-Model Support

Choose from OpenAI, Anthropic, Google, Groq, or custom models for evaluation, matching your quality and performance requirements.

Comprehensive Results

Detailed pass/fail results with complete conversation transcripts and timing information for thorough analysis.

Flexible Targets

Test individual assistants or entire squads with optional configuration overrides for comprehensive validation.

Status Monitoring

Real-time evaluation status tracking with detailed reason codes for failures, timeouts, and cancellations.