Evaluation Execution & Results Processing
-
Evaluation Execution Engine: Run comprehensive assistant evaluations with
EvalRun
andCreateEvalRunDTO
. Execute your mock conversations against live assistants and squads to validate performance and behavior in controlled environments. -
Multiple Evaluation Models: Choose from various AI models for LLM-as-a-judge evaluation:
EvalOpenAIModel
: GPT models including GPT-4.1, o1-mini, o3, and regional variantsEvalAnthropicModel
: Claude models with optional thinking features for complex evaluationsEvalGoogleModel
: Gemini models from 1.0 Pro to 2.5 Pro for diverse evaluation needsEvalGroqModel
: High-speed inference models including Llama and custom optionsEvalCustomModel
: Your own evaluation models with custom endpoints
-
Evaluation Results: Comprehensive result tracking with
EvalRunResult
:status
: Pass/fail evaluation outcomesmessages
: Complete conversation transcript from the evaluationstartedAt
andendedAt
: Precise timing information for performance analysis
-
Target Flexibility: Run evaluations against different targets:
EvalRunTargetAssistant
: Test individual assistants with optional overridesEvalRunTargetSquad
: Evaluate entire squad performance and coordination
-
Evaluation Status Tracking: Monitor evaluation progress with detailed status information:
running
: Evaluation in progressended
: Evaluation completedqueued
: Evaluation waiting to start- Detailed
endedReason
including success, error, timeout, and cancellation states
-
Judge Configuration: Optimize evaluation accuracy with model-specific settings:
maxTokens
: Recommended 50-10000 tokens (1 token for simple pass/fail responses)temperature
: 0-0.3 recommended for LLM-as-a-judge to reduce hallucinations
For LLM-as-a-judge evaluations, the judge model must respond with exactly “pass” or “fail”. Design your evaluation prompts to ensure clear, deterministic responses.
Evaluation Capabilities
Choose from OpenAI, Anthropic, Google, Groq, or custom models for evaluation, matching your quality and performance requirements.
Detailed pass/fail results with complete conversation transcripts and timing information for thorough analysis.
Test individual assistants or entire squads with optional configuration overrides for comprehensive validation.
Real-time evaluation status tracking with detailed reason codes for failures, timeouts, and cancellations.