Get the (almost) daily changelog

Evaluation Execution & Results Processing

  1. Evaluation Execution Engine: Run comprehensive assistant evaluations with EvalRun and CreateEvalRunDTO. Execute your mock conversations against live assistants and squads to validate performance and behavior in controlled environments.

  2. Multiple Evaluation Models: Choose from various AI models for LLM-as-a-judge evaluation:

    • EvalOpenAIModel: GPT models including GPT-4.1, o1-mini, o3, and regional variants
    • EvalAnthropicModel: Claude models with optional thinking features for complex evaluations
    • EvalGoogleModel: Gemini models from 1.0 Pro to 2.5 Pro for diverse evaluation needs
    • EvalGroqModel: High-speed inference models including Llama and custom options
    • EvalCustomModel: Your own evaluation models with custom endpoints
  3. Evaluation Results: Comprehensive result tracking with EvalRunResult:

    • status: Pass/fail evaluation outcomes
    • messages: Complete conversation transcript from the evaluation
    • startedAt and endedAt: Precise timing information for performance analysis
  4. Target Flexibility: Run evaluations against different targets:

  5. Evaluation Status Tracking: Monitor evaluation progress with detailed status information:

    • running: Evaluation in progress
    • ended: Evaluation completed
    • queued: Evaluation waiting to start
    • Detailed endedReason including success, error, timeout, and cancellation states
  6. Judge Configuration: Optimize evaluation accuracy with model-specific settings:

    • maxTokens: Recommended 50-10000 tokens (1 token for simple pass/fail responses)
    • temperature: 0-0.3 recommended for LLM-as-a-judge to reduce hallucinations

For LLM-as-a-judge evaluations, the judge model must respond with exactly “pass” or “fail”. Design your evaluation prompts to ensure clear, deterministic responses.

Evaluation Capabilities

Multi-Model Support

Choose from OpenAI, Anthropic, Google, Groq, or custom models for evaluation, matching your quality and performance requirements.

Comprehensive Results

Detailed pass/fail results with complete conversation transcripts and timing information for thorough analysis.

Flexible Targets

Test individual assistants or entire squads with optional configuration overrides for comprehensive validation.

Status Monitoring

Real-time evaluation status tracking with detailed reason codes for failures, timeouts, and cancellations.