For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
WebsiteStatusSupportDashboard
DocumentationAPI ReferenceMCPSDKsCLI (new)What's New?
DocumentationAPI ReferenceMCPSDKsCLI (new)What's New?
LogoLogo
WebsiteStatusSupportDashboard

What's New?

Subscribe to the latest product updates
September 28, 2025
September 28, 2025
Was this page helpful?
Edit this page
Previous

September 26, 2025

Next
Built with

Evaluation Execution & Results Processing

  1. Evaluation Execution Engine: Run comprehensive assistant evaluations with EvalRun and CreateEvalRunDTO. Execute your mock conversations against live assistants and squads to validate performance and behavior in controlled environments.

  2. Multiple Evaluation Models: Choose from various AI models for LLM-as-a-judge evaluation:

    • EvalOpenAIModel: GPT models including GPT-4.1, o1-mini, o3, and regional variants
    • EvalAnthropicModel: Claude models with optional thinking features for complex evaluations
    • EvalGoogleModel: Gemini models from 1.0 Pro to 2.5 Pro for diverse evaluation needs
    • EvalGroqModel: High-speed inference models including Llama and custom options
    • EvalCustomModel: Your own evaluation models with custom endpoints
  3. Evaluation Results: Comprehensive result tracking with EvalRunResult:

    • status: Pass/fail evaluation outcomes
    • messages: Complete conversation transcript from the evaluation
    • startedAt and endedAt: Precise timing information for performance analysis
  4. Target Flexibility: Run evaluations against different targets:

    • EvalRunTargetAssistant: Test individual assistants with optional overrides
    • EvalRunTargetSquad: Evaluate entire squad performance and coordination
  5. Evaluation Status Tracking: Monitor evaluation progress with detailed status information:

    • running: Evaluation in progress
    • ended: Evaluation completed
    • queued: Evaluation waiting to start
    • Detailed endedReason including success, error, timeout, and cancellation states
  6. Judge Configuration: Optimize evaluation accuracy with model-specific settings:

    • maxTokens: Recommended 50-10000 tokens (1 token for simple pass/fail responses)
    • temperature: 0-0.3 recommended for LLM-as-a-judge to reduce hallucinations

For LLM-as-a-judge evaluations, the judge model must respond with exactly “pass” or “fail”. Design your evaluation prompts to ensure clear, deterministic responses.

Evaluation Capabilities

Multi-Model Support

Choose from OpenAI, Anthropic, Google, Groq, or custom models for evaluation, matching your quality and performance requirements.

Comprehensive Results

Detailed pass/fail results with complete conversation transcripts and timing information for thorough analysis.

Flexible Targets

Test individual assistants or entire squads with optional configuration overrides for comprehensive validation.

Status Monitoring

Real-time evaluation status tracking with detailed reason codes for failures, timeouts, and cancellations.