AI Agent Evaluation

Evaluations are the core of the testing process in Giskard Hub. They allow you to run your test datasets against your AI agents and systematically assess their performance, safety, and security using the checks that you have defined.

The Giskard Hub provides a comprehensive AI agent evaluation system that supports:

Local evaluations: Run evaluations locally using development agents
Remote evaluations: Run evaluations in the Hub using deployed agents
Scheduled evaluations: Automatically run evaluations at specified intervals

In this section, we will walk you through how to run and manage evaluations using the Hub interface.

In this section, we will walk you through how to manage evaluations in Giskard Hub.

Run evaluations Create evaluations

Schedule evaluations Schedule evaluations to run automatically.

Compare evaluations Compare evaluations to see if there are any regressions.

Evaluation workflow

graph LR
    A([<a href="/hub/ui/evaluations/create" target="_self">Run Evaluation</a>]) --> B([<a href="/hub/ui/evaluations" target="_self">Review Results</a>])
    B --> C{Analysis}
    C -->|Compare Versions| D([<a href="/hub/ui/evaluations/compare" target="_self">Compare Evaluations</a>])
    C -->|Schedule Automation| E([<a href="/hub/ui/evaluations/schedule" target="_self">Schedule Evaluation</a>])
    D --> F{Next Steps}
    E --> F
    F -->|Iterate| A
    F -->|Fix Issues| G[<a href="/hub/ui/annotate" target="_self">Update Test Cases</a>]
    G --> A