Run, review, schedule and compare evaluation runs

Evaluations are the core of the testing process in Giskard Hub. They allow you to run your test datasets against your agents and evaluate their performance using the checks that you have defined.

The Giskard Hub provides a comprehensive evaluation system that supports:

Local evaluations: Run evaluations locally using development agents
Remote evaluations: Run evaluations in the Hub using deployed agents
Scheduled evaluations: Automatically run evaluations at specified intervals

In this section, we will walk you through how to run and manage evaluations using the Hub interface.

Tip

💡 When to execute your tests?

Depending on your AI lifecycle, you may have different reasons to execute your tests:

Development time: Compare agent versions during development and identify the right correction strategies for developers.
Deployment time: Perform non-regression testing in the CI/CD pipeline for DevOps.
Production time: Provide high-level reporting for business executives to stay informed about key vulnerabilities in a running agent.

In this section, we will walk you through how to manage evaluations in Giskard Hub.

Run evaluations

Create evaluations

Run and review evaluations

Schedule evaluations

Schedule evaluations to run automatically.

Schedule evaluations

Compare evaluations

Compare evaluations to see if there are any regressions.

Compare evaluations

High-level workflow

        graph LR
    A([<a href="create.html" target="_self">Run Evaluation</a>]) --> B([<a href="index.html" target="_self">Review Results</a>])
    B --> C{Analysis}
    C -->|Compare Versions| D([<a href="compare.html" target="_self">Compare Evaluations</a>])
    C -->|Schedule Automation| E([<a href="schedule.html" target="_self">Schedule Evaluation</a>])
    D --> F{Next Steps}
    E --> F
    F -->|Iterate| A
    F -->|Fix Issues| G[<a href="../annotate/index.html" target="_self">Update Test Cases</a>]
    G --> A

Tip

Local evaluations are supported via the SDK. To run evaluations against local development agents, see Run local evaluations.