Run, review, schedule and compare evaluation runs

Evaluations are the core of the testing process in Giskard Hub. They allow you to run your test datasets against your agents and evaluate their performance using the checks that you have defined.

The Giskard Hub provides a comprehensive evaluation system that supports:

  • Local evaluations: Run evaluations locally using development agents

  • Remote evaluations: Run evaluations in the Hub using deployed agents

  • Scheduled evaluations: Automatically run evaluations at specified intervals

In this section, we will walk you through how to run and manage evaluations using the Hub interface.

Tip

šŸ’” When to execute your tests?

Depending on your AI lifecycle, you may have different reasons to execute your tests:

  • Development time: Compare agent versions during development and identify the right correction strategies for developers.

  • Deployment time: Perform non-regression testing in the CI/CD pipeline for DevOps.

  • Production time: Provide high-level reporting for business executives to stay informed about key vulnerabilities in a running agent.

In this section, we will walk you through how to manage evaluations in Giskard Hub.

Run evaluations

Create evaluations

Run and review evaluations
Schedule evaluations

Schedule evaluations to run automatically.

Schedule evaluations
Compare evaluations

Compare evaluations to see if there are any regressions.

Compare evaluations

High-level workflow

        graph LR
    A([<a href="create.html" target="_self">Run Evaluation</a>]) --> B([<a href="index.html" target="_self">Review Results</a>])
    B --> C{Analysis}
    C -->|Compare Versions| D([<a href="compare.html" target="_self">Compare Evaluations</a>])
    C -->|Schedule Automation| E([<a href="schedule.html" target="_self">Schedule Evaluation</a>])
    D --> F{Next Steps}
    E --> F
    F -->|Iterate| A
    F -->|Fix Issues| G[<a href="../annotate/index.html" target="_self">Update Test Cases</a>]
    G --> A
    

Tip

Local evaluations are supported via the SDK. To run evaluations against local development agents, see Run local evaluations.