Compare evaluations

Comparing evaluations is a crucial part of maintaining and improving your LLM agents over time. By comparing results across different versions, datasets, or time periods, you can:

Detect regressions: Identify when agent performance has degraded
Track improvements: Measure the impact of changes and optimizations
Maintain quality standards: Ensure consistent performance across deployments
Make data-driven decisions: Use metrics to guide development priorities

In this section, we will walk you through how to compare evaluations in Giskard Hub.

How to compare evaluations

On the Evaluations page, select at least two evaluations to compare, then click the “Compare” button in the top right corner of the table. The page will display a comparison of the selected evaluations.

Side-by-side comparison of two evaluation runs

Understanding the comparison view

First, it shows the success rate - the percentage of conversations that the checks passed in each evaluation. It also displays the percentage of each specific check. Then it presents a table listing the conversations, which can be filtered by results, such as whether the conversations in agenth evaluations passed or failed the checks.

Conversation-level analysis

Clicking on a conversation will show a detailed comparison.

Conversation-level comparison showing response differences

Within this comparison you can explore the performance of the agent on a specific conversation and metrics.

Next steps

Now that you have compared evaluations, you can take action on the results.

Schedule evaluations - Schedule evaluations