To get started, we’ll implement a minimal agent that exposes both its reasoning
steps and its final answer. Returning an AgentResponse that includes the full
steps list is what makes tool selection and reasoning quality testable —
without that structure, your checks can only inspect the final string.
First, let’s create an agent to test:
from typing import Literal, Optional
from pydantic import BaseModel
classTool(BaseModel):
name:str
description:str
classAgentStep(BaseModel):
thought:str
tool:str
tool_input:str
observation:str
classAgentResponse(BaseModel):
steps:list[AgentStep]
final_answer:str
success:bool
classSimpleAgent:
def__init__(self):
self.tools ={
"search":Tool(
name="search",description="Search the internet for information"
),
"calculator":Tool(
name="calculator",
description="Perform mathematical calculations",
),
"database":Tool(
name="database",
description="Query a database for structured data",
With the agent built, we can now write our first test. This scenario verifies
three things at once: that the agent invoked at least one tool, that it chose
the right tool for a math task, and that it completed successfully. Checking all
three together gives you a tight specification for the most basic agent
behavior.
Verify that the agent selects appropriate tools:
import asyncio
from giskard.checks import Scenario, FnCheck, Equals
agent =SimpleAgent()
asyncdeftest_tool_selection():
tc =(
Scenario("tool_selection_calculator")
.interact(
inputs="What is 15 * 23?",outputs=lambdainputs: agent.run(inputs)
──────────────────────────────────────────────────── ✅ PASSED ────────────────────────────────────────────────────used_toolsPASSselected_calculatorPASStask_successfulPASS────────────────────────────────────────────────────── Trace ──────────────────────────────────────────────────────────────────────────────────────────────────────── Interaction 1 ──────────────────────────────────────────────────
Inputs: 'What is 15 * 23?'
Outputs: AgentResponse(steps=[AgentStep(thought='I need to use the calculator for this math problem',
tool='calculator', tool_input='15', observation='15')], final_answer='The answer is 15', success=True)────────────────────────────────────────────────── 1 step in 5ms ──────────────────────────────────────────────────
Building on Test 1, we now evaluate whether the agent’s internal thought process
makes sense — not just whether the right tool was called. The LLMJudge check
is the appropriate tool here because reasoning quality is a semantic property
that can’t be reduced to a string match.
Evaluate the quality of the agent’s reasoning:
from giskard.agents.generators import Generator
from giskard.checks import Scenario, LLMJudge, FnCheck, set_default_generator
Next, we’ll test an agent that must chain multiple tools in a specific order.
The three FnCheck checks assert that each required tool was used, while the
LLMJudge check validates that the steps appeared in a logical sequence —
catching cases where the agent calculates before it has gathered the data to
calculate from.
With happy-path tests in place, we now test the recovery path. The RobustAgent
below attempts the calculator first and falls back to search when it fails — and
our checks verify both that the fallback was triggered and that the agent
ultimately succeeded despite the initial error.
Verify that agents handle errors gracefully:
classRobustAgent(SimpleAgent):
defrun(self,task:str)-> AgentResponse:
steps =[]
# Try first approach
thought ="I'll try using the calculator"
observation =self._use_tool("calculator", task)
steps.append(
AgentStep(
thought=thought,
tool="calculator",
tool_input=task,
observation=observation,
)
)
if"Error"in observation:
# Fallback strategy
thought ="Calculator failed, I'll search instead"
observation =self._use_tool("search", task)
steps.append(
AgentStep(
thought=thought,
tool="search",
tool_input=task,
observation=observation,
)
)
final_answer =f"After trying different approaches: {observation}"
Now we’ll verify that an agent can reference information from an earlier turn.
The scenario uses two .interact() calls, and the check on the second
interaction examines trace.last.outputs.final_answer to confirm the agent
correctly recalled what was discussed before.
Building on the stateful agent pattern, we now test a more structured workflow
where the agent tracks a task queue. This scenario walks through all four
lifecycle steps — add, add, complete, status — and checks at each stage that the
internal state matches what the responses claim.
Verify that complex tasks are fully completed:
from giskard.checks import Scenario, LLMJudge, FnCheck
Now we’ll bring all the individual tests together into a reusable suite class.
The add_test and add_scenario methods let you compose the suite
incrementally, and run_all reports both categories in a single pass so you can
see the full picture at a glance.