Output
[PASS] test_0: Output is unique [FAIL] test_1: Duplicate output detected: ‘I can help with that.’ [FAIL] test_2: Duplicate output detected: ‘I can help with that.’
Most checks are stateless — they inspect the current trace and return a result. Stateful checks maintain internal state across multiple scenario runs, enabling patterns like uniqueness tracking, accumulated counts, or cross-scenario consistency validation.
Stateful checks make individual test results depend on execution order and prior runs. Use them deliberately, and prefer trace-based state when possible.
Use a stateful check when you need to:
For within-scenario state (e.g. “turn 2 references turn 1”), use the trace
directly — trace.interactions[0] is always available without stateful checks.
To get started with stateful checks, we’ll implement the most common pattern:
asserting that a model never returns the same response twice across a batch of
distinct inputs. The check stores previously seen outputs in a set that
persists for the lifetime of the instance.
The most common use case: assert that responses are not duplicated across scenarios.
from giskard.checks import Check, CheckResult, Trace
@Check.register("uniqueness_tracker")class UniquenessTracker(Check): """Fails if the same output is seen more than once across runs."""
def __init__(self, **data): super().__init__(**data) self._seen: set[str] = set()
async def run(self, trace: Trace) -> CheckResult: output = str(trace.last.outputs)
if output in self._seen: return CheckResult.failure( message=f"Duplicate output detected: {output!r}", details={"unique_count": len(self._seen)}, )
self._seen.add(output) return CheckResult.success( message="Output is unique", details={"unique_count": len(self._seen)}, )Notice that the check must be a single shared instance — passing
UniquenessTracker(name="unique_responses") inside the loop would create a
fresh instance for every scenario and defeat the purpose. Use the same
instance across all scenarios so the state accumulates:
import asynciofrom giskard.checks import Scenario
tracker = UniquenessTracker(name="unique_responses")
def chatbot(prompt: str) -> str: # Your chatbot — for this example it always returns the same string return "I can help with that."
scenarios = [ Scenario(f"test_{i}") .interact( inputs=f"Question {i}", outputs=lambda inputs: chatbot(inputs), ) .check(tracker) # same tracker instance for i in range(3)]
results = asyncio.run(asyncio.gather(*(s.run() for s in scenarios)))
for i, result in enumerate(results): status = "PASS" if result.passed else "FAIL" print(f"[{status}] test_{i}: {result.steps[0].results[0].message}")Output
[PASS] test_0: Output is unique [FAIL] test_1: Duplicate output detected: ‘I can help with that.’ [FAIL] test_2: Duplicate output detected: ‘I can help with that.’
Expected output (because all three return the same string):
[PASS] test_0: Output is unique[FAIL] test_1: Duplicate output detected: 'I can help with that.'[FAIL] test_2: Duplicate output detected: 'I can help with that.'Next, we’ll build on the uniqueness pattern to count how many times a condition occurs rather than just whether it has occurred before. This lets you set a tolerance threshold — for example, allowing a small number of refusals in a large dataset without failing the entire batch.
Track how many responses satisfy a condition across a batch and fail if the count exceeds a threshold:
from giskard.checks import Check, CheckResult, Trace
@Check.register("refusal_counter")class RefusalCounter(Check): """Fails if the model refuses more than `max_refusals` times."""
max_refusals: int = 2
def __init__(self, **data): super().__init__(**data) self._refusal_count: int = 0
async def run(self, trace: Trace) -> CheckResult: output = str(trace.last.outputs).lower() refused = any( kw in output for kw in ["cannot", "sorry", "i'm unable", "i can't"] )
if refused: self._refusal_count += 1
if self._refusal_count > self.max_refusals: return CheckResult.failure( message=( f"Model has refused {self._refusal_count} times " f"(max allowed: {self.max_refusals})" ), details={"refusal_count": self._refusal_count}, )
return CheckResult.success( message=f"Refusal count within limit ({self._refusal_count})", details={"refusal_count": self._refusal_count}, )With stateful checks in use, you need to be careful not to carry state from one test session into another. A pytest fixture that constructs a fresh instance for each test is the cleanest way to guarantee isolation.
If you run the same stateful check across multiple test sessions (e.g. in pytest), reset state in a fixture to prevent cross-test contamination:
import pytestfrom giskard.checks import Scenario
@pytest.fixturedef fresh_tracker(): return UniquenessTracker(name="unique_responses")
@pytest.mark.asyncioasync def test_no_duplicate_responses(fresh_tracker): inputs = ["Hello", "What time is it?", "Tell me a joke"] scenarios = [ Scenario(f"test_{i}") .interact( inputs=inp, outputs=lambda inputs: chatbot(inputs), ) .check(fresh_tracker) for i, inp in enumerate(inputs) ]
import asyncio
results = await asyncio.gather(*(s.run() for s in scenarios)) assert all(r.passed for r in results), "Duplicate responses detected"Before reaching for a stateful check, check whether you can express the constraint using the trace. Multi-turn scenarios keep the full history, so cross-turn assertions like “does turn 2 reference what was said in turn 1?” are naturally captured without any external state:
from giskard.checks import Scenario, FnCheck
# This does NOT need a stateful check — the trace has both turnsscenario = ( Scenario("context_retained") .interact( inputs="My name is Alice.", outputs=lambda inputs: chatbot(inputs) ) .interact(inputs="What is my name?", outputs=lambda inputs: chatbot(inputs)) .check( FnCheck(fn= lambda trace: "Alice" in trace.last.outputs, name="recalls_name", ) ))Use stateful checks only when the constraint genuinely spans multiple independent scenario runs, not multiple turns within a single scenario.