Batch evaluation runs the same scenario pattern across many inputs and
aggregates the results into a pass/fail summary. Use it to evaluate a dataset of
test cases, measure regression coverage, or compare outputs across prompt
variants.
To get started, weβll implement the core batch loop. The key insight is that
asyncio.gather submits all scenarios simultaneously, so the total runtime
scales with the slowest single call rather than the number of test cases β
critical when each interaction involves an LLM.
Define your test cases as a list of (input, expected) pairs, create a scenario
for each pair, run them all concurrently with asyncio.gather, then summarise:
import asyncio
from giskard.checks import Scenario, StringMatching
test_cases =[
("How long do we retain KYC records?","5 years"),
("Can we share customer data with third parties?","only with consent"),
("Is medical advice allowed in the chatbot?","no"),
]
defmy_qa_system(question:str)->str:
# Your QA system
return"..."
asyncdefrun_batch():
scenarios =[
(
question,
Scenario(f"qa_{i}")
.interact(
inputs=question,
outputs=lambdainputs,q=question:my_qa_system(q),
)
.check(
StringMatching(
name="contains_expected",
keyword=expected,
text_key="trace.last.outputs",
)
),
)
for i,(question, expected)inenumerate(test_cases)
]
results =await asyncio.gather(*(s.run()for _, s in scenarios))
passed =sum(1for r in results if r.passed)
total =len(results)
print(f"Passed: {passed}/{total} ({passed / total *100:.1f}%)")
for result in results:
result.print_report()
return results
_ = asyncio.run(run_batch())
Output
Passed: 0/3 (0.0%)
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββcontains_expectedFAIL The answer does not contain the keyword '5 years'ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'How long do we retain KYC records?'
Outputs: '...'ββββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 5ms ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββcontains_expectedFAIL The answer does not contain the keyword 'only with consent'ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Can we share customer data with third parties?'
Outputs: '...'ββββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 4ms ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββcontains_expectedFAIL The answer does not contain the keyword 'no'ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Is medical advice allowed in the chatbot?'
Outputs: '...'ββββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 3ms ββββββββββββββββββββββββββββββββββββββββββββββββββ
The asyncio.gather approach above gives you aggregate pass/fail counts, but a
CI pipeline benefits from individual failure markers. Next, weβll convert the
same test cases into a parametrized pytest function so each input gets its own
entry in the test report.
To get per-test failure reporting in CI, use @pytest.mark.parametrize:
import pytest
from giskard.checks import Scenario, StringMatching
QA_CASES =[
("How long do we retain KYC records?","5 years"),
("Can we share customer data with third parties?","only with consent"),
("Is medical advice allowed in the chatbot?","no"),
With the basic batch loop established, we can now swap in an LLMJudge check.
The generator is configured once before the loop; every scenario created inside
it reuses that single configuration, so you arenβt reinitializing a client on
every iteration.
LLM-based checks work in batch too. Set a default generator once before the
loop:
import asyncio
from giskard.agents.generators import Generator
from giskard.checks import Scenario, LLMJudge, set_default_generator
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β PASSED ββββββββββββββββββββββββββββββββββββββββββββββββββββfactual_consistencyPASSββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'The new policy requires all employees to complete security training annually.'
Outputs: 'Summary of: The new policy requires all employees to...'ββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 4573ms βββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β PASSED ββββββββββββββββββββββββββββββββββββββββββββββββββββfactual_consistencyPASSββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'The quarterly report shows a 12% increase in revenue compared to last year.'
Outputs: 'Summary of: The quarterly report shows a 12% increas...'ββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 4228ms βββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β PASSED ββββββββββββββββββββββββββββββββββββββββββββββββββββfactual_consistencyPASSββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Our refund policy allows returns within 30 days of purchase with a receipt.'
Outputs: 'Summary of: Our refund policy allows returns within ...'ββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 3371ms βββββββββββββββββββββββββββββββββββββββββββββββββ
Beyond pass/fail, you can collect numeric data from each result to compute
statistics across the whole batch. This is useful for monitoring response
quality trends over time rather than just asserting a binary threshold.
If your checks emit numeric metrics, collect them to compute aggregates:
import asyncio
from giskard.checks import Scenario, FnCheck
test_inputs =[
"This is a short response.",
"This is a slightly longer response with more words in it.",
"Short.",
]
defmy_model(text:str)->str:
return text # Echo for demonstration
asyncdefrun_with_metrics():
scenarios =[
Scenario(f"length_{i}")
.interact(
inputs=inp,
outputs=lambdainputs,x=inp:my_model(x),
)
.check(
FnCheck(fn=
lambdatrace:len(trace.last.outputs.split())>=3,
name="min_word_count",
success_message="Meets minimum word count",
failure_message="Response too short",
)
)
for i, inp inenumerate(test_inputs)
]
results =await asyncio.gather(*(s.run()for s in scenarios))
word_counts =[len(r.final_trace.last.outputs.split())for r in results]
print(f"Average word count: {sum(word_counts)/len(word_counts):.1f}")
print(f"Passed: {sum(1for r in results if r.passed)}/{len(results)}")
for r in results:
r.print_report()
asyncio.run(run_with_metrics())
Output
Average word count: 5.7
Passed: 2/3
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β PASSED ββββββββββββββββββββββββββββββββββββββββββββββββββββmin_word_countPASSββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'This is a short response.'
Outputs: 'This is a short response.'ββββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 0ms ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β PASSED ββββββββββββββββββββββββββββββββββββββββββββββββββββmin_word_countPASSββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'This is a slightly longer response with more words in it.'
Outputs: 'This is a slightly longer response with more words in it.'ββββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 0ms ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββmin_word_countFAIL Response too short
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Short.'
Outputs: 'Short.'ββββββββββββββββββββββββββββββββββββββββββββββββββ 1 step in 0ms ββββββββββββββββββββββββββββββββββββββββββββββββββ