Run remote evaluations
This section will guide you through running evaluations against a remote model.
In production, you will want to run evaluations against agents that are configured in the Hub and exposed with an API.
As usual, let’s initialize the Hub client and set our current project ID:
import os
from giskard_hub import HubClient
hub = HubClient()
project_id = os.getenv("HUB_PROJECT_ID")
Configure an agent
First, we need to configure the agent that we want to evaluate. The agent will need to be accessible from a remote API endpoint.
We can configure the agent endpoint in the Hub:
agent = hub.models.create(
project_id=project_id,
name="MyAgent (staging)",
description="An agent that answers questions about the weather",
url="https://my-agent.staging.example.com/chat",
supported_languages=["en"],
# if your agent endpoint needs special headers:
headers={"X-API-Key": "SECRET_TOKEN"},
)
You can test that everything is working by sending a test request to the agent
response = model.chat(messages=[{
"role": "user",
"content": "What's the weather like in Rome?"
}])
print(response)
# ModelOutput(message=ChatMessage(role='assistant', content='It is sunny!'))
Create a evaluation
Now that the agent is configured, we can launch an evaluation run. We first need to know which dataset we will run the evaluation on. If you are running this in the CI/CD pipeline, we recommend setting the dataset ID in the environment.
dataset_id = os.getenv("HUB_EVAL_DATASET_ID")
We can now launch the evaluation run:
eval_run = hub.evaluations.create(
model_id=model.id,
dataset_id=dataset_id,
# optionally,
tags=["staging", "build"],
run_count=1, # number of runs per case
name="staging-build-a4f321",
)
The evaluation run will be queued and processed by the Hub. The evaluate
method will immediately return an EvaluationRun
object
while the evaluation is running. Note however that this object will not contain
the evaluation results until the evaluation is completed.
You can wait until the evaluation run has finished running with the
wait_for_completion
method:
eval_run.wait_for_completion(
# optionally, specify a timeout in seconds (10 min by default)
timeout=600
)
This will block until the evaluation is completed and update the eval_run
object in-place. The method will wait for up to 10 minutes for the
evaluation to complete. If the evaluation takes longer, the method will raise a
TimeoutError
.
Then, you can print the results:
# Let's print the evaluation results
eval_run.print_metrics()

Once the evaluation is completed, may want to compare the results with some thresholds to decide whether to promote the agent to production or not.
You can retrieve the metrics from eval_run.metrics
: this will contain a list
of Metric
objects.
For example:
import sys
# make sure to wait for completion or the metrics may be empty
eval_run.wait_for_completion()
for metric in eval_run.metrics:
print(metric.name, metric.percentage})
if metric.percentage < 90:
print(f"FAILED: {metric.name} is below 90%.")
sys.exit(1)
That covers the basics of running evaluations in the Hub. You can now integrate this code in your CI/CD pipeline to automatically evaluate your agents every time you deploy a new version.
Note
If you want to run evaluations on a local model that is not yet exposed with an API, check Run local evaluations.