Understand metrics, failure categories and tags
This page provides an overview of the key concepts for organizing and analyzing your test cases: metrics, failure categories, and tags. Understanding these concepts helps you structure your test datasets, interpret evaluation results, and prioritize improvements to your AI agent.
- Metrics provide quantitative measurements showing how well your agent performs on different checks
- Failure categories help you understand the root causes of failures and prioritize fixes for each category
- Tags help you organize and filter your test cases by business context, user type, or scenario
By combining these three concepts, you can:
- Understand which checks (metrics) are failing most often
- Determine the root causes (failure categories) of those failures
- Identify which types of test cases (tags) have the highest failure rates
- Prioritize fixes for each failure category
You can then focus on improving your agentâs compliance with business rules specifically for customer support scenarios.
Metrics
Section titled âMetricsâMetrics provide quantitative measurements of your agentâs performance across different checks. They help you understand how well your agent is performing and identify areas that need improvement.
Create a check
Section titled âCreate a checkâTo create a check, click on the âCreate a new checkâ button in the upper right corner of the screen.

After, you can configure the check parameters which depends on the check type. This will look something like this:

After configuring the check parameters, you can save the check by clicking on the âSaveâ button in the upper right corner of the screen. A full check configuration paramters can be found below.
Available checks
Section titled âAvailable checksâCorrectness
Section titled âCorrectnessâCheck whether all information from the reference answer is present in the agent answer without contradiction. Unlike the groundedness check, the correctness check is sensitive to omissions but tolerant of additional information in the agentâs answer.
Conformity
Section titled âConformityâGiven a rule or criterion, check whether the agent answer complies with this rule. This can be used to check business specific behavior or constraints. A conformity check may have several rules. Each rule should check a unique and unambiguous behavior. Here are a few examples of rules:
- The agent should not talk about {{competitor company}}.
- The agent should only answer in English.
- The agent should always keep a professional tone.
String Matching
Section titled âString MatchingâCheck whether the given keyword or sentence is present in the agent answer.
Keyword: âHelloâ
Failure example:
- Hi, can I help you?
- Reason: The agent answer does not contain the keyword âHelloâ
Success example:
- Hello, how may I help you today?
Metadata
Section titled âMetadataâCheck whether the agent answer contains the expected value at the specified JSON path. This check is useful to verify that the agent answer contains the expected metadata (e.g. whether a tool is called). The metadata check can be used to check for specific values in the metadata of agent answer, such as a specific date or a specific name.
We recommend using a tool like json-path-evaluator to evaluate the JSON path rules.
JSON Path rule: Expecting John (string) at $.user.name
Failure examples:
- Metadata:
{"user": {"name": "Doe"}}- Reason: Expected
Johnat$.user.namebut gotDoe
- Reason: Expected
Success examples:
- Metadata:
{"user": {"name": "John"}} - Metadata:
{"user": {"name": "John Doe"}}
JSON Path rule: Expecting true (boolean) at $.output.success
Failure examples:
-
Metadata:
{"output": {"success": false}}- Reason: Expected
trueat$.output.successbut gotfalse
- Reason: Expected
-
Metadata:
{"output": {}}- Reason: JSON path
$.output.successdoes not exist in metadata
- Reason: JSON path
Success example:
- Metadata:
{"output": {"success": true}}
Semantic Similarity
Section titled âSemantic SimilarityâCheck whether the agentâs response is semantically similar to the reference. This is useful when you want to allow for some variation in wording while ensuring the core meaning is preserved.
Query: What is the capital of France?
Reference Answer: âThe capital of France is Paris, which is located in the northern part of the country.â
Threshold: 0.8
Failure example:
- The capital of France is Paris, which is located in the southern part of the country.
Custom Checks
Section titled âCustom ChecksâCustom checks are built on top of the built-in checks (Conformity, Correctness, Groundedness, String Matching, Metadata, and Semantic Similarity) and can be used to evaluate the quality of your agentâs responses.
The advantage of custom checks is that they can be tailored to your specific use case and can be enabled on many conversations at once.
On the Checks page, you can create custom checks by clicking on the âNew checkâ button in the upper right corner of the screen.

Next, set the parameters for the check:
Name: Give your check a name.Identifier: A unique identifier for the check. It should be a string without spaces.Description: A brief description of the check.Type: The type of the check, which can be one of the following:Correctness: The output of the agent should match the reference.
Conformity: The conversation should follow a set of rules.Groundedness: The output of the agent should be grounded in the conversation.String matching: The output of the agent should contain a specific string (keyword or sentence).Metadata: The metadata output of the agent should match a list of JSON path rules.Semantic Similarity: The output of the agent should be semantically similar to the reference.- And a set of parameters specific to the check type. For example, for a
Correctnesscheck, you would need to provide theExpected responseparameter, which is the reference answer.

Once you have created a custom check, you can apply it to conversations in your dataset. When you run an evaluation, the custom check will be executed along with the built-in checks that are enabled.
Failure categories
Section titled âFailure categoriesâFailure categories help you understand the root cause of test failures and identify patterns in how your agent is failing. When a test fails, it is automatically categorized based on the type of failure.
Create a failure category
Section titled âCreate a failure categoryâTo add or edit failure categories, go to Settings -> Project Settings. After clicking on a specific project, you can create new failure categories or update existing ones as needed.
Assign failure categories
Section titled âAssign failure categoriesâWhen a test fails, a failure category is assigned to the test automatically, however you can manually update the failure category to a different one.

You can read about modifying test cases in Modify test cases.
Defining the right failure categories
Section titled âDefining the right failure categoriesâFailure categories help you understand the root cause of test failures and identify patterns in how your agent is failing. When creating failure categories, it is good to stick to a naming convention that you agreed on beforehand. Ensure that similar failures based on root causes, impact, and other relevant criteria are grouped together.
-
Accuracy-Related Failures: These categories capture failures related to the correctness and completeness of information in the agentâs response.
Examples: âContradictionâ, âOmissionâ, âAdditionâ, âIncorrect Informationâ
-
Security-Related Failures: These categories relate to failures that pose security risks or vulnerabilities.
Examples: âPrompt Injectionâ, âData Disclosureâ, âUnauthorized Accessâ
-
Compliance-Related Failures: These categories pertain to failures where the agent violates business rules, policies, or scope constraints.
Examples: âBusiness Out of Scopeâ, âNon-Conform Inputâ, âPolicy Violationâ
-
Content Quality Failures: These categories describe failures related to the appropriateness and quality of the agentâs response.
Examples: âInappropriate Contentâ, âUnprofessional Toneâ, âOff-Topic Responseâ
-
Behavioral Failures: These categories capture failures related to the agentâs behavior or interaction style.
Examples: âSycophancyâ, âDenial of Answerâ, âOverly Defensiveâ
-
Context-Awareness Failures: These categories relate to failures where the agent fails to properly understand or use the provided context.
Examples: âContext Misunderstandingâ, âMissing Context Referenceâ, âContext Contradictionâ
-
Create Categories Based on Root Causes: Focus on categorizing failures by their underlying root cause rather than surface-level symptoms to enable more effective fixes.
Example: Instead of creating separate categories for âWrong Dateâ and âWrong Nameâ, consider a broader âFactual Errorâ category that captures the root cause.
-
Use Categories for Prioritization: Focus on fixing the most common failure categories first to have the greatest impact on your agentâs performance.
Example: If âAccuracy-Related Failuresâ are the most frequent, prioritize improving your agentâs fact-checking and information retrieval capabilities.
-
Analyze Patterns Across Categories: Look for patterns in failure categories across different tags or test types to identify systemic issues.
Example: If âSecurity-Related Failuresâ are concentrated in conversations tagged with âAdversarial Testingâ, you may need to strengthen your agentâs security defenses.
Tags are optional but highly recommended labels that help you organize and filter your test cases. Tags help you analyze evaluation results by allowing you to:
- Filter results - Focus on specific test types or scenarios
- Compare performance - See how your agent performs across different test categories
- Identify weak areas - Discover which types of tests have higher failure rates
- Organize reviews - Review test results by category or domain
Create a tag
Section titled âCreate a tagâTo create a tag, first open a conversation and click on the âAdd tagâ button in the âPropertiesâ section at the right side of the screen.

Before creating a tag, we recommend you to read about the best practices for modifying test cases in Modify test cases.
Choosing the right tag structure
Section titled âChoosing the right tag structureâTo choose a tag, it is good to stick to a naming convention that you agreed on beforehand. Ensure that similar conversations based on categories, business functions, and other relevant criteria are grouped together. For example, if your team is located in different regions, you can have tags for each, such as âNormandyâ and âBrittanyâ.
-
Issue-Related Tags: These tags categorize the types of problems that might occur during a conversation.
Examples: âHallucinationâ, âMisunderstandingâ, âIncorrect Informationâ
-
Attack-Oriented Tags: These tags relate to specific types of adversarial testing or attacks.
Examples: âSQL Injection Attemptâ, âPhishing Queryâ, âIllegal Requestâ
-
Legitimate Question Tags: These tags categorize standard, everyday user queries.
Examples: âBalance Inquiryâ, âLoan Applicationâ, âAccount Openingâ
-
Context-Specific Tags: These tags pertain to specific business contexts or types of interactions.
Examples: âCaisse dâEpargneâ, âBanco Popularâ, âCorporate Bankingâ
-
User Behavior Tags: These tags describe the nature of the userâs behavior or the style of interaction.
Examples: âConfused Userâ, âAngry Customerâ, âNew Userâ
-
Temporal Tags: Depending on the life cycle of the testing process of the agent.
Examples: âred teaming phase 1â, âred teaming phase 2â
-
Use Multiple Tags if Necessary: Apply multiple tags to a single conversation to cover all relevant aspects.
Example: A conversation with a confused user asking about loan applications could be tagged with âConfused Userâ, âLoan Applicationâ, and âMisunderstandingâ.
-
Hierarchical Tags: Implement a hierarchy in your tags to create a structured and clear tagging system.
Example: Use âUser Issues > Hallucinationâ to show the relationship between broader categories and specific issues.
-
Stick to Agreed Naming Conventions: Ensure that your team agrees on and follows a consistent naming convention for tags to maintain organization and clarity.
Example: Decide on using either plural or singular forms for all tags and stick to it.
Next Steps
Section titled âNext StepsâNow that you understand the fundamentals of test organization, you can:
- Review test results - Review test results
- Modify test cases - Modify test cases
- Run evaluations - Create evaluations