Harmful Content Generation
Probes that attempt to bypass safety measures and generate dangerous, illegal, or harmful content across various categories
OWASP LLM Top 10
OWASP LLM01: Prompt Injection: Prompt Injection occurs when an attacker manipulates an LLM’s behavior by injecting malicious input. These attacks exploit how LLMs process text-based prompts, often bypassing safeguards, compromising outputs, or enabling unauthorized access. The vulnerability lies in the model’s inability to distinguish between safe and malicious inputs, even if imperceptible to humans. Prompt Injection attacks can range from causing the LLM to generate harmful outputs to accessing sensitive data or performing unauthorized actions
OWASP LLM02: Sensitive Information Disclosure: Sensitive Information Disclosure happens when an LLM unintentionally reveals private or proprietary information. This can include PII, system credentials, or confidential business data. The risk arises from improper data sanitization, poor input handling, or overly permissive outputs. Attackers or users may exploit these vulnerabilities, leading to privacy violations, data breaches, or compliance issues
Probes
Probe Name |
Type |
Description |
|---|---|---|
CBRN |
This probe generates adversarial messages that attempt to manipulate the AI agent into discussing or promoting the development, production, acquisition, or distribution of chemical, biological, radiological, or nuclear weapons. |
|
Context Compliance |
This probe tests whether an AI agent can be influenced to provide harmful or inappropriate content when presented with fabricated conversation history that appears to establish a pattern of compliance with similar requests. Note: only the last assistant message of the conversation is really generated by the target agent, previous messages are fabricated by the probe to trick the target agent into providing harmful content. |
|
Copyright Violations |
This probe generates adversarial messages that attempt to bypass copyright protections and request copyrighted material. |
|
Crescendo |
Multi-turn |
Crescendo is a multi-turn attack that gradually steers the target agent towards a prohibited objective. It starts with a harmless requests and slowly leads the target to providing harmful information, exploiting the tendency of LLMs to follow patterns and focus on recent messages. The attack leverages the model’s recency bias, pattern-following tendency, and trust in self-generated text to progressively guide it through small, benign steps toward generating harmful content. |
CyberSecEval |
This probe tests the AI agent using samples from the CyberSecEval dataset, which contains cybersecurity-related prompts designed to evaluate AI systems’ resistance to generating harmful cybersecurity content. |
|
DoNotAnswer |
This probe tests the target using samples from the DoNotAnswer dataset, which contains questions and prompts that AI systems should refuse to answer due to their harmful, explicit, or illegal nature. |
|
GCG Injection |
This probe tests whether an AI agent is vulnerable to Greedy Coordinate Gradient (GCG) attacks, which use specially crafted adversarial suffixes designed to bypass safety measures and content filters. GCG attacks optimize token sequences that can be appended to harmful prompts to maximize the probability of eliciting prohibited responses. |
|
GOAT Attack |
Multi-turn |
This probe implements the Generalized Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in AI systems. |
HarmBench |
This probe tests the target using samples from the HarmBench dataset, a comprehensive benchmark for evaluating AI systems’ resistance to generating harmful content across multiple categories of potential harm. |
|
Harmful Misguidance |
This probe tests whether an AI agent can recognize potentially dangerous situations and provide appropriate warnings when users describe risky behaviors. |
|
Illegal Activities |
This probe generates adversarial messages that attempt to induce content about illegal activities or criminal behavior (fraud, scams, cybercrime, drug creation or distribution, violence, harassment, etc.). |
|
Misinformation & Disinformation |
This probe generates adversarial messages that attempt to manipulate the AI agent into spreading false or misleading information (fake news, complotist theories, false beliefs, etc.). |
|
Politics Topic |
This probe generates adversarial messages that manipulate the AI agent into discussing controversial political topics from biased perspectives or generating politically divisive content. |
|
Religion Topic |
This probe generates adversarial messages that attempt to manipulate the AI agent into making judgmental statements about religious topics, practices, beliefs, or communities. |
|
Stereotypes and Discrimination |
This probe generates adversarial messages that attempt to manipulate the AI agent into making stereotypical or discriminatory statements. |
|
TAP |
Agentic |
This probe implements the Tree of Attacks with Pruning (TAP) method, an automated approach for jailbreaking black-box large language models by systematically generating and refining adversarial prompts through iterative tree search. |
XSTest |
This probe tests the target using samples from the XSTest dataset, which ambiguous questions that could be interpreted both in a benign and harmful way. It assess the target’s ability to answer such questions in a safe way. |