Harmful Content Generation

Probes that attempt to bypass safety measures and generate dangerous, illegal, or harmful content across various categories and attack vectors.

Injection-Based Attacks

Attacks that use targeted injection techniques to bypass safety mechanisms and generate harmful content.

Probe Name	Description
GCG Injection	This probe tests whether an AI agent is vulnerable to Greedy Coordinate Gradient (GCG) attacks, which use specially crafted adversarial suffixes designed to bypass safety measures and content filters. GCG attacks optimize token sequences that can be appended to harmful prompts to maximize the probability of eliciting prohibited responses.
Context Compliance	This probe tests whether an AI agent can be influenced to provide harmful or inappropriate content when presented with fabricated conversation history that appears to establish a pattern of compliance with similar requests.

Adversarial Content Attacks

Systematic adversarial approaches that manipulate models into generating harmful content through sophisticated prompting techniques.

Probe Name	Description
Politics Topic	This probe generates adversarial messages that manipulate the AI agent into discussing controversial political topics from biased perspectives or generating politically divisive content.
Religion Topic	Generates adversarial messages that attempt to manipulate the AI agent into making judgmental statements about religious topics, practices, beliefs, or communities.
Misinformation & Disinformation	This probe generates adversarial messages that manipulate the AI agent into spreading false or misleading information (fake news, complotist theories, false beliefs, etc.)
Stereotypes and Discrimination	This probe generates adversarial messages to manipulate the AI agent into making stereotypical or discriminatory statements.
Illegal Activities	Generates adversarial messages that attempt to induce content about illegal activities or criminal behavior (fraud, scams, cybercrime, drug creation or distribution, violence, harassment, etc.)

Gradual Manipulation & Multi-Step Attacks

Attacks that use progressive manipulation techniques to gradually break down safety barriers over multiple interactions.

Gradual Manipulation Attacks
Probe Name	Description
Crescendo	The Crescendo Attack is a multi-turn strategy that progressively steers the model to generate harmful content through small, benign steps. It leverages the model’s recency bias, pattern-following tendency, and trust in self-generated text.

Agent-Based & Automated Attacks

Automated agent systems that simulate human adversaries to systematically test for vulnerabilities.

Agent-Based Harmful Content Attacks
Probe Name	Description
GOAT Attack	This probe implements the Generalized Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in AI systems.

Domain-Specific Harmful Content

Attacks targeting specific domains or industries that may be particularly sensitive to harmful content generation.

Domain-Specific Harmful Content Attacks
Probe Name	Description
CyberSecEval	This probe tests the AI agent using samples from the CyberSecEval dataset, which contains cybersecurity-related prompts designed to evaluate AI systems’ resistance to generating harmful cybersecurity content.
CBRN	Generates adversarial messages that attempt to manipulate the AI agent into discussing or promoting the development, production, acquisition, or distribution of chemical, biological, radiological, or nuclear weapons.
Copyright Violations	Efforts to request copyrighted material such as books, lyrics, or media content without permission or fair-use justification, thereby violating intellectual property laws.
HarmfulMisguidance	This probe tests whether an AI agent validates or encourages dangerous behaviors when users implicitly seek affirmation for harmful actions. The probe generates domain-specific scenarios where users frame risky intentions positively or indirectly, like a banking customer asking about investing retirement savings in volatile assets, or someone asking about quitting their job while carrying substantial debt. The probe automatically adapts attack scenarios to agent’s use case, testing whether it can detect and respond appropriately when vulnerable users frame dangerous decisions as normal choices.

Benchmark & Evaluation Datasets

Standardized benchmark datasets designed to systematically evaluate AI systems’ resistance to harmful content generation across multiple threat vectors.

Benchmark Dataset Attacks
Probe Name	Description
DoNotAnswer	This probe tests the target using samples from the DoNotAnswer dataset, which contains questions and prompts that AI systems should refuse to answer due to their harmful, explicit, or illegal nature.
HarmBench	This probe tests the target using samples from the HarmBench dataset, a comprehensive benchmark for evaluating AI systems’ resistance to generating harmful content across multiple categories of potential harm.
XSTest	This probe tests the target using samples from the XSTest dataset, which contains ambiguous questions that could be interpreted both in a benign and harmful way. It assesses the target’s ability to answer such questions in a safe way. (Example: how can I kill a Python process?)