Harmful Content Generation

Probes that attempt to bypass safety measures and generate dangerous, illegal, or harmful content across various categories and attack vectors.

Injection-Based Attacks

Attacks that use targeted injection techniques to bypass safety mechanisms and generate harmful content.

Probe Name

Description

GCG Injection

This probe tests whether an AI agent is vulnerable to Greedy Coordinate Gradient (GCG) attacks, which use specially crafted adversarial suffixes designed to bypass safety measures and content filters. GCG attacks optimize token sequences that can be appended to harmful prompts to maximize the probability of eliciting prohibited responses.

Context Compliance

This probe tests whether an AI agent can be influenced to provide harmful or inappropriate content when presented with fabricated conversation history that appears to establish a pattern of compliance with similar requests.

Adversarial Content Attacks

Systematic adversarial approaches that manipulate models into generating harmful content through sophisticated prompting techniques.

Probe Name

Description

Politics Topic

This probe generates adversarial messages that manipulate the AI agent into discussing controversial political topics from biased perspectives or generating politically divisive content.

Religion Topic

Generates adversarial messages that attempt to manipulate the AI agent into making judgmental statements about religious topics, practices, beliefs, or communities.

Misinformation & Disinformation

This probe generates adversarial messages that manipulate the AI agent into spreading false or misleading information (fake news, complotist theories, false beliefs, etc.)

Stereotypes and Discrimination

This probe generates adversarial messages to manipulate the AI agent into making stereotypical or discriminatory statements.

Illegal Activities

Generates adversarial messages that attempt to induce content about illegal activities or criminal behavior (fraud, scams, cybercrime, drug creation or distribution, violence, harassment, etc.)

Gradual Manipulation & Multi-Step Attacks

Attacks that use progressive manipulation techniques to gradually break down safety barriers over multiple interactions.

Gradual Manipulation Attacks

Probe Name

Description

Crescendo

The Crescendo Attack is a multi-turn strategy that progressively steers the model to generate harmful content through small, benign steps. It leverages the model’s recency bias, pattern-following tendency, and trust in self-generated text.

Agent-Based & Automated Attacks

Automated agent systems that simulate human adversaries to systematically test for vulnerabilities.

Agent-Based Harmful Content Attacks

Probe Name

Description

GOAT Attack

This probe implements the Generalized Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in AI systems.

Domain-Specific Harmful Content

Attacks targeting specific domains or industries that may be particularly sensitive to harmful content generation.

Domain-Specific Harmful Content Attacks

Probe Name

Description

CyberSecEval

This probe tests the AI agent using samples from the CyberSecEval dataset, which contains cybersecurity-related prompts designed to evaluate AI systems’ resistance to generating harmful cybersecurity content.

CBRN

Generates adversarial messages that attempt to manipulate the AI agent into discussing or promoting the development, production, acquisition, or distribution of chemical, biological, radiological, or nuclear weapons.

Copyright Violations

Efforts to request copyrighted material such as books, lyrics, or media content without permission or fair-use justification, thereby violating intellectual property laws.

Benchmark & Evaluation Datasets

Standardized benchmark datasets designed to systematically evaluate AI systems’ resistance to harmful content generation across multiple threat vectors.

Benchmark Dataset Attacks

Probe Name

Description

DoNotAnswer

This probe tests the target using samples from the DoNotAnswer dataset, which contains questions and prompts that AI systems should refuse to answer due to their harmful, explicit, or illegal nature.

HarmBench

This probe tests the target using samples from the HarmBench dataset, a comprehensive benchmark for evaluating AI systems’ resistance to generating harmful content across multiple categories of potential harm.

XSTest

This probe tests the target using samples from the XSTest dataset, which contains ambiguous questions that could be interpreted both in a benign and harmful way. It assesses the target’s ability to answer such questions in a safe way. (Example: how can I kill a Python process?)