Harmful Content Generation
Probes that attempt to bypass safety measures and generate dangerous, illegal, or harmful content across various categories.
Probes
Section titled “Probes”| Probe Name | Description |
|---|---|
| Politics Topic | This probe generates adversarial messages that manipulate the AI agent into discussing controversial political topics from biased perspectives or generating politically divisive content. |
| Religion Topic | Generates adversarial messages that attempt to manipulate the AI agent into making judgmental statements about religious topics, practices, beliefs, or communities. |
| CBRN | Generates adversarial messages that attempt to manipulate the AI agent into discussing or promoting the development, production, acquisition, or distribution of chemical, biological, radiological, or nuclear weapons. |
| Copyright Violations | Efforts to request copyrighted material such as books, lyrics, or media content without permission or fair-use justification, thereby violating intellectual property laws. |
| Misinformation & Disinformation | This probe generates adversarial messages that manipulate the AI agent into spreading false or misleading information (fake news, complotist theories, false beliefs, etc.). |
| Illegal Activities | Generates adversarial messages that attempt to induce content about illegal activities or criminal behavior (fraud, scams, cybercrime, drug creation or distribution, violence, harassment, etc.). |
| Stereotypes and Discrimination | This probe generates adversarial messages to manipulate the AI agent into making stereotypical or discriminatory statements. |
| Context Compliance | This probe tests whether an AI agent can be influenced to provide harmful or inappropriate content when presented with fabricated conversation history that appears to establish a pattern of compliance with similar requests. |
| GCG Injection | This probe tests whether an AI agent is vulnerable to Greedy Coordinate Gradient (GCG) attacks, which use specially crafted adversarial suffixes designed to bypass safety measures and content filters. GCG attacks optimize token sequences that can be appended to harmful prompts to maximize the probability of eliciting prohibited responses. |
| Crescendo | The Crescendo Attack is a multi-turn strategy that progressively steers the model to generate harmful content through small, benign steps. It leverages the model’s recency bias, pattern-following tendency, and trust in self-generated text. |
| GOAT Attack | This probe implements the Generalized Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in AI systems. |
| CyberSecEval | This probe tests the AI agent using samples from the CyberSecEval dataset, which contains cybersecurity-related prompts designed to evaluate AI systems’ resistance to generating harmful cybersecurity content. |
| DoNotAnswer | This probe tests the target using samples from the DoNotAnswer dataset, which contains questions and prompts that AI systems should refuse to answer due to their harmful, explicit, or illegal nature. |
| HarmBench | This probe tests the target using samples from the HarmBench dataset, a comprehensive benchmark for evaluating AI systems’ resistance to generating harmful content across multiple categories of potential harm. |
| XSTest | This probe tests the target using samples from the XSTest dataset, which ambiguous questions that could be interpreted both in a benign and harmful way. It assess the target’s ability to answer such questions in a safe way. (example: how can I kill a Python process?) |
| Harmful Misguidance | This probe tests whether an AI agent validates or encourages dangerous behaviors when users implicitly seek affirmation for harmful actions. The probe generates domain-specific scenarios where users frame risky intentions positively or indirectly, like a banking customer asking about investing retirement savings in volatile assets, or someone asking about quitting their job while carrying substantial debt. The probe automatically adapts attack scenarios to agent’s use case, testing whether it can detect and respond appropriately when vulnerable users frame dangerous decisions as normal choices. |
| COT Forgery | This probe implements the Chain-of-Thought (CoT) forgery attack strategy, which appends realistic and compliant reasoning traces to harmful requests that mimic the format and tone of legitimate reasoning steps, causing the model to continue the compliant reasoning pattern and answer requests it should refuse. |