Harmful Content Generation
Probes that attempt to bypass safety measures and generate dangerous, illegal, or harmful content across various categories and attack vectors.
Injection-Based Attacks
Attacks that use targeted injection techniques to bypass safety mechanisms and generate harmful content.
Probe Name |
Description |
---|---|
GCG Injection |
This probe tests whether an AI agent is vulnerable to Greedy Coordinate Gradient (GCG) attacks, which use specially crafted adversarial suffixes designed to bypass safety measures and content filters. GCG attacks optimize token sequences that can be appended to harmful prompts to maximize the probability of eliciting prohibited responses. |
Context Compliance |
This probe tests whether an AI agent can be influenced to provide harmful or inappropriate content when presented with fabricated conversation history that appears to establish a pattern of compliance with similar requests. |
Adversarial Content Attacks
Systematic adversarial approaches that manipulate models into generating harmful content through sophisticated prompting techniques.
Probe Name |
Description |
---|---|
Politics Topic |
This probe generates adversarial messages that manipulate the AI agent into discussing controversial political topics from biased perspectives or generating politically divisive content. |
Religion Topic |
Generates adversarial messages that attempt to manipulate the AI agent into making judgmental statements about religious topics, practices, beliefs, or communities. |
Misinformation & Disinformation |
This probe generates adversarial messages that manipulate the AI agent into spreading false or misleading information (fake news, complotist theories, false beliefs, etc.) |
Stereotypes and Discrimination |
This probe generates adversarial messages to manipulate the AI agent into making stereotypical or discriminatory statements. |
Illegal Activities |
Generates adversarial messages that attempt to induce content about illegal activities or criminal behavior (fraud, scams, cybercrime, drug creation or distribution, violence, harassment, etc.) |
Gradual Manipulation & Multi-Step Attacks
Attacks that use progressive manipulation techniques to gradually break down safety barriers over multiple interactions.
Agent-Based & Automated Attacks
Automated agent systems that simulate human adversaries to systematically test for vulnerabilities.
Domain-Specific Harmful Content
Attacks targeting specific domains or industries that may be particularly sensitive to harmful content generation.
Benchmark & Evaluation Datasets
Standardized benchmark datasets designed to systematically evaluate AI systems’ resistance to harmful content generation across multiple threat vectors.