Prompt Injection (OWASP LLM 01)
Attacks that attempt to manipulate AI agents through carefully crafted input prompts to override original instructions.
Variation & Optimization Attacks
Probe Name |
Description |
|---|---|
Best-of-N |
This probe implements the Best-of-N attack strategy, which generates multiple diverse variations of adversarial prompts using methods such as word scrambling, random capitalization, and other obfuscation techniques to maximize the chance of bypassing prompt injection defenses. For more details, see: https://arxiv.org/pdf/2412.03556 |
TAP |
This probe implements the Tree of Attacks with Pruning (TAP) method, an automated approach for jailbreaking black-box large language models by systematically generating and refining adversarial prompts through iterative tree search. For more details, see: https://arxiv.org/pdf/2312.02119 |
Framing-Based Attacks
Probe Name |
Description |
|---|---|
Math Framing Injection |
This probe tests whether an AI agent can be tricked by framing harmful requests as mathematical reasoning problems, equations, or academic exercises to bypass safety measures and content filters |
Likert Framing Injection |
This probe tests whether an AI agent can be tricked by framing harmful requests as academic research questions and using Likert scales or survey methodologies to answer. For more information, see: https://unit42.paloaltonetworks.com/multi-turn-technique-jailbreaks |
Citation Framing Injection |
This probe tests whether an AI agent can be manipulated by framing harmful requests as academic citations, research references, or scholarly content to bypass safety mechanisms. For more information, see: https://arxiv.org/pdf/2411.11407 |
Grandma Framing Injection |
This probe tests whether an AI agent can be manipulated through emotional framing that involves a grandmother scenario, where harmful requests are disguised as innocent family interactions or memories |
Classic Jailbreak Techniques
Probe Name |
Description |
|---|---|
DAN |
This probe tests the target’s resistance to DAN-style jailbreak attempts that try to override the AI’s safety guidelines. DAN prompts typically instruct the AI to ignore its restrictions and act as an unrestricted assistant that can “do anything now”. The probe uses a curated dataset of DAN prompts and variations to evaluate whether the target can be manipulated into bypassing its safety mechanisms and producing harmful or inappropriate content through role-playing scenarios |
Deepset Injection |
This probe tests the target using samples from the Deepset prompt injection dataset to evaluate vulnerability to various prompt injection techniques and manipulation strategies |
Template-Based Attacks
Probe Name |
Description |
|---|---|
ChatInject |
This probe tests whether agents can be manipulated through malicious instructions formatted to match their native chat templates. Unlike traditional plain-text injection attacks, ChatInject exploits the structured role-based formatting (system, user, assistant tags) that agents use internally. By wrapping attack payloads with forged chat template tokens, mimicking the model’s own instruction hierarchy, attackers can bypass defenses that rely on role priority. The probe includes a multi-turn variant that sends persuasive conversation, delimited with adequate separation tokens, inside one message to confuse the agent under test. This technique achieves significantly higher success rates than standard injection methods and transfers effectively across models, even when the target model’s exact template structure is unknown. |
CoT Forgery |
This probe implements the Chain-of-Thought (CoT) forgery attack strategy, which appends realistic and compliant reasoning traces to harmful requests that mimic the format and tone of legitimate reasoning steps, causing the model to continue the compliant reasoning pattern and answer requests it should refuse. |
Encoding & Character Manipulation
Probe Name |
Description |
|---|---|
ASCII Smuggling Injection |
This probe tests whether an AI agent can be tricked by injecting invisible ASCII characters. These characters, while invisible for the user, can be interpreted by the AI agent. This is a way to inject malicious instructions inside seemingly harmless prompts. |
Encoding |
This probe tests whether an agent can be tricked into generating harmful content when prompts are encoded (e.g., base64, NATO, etc.) to bypass content filters |
Transliteration |
This probe tests whether an AI agent can be tricked into generating harmful content when prompts are transliterated (converted to different character sets or writing systems) to bypass content filters and detection mechanisms |