Prompt Injection (OWASP LLM 01)
Attacks that attempt to manipulate AI agents through carefully crafted input prompts to override original instructions.
Variation & Optimization Attacks
Probe Name |
Description |
---|---|
Best-of-N |
This probe implements the Best-of-N attack strategy, which generates multiple diverse variations of adversarial prompts using methods such as word scrambling, random capitalization, and other obfuscation techniques to maximize the chance of bypassing prompt injection defenses. For more details, see: https://arxiv.org/pdf/2412.03556 |
TAP |
This probe implements the Tree of Attacks with Pruning (TAP) method, an automated approach for jailbreaking black-box large language models by systematically generating and refining adversarial prompts through iterative tree search. For more details, see: https://arxiv.org/pdf/2312.02119 |
Framing-Based Attacks
Probe Name |
Description |
---|---|
Math Framing Injection |
This probe tests whether an AI agent can be tricked by framing harmful requests as mathematical reasoning problems, equations, or academic exercises to bypass safety measures and content filters |
Likert Framing Injection |
This probe tests whether an AI agent can be tricked by framing harmful requests as academic research questions and using Likert scales or survey methodologies to answer. For more information, see: https://unit42.paloaltonetworks.com/multi-turn-technique-jailbreaks |
Citation Framing Injection |
This probe tests whether an AI agent can be manipulated by framing harmful requests as academic citations, research references, or scholarly content to bypass safety mechanisms. For more information, see: https://arxiv.org/pdf/2411.11407 |
Grandma Framing Injection |
This probe tests whether an AI agent can be manipulated through emotional framing that involves a grandmother scenario, where harmful requests are disguised as innocent family interactions or memories |
Classic Jailbreak Techniques
Probe Name |
Description |
---|---|
DAN |
This probe tests the target’s resistance to DAN-style jailbreak attempts that try to override the AI’s safety guidelines. DAN prompts typically instruct the AI to ignore its restrictions and act as an unrestricted assistant that can “do anything now”. The probe uses a curated dataset of DAN prompts and variations to evaluate whether the target can be manipulated into bypassing its safety mechanisms and producing harmful or inappropriate content through role-playing scenarios |
Deepset Injection |
This probe tests the target using samples from the Deepset prompt injection dataset to evaluate vulnerability to various prompt injection techniques and manipulation strategies |
Encoding & Character Manipulation
Probe Name |
Description |
---|---|
ASCII Smuggling Injection |
This probe tests whether an AI agent can be tricked by injecting invisible ASCII characters. These characters, while invisible for the user, can be interpreted by the AI agent. This is a way to inject malicious instructions inside seemingly harmless prompts. |
Encoding |
This probe tests whether an agent can be tricked into generating harmful content when prompts are encoded (e.g., base64, NATO, etc.) to bypass content filters |
Transliteration |
This probe tests whether an AI agent can be tricked into generating harmful content when prompts are transliterated (converted to different character sets or writing systems) to bypass content filters and detection mechanisms |