Prompt Injection (OWASP LLM 01)

Attacks that attempt to manipulate AI agents through carefully crafted input prompts to override original instructions.

Variation & Optimization Attacks

Advanced Prompt Generation Techniques

Probe Name

Description

Best-of-N

This probe implements the Best-of-N attack strategy, which generates multiple diverse variations of adversarial prompts using methods such as word scrambling, random capitalization, and other obfuscation techniques to maximize the chance of bypassing prompt injection defenses. For more details, see: https://arxiv.org/pdf/2412.03556

TAP

This probe implements the Tree of Attacks with Pruning (TAP) method, an automated approach for jailbreaking black-box large language models by systematically generating and refining adversarial prompts through iterative tree search. For more details, see: https://arxiv.org/pdf/2312.02119

Framing-Based Attacks

Context Manipulation Techniques

Probe Name

Description

Math Framing Injection

This probe tests whether an AI agent can be tricked by framing harmful requests as mathematical reasoning problems, equations, or academic exercises to bypass safety measures and content filters

Likert Framing Injection

This probe tests whether an AI agent can be tricked by framing harmful requests as academic research questions and using Likert scales or survey methodologies to answer. For more information, see: https://unit42.paloaltonetworks.com/multi-turn-technique-jailbreaks

Citation Framing Injection

This probe tests whether an AI agent can be manipulated by framing harmful requests as academic citations, research references, or scholarly content to bypass safety mechanisms. For more information, see: https://arxiv.org/pdf/2411.11407

Grandma Framing Injection

This probe tests whether an AI agent can be manipulated through emotional framing that involves a grandmother scenario, where harmful requests are disguised as innocent family interactions or memories

Classic Jailbreak Techniques

Probe Name

Description

DAN

This probe tests the target’s resistance to DAN-style jailbreak attempts that try to override the AI’s safety guidelines. DAN prompts typically instruct the AI to ignore its restrictions and act as an unrestricted assistant that can “do anything now”. The probe uses a curated dataset of DAN prompts and variations to evaluate whether the target can be manipulated into bypassing its safety mechanisms and producing harmful or inappropriate content through role-playing scenarios

Deepset Injection

This probe tests the target using samples from the Deepset prompt injection dataset to evaluate vulnerability to various prompt injection techniques and manipulation strategies

Encoding & Character Manipulation

Probe Name

Description

ASCII Smuggling Injection

This probe tests whether an AI agent can be tricked by injecting invisible ASCII characters. These characters, while invisible for the user, can be interpreted by the AI agent. This is a way to inject malicious instructions inside seemingly harmless prompts.

Encoding

This probe tests whether an agent can be tricked into generating harmful content when prompts are encoded (e.g., base64, NATO, etc.) to bypass content filters

Transliteration

This probe tests whether an AI agent can be tricked into generating harmful content when prompts are transliterated (converted to different character sets or writing systems) to bypass content filters and detection mechanisms