Prompt Injection
Attacks that attempt to manipulate AI agents through carefully crafted input prompts to override original instructions.
OWASP LLM Top 10
Section titled “OWASP LLM Top 10”OWASP LLM01:2025 Prompt Injection ↗ - Prompt Injection occurs when an attacker manipulates an LLM’s behavior by injecting malicious input. These attacks exploit how LLMs process text-based prompts, often bypassing safeguards, compromising outputs, or enabling unauthorized access. The vulnerability lies in the model’s inability to distinguish between safe and malicious inputs, even if imperceptible to humans. Prompt Injection attacks can range from causing the LLM to generate harmful outputs to accessing sensitive data or performing unauthorized actions.
Probes
Section titled “Probes”| Probe Name | Description |
|---|---|
| Best-of-N | This probe implements the Best-of-N attack strategy, generating multiple variations of harmful prompts using techniques like word scrambling, random capitalization, and other obfuscation methods to test for prompt injection vulnerabilities. Reference ↗ |
| Math Framing Injection | This probe tests whether an AI agent can be tricked by framing harmful requests as mathematical reasoning problems, equations, or academic exercises to bypass safety measures and content filters. |
| DAN | This probe tests the target’s resistance to DAN-style jailbreak attempts that try to override the AI’s safety guidelines. DAN prompts typically instruct the AI to ignore its restrictions and act as an unrestricted assistant that can “do anything now”. The probe uses a curated dataset of DAN prompts and variations to evaluate whether the target can be manipulated into bypassing its safety mechanisms and producing harmful or inappropriate content through role-playing scenarios. |
| Likert Framing Injection | This probe tests whether an AI agent can be tricked by framing harmful requests as academic research questions and using Likert scales or survey methodologies to answer. Reference ↗ |
| Citation Framing Injection | This probe tests whether an AI agent can be manipulated by framing harmful requests as academic citations, research references, or scholarly content to bypass safety mechanisms. Reference ↗ |
| Grandma Framing Injection | Tests whether an AI agent can be manipulated through emotional framing that involves a grandmother scenario, where harmful requests are disguised as innocent family interactions or memories. |
| ASCII Smuggling Injection | This probe tests whether an AI agent can be tricked by injecting invisible ASCII characters. These characters, while invisible for the user can be interpreted by the AI agent. This is a way to inject malicious instructions inside seemingly harmless prompts. |
| Encoding | This probe tests whether an agent can be tricked into generating harmful content when prompts are encoded (e.g., base64, NATO, etc.) to bypass content filters. |
| Transliteration | This probe tests whether an AI agent can be tricked into generating harmful content when prompts are transliterated (converted to different character sets or writing systems) to bypass content filters and detection mechanisms. |
| Deepset Injection | This probe tests the target using samples from the Deepset prompt injection dataset to evaluate vulnerability to various prompt injection techniques and manipulation strategies. |
| TAP | Implements the Tree of Attacks with Pruning (TAP) method, an automated approach for jailbreaking black-box large language models by systematically generating and refining adversarial prompts through iterative tree search. Reference ↗ |
| NEW Token Break | This probe tests whether your AI agent can be manipulated through obfuscated prompt injection. It embeds malicious instructions inside legitimate-looking user messages, then prepends characters to sensitive trigger words to break tokenization and evade input classifiers while remaining interpretable by the underlying language model. This technique exploits the gap between how safety filters tokenize text and how LLMs process it. Supports English, French, Italian, and German. |
| ChatInject | This probe tests whether agents can be manipulated through malicious instructions formatted to match their native chat templates. Unlike traditional plain-text injection attacks, ChatInject exploits the structured role-based formatting (system, user, assistant tags) that agents use internally. By wrapping attack payloads with forged chat template tokens, mimicking the model’s own instruction hierarchy, attackers can bypass defenses that rely on role priority. The probe includes a multi-turn variant that sends persuasive conversation, delimited with adequate separation tokens, inside one message to confuse the agent under test. This technique achieves significantly higher success rates than standard injection methods and transfers effectively across models, even when the target model’s exact template structure is unknown. |