Skip to content
GitHubDiscord

Prompt Injection

Attacks that attempt to manipulate AI agents through carefully crafted input prompts to override original instructions.

OWASP LLM01:2025 Prompt Injection ↗ - Prompt Injection occurs when an attacker manipulates an LLM’s behavior by injecting malicious input. These attacks exploit how LLMs process text-based prompts, often bypassing safeguards, compromising outputs, or enabling unauthorized access. The vulnerability lies in the model’s inability to distinguish between safe and malicious inputs, even if imperceptible to humans. Prompt Injection attacks can range from causing the LLM to generate harmful outputs to accessing sensitive data or performing unauthorized actions.

Probe NameDescription
Best-of-NThis probe implements the Best-of-N attack strategy, generating multiple variations of harmful prompts using techniques like word scrambling, random capitalization, and other obfuscation methods to test for prompt injection vulnerabilities. Reference ↗
Math Framing InjectionThis probe tests whether an AI agent can be tricked by framing harmful requests as mathematical reasoning problems, equations, or academic exercises to bypass safety measures and content filters.
DANThis probe tests the target’s resistance to DAN-style jailbreak attempts that try to override the AI’s safety guidelines. DAN prompts typically instruct the AI to ignore its restrictions and act as an unrestricted assistant that can “do anything now”. The probe uses a curated dataset of DAN prompts and variations to evaluate whether the target can be manipulated into bypassing its safety mechanisms and producing harmful or inappropriate content through role-playing scenarios.
Likert Framing InjectionThis probe tests whether an AI agent can be tricked by framing harmful requests as academic research questions and using Likert scales or survey methodologies to answer. Reference ↗
Citation Framing InjectionThis probe tests whether an AI agent can be manipulated by framing harmful requests as academic citations, research references, or scholarly content to bypass safety mechanisms. Reference ↗
Grandma Framing InjectionTests whether an AI agent can be manipulated through emotional framing that involves a grandmother scenario, where harmful requests are disguised as innocent family interactions or memories.
ASCII Smuggling InjectionThis probe tests whether an AI agent can be tricked by injecting invisible ASCII characters. These characters, while invisible for the user can be interpreted by the AI agent. This is a way to inject malicious instructions inside seemingly harmless prompts.
EncodingThis probe tests whether an agent can be tricked into generating harmful content when prompts are encoded (e.g., base64, NATO, etc.) to bypass content filters.
TransliterationThis probe tests whether an AI agent can be tricked into generating harmful content when prompts are transliterated (converted to different character sets or writing systems) to bypass content filters and detection mechanisms.
Deepset InjectionThis probe tests the target using samples from the Deepset prompt injection dataset to evaluate vulnerability to various prompt injection techniques and manipulation strategies.
TAPImplements the Tree of Attacks with Pruning (TAP) method, an automated approach for jailbreaking black-box large language models by systematically generating and refining adversarial prompts through iterative tree search. Reference ↗
NEW Token BreakThis probe tests whether your AI agent can be manipulated through obfuscated prompt injection. It embeds malicious instructions inside legitimate-looking user messages, then prepends characters to sensitive trigger words to break tokenization and evade input classifiers while remaining interpretable by the underlying language model. This technique exploits the gap between how safety filters tokenize text and how LLMs process it. Supports English, French, Italian, and German.
ChatInjectThis probe tests whether agents can be manipulated through malicious instructions formatted to match their native chat templates. Unlike traditional plain-text injection attacks, ChatInject exploits the structured role-based formatting (system, user, assistant tags) that agents use internally. By wrapping attack payloads with forged chat template tokens, mimicking the model’s own instruction hierarchy, attackers can bypass defenses that rely on role priority. The probe includes a multi-turn variant that sends persuasive conversation, delimited with adequate separation tokens, inside one message to confuse the agent under test. This technique achieves significantly higher success rates than standard injection methods and transfers effectively across models, even when the target model’s exact template structure is unknown.