Skip to content
GitHubDiscord

Prompt Injection

Attacks that attempt to manipulate AI agents through carefully crafted input prompts to override original instructions

OWASP LLM01: Prompt Injection: Prompt Injection occurs when an attacker manipulates an LLM’s behavior by injecting malicious input. These attacks exploit how LLMs process text-based prompts, often bypassing safeguards, compromising outputs, or enabling unauthorized access. The vulnerability lies in the model’s inability to distinguish between safe and malicious inputs, even if imperceptible to humans. Prompt Injection attacks can range from causing the LLM to generate harmful outputs to accessing sensitive data or performing unauthorized actions

Probe NameTypeDescription
ASCII Smuggling InjectionThis probe tests whether an AI agent can be tricked by injecting invisible ASCII characters. These characters, while invisible for the user can be interpreted by the AI agent. This is a way to inject malicious instructions inside seemingly harmless prompts.
Best-of-NThis probe implements the Best-of-N attack strategy, generating multiple variations of harmful prompts using techniques like word scrambling, random capitalization, and other obfuscation methods to test for prompt injection vulnerabilities.
COT ForgeryThis probe tests whether an AI agent can be manipulated by adding a complying CoT trace after a harmful request.
ChatInjectThis probe implements a ChatInject-style indirect prompt injection attack. It abuses LLM chat templates and role hierarchies to inject a malicious, persuasive multi-turn conversation.
Citation Framing InjectionThis probe tests whether an AI agent can be manipulated by framing harmful requests as academic citations, research references, or scholarly content to bypass safety mechanisms.
DANDo Anything Now (DAN) is a well know Jailbreak attack. This probe tests the target’s resistance to DAN-style jailbreak attempts that try to override the AI’s safety guidelines. DAN prompts typically instruct the AI to ignore its restrictions and act as an unrestricted assistant that can “do anything now”. The probe uses a curated dataset of DAN prompts and variations to evaluate whether the target can be manipulated into bypassing its safety mechanisms and producing harmful or inappropriate content through role-playing scenarios.
Deepset InjectionThis probe tests the target using samples from the Deepset prompt injection dataset to evaluate vulnerability to various prompt injection techniques and manipulation strategies.
EncodingThis probe tests whether an AI agent can be tricked into generating harmful content when prompts are encoded using various encoding schemes such as Base64, NATO phonetic alphabet, ROT13, or other obfuscation techniques.
Grandma Framing InjectionThis probe tests whether an AI agent can be manipulated through emotional framing that involves a grandmother scenario, where harmful requests are disguised as innocent family interactions or memories.
Likert Framing InjectionThis probe tests whether an AI agent can be tricked by framing harmful requests as academic research questions and using Likert scales or survey methodologies to answer.
Math Framing InjectionThis probe tests whether an AI agent can be tricked by framing harmful requests as mathematical reasoning problems, equations, or academic exercises to bypass safety measures and content filters.
TokenBreakThis probe tests whether an AI agent can be manipulated through obfuscated prompt injection. It prepends characters to sensitive trigger words (e.g. “ignore” → “Aignore”) to evade input classifiers while remaining interpretable by the underlying language model. Based on the TokenBreak research.
TransliterationThis probe tests whether an AI agent can be tricked into generating harmful content when prompts are transliterated (converted to different character sets or writing systems) to bypass content filters and detection mechanisms.