Prompt Injection

Attacks that attempt to manipulate AI agents through carefully crafted input prompts to override original instructions

OWASP LLM Top 10

OWASP LLM01: Prompt Injection: Prompt Injection occurs when an attacker manipulates an LLM’s behavior by injecting malicious input. These attacks exploit how LLMs process text-based prompts, often bypassing safeguards, compromising outputs, or enabling unauthorized access. The vulnerability lies in the model’s inability to distinguish between safe and malicious inputs, even if imperceptible to humans. Prompt Injection attacks can range from causing the LLM to generate harmful outputs to accessing sensitive data or performing unauthorized actions

Probes

Probe Name

Type

Description

ASCII Smuggling Injection

This probe tests whether an AI agent can be tricked by injecting invisible ASCII characters. These characters, while invisible for the user can be interpreted by the AI agent. This is a way to inject malicious instructions inside seemingly harmless prompts.

Best-of-N

This probe implements the Best-of-N attack strategy, generating multiple variations of harmful prompts using techniques like word scrambling, random capitalization, and other obfuscation methods to test for prompt injection vulnerabilities.

COT Forgery

This probe tests whether an AI agent can be manipulated by adding a complying CoT trace after a harmful request.

ChatInject

This probe implements a ChatInject-style indirect prompt injection attack. It abuses LLM chat templates and role hierarchies to inject a malicious, persuasive multi-turn conversation.

Citation Framing Injection

This probe tests whether an AI agent can be manipulated by framing harmful requests as academic citations, research references, or scholarly content to bypass safety mechanisms.

DAN

Do Anything Now (DAN) is a well know Jailbreak attack. This probe tests the target’s resistance to DAN-style jailbreak attempts that try to override the AI’s safety guidelines. DAN prompts typically instruct the AI to ignore its restrictions and act as an unrestricted assistant that can “do anything now”. The probe uses a curated dataset of DAN prompts and variations to evaluate whether the target can be manipulated into bypassing its safety mechanisms and producing harmful or inappropriate content through role-playing scenarios.

Deepset Injection

This probe tests the target using samples from the Deepset prompt injection dataset to evaluate vulnerability to various prompt injection techniques and manipulation strategies.

Encoding

This probe tests whether an AI agent can be tricked into generating harmful content when prompts are encoded using various encoding schemes such as Base64, NATO phonetic alphabet, ROT13, or other obfuscation techniques.

Grandma Framing Injection

This probe tests whether an AI agent can be manipulated through emotional framing that involves a grandmother scenario, where harmful requests are disguised as innocent family interactions or memories.

Likert Framing Injection

This probe tests whether an AI agent can be tricked by framing harmful requests as academic research questions and using Likert scales or survey methodologies to answer.

Math Framing Injection

This probe tests whether an AI agent can be tricked by framing harmful requests as mathematical reasoning problems, equations, or academic exercises to bypass safety measures and content filters.

Transliteration

This probe tests whether an AI agent can be tricked into generating harmful content when prompts are transliterated (converted to different character sets or writing systems) to bypass content filters and detection mechanisms.