AI ai

Prompt Injection 101 - Risks and Defences

Hero image for Prompt Injection 101 - Risks and Defences

Prompt Injection 101 - Risks and Defences

This article is an introduction to prompt injection, explaining the risks and defences, with deeper dives coming in later posts.

Generative AI brings a new class of security risks. Chief among them is prompt injection, where an adversary manipulates a Large Language Model (LLM) into ignoring its intended purpose and instead following the attacker’s instructions.

OWASP’s LLM Top 10 (2025) places “Prompt Injection” at position LLM01 and folds jailbreak-style attacks into the same category, underscoring the importance of this threat. (If you’re looking for attacks that target the model’s built-in safety filters rather than the developer prompt, see our separate article on LLM jailbreaking.)

There is no silver bullet. Defending against prompt injection is a continuous, multi-layered effort focused on managing an evolving risk, not a one-time fix.

What Attackers Can Achieve with Prompt Injection

Prompt injection exploits an LLM’s ability to process natural language. Attackers can embed hostile instructions in any input the model consumes, including documents, emails, chat messages, and images containing visible or hidden text. When processed, an attacker can achieve the following:

  • Steal sensitive information (data exfiltration): If the agent can send data externally (API call, email, web request), an injected prompt can command the AI to reveal confidential data. In chat-only deployments, leaks are associated with the active session.

  • Break the rules and escalate privileges: Prompt injection can overwrite the constraints laid out in the system prompt, allowing the model to act beyond its intended permissions.

  • Hijack the agent for malicious actions: When the model is connected to external tools such as email APIs or web-request functions, a prompt injection can steer those tools to send spam, spread misinformation, or probe other systems. This effectively turns your AI into an attacker-controlled bot.

We walk through a real-world example of this kind of attack later in the article.

Reducing Prompt Injection Risk: Key Defences

Take the next few points as a primer; deeper dives will land in future articles. Combine practical controls to lower risk. Each on its own is probabilistic; together they provide a baseline.

  • Input sanitisation and filtering: Use keyword, regex or embedding-based classifiers to catch obvious or obfuscated payloads.

  • Monitoring and anomaly detection: Track spikes in tokens per second, bursts of tool calls or clusters of near-duplicate prompts.

  • Least privilege: Limit the AI’s access to data and tools to shrink the blast radius.

  • Output or “policy” model: A lightweight model can review actions or responses before execution.

  • Red-team testing: Regular simulated attacks reveal blind spots.

Understanding Indirect Prompt Injection

While direct prompt injection involves attackers including malicious instructions directly in their prompts, indirect prompt injection presents a more insidious threat. This attack vector exploits the AI’s processing of external content that appears legitimate.

Consider this attack flow as an example:

Indirect Prompt Injection Attack Flow
Figure 1: AI executes hidden attacker instructions embedded in shared content.

In this scenario we note the following:

  1. The Setup: An attacker plants malicious instructions in shared resources—meeting notes, documents, or emails—that they know an AI system will later process.

  2. The Trigger: An unsuspecting user makes a legitimate request, such as “Send John the document from our meeting notes.”

  3. The Hijack: When the AI reads the poisoned content, hidden instructions override the user’s intent. Instead of sending a document to John, the AI extracts and sends confidential files to the attacker.

This category of attack is particularly dangerous because:

  • Trust exploitation: Users typically trust their own data sources
  • Delayed execution: Attacks can be planted long before activation
  • Attribution difficulty: Tracing the attack back to its source is a challenge
  • Scale potential: One poisoned document can potentially compromise multiple interactions

Indirect prompt injection highlights why AI systems must treat all external content as potentially untrusted, implementing strict boundaries between instructions and data—even when that data comes from seemingly reliable sources.

Emerging from Research: Where Defences Are Heading

The paper “Defeating Prompt Injections by Design” introduces CaMeL, a safety-first architecture that creates a protective system layer around the LLM. CaMeL defends against these attacks by building upon the dual-LLM architecture, a pattern which consists of two LLMs:

  1. Privileged LLM (Planner). Reads only the trusted user request, generates code that fulfils the task, and never sees untrusted data. It is the only component allowed to call external tools or touch sensitive information.

  2. Quarantined LLM (Processor). Parses all untrusted text—web pages, emails, uploaded files—into structured data for the Planner. It holds no secrets and has no tool-calling rights.

Every value the system handles carries a capability tag recording its origin and permitted uses. A custom interpreter checks those tags before any tool call; if untrusted data would violate policy, the action is blocked. In the paper’s AgentDojo evaluation this dual-LLM + capability layer eliminated all control-flow prompt-injection attacks and, with suitable policies, stopped data-flow hijacks as well.

Open questions remain—timing and exception side-channels, runtime overhead, and the work of writing fine-grained policies—but the direction is clear: future agents will be “secure by design,” not patched after deployment.

The Road Ahead: An Unsolved and Evolving Problem

Prompt injection remains unsolved. We need secure architectures that are practical, performant, and manageable. Even capability-based designs face limits such as subtle side-channels and the complexity of perfect policies. Rigorous threat modeling is therefore essential for every LLM-powered application.

Security means defining an acceptable risk per application and enforcing it with layered guardrails. Combine today’s best-practice controls with forward-looking architectures, then evolve them as new threats and defences emerge.