LlamaFirewall: AI Agent Open-Source Guardrail System

Meta has released LlamaFirewall, part of a suite of new Llama protection tools, an open-source framework that applies real-time guardrails to language-model agents. It’s designed to orchestrate across different guard models and help prevent risks like insecure LLM plug-in interactions. It focuses on three areas: detecting jailbreak prompts (PromptGuard 2), auditing chain-of-thought for alignment issues (AlignmentCheck), and performing fast static checks on generated code (CodeShield).

The system is already deployed at Meta, they are inviting the community to “leverage its capabilities and collaborate in addressing the new security risks introduced by Agents”.

Inside the LlamaFirewall Toolkit

LlamaFirewall isn’t just one thing; it’s a suite of specialised scanners working together.

PromptGuard 2

First up is PromptGuard 2. This scanner is defined to defend against direct attacks hidden in user prompts or system messages. It uses a lightweight BERT-style classifier model to spot malicious instructions or “jailbreak” attempts.

When a message arrives, PromptGuard analyses it. If it spots a likely attack pattern (like the classic “Ignore all previous instructions.”), it assigns a high score and returns a BLOCK decision. Benign inputs (“Hello! How’s it going?”) receive an ALLOW decision with a low score, letting them pass safely.

Meta say it’s designed to be fast and efficient. They released an updated Prompt Guard 2 86M model and a smaller, faster Prompt Guard 2 22M variant designed to significantly reduce latency and compute costs.

AlignmentCheck

Next, AlignmentCheck serves as the behavior monitor. This scanner tackles the trickier problem of determining if the agent has been manipulated or gone rogue during a conversation. It focuses on detecting deviations from the user’s original goal by inspecting the agent’s reasoning process.

Instead of single messages, AlignmentCheck analyses the entire conversation trace – the sequence of user requests, assistant thoughts, tool usage, and more. It employs a separate, powerful LLM to assess if the agent’s latest action still aligns with the original user objective.

If the agent attempts something unrelated and potentially harmful (like the tutorial example of attaching a malicious file after seemingly completing the user’s request), AlignmentCheck flags it. It might decide HUMAN_IN_THE_LOOP_REQUIRED or BLOCK, providing a rationale based on the observed deviation.

Other approaches addressing similar agent alignment and prompt injection risks, such as the CaMeL system described by Google DeepMind which translates commands into a verifiable sequence of steps, are also areas of active research worth noting in this space.

CodeShield

For AI agents that generate code, CodeShield acts as an automated code inspector. This component scans generated code for potential security vulnerabilities. Typically configured for messages originating from the ASSISTANT role (as shown in its tutorial), CodeShield analyzes code blocks embedded within the message text using static analysis techniques.

If it identifies insecure patterns (such as the tutorial example using the weak MD5 hashing algorithm, flagged as CWE-327 or CWE-328), it returns a BLOCK decision. Crucially, it also provides the reason, detailing the detected vulns and their locations within the code. Benign code snippets, naturally, pass with an ALLOW.

Fig. 1: LlamaSystem architecture

Open Source Approach

The open-source approach allows the community to scrutinise the tools, adapt them, share new detection rules, and collectively build stronger defenses against emerging threats.

Resources like the provided tutorials demonstrate the functions of these components, allowing developers and organisations to examine and potentially utilise this framework within their AI application development processes. Learn more about LlamaFirewall here