Agentjacking Defense Checklist for AI Agents
Agentjacking is useful shorthand for an agent hijacking failure: the agent's tools, context, or execution path are steered away from the user's goal and toward an attacker-controlled action. The label is newer than the underlying risk, so document the concrete attack path in your system: indirect prompt injection, malicious tool output, unsafe browser content, overbroad credentials, or an unreviewed tool action.
1. Map the agent's blast radius
List every tool the agent can call, every credential it can use, every network destination it can reach, and every user-visible or system-visible state it can change. Mark tools as read-only, reversible write, irreversible write, payment, identity, code execution, or external communication.
2. Separate trusted instructions from untrusted content
Treat webpages, emails, PDFs, tickets, screenshots, retrieved passages, code comments, and tool outputs as data. They can contain instructions, but they should not become the agent's authority. Use clear boundaries, source labels, and retrieval metadata so the model can distinguish system instructions, developer instructions, user requests, and untrusted evidence.
3. Put policy at the tool boundary
Do not rely only on the model to decide whether a tool call is safe. Validate arguments, enforce identity-based permissions, block unexpected domains, rate-limit high-risk actions, and require approval for purchases, account changes, deletions, external messages, legal acceptance, or production writes.
4. Sandbox computer-use and browser agents
Computer-use agents need extra controls because screenshots and web pages become model input. Run them in a dedicated virtual machine or container, remove sensitive accounts, use minimal privileges, and restrict network access with an allowlist when possible.
5. Log enough to debug and audit
Store the user request, retrieved sources, tool-call arguments, tool results, approval decisions, and final answer. Redact secrets, but keep enough context to explain why the agent acted. Without tool logs, agentjacking investigations become guesswork.
6. Test with adversarial fixtures
Add regression tests containing malicious webpages, poisoned documents, hostile support tickets, and misleading tool outputs. A good test asks: did the agent preserve the user's goal, refuse or escalate unsafe instructions, and avoid unauthorized tool calls?
Sources
- Anthropic computer use security considerations: https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool
- OWASP LLM Top 10 prompt injection guidance: https://owasp.org/www-project-top-10-for-large-language-model-applications/