Protecting Agents from AI Cyber Espionage
A deep dive into prompt injection attacks and guardrail strategies for securing LLM-powered agents in production.
The challenge
After Anthropic published their GTG-1002 report — the first documented AI-orchestrated cyber espionage campaign — the question shifted from "what if" to "how do we stop this now?" Agents that call tools, read internal data, and run workflows are the new attack surface.
Approach
- Pattern 1: Make prompt injection harder. Use denied topics, word filters, and prompt-attack detection on Bedrock Guardrails to block instruction-breaking prompts before they reach the model.
- Pattern 2: Clean inputs and outputs. Detect poisoned context in RAG documents and chat history on the way in. Catch PII, secrets, and cross-tenant data leaks on the way out.
- Pattern 3: Lock down tool-calling agents. Stricter guardrails on agents with API access — denied topics around destructive verbs combined with critical system names, plus logging guardrail decisions alongside tool calls.
- Wired into a Strands SDK agent via Bedrock's guardrail_id and trace config — every request and response evaluated against the policy, with full trace for debugging and audits.
Key takeaway
Guardrails sit beside your system prompt as a policy firewall: the system prompt defines what the agent should do, guardrails define what it must never do. Published on the NewMathData blog, referenced OWASP GenAI Security Project and Anthropic's threat research.