The Anthropic team published a guide titled Zero Trust for AI agents on the Claude blog, focusing on the secure deployment of autonomous AI agents in corporate environments. The document outlines key risks associated with agent systems and a cybersecurity approach for businesses.
AI Accelerates Attack Cycles
According to Anthropic, advanced models have reduced the time between the discovery of a vulnerability and its exploitation from months to hours. The company suggests considering not only AI-accelerated attacks on infrastructure but also the risks posed by the agents themselves, which can interpret goals, select tools, and perform multi-step actions without constant human involvement.
The guide is based on Zero Trust principles: do not trust by default, verify every action, and assume potential compromise. Anthropic references recommendations from the NIST SP 800-207, published in 2020, and a series of Zero Trust Implementation Guidelines that the NSA began releasing in 2026. The guide is positioned as a practical framework for security teams, architects, and engineers, rather than a universal compliance scheme.
Key threats listed in the document include direct and indirect prompt injections, tool contamination, identity and privilege abuse, memory and context poisoning, and supply chain attacks.
Direct prompt poisoning is described as the insertion of malicious instructions through user input, while indirect poisoning occurs through web pages, emails, documents, and other external sources that the agent processes during its operation.
The document examines the replacement of legitimate tools with malicious ones and dangerous call chains, where individually safe tools combine to produce risky outcomes. Anthropic uses the concepts of “blast radius” and “least agency”: this involves not only minimal access rights but also strict limitations on the agent’s actions, call frequency, and accessible areas.
Zero Trust for Agent Systems
For protection, the company proposes a three-tier maturity model and a set of basic technical measures. At the initial level, the guide recommends assigning each agent instance a unique cryptographic identity, using short-lived tokens, applying “deny by default,” and “role-based access control.” For agents working with untrusted inputs like web content and documents, the “sandbox execution” method is deemed practically mandatory.
At higher levels, Anthropic suggests implementing:
- the mTLS standard with mutual client-server authentication using digital certificates;
- hardware-bound identity through HSM or TPM, as well as remote attestation.
Static API keys and shared service account passwords are deemed unsuitable even for the basic level.
A significant section is dedicated to observability. Anthropic recommends detailed logging of all agent actions, including tool calls, data access, and external communications, and then sending events to SIEM for real-time correlation. Key metrics include dwell time and coverage. For critical systems, the target time for anomaly detection is set at within an hour. The guide also suggests building a “traceability matrix” to link each agent action to the original request and reconstruct the full decision chain.
Future of Security Operations Center — Human-Controlled Agents
In terms of response, Anthropic formulates the principle: automate the bureaucracy around incidents, but not the key decisions. Agents and models are proposed to handle artifact collection and initial sorting, conduct parallel investigation branches, and draft postmortems. Decisions on containment, incident disclosure, and client communication should remain human responsibilities. The same approach is applied to “protection operations” — with a mention of transitioning from classic SOAR to agent-based operations.
The document also provides quantitative benchmarks. Anthropic cites Microsoft’s Spotlighting research, where the success rate of indirect prompt poisoning attacks in experiments dropped from over 50% to less than 2%. The company also shares its own results on using “constitutional classifiers,” which reportedly block over 95% of jailbreak attempts with minimal false positive growth.
In the supply chain section, Anthropic recommends using AI-BOM, OpenSSF Scorecard, dependency audits, and access feasibility analysis. As evidence, the company cites its own research indicating that 250 malicious documents are sufficient to embed a backdoor in models ranging from 600 million to 13 billion parameters.
Ultimately, Anthropic concludes that AI agents require more than just targeted filters and perimeter defenses. The company advocates for building security around identity, minimal privileges, pre-limited damage, and constant action verification. According to Anthropic, organizations with stronger foundational security architecture, rather than the most advanced AI, will be better positioned.
In June, the Anthropic team warned about the risks of achieving recursive self-improvement in AI.
