OpenAI disclosed vulnerabilities in AI browsers and measures to bolster the security of its own solution — Atlas.
The company acknowledged that “prompt injection” attacks, which manipulate agents into executing malicious instructions, pose a risk that is unlikely to disappear soon.
“Such vulnerabilities, like fraud and social engineering on the internet, are unlikely to be completely eradicated,” wrote OpenAI.
It noted that the “agent mode” in Atlas “increases the threat surface.”
Besides Sam Altman’s startup, other experts have also taken note of the issue. In early December, the UK’s National Cyber Security Centre warned that attacks involving malicious prompt integration “will never disappear.” The government advised cybersecurity specialists to focus on mitigating risks and consequences rather than attempting to eliminate the problem.
“We view this as a long-term AI security issue and will continuously strengthen our defenses,” OpenAI noted.
Countermeasures
Prompt injection is a method of manipulating AI by deliberately adding text to its input, causing it to ignore original instructions.
OpenAI reported the implementation of a proactive rapid response cycle, which shows promising results in identifying new attack strategies before they appear “in real-world conditions.”
Anthropic and Google express similar views. Competitors suggest employing multi-layered defenses and conducting constant stress tests.
OpenAI uses an “automated attacker based on LLM” — an AI bot trained to act as a hacker seeking ways to infiltrate the agent with malicious prompts.
The artificial fraudster can test the exploitation of vulnerabilities in a simulator, which demonstrates the actions of the compromised neural network. The bot then studies the response, adjusts its actions, and makes a second attempt, then a third, and so on.
Outsiders do not have access to information about the internal reasoning of the target AI. In theory, the “virtual hacker” should find vulnerabilities faster than a real attacker.
“Our AI assistant can prompt the agent to execute complex, long-term malicious processes that unfold over dozens or even hundreds of steps. We have observed new attack strategies that did not manifest in our red team campaign or external reports,” states the OpenAI blog.
In the example provided, the automated attacker sent an email to the user. The AI agent then scanned the email service and executed hidden instructions, sending a resignation message instead of composing an out-of-office reply.
Following a security update, the “agent mode” was able to detect the sudden prompt injection attempt and flag it for the user.
OpenAI emphasized that while it is challenging to reliably defend against such attacks, it relies on large-scale testing and rapid fix cycles.
User Recommendations
Rami McCarthy, Chief Security Researcher at Wiz, highlighted that reinforcement learning is one of the main ways to continuously adapt to malicious behavior, but it is only part of the picture.
“A useful way to think about risks in AI systems is autonomy multiplied by access. Agent browsers are in a challenging part of this space: moderate autonomy combined with very high access. Many current recommendations reflect this trade-off. Limiting access after logging in primarily reduces vulnerability, while requiring confirmation request checks limits autonomy,” said the expert.
These two recommendations were provided by OpenAI to users to reduce risk. The startup also suggested giving agents specific instructions rather than granting access to email and asking them to “take any necessary actions.”
McCarthy noted that, as of today, browsers with built-in AI agents do not offer enough benefits to justify the risk profile.
“This balance will evolve, but today the trade-offs are still very real,” he concluded.
Back in November, Microsoft experts introduced a testing environment for AI agents and identified vulnerabilities inherent in modern digital assistants.
