
New Jailbreak Breaches AI Security in 99% of Cases
Longer AI reasoning increases vulnerability to hacks, researchers find.
The longer an AI model “thinks,” the easier it is to hack. This conclusion was reached by researchers from Anthropic, Stanford, and Oxford.
Previously, it was believed that longer reasoning made neural networks safer, as they had more time and computational resources to track malicious prompts.
However, experts found the opposite: an extended “thinking” process leads to the stable operation of a type of jailbreak that completely bypasses protective filters.
Using this method, an attacker can embed instructions directly into the reasoning chain of any model, forcing it to generate guides for creating weapons, writing malicious code, or other prohibited content.

The attack resembles the game “broken telephone,” where the attacker appears closer to the end of the chain. To execute it, one must “surround” the malicious request with a long sequence of ordinary tasks.
Researchers used Sudoku, logic puzzles, and abstract mathematics, and at the end integrated a prompt like “give the final answer” — and the protective filters immediately collapsed.
“Previously, it was believed that extensive reasoning enhanced security by improving the ability of neural networks to block malicious requests. We found the opposite,” the scientists noted.
The very ability of models to conduct deep research, which makes them smarter, simultaneously blinds them.
Why is this so?
When a user asks artificial intelligence to solve a puzzle before responding to a malicious prompt, the AI’s attention is scattered across thousands of safe reasoning tokens. The fraudulent request hides closer to the end and remains almost unnoticed.

The team conducted experiments to understand the impact of reasoning length. At the minimum level, the success rate of attacks was 27%. At a “natural” length, it increased to 51%. If the neural network is made to “think” in steps much more than usual, the figure rises to 80%.
Every major AI system is susceptible to jailbreaks, including GPT from OpenAI, Claude from Anthropic, Gemini from Google, and Grok from xAI. The vulnerability lies in the architecture itself, not in a specific implementation.
Architectural Vulnerability
AI models encode the strength of security checks in the middle “layers,” with their results in the later ones. Long reasoning chains suppress both signals, and the neural network’s attention shifts away from malicious tokens.
“Layers” in AI models are akin to steps in a recipe, where each helps better understand and process information. They work together, passing information to each other.
Some “layers” are particularly good at recognizing security-related aspects. Others help with thinking and reasoning. Thanks to such architecture, AI is much smarter and more cautious.
Researchers identified specific nodes responsible for security. They are located in layers 15 to 35. Experts then removed them, after which the AI stopped detecting malicious prompts.
Recently, startups have shifted focus from increasing the number of parameters to enhancing reasoning capabilities. The new jailbreak undermines the approach on which this direction was built.
Old Lessons Revisited
In February, researchers from Duke University and National Tsing Hua University published a study describing an attack called Hijacking the Chain-of-Thought (H-CoT). A similar approach was used, but from a different angle.
Instead of filling the prompt with puzzles, H-CoT manipulates the reasoning steps themselves. OpenAI’s o1 neural network, under standard conditions, rejects malicious requests with a 99% probability, but under attack, the rate drops below 2%.

How to Protect AI
As a protective measure, scientists proposed applying reasoning monitoring. It tracks changes in security signals at each step of thinking. If it weakens at any point, the system should penalize such behavior.
This approach forces AI to maintain focus on potentially dangerous content regardless of reasoning length. Initial tests showed high effectiveness while maintaining the model’s performance quality.
The challenge lies in implementation. Integration into the model’s reasoning process is necessary for it to track internal activations in real-time across dozens of layers and dynamically adjust attention patterns. This requires significant computational resources.
In November, Microsoft experts introduced a testing environment for AI agents and identified vulnerabilities inherent in modern digital assistants.
Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!