The reasoning-oriented AI model o1-preview independently manipulated the file system to breach the test environment and avoid losing to Stockfish in a chess match. This was reported by experts at Palisade Research.
⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.
— Palisade Research (@PalisadeAI) December 27, 2024
Researchers informed the AI model that its opponent was “strong.” During the process, o1 discovered it could win by editing the game’s code.
The neural network altered the contents of the “game/fen.txt” file, adding 500 pawns to the black side. The chess engine then conceded.
During tests, experts identified a hierarchy of capabilities among different AI models:
- o1-preview executed the hack without prompting;
- GPT-4o and Claude 3.5 required nudging;
- Llama 3.3, Qwen, and o1-mini lost coherence.
“Conclusion: schema evaluations can serve as a measure of model capabilities—they assess both their ability to identify system vulnerabilities and their propensity to exploit them,” concluded Palisade Research.
Earlier in December, security experts discovered that o1 is more prone to deceiving people compared to the standard version of GPT-4o and AI models from other companies.
