The non-profit organisation Arc Prize announced the creation of a new complex test to measure the intelligence of leading AI models.
Most neural networks failed to cope with ARC-AGI-2. Its tests consist of puzzle-like tasks where artificial intelligence must identify visual patterns from a set of multicoloured squares and generate the correct answer grid.
The test is designed to force AI to adapt to new problems it has not encountered before.
“Intelligent” neural networks like o1-pro from OpenAI and R1 from DeepSeek score between 1% and 1.3% on ARC-AGI-2. Powerful non-reasoning artificial intelligences, such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, score around 1%.
In comparison, humans correctly answer 60% of the questions on average. For analysis, the foundation asked 400 people to take the test.
Co-founder of the organisation François Chollet emphasised that the new benchmark is intended to measure the flexibility of artificial intelligence, rather than the memorisation of skills.
Today, we’re releasing ARC-AGI-2. It’s an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with.
It keeps the same format as ARC-AGI-1, while significantly… pic.twitter.com/9mDyu48znp
— François Chollet (@fchollet) March 24, 2025
He added that, unlike ARC-AGI-1, the new test does not allow models to rely on “brute force”—the use of large amounts of computational resources to find a solution. This was the main drawback of the previous version of the benchmark.
“Intelligence is not only defined by the ability to solve problems or achieve high scores. The efficiency with which these skills are acquired and applied is a crucial, defining component. The main question we ask is not only whether AI can acquire [a skill] to solve a task, but also with what efficiency or cost [it does so],” noted Arc Prize Foundation co-founder Greg Kamradt.
AI models were unable to pass ARC-AGI-1 for about five years—until December 2024, when OpenAI released the “thinking” AI o3. It matched human performance.
Previously, the reasoning-oriented AI model o1-preview manipulated the file system independently and without prompts to hack the test environment to avoid losing to Stockfish in chess.
In January 2025, leading neural networks lost in a chess tournament, despite using illegal moves.