New Test Stumps Majority of AI Models

25.03.2025 ForkLog

The non-profit organisation Arc Prize announced the creation of a new complex test to measure the intelligence of leading AI models.

Most neural networks failed to cope with ARC-AGI-2. Its tests consist of puzzle-like tasks where artificial intelligence must identify visual patterns from a set of multicoloured squares and generate the correct answer grid.

New test stumps majority of AI models — Example question from ARC-AGI-2. Data: Arc Prize.

The test is designed to force AI to adapt to new problems it has not encountered before.

“Intelligent” neural networks like o1-pro from OpenAI and R1 from DeepSeek score between 1% and 1.3% on ARC-AGI-2. Powerful non-reasoning artificial intelligences, such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, score around 1%.

In comparison, humans correctly answer 60% of the questions on average. For analysis, the foundation asked 400 people to take the test.

Co-founder of the organisation François Chollet emphasised that the new benchmark is intended to measure the flexibility of artificial intelligence, rather than the memorisation of skills.

Today, we’re releasing ARC-AGI-2. It’s an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with.

It keeps the same format as ARC-AGI-1, while significantly… pic.twitter.com/9mDyu48znp

— François Chollet (@fchollet) March 24, 2025

He added that, unlike ARC-AGI-1, the new test does not allow models to rely on “brute force”—the use of large amounts of computational resources to find a solution. This was the main drawback of the previous version of the benchmark.

“Intelligence is not only defined by the ability to solve problems or achieve high scores. The efficiency with which these skills are acquired and applied is a crucial, defining component. The main question we ask is not only whether AI can acquire [a skill] to solve a task, but also with what efficiency or cost [it does so],” noted Arc Prize Foundation co-founder Greg Kamradt.

AI models were unable to pass ARC-AGI-1 for about five years—until December 2024, when OpenAI released the “thinking” AI o3. It matched human performance.

Previously, the reasoning-oriented AI model o1-preview manipulated the file system independently and without prompts to hack the test environment to avoid losing to Stockfish in chess.

In January 2025, leading neural networks lost in a chess tournament, despite using illegal moves.

Подписывайтесь на ForkLog в социальных сетях

Telegram (основной канал) Facebook X

Нашли ошибку в тексте? Выделите ее и нажмите CTRL+ENTER

Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!

Материалы по теме

Free VPN in Chrome Collected AI Conversations

OpenAI Introduces In-App Store for ChatGPT

Luma unveils AI video editor Ray3 Modify

Tether CEO identifies AI bubble as key threat to Bitcoin

OpenAI Launches New Image Generator GPT Image 1.5

Nvidia Unveils Open AI Models for Agents

Merriam-Webster Declares ‘Slop’ as 2025 Word of the Year

Tesla Begins Testing Driverless Taxis Without Onboard Supervisors

AI Leads to Energy Grid Issues in Developed Countries