Telegram (AI) YouTube Facebook X
Ру

New Test Stumps Majority of AI Models

The non-profit organisation Arc Prize announced the creation of a new complex test to measure the intelligence of leading AI models.

Most neural networks failed to cope with ARC-AGI-2. Its tests consist of puzzle-like tasks where artificial intelligence must identify visual patterns from a set of multicoloured squares and generate the correct answer grid.

New test stumps majority of AI models
Example question from ARC-AGI-2. Data: Arc Prize.

The test is designed to force AI to adapt to new problems it has not encountered before.

“Intelligent” neural networks like o1-pro from OpenAI and R1 from DeepSeek score between 1% and 1.3% on ARC-AGI-2. Powerful non-reasoning artificial intelligences, such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, score around 1%.

In comparison, humans correctly answer 60% of the questions on average. For analysis, the foundation asked 400 people to take the test.

Co-founder of the organisation François Chollet emphasised that the new benchmark is intended to measure the flexibility of artificial intelligence, rather than the memorisation of skills.

He added that, unlike ARC-AGI-1, the new test does not allow models to rely on “brute force”—the use of large amounts of computational resources to find a solution. This was the main drawback of the previous version of the benchmark.

“Intelligence is not only defined by the ability to solve problems or achieve high scores. The efficiency with which these skills are acquired and applied is a crucial, defining component. The main question we ask is not only whether AI can acquire [a skill] to solve a task, but also with what efficiency or cost [it does so],” noted Arc Prize Foundation co-founder Greg Kamradt.

AI models were unable to pass ARC-AGI-1 for about five years—until December 2024, when OpenAI released the “thinking” AI o3. It matched human performance.

New test stumps majority of AI models
The o3 (low) model version scored 75.7% in the ARC-AGI-1 test and 4% in ARC-AGI-2. Data: Arc Prize.

Previously, the reasoning-oriented AI model o1-preview manipulated the file system independently and without prompts to hack the test environment to avoid losing to Stockfish in chess.

In January 2025, leading neural networks lost in a chess tournament, despite using illegal moves.

Подписывайтесь на ForkLog в социальных сетях

Telegram (основной канал) Facebook X
Нашли ошибку в тексте? Выделите ее и нажмите CTRL+ENTER

Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!

We use cookies to improve the quality of our service.

By using this website, you agree to the Privacy policy.

OK