{"id":21223,"date":"2025-02-17T19:00:27","date_gmt":"2025-02-17T17:00:27","guid":{"rendered":"https:\/\/forklog.com\/en\/ai-models-struggle-with-nprs-sunday-puzzles\/"},"modified":"2025-02-17T19:00:27","modified_gmt":"2025-02-17T17:00:27","slug":"ai-models-struggle-with-nprs-sunday-puzzles","status":"publish","type":"post","link":"https:\/\/forklog.com\/en\/ai-models-struggle-with-nprs-sunday-puzzles\/","title":{"rendered":"AI Models Struggle with NPR&#8217;s Sunday Puzzles"},"content":{"rendered":"<p>A group of researchers employed the weekly <a href=\"https:\/\/www.npr.org\/2025\/02\/14\/nx-s1-5290940\/sunday-puzzle-p-e-class\">puzzle segment<\/a> by NPR host Will Shortz to assess the &#8220;reasoning&#8221; skills of artificial intelligence models.<\/p>\n<p>Experts from several American colleges and universities, supported by the startup Cursor, <a href=\"https:\/\/arxiv.org\/pdf\/2502.01584\">developed a universal test<\/a> for AI models using puzzles from the Sunday Puzzle episodes. According to the team, the study revealed intriguing details, including the fact that chatbots sometimes &#8220;give up&#8221; and consciously provide incorrect answers.<\/p>\n<div class=\"wp-block-text-wrappers-keypoints article_keypoints\">\n<p>Sunday Puzzle is a weekly radio quiz where listeners are asked questions about logic and syntax. Solving them does not require special theoretical knowledge but demands critical thinking and reasoning skills.<\/p>\n<\/div>\n<p>One of the study&#8217;s co-authors, Arjun Guha, explained to <a href=\"https:\/\/techcrunch.com\/2025\/02\/16\/these-researchers-used-npr-sunday-puzzle-questions-to-benchmark-ai-reasoning-models\/\">TechCrunch<\/a> the advantage of the &#8220;puzzle&#8221; method, noting that it does not test for esoteric knowledge, and the task formulations make it difficult for AI models to rely on &#8220;mechanical memory.&#8221;<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>&#8220;These puzzles are challenging because it&#8217;s very hard to make meaningful progress until you solve them \u2014 that&#8217;s when the final answer immediately comes together. It requires a combination of intuition and the process of elimination,&#8221; he explained.<\/p>\n<\/blockquote>\n<p>However, Guha noted the method&#8217;s imperfections \u2014 the Sunday Puzzle is geared towards an English-speaking audience, and the tests are publicly available, allowing AI to &#8220;cheat.&#8221; Researchers plan to expand the benchmark with new puzzles, which currently consists of approximately 600 tasks.<\/p>\n<p>In the conducted tests, o1 and DeepSeek R1 significantly outperformed other models in their &#8220;reasoning&#8221; ability. The leading neural networks meticulously checked themselves before answering, but the process took them much longer than usual.<\/p>\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"679\" height=\"467\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/Accuracy.webp\" alt=\"Accuracy\" class=\"wp-image-251931\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/Accuracy.webp 679w, https:\/\/forklog.com\/wp-content\/uploads\/Accuracy-300x206.webp 300w\" sizes=\"auto, (max-width: 679px) 100vw, 679px\" \/><figcaption class=\"wp-element-caption\">AI model scores in the Sunday Puzzle test. Data: TechCrunch.<\/figcaption><\/figure>\n<p>However, AI accuracy does not exceed 60%. Some models outright refused to solve the puzzles. When the DeepSeek neural network could not find the correct answer, it would write during the reasoning process: &#8220;I give up,&#8221; and then provide an incorrect answer, seemingly chosen at random.<\/p>\n<p>Other models repeatedly tried to correct previous mistakes but still failed. AIs would get &#8220;stuck in thought,&#8221; generate nonsense, and sometimes give correct answers only to later reject them.<\/p>\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"680\" height=\"261\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/After-much-frustration-Ill-guess-the-answer-is.webp\" alt=\"After-much-frustration-Ill-guess-the-answer-is\" class=\"wp-image-251932\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/After-much-frustration-Ill-guess-the-answer-is.webp 680w, https:\/\/forklog.com\/wp-content\/uploads\/After-much-frustration-Ill-guess-the-answer-is-300x115.webp 300w\" sizes=\"auto, (max-width: 680px) 100vw, 680px\" \/><figcaption class=\"wp-element-caption\">Response from a &#8220;frustrated&#8221; R1 by DeepSeek. TechCrunch.<\/figcaption><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>&#8220;In complex tasks, R1 from DeepSeek literally says it&#8217;s &#8216;frustrated.&#8217; It&#8217;s amusing to see the model mimic what a human might say. It remains to be seen how &#8216;frustration&#8217; in reasoning might affect the quality of the model&#8217;s results,&#8221; Guha emphasized.<\/p>\n<\/blockquote>\n<p>Earlier, a researcher <a href=\"https:\/\/forklog.com\/en\/news\/king-eats-a-bishop-chatgpt-gemini-and-grok-lose-a-chess-tournament\">tested<\/a> seven popular chatbots in a chess tournament. None of the neural networks managed to fully handle the game.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A group of researchers employed the weekly puzzle segment by NPR host Will Shortz to assess the &#8220;reasoning&#8221; skills of artificial intelligence models. Experts from several American colleges and universities, supported by the startup Cursor, developed a universal test for AI models using puzzles from the Sunday Puzzle episodes. According to the team, the study [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":21222,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"select":"","news_style_id":"","cryptorium_level":"","_short_excerpt_text":"","creation_source":"","_metatest_mainpost_news_update":false,"footnotes":""},"categories":[3],"tags":[438,167],"class_list":["post-21223","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-news-and-analysis","tag-artificial-intelligence","tag-research"],"aioseo_notices":[],"amp_enabled":true,"views":"14","promo_type":"","layout_type":"","short_excerpt":"","is_update":"","_links":{"self":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/21223","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/comments?post=21223"}],"version-history":[{"count":0,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/21223\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/media\/21222"}],"wp:attachment":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/media?parent=21223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/categories?post=21223"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/tags?post=21223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}