Apple's Study Reveals Limitations of Large Reasoning Models in Complex Tasks

Apple Machine Learning Research has published a groundbreaking paper titled The Illusion of Thinking, which delves into the capabilities and limitations of Large Reasoning Models (LRMs) when faced with complex puzzles. The study reveals that as puzzle complexity increases, these models encounter a “collapse” threshold, reducing their reasoning effort and highlighting scalability limits.

In their experiments, Apple researchers selected four puzzle problems, including the well-known Tower of Hanoi, and tested various LRMs and standard Large Language Models (LLMs) such as o3-mini and DeepSeek-R1. The complexity of each puzzle could be adjusted, for example, by varying the number of disks in the Tower of Hanoi. The study found that model behavior underwent three distinct regimes: for simple problems, both reasoning and non-reasoning models performed similarly well; for medium complexity, reasoning models with Chain-of-Thought (CoT) inference outperformed LLMs; and for high complexity, both groups’ performance “collapsed to zero.”

According to Apple,

“In this study, we probe the reasoning mechanisms of frontier LRMs through the lens of problem complexity… Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds.”

These insights challenge prevailing assumptions about LRM capabilities and suggest that current methodologies may face inherent barriers to achieving generalizable reasoning.

Understanding Large Reasoning Models

LRMs such as o3 and DeepSeek-R1 are advanced LLMs fine-tuned to generate step-by-step instructions, effectively “thinking out loud” to produce more accurate responses. This approach allows them to outperform standard LLM counterparts in tasks like coding, mathematics, and science benchmarks. During their experiments, the Apple team analyzed these reasoning traces generated by the models. They observed that for simpler problems, models often “overthink,” with the correct solution appearing early in the trace, yet the models continued to explore incorrect ideas. In medium complexity problems, models explored incorrect solutions before arriving at the correct one.

Community Reactions and Expert Opinions

Apple’s paper has ignited a wide-ranging debate within the AI community. Gary Marcus, a cognitive scientist and vocal critic of current AI developments, commented on the research, stating:

“What the Apple paper shows, most fundamentally, regardless of how you define [Artificial General Intelligence (AGI)], is that LLMs are no substitute for good well-specified conventional algorithms. They also can’t play chess as well as conventional algorithms, can’t fold proteins like special-purpose neurosymbolic hybrids, can’t run databases as well as conventional databases, etc.”

Meanwhile, open-source developer and AI commentator Simon Willison offered a different perspective:

“I’m not interested in whether or not LLMs are the ‘road to AGI’. I continue to care only about whether they have useful applications today, once you’ve understood their limitations. Reasoning LLMs are a relatively new and interesting twist on the genre. They are demonstrably able to solve a whole bunch of problems that previous LLMs were unable to handle, hence why we’ve seen a rush of new models from OpenAI and Anthropic and Gemini and DeepSeek and Qwen and Mistral… They’re already useful to me today, whether or not they can reliably solve the Tower of Hanoi.”

Limitations and Future Directions

Apple acknowledges several limitations in their research, particularly noting that their experiments primarily relied on “black box” API calls, which restricted their ability to examine the models’ internal states. They also concede that the use of puzzles means their conclusions may not be applicable to all reasoning domains.

The announcement comes as the field of AI continues to grapple with the challenge of developing models that not only perform well on specific tasks but also exhibit generalizable reasoning capabilities. As researchers push the boundaries of what LRMs can achieve, the insights from Apple’s study may guide future developments and encourage a reevaluation of current approaches.

Looking ahead, the AI community is keenly observing how these findings will influence the development of more robust and versatile reasoning models. As technology evolves, the quest for models that can seamlessly handle complex reasoning tasks remains a pivotal goal.

About The Author

editorial

Our editorial team is dedicated to delivering accurate and timely news coverage. With a commitment to journalistic integrity, we bring you the stories that matter most to our community.

See author's posts