When billion-dollar AIs break down over puzzles a child can do, it's time to rethink the hype | Gary Marcus

TruthLens AI Suggested Headline:

"Apple Research Paper Questions the Reasoning Capabilities of Large Language Models"

View Raw Article Source (External Link)
Raw Article Publish Date:
AI Analysis Average Score: 8.4
These scores (0-10 scale) are generated by Truthlens AI's analysis, assessing the article's objectivity, accuracy, and transparency. Higher scores indicate better alignment with journalistic standards. Hover over chart points for metric details.

TruthLens AI Summary

A recent research paper by Apple has sparked significant discussion in the tech community by challenging the prevailing belief in the reasoning capabilities of large language models (LLMs) and their newer counterparts, large reasoning models (LRMs). The study highlighted that although models like ChatGPT, Claude, and Deepseek may appear intelligent, they falter when faced with increased complexity. This observation aligns with the views of venture capitalist Josh Wolfe, who suggested that Apple has effectively 'Gary Marcus’d' the reasoning abilities of LLMs, emphasizing the need for critical assessment of AI's limitations. The paper underscores that while these models excel at pattern recognition, they struggle with novel scenarios that exceed their training data, a limitation that has been consistently pointed out for over two decades. The findings suggest that despite advancements, the fundamental weaknesses in LLMs remain unaddressed, challenging the notion that simply scaling these models will lead to improvements in their reasoning abilities.

The research further illustrated the inadequacy of LLMs through classic puzzles like the Tower of Hanoi. Apple’s tests revealed that leading generative models could barely solve problems involving seven discs, achieving less than 80% accuracy, and struggled significantly with eight discs. This performance raises concerns about the reliability of LLMs in solving problems that have been tackled using classical AI techniques for decades. The paper’s co-lead author noted that even when provided with the solution algorithm, these models failed to demonstrate logical and intelligent reasoning. The study calls into question the anthropomorphism often associated with LLMs, highlighting that they do not mimic human problem-solving strategies effectively. Ultimately, the paper suggests that while LLMs may serve useful functions in areas like coding and brainstorming, they are not a substitute for robust, well-defined algorithms necessary for genuine artificial general intelligence (AGI), indicating that reliance on generative AI for complex tasks could lead to unreliable outcomes.

TruthLens AI Analysis

The article presents a critical examination of the capabilities of large language models (LLMs) and their limitations in reasoning, as highlighted by a recent research paper from Apple. This assessment appears to challenge the prevailing narrative about the effectiveness of AI technologies in handling complex tasks.

Purpose of the Publication

The primary intent behind sharing this article seems to be to question the hype surrounding AI's reasoning capabilities, specifically LLMs. By referencing the research paper's findings, the article aims to inform readers about the shortcomings of these models, thus fostering a more cautious and informed perspective on AI technology.

Public Perception

The article likely seeks to create skepticism regarding the overinflated expectations of AI systems, especially among investors and tech enthusiasts. It aims to shift the dialogue from unbridled optimism to a more nuanced understanding of AI's capabilities and limitations.

Hidden Aspects

There appears to be no overt attempt to conceal information; rather, the article emphasizes the need for transparency regarding AI capabilities. However, it may inadvertently downplay the advancements made in AI by focusing on its failures, which could lead to a skewed perception of the technology's overall progress.

Truthfulness of the Content

The research referenced is likely credible, given its origin from Apple, a major player in the tech industry. The claims made in the article are supported by empirical evidence, thereby enhancing its reliability.

Societal Implications

The discussion surrounding the limitations of AI could lead to increased scrutiny in the tech sector. Businesses may become more cautious about investments in AI technologies, and policymakers might push for stricter regulations regarding AI development and deployment.

Community Support

The article may resonate more with communities that are critical of technology, such as ethicists, academics, and AI skeptics. These groups often advocate for responsible AI use and may see this article as reinforcing their views.

Market Impact

The publication could influence the stock prices of companies heavily invested in AI technologies. Investors may reassess their positions based on the perceived viability of LLMs and related technologies, potentially affecting shares of companies like OpenAI, Google, or Apple.

Geopolitical Context

In a broader context, the article addresses the ongoing discourse about the role of AI in society and its implications for global power dynamics. Countries that heavily invest in AI technologies may face increased scrutiny regarding their ethical use and the impact on job markets and privacy.

Use of AI in Writing

There is a possibility that AI tools were employed in drafting this article, particularly in structuring the argumentation or generating specific phrases. However, the critical perspective and analytical depth suggest a human touch in the writing process.

Manipulative Elements

While the article does not overtly manipulate information, the choice of language and emphasis on limitations could be seen as a form of steering public opinion against AI technologies. The framing of LLMs as unreliable may inadvertently foster fear or distrust among the general populace.

In conclusion, the article serves as a significant critique of AI capabilities, pushing for a more realistic understanding of what these technologies can achieve. It encourages readers to approach AI advancements with caution and critical thinking.

Unanalyzed Article Content

Aresearch paper by Apple has taken thetech world by storm, all but eviscerating the popular notion that large language models (LLMs, and their newest variant, LRMs, large reasoning models)are able to reason reliably. Some are shocked by it, some are not. The well-known venture capitalist Josh Wolfe went sofar as to poston X that “Apple [had] just GaryMarcus’d LLM reasoning ability” – coining a new verb (and a compliment to me),referring to“the act of critically exposing or debunking the overhyped capabilities of artificial intelligence … by highlighting their limitations in reasoning, understanding, or general intelligence”.

Apple did thisby showing thatleading models such as ChatGPT, Claude and Deepseek may “look smart – but when complexity rises, they collapse”. In short, these models are very good at a kind of pattern recognition, but often fail when they encounter novelty that forces them beyond the limits of their training, despite being,as the paper notes, “explicitly designed for reasoning tasks”.

As discussed later, there is a loose end that the paper doesn’t tie up, but on the whole, its force is undeniable. So much so that LLM advocatesare already partly concedingthe blow while hinting at, or at least hoping for, happier futures ahead.

In many ways the paper echoes and amplifies an argument thatI have been makingsince 1998: neural networks of various kinds can generalisewithin a distribution of data they are exposed to, but their generalisations tend to break down beyond that distribution.A simple example of this is that I once trained an older model to solve a very basic mathematical equation using only even-numbered training data. The model was able to generalise a little bit: solve for even numbers it hadn’t seen before, but unable to do so for problems where the answer was an odd number.

More than a quarter of a century later, when a task is close to the training data, these systems work pretty well. But as they stray further away from that data, they often break down, as they did in the Apple paper’s more stringent tests.Such limits arguably remain the singlemost important serious weaknessin LLMs.

The hope, as always, has been that “scaling” the models by making them bigger, would solve these problems. The new Apple paper resoundingly rebuts these hopes. They challenged some of the latest, greatest, most expensive models with classic puzzles, such as theTower of Hanoi– and found that deep problems lingered. Combined with numerous hugely expensive failures in efforts to build GPT-5 level systems, this is very bad news.

The Tower of Hanoi is a classic game with three pegs and multiple discs, in which you need to move all the discs on the left peg to the right peg, never stacking a larger disc on top of a smaller one. With practice, though, a bright (and patient) seven-year-old can do it.

WhatApplefound was that leading generative models could barely do seven discs, getting less than 80% accuracy, and pretty much can’t get scenarios with eight discs correct at all. It is truly embarrassing that LLMs cannot reliably solve Hanoi.

And, as the paper’s co-lead-author Iman Mirzadeh told me via DM, “it’s not just about ‘solving’ the puzzle. We have an experiment where we give the solution algorithm to the model, and [the model still failed] … based on what we observe from their thoughts, their process is not logical and intelligent”.

The new paper also echoes and amplifies several arguments that Arizona State University computer scientist Subbarao Kambhampati has been making about the newly popular LRMs. He has observed that people tend to anthropomorphise these systems,to assume they usesomething resembling “steps a human might take when solving a challenging problem”. And he has previously shown that in fact they have the same kind of problem that Apple documents.

If you can’t use a billion-dollar AI system to solve a problem that Herb Simon (one of the actual godfathers of AI)solved with classical(but out of fashion)AI techniques in 1957, the chances that models such as Claude or o3 are going to reach artificial general intelligence (AGI) seem truly remote.

So what’s the loose thread that I warn you about? Well, humans aren’t perfect either. On a puzzle like Hanoi, ordinary humans actually have a bunch of (well-known) limits that somewhat parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with eight discs.

But look, that’s why we invented computers, and for that matter calculators: to reliably compute solutions to large, tedious problems. AGI shouldn’t be about perfectly replicating a human, it should be about combining the best of both worlds; human adaptiveness with computational brute force and reliability. We don’t want an AGI that fails to “carry the one” in basic arithmetic just because sometimes humans do.

Whenever people ask me why I actuallylikeAI (contrary to the widespread myth that I am against it), and think that future forms of AI (though not necessarily generative AI systems such as LLMs) may ultimately be of great benefit to humanity, I point to the advances in science and technology we might make if we could combine the causal reasoning abilities of our best scientists with the sheer compute power of modern digital computers.

What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that these LLMs that have generated so much hype are no substitute for good, well-specified conventional algorithms. (They also can’t play chess as well as conventional algorithms, can’t fold proteins like special-purpose neurosymbolic hybrids, can’t run databases as well as conventional databases, etc.)

What this means for business is that you can’t simply drop o3 or Claude into some complex problem and expect them to work reliably. What it means for society is that we can never fully trust generative AI; its outputs are just too hit-or-miss.

One of the most striking findings in the new paper was that an LLM may well work in an easy test set (such as Hanoi with four discs) and seduce you into thinking it has built a proper, generalisable solution when it has not.

To be sure, LLMs will continue to have their uses, especially for coding and brainstorming and writing, with humans in the loop.

But anybody who thinks LLMs are a direct route to the sort of AGI that could fundamentally transform society for the good is kidding themselves.

This essay was adapted from Gary Marcus’s newsletter,Marcus on AI

Gary Marcus is a professor emeritus at New York University, the founder of two AI companies, and the author of six books, including Taming Silicon Valley

Back to Home
Source: The Guardian