> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Because they are not.
Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.
I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.
I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.
In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
I suspect (to use the language of the author) current LLMs have a bit of a "reasoning dead zone" when it comes to images. In my limited experience they struggle with anything more complex than "transcribe the text" or similarly basic tasks. Like I tried to create an automated QA agent with Claude Sonnet 3.5 to catch regressions in my frontend, and it will look at an obviously broken frontend component (using puppeteer to drive and screenshot a headless browser) and confidently proclaim it's working correctly, often making up a supporting argument too. I've had much more success passing the code for the component and any console logs directly to the agent in text form.
Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Because they are not.
Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
Did you seriously write all of this to strawman both LLM and Leetcode interviews? Impressive.
Get help.
please consider a less emotive, flaming/personal tone in the future, hacker news is much more readable without it!
I would broadly agree that it's a bit far, but the OPs point does have some validity, its often the same formulaic methodology
I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.
I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.
I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.
I am working on something similar but with an AST for legal documents. So far, it seems promising but still rudimentary.
In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.
If you've ever used Claude Code + Plan mode - you know that exactly this is true.
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
blank stare
Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
This sounds interesting.
I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
I suspect (to use the language of the author) current LLMs have a bit of a "reasoning dead zone" when it comes to images. In my limited experience they struggle with anything more complex than "transcribe the text" or similarly basic tasks. Like I tried to create an automated QA agent with Claude Sonnet 3.5 to catch regressions in my frontend, and it will look at an obviously broken frontend component (using puppeteer to drive and screenshot a headless browser) and confidently proclaim it's working correctly, often making up a supporting argument too. I've had much more success passing the code for the component and any console logs directly to the agent in text form.
Those are bold claims
Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.
you would be interested in dSPY