I disagree that LLMs cannot solve "unsolved problems." This is already happening, and at fundamental mathematical and medical levels (the fields that are the most demanding when it comes to quality).
The idea that we haven't taught LLMs to come up with new answers... That doesn't even sound plausible. Just crank up the temperature, and an LLM will throw out so many ideas you'll exhaust yourself trying to sort through them.
So what haven't we taught LLMs?
- Have we not taught them to "filter"? We just haven't equipped them with experience and intuition, because we only feed them either "absolute fakes" or "verified facts." We don't feed them the actual path of problem-solving and research; those datasets simply don't exist.
- Have we not taught them to "double-check"? They are already excellent at verifying the credibility of our work.
- Have we not taught them to "defend" their ideas? They can justify ironclad logic and spot potentially "flaky" logic better than any human.
- Have we not taught them to "publish" and "present to the scientific community"? It's just that the previous steps aren't fully polished yet.
And if you look at the question of "creating completely new ideas" from this angle and in this level of detail... To me personally, it doesn't seem at all like LLMs are incapable of this kind of work.
We simply haven't taught them how to do it yet, purely because we don't have a sufficient volume of the right training materials.
Compared to AI our brains are order of magnitudes more energy efficient.
Frontier science doesn’t necessarily mean doesn’t mean that its meaningful, there are a bunch of problems that are tedious to solve with existing patterns. Really tedious. So a human prompting it can let an AI solve it, but at the same time that solution might completely not matter at all ever.
I think the real question should be how much does it help when it matters, moneywise.
Like it can create an app, but can it create an app that makes money and somebody cares about?
Like the amounts a human has to intervene to get an app that makes just 10k MRR is probably around in the 1000s, so how we are at really close to AGI?
I’ve tried and promoted the Ralph Loop, but I learnt that the loop just keeps overcomplicating stuff and then you try to simplify the overcomplication and it simplfies the wrong things and enforces the wrong things and in the end you can not move,till a human goes in and entangles it properly.
I’d actually focus on something else entirely here.
Let's be honest: we are giving LLMs and humans the exact same tasks, but are we putting them on an equal playing field? Specifically, do they have access to the same resources and behavioral strategies?
- LLMs don't have spatial reasoning.
- LLMs don't have a lifetime of video game experience starting from childhood.
- LLMs don't have working memory or the ability to actually "memorize" key parameters on the fly.
- LLMs don't have an internal "world model" (one that actively adapts to real-world context and the actual process of playing a game).
... I could go on, but I've outlined the core requirements for beating these tests above.
So, are we putting LLMs and humans in the same position? My answer is "no." We give them the same tasks, but their approach to solving them—let alone their available resources—is fundamentally different. Even Einstein wouldn't necessarily pass these tests on the first try. He’d first have to figure out how to use a keyboard, and then frantically start "building up new experience."
P.S. To quickly address the idea that LLMs and calculators are just "useful tools" that will never become AGI—I have some bad news there too. We differ from calculators architecturally; we run on entirely different "processors." But with LLMs, we are architecturally built the same way: it is a Neural Network that processes and makes decisions. This means our only real advantage over them is our baseline configuration and the list of "tools" connected to our neural network (senses, motor functions, etc.). To me, this means LLMs don't have any fundamental "architectural" roadblocks. We just have a head start, but their speed of evolution is significantly faster.
Isn’t this what AGI is by design? People CAN learn to become good at videogames. Modern LLMs can’t, they have to be retrained from scratch (I consider pre-training to be a completely different process than learning). I also don’t necessarily agree that a grandma would fail. Give her enough motivation and a couple days and she’ll manage these.
My main criticism would be that it doesn’t seem like this test allows online learning, which is what humans do (over the scale of days to years). So in practice it may still collapse to what you point out, but not because the task is unsuited to showing AGI.
What I'm saying is that this test is just another "out-of-distribution task" for an LLM. And it will be solved using the exact same methods we always use: it will end up in the pre-training data, and LLMs will crush it.
This has absolutely nothing to do with AGI. Once they beat these tests, new ones will pop up. They'll beat those, and people will invent the next batch.
The way I see it, the true formula for AGI is: [Brain] + [External Sensors] (World Receptors) + [Internal State Sensors] + [Survival Function] + [Memory].
I won't dive too deep into how each of these components has its own distinct traits and is deeply intertwined with the others (especially the survival function and memory). But on a fundamental level, my point is that we are not going to squeeze AGI out of LLMs just by throwing more tests and training cycles at them.
These current benchmarks aren't bringing us any closer to AGI. They merely prove that we've found a new layer of tasks that we simply haven't figured out how to train LLMs on yet.
P.S. A 2-year-old child is already an AGI in terms of its functional makeup and internal interaction architecture, even though they are far less equipped for survival than a kitten. The path to AGI isn't just endless task training—it's a shift toward a fundamentally different decision-making architecture.
good post, but I disagree Surival Function is needed for AGI. Why do you think Survival Function is needed?
The item I think you should add is a Mesolimbic System (Reward / Motivation). I think AGI needs motivation to direct its learning and tasks.
Also, I don't think the industry has just been training LLMs with more data to get advancement the last 2 years. RAG / Agents loops / skills / context mgmt are all just early forms a Memory system. An LLM with an updatable working set memory is a lot more capable than just an LLM.
Kids develop video game skills, grandmothers do not. Hypothetically grandmothers develop baking skills, that kids do not (perfectly golden brown cookies). A human intelligence is generally capable of developing video game skills or baking skills, given enough motivation and experience to hone those skills. One test of AGI is if the same system can develop video game skills and baking skills, without having to rebuild the core models... this would demonstrate generalized intelligence.
I've been a gamer for just about 40 years. Gaming is my "thing"
I found the challenges fun, but easy. Coming back and reading comments from people struggling with the games, my first thought was - yup definitely not a gamer.
My approach was to poke at the controls to suss the rules, then the actual solutions were really straightforward.
fwiw, I'm pretty dumb generally, but these kinds of puzzles are my jam.
Personally, I think the mechanics of memory can be universal, but the "memory structure" needs to be customized by each user individually. What gets memorized and how should be tied directly to the types of tasks being solved and the specific traits of the user.
Big corporations can only really build a "giant bucket" and dump everything into it. BUT what needs to be remembered in a conversation with a housewife vs. a programmer vs. a tourist are completely different things.
True usability will inevitably come down to personalized, purpose-driven memory. Big tech companies either have to categorize all possible tasks into a massive list and build a specific memory structure for each one, or just rely on "randomness" and "chaos".
Building the underlying mechanics but handing the "control panel" over to the user—now that would be killer.
I recommend installing Google's Antigravity and digging into its temp files in the user folder. You'll find some interesting ideas on how to organize memory there (the memory structure consists of: Brain / Conversation / Implicits / Knowledge items / Artifacts / Annotations / etc.).
I'd also add that memory is best organized when it's "directed" (purpose-driven). You've already started asking questions where the answers become the memories (at least, you mention this in your description). So, it's really helpful to also define the structure of the answer, or a sequence of questions that lead to a specific conclusion. That way, the memories will be useful instead of turning into chaos.
1) If we're talking about how "useful" it is to analyze Antigravity's memory structure, it's honestly just interesting. Just to satisfy my "curiosity". It's only useful for people who "design memory". Or for people who use Antigravity a lot and are also tired of it "remembering other projects" when it's totally not needed...
2) But if we're talking about the "usefulness" of memory itself... I actually try to clear Antigravity's memory, because I read all my materials very carefully, like 10-15 times. For me, going from an "idea" to "coding" can easily take a couple of weeks. Until I lay out the whole architecture perfectly, I don't give the green light to "build" even a simple HTML article. So for me personally, an agent's "memory" mostly just gets in the way.
P.S. I only used memory in one project, and I designed it myself. I made it very "directed" with strict rules: "what to remember?" and "when to remember?". That is convenient and a good working concept. The thing is, for my current tasks I just don't need it. My own memory is enough.
indexer written in 50-60 lines uses treesitter, incrementally builds and is super fast. Noo need to query project directory structure again and again, or have any "breaking" changes.
I wrote a short breakdown of Antigravity's memory earlier:
1) A full text log of all chats (conversations).
2) A short summary of all chats (I couldn't find where it's saved).
3) A storage for all files from the chats (brain).
4) A list of hidden notes (Implicts).
5) A list of annotations, but I couldn't understand what is kept there (Annotations).
6) Special "Knowledge items" that are linked together. One note can pull up others (Knowledge).
7) A short text summary of all Knowledge items in one file (Knowledge).
8) Custom Workflows set by the user or the AI (workflows in the user folder).
9) Project Workflows (workflows in the project folder).
10) Custom rules for the project (rules.md in the project folder).
11) A list of saved "important" files (Artifacts).
12) Custom "skills" (skills).
This is what I found. I figured out how some parts work, but others are still a question mark for me. I also skipped a couple of things because I didn't even understand what they are used for.
- 98% of human's repos have <2 stars
Claude is 5 times smarter than humans!
The math is a bit of a stretch, but the correlation still holds up.
reply