This feels like a situation with a sold out train to a popular destination, where people are already reselling their tickets for some crazy markup, and then suddenly railway decides to add one more train car and opens flash ticket sale. Investors feeling missing out on OpenAI and others are now hoping to catch this last train ticket to the AI.
It's a highly risky bet, but not fundamentally unreasonable. One might believe that Ilya's research was genuinely critical to OpenAI's current situation. If one takes that premise, three potential corollaries follow: (1) OpenAI will struggle to produce future research breakthroughs without Ilya; (2) OpenAI will struggle to materially move beyond its current product lineup and variations thereof without said future research breakthroughs; (3) a startup led by Ilya could overcome both (1) and (2) with time.
An alternative sequence of reasoning places less emphasis on Ilya specifically and uses Ilya as an indicator of research health. Repeat (1), (2), and (3) above, but replace "Ilya" with something like "strong and healthy fundamental research group". In this version, Ilya's departure is taken as indication that OpenAI no longer has a strong and healthy fundamental research group but that the company is "compromised" by relentless feature roadmaps for current products and their variations. That does not mean OpenAI will fail, but in this perspective it might mean that OpenAI is not well positioned to capture future research breakthroughs and the products that they will generate.
From my perspective, it's just about impossible to know how true these premises really are. And that's what makes it a bet or gamble rather than anything with any degree of assurance. To me, just as likely is the scenario where it's revealed that Ilya is highly ineffective as a generalist leader and that research without healthy tension from the business goes nowhere.
Me after watching channel5 I think some of it should go to poor people instead of billion dollars roulettes only.
Thought the problem is with even richer corporations I feel and financial derivatives and not fully here.
except in this case, the train driver from the original train was "sacked" (some believe unfairly), and decided to get their own train to drive. Of course, the smoothness of the ride depends on the driver of the train.
The problem is a content to train LLMs (I assume that Ilia will continue this line or research). Big content holders are already raising moats and restricting access or partnering with a single existing LLM corporation. And also time, because all this involves a lot of hardware. Any subsequent competitor will have to scale higher and higher wall just to catch up (if the LLM progress doesn't stall and get into diminishing returns).
Evergrande imploded because of massive amounts of debt that they had been rolling for years. Continually rolling this massive debt was working till property demand slowed and their revenues couldn't keep up adequately to qualify them to issue new debt.
For these kinds of capital-intensive startups, though, that almost seems like a requirement, and I guess there are really 2 "types" of valuations.
In this case, everyone knows it takes hundreds of millions to train models. So I'm investors are essentially rolling the dice on an extremely well-regarded team. And if it takes about a billion just to get off the ground, the valuation would need to at least be in the couple billion range to make it worth it for employees to work there.
That feels very different than say selling a company where founders are cashing out. In that case, the business should expect to meaningful contribute to revenue, and quickly.
The company's pitch isn't exactly a secret. The one and only thing they're planning to do is build an ML model smarter than a human being, which would be immensely valuable for a wide variety of tasks that currently require human input. You see a lot of commentators jumping through hoops to deny that anyone could believe this is possible in the near future, but clearly they and their investors do.
Agreed, the AI bubble is very, very real. Not that LLMs are all hype, they’re certainly impressive with useful applications, but AI companies are getting insane valuations with zero proof that they’re viable businesses.
The successful companies that came out of the dot com bubble era actually proved their business viability before getting major investment, though.
Amazon is one of the most famous successes of the era. Bezos quit his job, launched the business out of his garage, with seed money being $10K of his own savings, and was doing $20K/week in sales just 30 days later. And I believe their only VC round before going public was an $8 investment from Kleiner Perkins. But they were a company who proved their viability early on, had a real product with rapid revenue growth before getting any VC $$.
I’d say this SSI round is more similar to Webvan, who went public with a valuation of $4.8 billion, and at that time had done a grand total of $395K in sales, with losses over $50 million.
I’m sure there are good investments out there for AI companies that are doing R&D and advancing the state of the art. However, a $1 billion investment at a $5 billion valuation, for a company with zero product or revenue, just an idea, that’s nuts IMO, and extremely similar to the type of insanity we saw during the dot com bubble. Even more so given that SSI seemingly don’t even want to be a business - direct quote from Ilya:
> This company is special in that its first product will be the safe superintelligence, and it will not do anything else up until then … It will be fully insulated from the outside pressures of having to deal with a large and complicated product and having to be stuck in a competitive rat race.
This doesn’t sound to me like someone who wants to build a business, it sounds like someone who wants to hack on AI with no oversight or proof of financial viability. Kinda wild to give him $1 billion to do that IMO.
This wave, whether or not it's a bubble, has very little in common with the dotcom era. It's simply a bad analogy.
The dotcom era was full of unprofitable startups pumping up the stock price in all sorts of ways, as they were completely dependent on continues capital flows from investors to stay afloat. Also, a lot of that capital came from retail investors in various forms.
The AI wave that is currently ongoing is for the most part funded by some of the largest and most profitable corporations on the planet.
Companies like Alphabet, Meta, Tesla/X, Amazon and (to a lesser extent) Microsoft still have founders that either control or provide a direction for these companies.
What drives this way is the fact that most of these founders have a strong belief.
We know, for instance, that Larry Page and Elon Musk had a disagreement about the future role of AGI/ASI about 15 years ago, leading to Elon Musk helping to found OpenAI to make sure that Google would not gain a monopoly.
These are strong convictions held by very powerful people that have been held for decades. Short term stock market fluctuations are not going to suddenly collapse this "bubble".
As long as these founders continue to believe that AGI is close, they will continue to push, even if the stock market stops it support to the push.
SSI may fail, of course. But Ilya has a rumor of (from people like Hinton and Elon) as being perhaps the greatest and most capable visionary in the business.
>Agreed, the car bubble is very, very real. Not that the internal combustion carriage is all hype, it's certainly impressive with useful applications, but car manufacturers are getting insane valuations with zero proof they're viable businesses.
It means that AI startups are actually a really poor value proposition compared to traditional tech companies, because your multiplier is limited. First round $50M valuation leaves a lot more opportunity to get rich.
This kind of structure isn't as unusual for capital intensive businesses.
How many niche verticals SaaSes that raised like $200 million only to go to zero? Even if this can't beat OpenAI models a commodity LLM which is about as good (and they have proven that they can build) is probably worth close to the investment
I'm neither a VC nor in the VC market, but I believe such valuation comes primarily from the name Ilya Sutskever. Having such a high-profile as the founder would give more credibility to the company, unlike what we witnessed in recent years where companies like Theranos et al. that were valued at tens of billions for no obvious reason. Despite having said the above, we might still agree that the AI hype is probably the second generation of the dot-com bubble.
Totally blind on this, hoping for someone to shed some light: do these investors get some pitch, information or some roadmap of what company intends to create, how will it earn revenue, how will it spend money or how will it operate?
I heard this on a reddit thread a while back but rings very true here.
> If you are seeking capital for a startup with a product, you have to sell the startup on realities (ie how much revenue you are making). If you are seeking capital for a startup with no product, you can sell the startup on dreams, which is much much easier but also way riskier for investors.
Since these guys don't have a product yet, they 100% sold it on big dreams combined with Ilya's track record at OpenAI.
I’m sure they have a pitch deck. It’s pretty obvious a big chunk will go to compute costs for model training & research. But mostly it’s about the people in any company at this stage, same as any seed funding but on a different monetary scale.
Are state-level actors the main market for AI security?
Using the definition from the article:
> AI safety, which refers to preventing AI from causing harm, is a hot topic amid fears that rogue AI could act against the interests of humanity or even cause human extinction.
If the purpose of a state is to ensure its continued existence, then they should be able to make >=$1 in profit.
It's certainly in their best interest not to tell us that it's just going to be another pile of LLMs that they've trained not to say or do anything that isn't business friendly.
AGI would definitely be a major historical milestone for humanity ...
... however, I'm on the camp that believes it's not going to be hyper-profitable for only one (or a few) single commercial entities.
AGI will not be a product like the iPhone where one company can "own" it and milk it for as long as they want. AGI feels more like "the internet", which will definitely create massive wealth overall but somehow distributed among millions of actors.
We've seen it with LLMs, they've been revolutionary and yet, one year after a major release, free to use "commodity" LLMs are already in the market. The future will not be Skynet controlling everything, it will be uncountable temu-tier AIs embedded into everything around you. Even @sama stated recently they're working on "intelligence so cheap that measuring its use becomes irrelevant".
In 2022 Ilya Sutskever claimed there wasn't a distinction:
> It may look—on the surface—that we are just learning statistical correlations in text. But it turns out that to ‘just learn’ the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world.
This is transparently false - newer LLMs appear to be great at arithmetic, but they still fail basic counting tests. Computers can memorize a bunch of symbolic times tables without the slightest bit of quantitative reasoning. Transformer networks are dramatically dumber than lizards, and multimodal LLMs based on transformers are not capable of understanding what numbers are. (And if Claude/GPT/Llama aren't capable of understanding the concept of "three," it is hard to believe they are capable of understanding anything.)
Sutskever is not actually as stupid as that quote suggests, and I am assuming he has since changed his mind.... but maybe not. For a long time I thought OpenAI was pathologically dishonest and didn't consider that in many cases they aren't "lying," they blinded by arrogance and high on their own marketing.
> But it turns out that to ‘just learn’ the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text
This is pretty sloppy thinking.
The neural network learns some representation of a process that COULD HAVE produced the text. (this isn't some bold assertion, it's just the literal definition of a statistical model).
There is no guarantee it is the same as the actual process. A lot of the "bow down before machine God" crowd is guity of this same sloppy confusion.
It's not sloppy. It just doesn't matter in the limit of training.
1. An Octopus and a Raven have wildly different brains. Both are intelligent. So just the idea that there is some "one true system" that the NN must discover or converge on is suspect. Even basic arithmetic has numerous methods.
2. In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust. What gets the job done gets the job done. There is no way an increasingly competent predictor will not learn representations of the concepts in text, whether that looks like how humans do it or not.
Sure I agree. But if that's what you're getting hung up on, i think you've missed his point entirely.
Whether the machines becomes a human brain clone or something entirely alien is irrelevant. The point is, you can't cheat reality. Statistics is not magic. You can't predict text that understands without understanding.
Sure you can, and if your predictive engine doesn't have the generality and power of the original generative one, then you have no choice.
Machine learning isn't magic - the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.
In the case of an LLM trained with a predict next word loss function, what you are asking/causing the model to learn is NOT the generative process - you are asking it to learn the surface statistics of the training set, and the model will only learn what it needs to (and is able to, per the model architecture being trained) in order to do this.
Now of course learning the surface statistics well does necessitate some level of "understanding" - are we dealing with a fairy tale or a scientific paper for example, but there is only so much the model can do. Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it, so the best it can do is to learn the surface statistics via much shallower pattern recognition of the state of the board. Now, given the size of these LLMs, if trained on enough games they will be able to play pretty well even using this pattern matching technique, but one doesn't need to get too far into a chess game to reach a position that has never been seen before in recorded games (e.g. watch agadmator's YouTube chess channel - he will often comment when this point has been reached), and the model therefore has no choice but to play moves that were seen in the training set in similar, but not identical positions... This is basically cargo-cult chess! It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs), but this same "cargo-cult" (follow surface statistics) generation process when out of training set applies to all inputs, not just chess...
>the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.
You clearly do not really understand what it means to predict internet scale text with increasing accuracy. No more than that ? Fantastic
LLMs do not just learn surface statistics. So many papers have thoroughly disabused this that i'm just not going to bother. This is just straight up denial.
You have no idea what you are talkin about. You've probably never even played with 3.5-turbo-instruct. That's how you can say this nonsense. You have your conclusion and keep working backwards to get a justification.
>It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs)
When you say this for everything LLMs can do then it just becomes a meaningless cope statement.
No of course not - they also learn whatever is necessary, and possible, in order to replicate those surface statistics (e.g. understanding of fairy tales, etc, as I noted).
However, you seem to be engaged in magical thinking and believe these models are learning things beyond their architectural limits. You appear to be star struck by what these models can do, and blind to what one can deduce - and SEE - they they are unable to do.
You've said a lot of things about LLM chess performance that is not true and can be easily shown to be not true. Literally evidence right there that shows the model learning the board state, rules, player skills etc.
And then you've tried to paper over being shown that with a conveniently vague and nonsensical, "says more about bla bla bla". No, you were wrong. Your model about this is wrong. It's that simple.
You start from your conclusions and work your way down from it. "pattern matching technique" ? Please. By all means, explain to all of us what this actually entails in a way we can test for it. Not just vague words.
An LLM will learn what it CAN (and needs to, to reduce the loss), but not what it CAN'T. How difficult is that to understand?!
Tracking probable board state given a sequence of moves (which don't even need to go all the way back to the start of the game!) is relatively simple to do, and doesn't require hundreds of sequential steps that are beyond the architecture of the model. It's just a matter of incrementally updating the current board state "hypothesis" per each new move (essentially: "a knight just moved to square X, so it must have moved away from some square a knight's move away from X that we believe currently contains a knight").
Ditto for estimating player ELO rating in order to predict appropriately good or bad moves. It's basically just a matter of how often the player makes the same move as other players of a given ELO rating in the training data. No need for hundreds of steps of sequential computation that are beyond the architecture of the model.
Doing an N-ply lookahead to reason about potential moves is a different story, but you want to ignore that and instead throw out a straw man "counter argument" about maintaining board state as if that somehow proves that the LLM can magically apply > N=layers of sequential reasoning to derive moves. Sorry, but this is precisely magical faith-based thinking "it can do X, so it can do Y" without any analysis of what it takes to do X and Y and why one is possible, and the other is not.
>An LLM will learn what it CAN (and needs to to reduce the loss), but not what it CAN'T. How difficult is that to understand?!
Right and the point is that you don't know what it CAN'T learn. You clearly don't quite understand this because you say stuff like this:
>Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it
and it's just baffling because:
1. Humans don't play chess anything like chess engines. They literally can't because the brain has iterative computation limits well below that of a computer. Most Grandmasters are only evaluating 5 to 6 moves deep on average.
You keep trying to make the point that because a Transformer architecturally has a depth limit for some trained model, a, it cannot reach human level.
But this is nonsensical for various reasons.
- Nobody is stopping you from just increasing N such that every GI problem we care about is covered.
- You have shown literally no evidence that the N even state of the art models posses today is insufficient to match human iterative compute ability.
GPT-4o instant shots arbitrary arithmetic more accurately than any human brain and that's supposedly something it's bad at.
You can clearly see it can reach world class chess play.
If you have some iterative computation benchmark that shows transformers zero shotting worse than an unaided human then feel free to show me.
I did not claim the state of the art was better at all forms of reasoning than all humans. I claimed the architecture isn't going to stop it from being so in the future but I guess constructing a strawman is always easier right ?
There are benchmarks that rightfully show the SOTA behind average human performance in other aspects of reasoning so why are you fumbling so much to demonstrate this with unaided iterative computation ? It's your biggest argument so I just thought you'd have something more substantial than "It's limited bro!"
You cannot even demonstrate this today nevermind some hypothetical scaled up model.
> so why are you fumbling so much to demonstrate this with unaided iterative computation
Well, you see, I've been a professional developer for the last 45 years, and often, gasp, think for long periods of time before coding, or even writing things down. "Look ma, no hands!".
I know this will come across as an excuse, but the thing is I assumed you were also vaguely famililar with things like software development, or other cases where human's think before acting, so I evidentially did a poor job of convincing you of this.
I also assumed (my bad!) that you would at least know some people who were semi-intelligent and wouldn't be hopelessly confused about farmers and chickens, but now I realize that was a mistake.
Really, it's all on me.
I know that "just add more rules", "make it bigger" didn't work for CYC, but maybe as you suggest "increase N" is all that's needed in the case of LLMs, because they are special. Really - that's genius! I should have thought of it myself.
I'm sure Sam is OK, but he'd still appreciate you letting him know he can forget about Q* and Strawberries and all that nonsense, and just "increase N"! So much simpler and cheaper rather than hiring thousands of developers to try to figure this out!
Maybe drop Yan LeCun a note too - tell him that the Turing Award committee are asshats, and that he is too, and that LLMs will get us all the way to AGI.
>Well, you see, I've been a professional developer for the last 45 years, and often, gasp, think for long periods of time before coding, or even writing things down. "Look ma, no hands!".
>I know this will come across as an excuse, but the thing is I assumed you were also vaguely famililar with things like software development, or other cases where human's think before acting, so I evidentially did a poor job of convincing you of this.
Really, you have the same train of thought for hours on end ?
When you finish even your supposed hours long spiel, do you just proceed to write every line of code that solves your problem just like that ? Or do you write and think some more ?
More importantly, are LLMs unable to produce the kind of code humans spend a train of thought on ?
>Maybe drop Yan LeCun a note too - tell him that the Turing Award committee are asshats, and that he is too, and that LLMs will get us all the way to AGI.
You know, the appeal to authority fallacy is shifty at the best of times but it's straight up nonsensical when said authority does not have consensus on what you're appealing to.
Like great you mentioned LeCun. And I can just as easily bring in Hinton, Norvig, Ilya. Now what ?
No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.
No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.
Architecture matters. The best you can hope for an LLM is that training will converge on the best LLM generating process it can be, which can be great for in-distribution prediction, but lousy for novel reasoning tasks beyond the capability of the architecture.
>No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.
Go back a few evolutionary steps and sure you can. Most ANN architectures basically have relatively little to no biases baked in and the Transformer might be the most blank slate we've built yet.
>No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.
A transformer can perform any computation it likes in a forward pass and you can arbitrarily increase inference compute time with the token length. Feedback paths? Sure. Compute inefficient? Perhaps. Some extra programming around the Model to facilitate this ? Maybe but the architecture certainly isn't stopping you.
Even if it couldn't, limited =/ trivial. The Human Brain is not Turing complete.
Internal Memory ?
Did you miss the memo ? Recurrency is overrated. Attention is all you need.
That said, there are already state keeping language model architectures around.
Altering weights ?
Can a transformer continuously train ? Sure. It's not really compute efficient but architecture certainly doesn't prohibit it.
>Architecture matters
Compute Efficiency? Sure. What it is capable of learning? Not so much
> A transformer can perform any computation it likes in a forward pass
No it can't.
A transformer has a fixed number of layers - call it N. It performs N sequential steps of computation to derive it's output.
If a computation requires > N steps, then a transformer most certainly can not perform it in a forward pass.
FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model". Attention is not all you need if what you actually want to build is a cognitive architecture.
Transformer produce the next token by manipulating K hidden vectors per layer, one vector per preceding token. So yes you can increase compute length arbitrarily by increasing tokens. Those tokens don't have to carry any information to work.
And again, human brains are clearly limited in the number of steps it can compute without writing something down.
Limited =/ Trivial
>FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model".
Great. Do you know what a "language model" is capable of in the limit ? No
These top research labs aren't only working on Transformers as they currently exist but it doesn't make much sense to abandon a golden goose before it has hit a wall.
> And again, human brains are clearly limited in the number of steps it can compute without writing something down
No - there is a loop between the cortex and thalamus, feeding the outputs of the cortex back in as inputs. Our brain can iterate for as long as it likes before initiating any motor output, if any, such as writing something down.
The brain's ability to iterate on information is still constrained by certain cognitive limitations like working memory capacity and attention span.
In practice, the cortex-thalamus loop allows for some degree of internal iteration, but the brain cannot endlessly iterate without some form of external aid (e.g., writing something down) to offload information and prevent cognitive overload.
I'm not telling you anything here you don't experience in your everyday life. Try indefinitely iterating on any computation you like and see how well that works for you.
My point is that for some bizare reason, people have standards of reasoning (for machines) that only exist in fiction or their own imagination.
It is beyond silly to dump an architecture for a limitation the human brain has. A reasoning engine that can iterate indefinitely with no external aid does not exist in real life. That the transformer also has this weakness is not any reason for it to have capabilities less than a brain so it's completely moot.
LLMs are here to stay until something better replaces them, and will be used for those things they are capable of.
It shouldn't be surprising they are not great at reasoning, or everything one would hope for from an AGI, since they simply were not built for that. If you look at the development history, the transformer was a successor to LSTM-based seq-2-seq models using Bahdanau attention, whose main goal was to more efficiently utilize parallel hardware by supporting parallel processing. Of course a good language model (word predictor) will look as if it's reasoning because it is trying to model the data it was trained on - a human reasoner.
As humans we routinely think for seconds/minutes or even hours before speaking or acting, while an LLM only has that fixed N steps (layers) of computation. I don't know why you claim this difference (among others) should make no difference, but it clearly does, with out-of-training-set reasoning weakness being a notable limitation that people such as Demis Hassabis have recently conceded.
Reasoning is reasoning. "Look as if it is reasoning" is an imaginary distinction you've made up. One that is very clear because everybody touting this "fake reasoning" rhetoric is still somehow unable to define a testable version of reasoning that disqualifies LLMs without also disqualifying some chunk of humans.
>As humans we routinely think for seconds/minutes or even hours before speaking or acting
No human is iterating on a base thought for hours uninterrupted lol so this is just moot
>with out-of-training-set reasoning weakness being a notable limitation that people such as Demis Hassabis have recently conceded.
Humans reason weaker out of training. LLMs are simply currently worse
> Reasoning is reasoning. "Look as if it is reasoning" is an imaginary distinction you've made up.
No - just because something has the surface appearance of reasoning doesn't mean that the generative process was reasoning, anymore than a cargo cult wooden aircraft reflects any understanding of aerodynamics and would be able to fly.
We've already touched on it, but the "farmer crossing river" problems is a great example. When the LLM sometimes degenerates into "cross bank A to B with chicken, cross band B to A with chicken, cross bank A to B with chicken.. that is the fewest trips possible", this is an example of "looks as if it is reasoning" aka cargo-cult surface-level copying of what a solution looks like. Real reasoning would never repeat a crossing without loading/unloading something since that conflicts with the goal of fewest trips possible.
I never said anything about the surface appearance of reasoning. Either the model demonstrates some understanding or reasoning in the text it generates as it is perfectly capable of or it reasons faultily or lacks understanding in that area. This does not mean LLMs don't reason anymore than it means you don't reason.
The idea that LLMs "fake reason" and Humans "really reason" is an imaginary distinction. If you cannot create any test that can distinguish the two then you are literally making things up.
Dude, I just gave you an example, and you straight-up ignore it and say "show me a test"?!
An averagely smart human does not have these failure modes where they answer a question with something that looks like an answer "cross A to B, then B to A. done. there you go!" but has zero logic to it.
Do you follow news in this field at all? Are you aware that poor reasoning is basically the #1 shortcoming that all the labs are working on?!!
Feel free to have the last word as this is just getting repetitive.
You are supposed to show me an example no human will fail. I didn't ignore anything. I'm just baffled that you genuinely believe this:
>An averagely smart human does not have these failure modes where they answer a question with something that looks like an answer "cross A to B, then B to A. done. there you go!" but has zero logic to it.
Humans are poor at logic in general. We make decisions, give rationales with logical contradictions and nonsense all the time. I just genuinely can't believe you think we don't. It happens so often we have names for these cognitive shortcomings. Get any teacher you know and ask them this. No need to take my word for it. And i don't care about getting the last word.
You seem to repeatedly insist that hidden computation is a distinction of any relevance whatsoever.
First of all, your understanding of the architecture itself is mistaken. A transformer can iterate endlessly because each token it produces allows it a forward pass, and each of these tokens is postpended to its input in the next inference. That's the autoregressive in autoregressive transformer, and the entire reason why it was proposed for arbitrary seq2seq transduction.
This means you get layers * tokens iterations, where tokens is up to two million, and is in practice unlimited due to the LLM being able to summarize and select from that. Parallelism is irrelevant, since the transformer is sequential in the output of tokens. A transformer can iterate endlessly, it simply has to output enough tokens.
And no, the throughput isn't limited either, since each token gets translated into a high-dimensional internal representation, that in turn is influenced by each other token in the model input. Models can encode whatever they want not just by choosing a token, but by choosing an arbitrary pattern of tokens encoding arbitrary latent-space interactions.
Secondly, internal thoughts are irrelevant, because something being "internal" is an arbitrary distinction without impact. If I trained an LLM to prepend and postpend <internal_thought> to some part of its output, and then simply didn't show that part, then the LLM wouldn't magically become human. This is something many models do even today, in fact.
Similarly, if I were to take a human and modify their brain to only be able to iterate using pen and paper, or by speaking out loud, then I wouldn't magically make them into something non-human. And I would definitely not reduce their capacity for reasoning in any way whatsoever. There are people with aphantasia working in the arts, there are people without an internal monologue working as authors - how "internal" something is can be trivially changed with no influence on either the architecture or the capabilities of that architecture.
Reasoning itself isn't some unified process, neither is it infinite iteration. It requires specific understanding about the domain being reasoned over, especially understanding of which transformation rules are applicable to produce desired states, where the judgement about which states are desirable has to be learned itself. LLMs can reason today, they're just not as good at it than humans are in some domains.
Sure - a transformer can iterate endlessly by generating tokens, but this is no substitute for iterating internally and maintaining internal context and goal-based attention.
One reason why just blathering on endlessly isn't the same as thinking deeply before answering, is that it's almost impossible to maintain long-term context/attention. Try it. "Think step by step" or other attempts to prompt the model into generating a longer reply that builds upon itself, will only get you so far because keeping a 1-dimensional context is no substitute for the thousands of connections we have in our brain between neurons, and the richness of context we're therefore able to maintain while thinking.
The reasoning weakness of LLMs isn't limited to "some domains" that they had less training data for - it's a fundamental architecturally-based limitation. This becomes obvious when you see the failure modes of simple problems like "how few trips does the farmer need to cross the river with his chicken & corn, etc" type problems. You don't need to morph the problem to require out-of-distribution knowledge to get it to fail - small changes to the problem statement can make the model state that crossing the river backwards and forwards multiple times without loading/unloading anything is the optimal way to cross the river.
But, hey, no need to believe me, some random internet dude. People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.
>You don't need to morph the problem to require out-of-distribution knowledge to get it to fail
make the slight variation look different from the version it have memorized and it often passes. Sometimes it's as straightforward as just changing the names. humans have this failure mode too.
First of all, I would urge you to stop arbitrarily using negative words to make an argument. Saying that LLMs are "blathering" is equivalent to saying you and I are "smacking meat onto plastic to communicate" - it's completely empty of any meaning. This "vibes based arguing" is common in these discussions and a massive waste of time.
Now, I don't really understand what you mean by "almost impossible to maintain long-term context/attention". I'm writing fiction in my spare time, LLMs do very well on this by my testing, even subtle and complex simulations of environments, including keeping track of multiple "off-screen" dynamics like a pot boiling over.
There is nothing "1-dimensional" about the context, unless you mean that it is directional in time, which any human thought is as well, of course. As I said in my original reply, each token is represented by a multidimensional embedding, and even that is abstracted away by the time inference reaches the later layers. The word "citrus" isn't just a word for the LLM, just as it isn't just a word for you. Its internal representation retrieves all the contextual understanding that is related to it. Properties, associated feelings, usage - every relevant abstract concept is considered. And these concepts interact which every embedding of every other token in the input in a learned way, and with the position they have relative to each other. And then when an output is generated from that dynamic, said output influences the dynamic in a way that is just as multidimensional.
The model can maintain context as rich as it wants, and it can built upon that context in whatever way it wants as well. The problem is that in some domains, it didn't get enough training time to build robust transformation rules, leading it to draw false conclusions.
You should reflect on why you are only able to provide vague and under defined, often incorrect, arguments here. You're drawing distinctions that don't really exist and trying to hide that by appealing to false intuitions.
> The reasoning weakness... it's a fundamental architecturally-based limitation...
You have provided no evidence or reasoning for that conclusion. The river crossing puzzle is exactly what I had in mind when talking about specific domains. It is a common trick question with little to no variation and LLMs have overfit on that specific form of the problem. Translate it to any other version - say transferring potatoes from one pot to the next, or even a mathematical description of sets being modified - and the models do just fine. This is like tricking a human with the "As I was going to Saint Ives" question, exploiting their expectation of having to do arithmetic because it looks superficially like a math problem, and then concluding that they are fundamentally unable to reason.
> People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.
What weakness? That current LLMs aren't as good as humans when reasoning over certain domains? I don't follow him personally but I doubt he would have the confidence to make any claims about fundamental inabilities of the transformer architecture. And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.
> And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.
I disagree - there is pretty widespread agreement that reasoning is a weakness, even among the best models, (and note Chollet's $1M ARC prize competition to spur improvements), but the big labs all seem to think that post-training can fix it. To me this is whack-a-mole wishful thinking (reminds me of CYC - just add more rules!). At least one of your "Turing award laureates" thinks Transformers are a complete dead end as far as AGI goes.
A weakness of the current models in some domains considered useful, yes - but not a fundamental limitation of the architecture. I see no consensus on the latter whatsoever.
The ARC challenge tests spatial reasoning, something we humans are obviously quite good at, given 4 billion years of evolutionary optimization. But as I said, there is no "general reasoning", it's all domain dependent. A child does better at the spatial problems in ARC given that it has that previously mentioned evolutionary advantage, but just as we don't worship calculators as superior intelligences because they can multiply 10^9 digit numbers in milliseconds, we shouldn't draw fundamental conclusions from humans doing well at a problem that they are in many ways built to solve. If the failures of previous predictions - those that considered Chess or Go as unmistakable signals of true general reasoning - have taught us anything, it's that general reasoning simply does not exist.
The bet of current labs is synthetic data in pre-training, or slight changes of natural data that induces more generalization pressure for multi-step transformations on state in various domains. The goal is to change the data so models learn these transformations more readily and develop good heuristics for them, so not the non-continuous patching that you suggest.
But yes, the next generation of models will probably reveal much more about where we're headed.
> If the failures of previous predictions - those that considered Chess or Go as unmistakable signals of true general reasoning - have taught us anything, it's that general reasoning simply does not exist.
I don't think DeepBlue or AlphaGo/etc were meant to teach us anything - they were just showcases of technological prowess by the companies involved, demonstrations of (narrow) machine intelligence.
But...
Reasoning (differentiated from simpler shallow "reactive" intelligence) is basically multi-step chained what-if prediction, and may involve a branching exploration of alternatives ("ok, so that wouldn't work, so what if I did this instead ..."), so could be framed as a tree search of sorts, not entirely disimilar to the MCTS used by DeepBlue or AlphaGo.
Of course general reasoning is a lot more general than playing a game like Chess or Go since the type of moves/choices available/applicable will vary at each step (these aren't all "game move" steps), as will the "evaluation function" that predicts what'll happen if we took that step, but "tree search" isn't a bad way to conceptualize the process, and this is true regardless of the domain(s) of knowledge over which the reasoning is operating.
Which is to say, that reasoning is in fact a generalized process, and one who' nature has some corresponding requirements (e.g. keeping track of state) for any machine to be capable of performing it ...
You are confusing number of sequential steps with total amount of compute spent.
The input sequence is processed in parallel, regardless of length, so number of tokens has no impact on number of sequential compute steps which is always N=layers.
> Do you know what a "language model" is capable of in the limit ?
Well, yeah, if the language model is an N-layer transformer ...
Then increase N (N is almost always increased when a model is scaled up) and train or write things down and continue.
A limitless iteration machine (without external aid) is currently an idea of fiction. Brains can't do it so I'm not particularly worried if machines can't either.
Increasing number of layers isn't a smart way to solve it. It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".
This lack of "variable compute" is a widely recognized shortcoming of transformer-based LLMs, and there are plenty of others. The point apropos this thread is that you can't just train an LLM to be something that it is not. If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process... the best it can do is to model the outcome of the generative process, not the process itself. The difference is that without having learnt the generative process, the model will fail when presented with a novel input that it didn't see during training, and therefore didn't memorize the "cheat sheet" answer for.
>Increasing number of layers isn't a smart way to solve it.
The "smart way" is a luxury. Solving the problem is what matters. Think of a smart way later if you can. That's how a lot of technological advancement has worked.
>It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".
Same thing. Efficiency is nice but a secondary concern.
>If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process.
Every inference problem has itself a fixed number of compute steps it needs (yes even your chess move). Variability is a nice thing for between inferences(maybe move 1 required 500 but 2 only 240 etc) A nice thing but never a necessary thing.
3.5-turbo-instruct plays chess consistently at 1800 Elo so clearly the N of the current SOTA is already enough to play non-trivial chess at a level beyond most humans.
There is an N large enough for every GI problem humans care about. Not to sound like a broken record but once again, limited =/ trivial.
> In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust.
This is just moving the goal posts from "learning the actual process" to "any process sufficiently robust"
A photograph is not the same as its subject, and it is not sufficient to reconstruct the subject, but it's still a representation of the subject. Even a few sketched lines are something we recognise as a representation of a physical object.
I think it's fair to call one process that can imitate a more complex one a representation of that process. Especially when in the very next sentence he describes it as a "projection", which has the mathematical sense of a representation that loses some dimensions.
YeS, exactly. The trick is to have enough tough data so you find optimal one. I think as we will scale models back to smaller sizes we will discover viable/correct representations
Which basic counting tests do they still fail? Recent examples I've seen fall well within the range of innumeracy that people routinely display. I feel like a lot of people are stuck in the mindset of 10 years ago, when transformers weren't even invented yet and state-of-the-art models couldn't identify a bird, no matter how much capabilities advance.
I have not used LLMs since 2023, when GPT-4 routinely failed almost every counting problem I could think of. I am sure the performance has improved since then, though "write an essay with 250 words" still seems unsolved.
The real problem is that LLM providers have to play a stupid game of whack-a-mole where an enormous number of trivial variations on a counting problem need to be specifically taught to the system. If the system was capable of true quantitative reasoning that wouldn't be necessary for basic problems.
There is also a deception is that "chain of thought" prompting makes LLMs much better at counting. But that's cheating: if the LLM had quantitative reasoning it wouldn't need a human to indicate which problems were amenable to step-by-step thinking. (And this only works for O(n) counting problems, like "count the number of words in the sentence." CoT prompting fails to solve O(nm) counting problems like "count the number of words in this sentence which contain the letter 'e'" For this you need a more specific prompt, like "First, go step-by-step and select the words which contain 'e.' Then go step-by-step to count the selected words." It is worth emphasizing over and over that rats are not nearly this stupid, they can combine tasks to solve complex problems without a human holding their hand.)
I don't know what you mean by "10 years ago" other than a desire to make an ad hominem attack about me being "stuck." My point is that these "capabilities" don't include "understands what a number is in the same way that rats and toddlers understand what numbers are." I suspect that level of AI is decades away.
Yeah, it's not clear what companies like OpenAI and Anthropic mean when they predict AGI coming out of scaled up LLMs, or even what they are really talking about when they say AGI or human-level intelligence. Do they believe that scale is all you need, or is it an unspoken assumption that they're really talking about scale plus some set of TBD architectural/training changes?!
I get the impression that they really do believe scale is all you need, other than perhaps some post-training changes to encourage longer horizon reasoning.
Maybe Ilya is in this camp, although frankly it does seem a bit naive to discount all the architectural and operational shortcomings of pre-trained Transformers, or assume they can be mitigated by wrapping the base LLM in an agent that provides what's missing.
> I honestly don't see a market for "AI security".
I suspect there's a big corporate market for LLMs with very predictable behaviour in terms of what the LLM knows from its training data, vs what it knows from RAG or its context window.
If you're making a chatbot for Hertz Car Hire, you want it to answer based on Hertz policy documents, even if the training data contained policy documents for Avis and Enterprise and Budget and Thrifty car hire.
Avoiding incorrect answers and hallucinations (when appropriate) is a type of AI safety.
Talent attracts capital. Ilya is a legendary visionary, with a proven track record of turning billions into hundreds of billions. Of course he can raise unlimited money.
There is so much talent in the world that didn’t join PayPal and get silicon valley investors and go on to make billions of dollars and found other companies.
The PayPal mafia includes Elon Musk, Peter Thiel, etc. They now parlayed that capital into more platforms and can easily arrange investments. Heck Peter Thiel even works with governments (Palantir) and got J D Vance on Trump’s ticket, while Elon might be in his admin.
Kolomoisky got Zelensky elected in Ukraine, by launching a show about an unlikely guy who wins the presidency and named the party after the show. They call them oligarchs over there but it’s same thing.
The first guy to 1 million followers on Twitter was Ashton Kutcher. He had already starred in sitcoms and movies for years.
This idea that you can just get huge audiences and investments due to raw talent, keeps a lot of people coming to Hollywood and Silicon Valley to “make it” and living on ramen. But even just coming there proves the point — a talented rando elsewhere in the world wouldn’t even have access to the capital and big boys networks.
I never understood this line of reasoning, because it presumes that everyone should have access to the same opportunities. It's clearly silly once you throw a few counter examples: should a Private in the military be able to skip the ranks and be promoted straight to General? Should a new grad software dev be able to be promoted to lead engineer without getting any experience?
Clearly there are reasons why opportunities are gated.
> This idea that you can just get huge audiences and investments due to raw talent, keeps a lot of people coming to Hollywood and Silicon Valley to “make it” and living on ramen. But even just coming there proves the point — a talented rando elsewhere in the world wouldn’t even have access to the capital and big boys networks.
All those people start somewhere though. Excluding nepotism, which is tangential point, all those people started somewhere and then grew through execution and further opening of opportunity. But it's not like they all got to where they are in one-shot. Taking your Ashton Kutcher example - yes he had a head start on twitter followers, but that's because he executed for years before on his career. Why would it make sense for some rando to rack up a million followers before he did?
Talent will earn you opportunities, but it's not going to open the highest door until you've put in the time and work.
Of course, it's not to say inequity or unequal access to opportunities doesn't exist in the world. Of course it does. But even in an ideal, perfectly equitable world, not everyone would have the same access to opportunities.
So yes, it makes perfect sense that someone would give Ilya $1B instead of some smart 18 year old, even if that 18 year old was Ilya from the past.
Presumably the private and the general are in the SAME organization and yes, the avenues for advancement are available equally to all, it’s based on merit and the rules are clear.
The analogy would be if the private could become a major overnight because they knew a guy.
Yes but a private cannot become a general without decades of experience.
What we see with ilya is not dissimilar. I don't see why it's bad that people are more hesitant to give a talented 18 year old $1B than the guy who's been at the forefront of AI innovation.
> And sometimes not even necessary. Paris Hilton got a music distribution deal overnight cause of her dad’s capital!
Nepotism is a tangential point, and yes I agree that it's a bad thing. Ilya did not get this deal through nepotism, he got it through his past accomplishments, much like how a general gets promoted after many years of exemplary work.
Those people weren’t handed that success. You are acting as if they were born billionaires, which is far from true.
It’s not personally my goal to amass immense wealth and start giant companies (I would rather work minimally and live hedonically) but I am impressed by those that do so.
No, I’m saying it was those who went to silicon valley and got lucky to strike up relationships with CAPITAL who made it.
Overwhelmingly talent isnt sufficient. For most startups, the old boys network gets to choose who gets millions. And the next rounds a few people choose who will get billions.
If you look at the early stages of Google or Zip2, they were NOT swimming in money. Facebook only attracted serious investments when they already had something that looked promising. Apple started in a bedroom in the parent's house, before being moved to the garage, and so on.
If anything, I think startups from 2005-2020 were more likely to get founding easily than those giants.
But after succeeding, in some cases several times, all of the above found it easier to find investors.
Ilya has a similar track record. He contributed on several breakthroughs such as AlexNet, AlphaGo, seq2seq (that would evolve into transformers after he left Google) before even joining OpenAI.
In fact, Elon see it as one of his key contributions to OpenAI that he managed to recruit Ilya.
That Ilya is able to raise 1B now is hardly surprising. He's probably able to raise way more than that once he's hired a larger team.