Taking the opposite side of that bet, here is why: \* even if an openweight mode...

xml · 2026-02-26T09:50:06 1772099406

Even with inflated RAM prices, you can buy a Strix Halo Mini PC with 128GB unified memory right now for less than 2k. It will run gpt-oss-120b (59 GB) at an acceptable 45+ tokens per second: https://github.com/lhl/strix-halo-testing?tab=readme-ov-file...

I also believe that it should eventually be possible to train a model with somewhat persistent mixture of experts, so you only have to load different experts every few tokens. This will enable streaming experts from NVMe SSDs, so you can run state of the art models at interactive speeds with very little VRAM as long as they fit on your disk.

athrowaway3z · 2026-02-26T10:57:50 1772103470

I agree the parent is a bit too pessimistic, especially because we care about logical skills and context size more than remembering random factoids.

But on a tangent, why do you believe in mixture of experts?

Every thing I know about them makes me believe they're a dead-end architecturally.

xml · 2026-02-26T12:05:48 1772107548

> But on a tangent, why do you believe in mixture of experts?

The fact that all big SoTA models use MoE is certainly a strong reason. They are more difficult to train, but the efficiency gains seem to be worth it.

> Every thing I know about them makes me believe they're a dead-end architecturally.

Something better will come around eventually, but I do not think that we need much change in architecture to achieve consumer-grade AI. Someone just has to come up with the right loss function for training, then one of the major research labs has to train a large model with it and we are set.

I just checked Google Scholar for a paper with a title like "Temporally Persistent Mixture of Experts" and could not find it yet, but the idea seems straightforward, so it will probably show up soon.

amelius · 2026-02-26T11:37:11 1772105831

> But on a tangent, why do you believe in mixture of experts

In a hardware inference approach you can do tens of thousands tokens per second and run your agents in a breadth first style. It is all very simply conceptually, and not more than a few years away.

amelius · 2026-02-26T09:53:37 1772099617

There will be companies producing ICs for cheap models, like Taalas or Axelera.ai today. These models will not be as good as the SOTA models, but because they are so fast, in a multi-agent approach with internet/database connectivity they can be as good as SOTA models, at least for the general public.

MagicMoonlight · 2026-02-26T10:33:14 1772101994

All they need to do is produce one for GPT-OSS and it’s over. That model is good enough for real uses.

kavalg · 2026-02-26T11:37:55 1772105875

I wonder why did they release it then.

amelius · 2026-02-26T11:39:09 1772105949

Why did Google publish the Transformers paper?

WarmWash · 2026-02-26T15:09:37 1772118577

The GPU makers have been purposely stunting VRAM growth for years to not undercut their enterprise offerings.

vegabook · 2026-02-26T10:22:20 1772101340

yeah but effective GPU RAM has ramped thanks to unified mem on apple. The 5y thing doesn't hold anymore.

randusername · 2026-02-26T14:17:18 1772115438

I agree, but I'm holding out hope that ASICs, unified RAM, and/or enterprise to consumer trickle-down will outpace consumer GPU VRAM growth rates.

otabdeveloper4 · 2026-02-26T09:44:42 1772099082

Increasing model size doesn't make your model smarter, it just makes it know more facts.

There's easier ways to do that.