You'd be surprised how quickly improvement of autoregressive language models lev... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		_0ffh 62 days ago \| parent \| context \| favorite \| on: Nvidia Stock Crash Prediction You'd be surprised how quickly improvement of autoregressive language models levels off with epoch count (though, admittedly, one epoch is a LOT). Diffusion language models otoh indeed keep profiting for much longer, fwiw.

zozbot234 62 days ago [–]

Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.

_0ffh 62 days ago | [–]

Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.

https://arxiv.org/abs/2507.15857

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact