Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It exhibits Transformer-like scaling laws: we find empirically that BDH rivals GPT2-architecture Transformer performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.

I'm assuming they're using "rivals GPT2-architecture" instead of "surpasses" or "exceeds" because they got close, but didn't manage to create something better. Is that a fair assessment?



Pretty much.

Everyone and their dog says "transformer LLMs are flawed", but words are cheap - and in practice, no one seems to have come up with something that's radically better.

Sidegrades yes, domain specific improvements yes, better performance across the board? Haha no. For how simple autoregressive transformers seem, they sure set a high bar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: