For inference, even with continuous batching, getting 100% MFUs is basically imp...

zozbot234 · 2026-02-11T16:01:39 1770825699

Isn't this just saying that your GPU use is bottlenecked by things such as VRAM bandwidth and RAM-VRAM transfers? That's normal and expected.

spmurrayzzz · 2026-02-11T19:52:18 1770839538

No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even in the more efficient training frameworks, there's per-op dispatch overhead in python itself. All the boxing/unboxing of python objects to C++ handles, dispatcher lookup + setup, all the autograd bookkeeping, etc.

All of the bottlenecks in sum is why you'd never get to 100% MFUs (but I was conceding you probably don't need to in order to get value)