fp4/fp8 for neural networks don't work the way you think they do - they are mere...

lifthrasiir · on Oct 9, 2024

You are correct that the accumulation (i.e. additions in dot products) has to be done in a higher precision, however the multiplication can still be done via LUT. (Source: I currently work at a hardware-accelerated ML hardware startup.)

SuchAnonMuchWow · on Oct 9, 2024

The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation.

So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.

rajnathani · on Oct 9, 2024

That's how Nvidia's mixed precision training worked with FP32-FP16, but it isn't the case for Bfloat16 on TPUs and maybe (I'm not sure) FP8 training on Nvidia Hopper GPUs.

imjonse · on Oct 9, 2024

With w8a8 quantization the hw (>= hopper) can do the heavy math in fp8 twice as fast as fp16.