Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

fp4/fp8 for neural networks don't work the way you think they do - they are merely compression formats - a set of, say, 256 fp32 weights from 1 neuron are lossily turned into 1 max value (stored in fp32 precision) and 256 fp4/fp8 numbers. Those compressed numbers are multiplied by the fp32 number at runtime to restore the original weights and full fp32 multiplication + additions are executed.


You are correct that the accumulation (i.e. additions in dot products) has to be done in a higher precision, however the multiplication can still be done via LUT. (Source: I currently work at a hardware-accelerated ML hardware startup.)


The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation.

So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.


That's how Nvidia's mixed precision training worked with FP32-FP16, but it isn't the case for Bfloat16 on TPUs and maybe (I'm not sure) FP8 training on Nvidia Hopper GPUs.


With w8a8 quantization the hw (>= hopper) can do the heavy math in fp8 twice as fast as fp16.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: