Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I suspect he could make it faster using the newer neural network instructions like https://iq.opengenus.org/avx512-vnni/ which pack four multiply and add ops into parallel instead of two that he used.


The problem with that instruction is that to do 4 at a time you would have to multiply by 1, 10, 100 and 1000 respectively. The last multiplier does not fit in a byte.


Use the expand operation to put the operands in larger bitfields - you have 512 bits to play with, the input took 128, so you can expand each byte to a 32 bit value and use 32 bit multipliers.

Then you can use larger than byte operations and larger than byte multipliers if you need them.

Alternatively,

First pass you do all the 1,10,100 cases in one pass, using multipliers 1,10,100,0 (throw out last byte).

Then mask (or use the compression instructions as needed) and use the larger bit versions and do all the 1000 multiplier cases in one pass.

Then you merge them all in one pass.

It merges 4 at a time for the large pass and 3 at a time for the small pass instead of 2 at a time.

I'll write it up later today if I get time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: