This matches with my experiences with ICC doing kernel development. It would vectorize loops and break out the SIMD instructions where gcc would not.
It was quite strange the first time looking through the objdump seeing things like punpwlkd and xmm.
And then discovering what -fast would do to things (it makes icc look at your whole program to optimize, so it does things like ignore CDECL and uses whatever registers it can.
It was quite strange the first time looking through the objdump seeing things like punpwlkd and xmm.
And then discovering what -fast would do to things (it makes icc look at your whole program to optimize, so it does things like ignore CDECL and uses whatever registers it can.