> An optimization with a universal >=0 speedup across your entire suite of tests...

  > An optimization with a universal >=0 speedup across your entire suite of tests is a really hard thing to come by. Something is always going to have a negative speedup.

Maybe a common example of this is that people can write matrix matrix multiplication kernels that outperform standard implementations (also in BLAS for CPU). But that's not a General Matrix Matrix multiply. Is the speedup still there for spare matrices? Larger ones? Small ones? Ones that aren't powers of 2? Non-square? And so on. You can beat the official implementation in any one of these but good luck doing it everywhere. In fact, you should beat the official method because you don't have the overhead to check which optimization you should use.

It's easy to over simplify a problem and not even realize you have done so. There's always assumptions being made and you should not let these be invisible.