Thank you for the correction. The best sort in the article appears to yield 3x v...

Thank you for the correction.

The best sort in the article appears to yield 3x vs. std::sort for an array that fits in L2 cache. In my testing (with much larger arrays) I got 2.5x without any small-array transition, and 3x with a branchless sorting-network for ranges of 3 or less, and dramatically less code than that presented in the article.

I suspect that the algorithms used in the article, implemented just a little differently, could still yield substantially better performance, beating my simpler code.

This is a consequence of our deal with the devil: caches and speculation make our code faster, but we can no longer know whether it is objectively fast, or how much faster it could be. What seemed fast becomes slow the moment somebody does it faster.