The key point of this article (I'm the author) is that even if cmov would be like 50 cycle latency instruction and the penalty of a mispredicted branch is 0 (ie. after the cpu sees a branch is mispredicted it would immediately start executing the instructions from the correct branch), than cmov would still be faster on modern CPU's.
Because in branchless quicksort, no matter how long it takes to figure out where the element has to go relative to the pivot it doesn't obstruct the CPU's ability in this algorithm to start working on the next element the next cycle.
I agree with the first part -- I implemented the movable std::sort for g++, and I found it very hard to convince the compiler to keep the pivot in a register (although I did write it over 10 years ago, maybe the compiler would do a better job by now!)
I think life gets harder when you have to deal with iterators, and have to use iterator::reference (which might not be a raw C++ reference, although it almost always is in practice).
Yes that is very much possible. In general you won't get 4 equal length ranges though. For instance in an already sorted array the median results in (n/2,0) and a (0,n/2) division. But the average case should be two sets of merge intervals both of size (n/4, n/4). In reality I think you will quickly run into other limits. You need quite a bit of registers, that shouldn't spill to stack.
I think the median of the results works great in any case (but, you are the expert here). If we find the result median in each of the two inputs, say 'a' and 'b' we have four merges (two forward, two reverse):
[0, ..) w (0, ..)
(.., a) w (.., b)
[a, ..) w [b, ..)
(.., n) w (.., n)
each of which should produce n/4 elements and then meet.
If the array is already sorted then great; that split will just "merge":
[0, ..) w nothing
(.., n/2) w nothing
nothing w [n/2, ..)
nothing w (.., n)
which should go very fast. Anyhow, possible I'm missing something!
A very interesting suggestion indeed. Shell sort is an insertion sort though. It owes its better complexity from being able to breaking out of loop early. But these are difficult to predict branches that bubble sort avoids.
The "hybrid" as used in this article referred to a partition scheme that combines a "Lomuto partitioning like algorithm" together with a "Hoare partitioning like algorithm".
This amplifies the the strength of each scheme and mitigates the weakness. Ie. Lomuto does too many swaps, but naturally fits a branchfree implementation. Hoare is frugal with swaps, but is rather branchy. The "hybrid" here inherits the frugality of swaps from Hoare's scheme, while benefits from the branchfree implementation of Lomuto scheme.
What you seem to describe here is the classical and well-known hybridization of divide and conquer algorithms. This is also used and implicitly implied when branchless bubble sort is discussed. But the "hybrid" as used in the article wasn't meant to refer to this.
That was just hand analysis of the data flow together with Google Diagrams.
As I mention in the article, dependency analysis has, IMO, not received the widespread attention it should. I like Fabian Giesen's article if you want an introduction.
One particularly interesting example in this context is OSACA (Open Source Architecture Code Analyzer), https://github.com/RRZE-HPC/osaca
The related publications explain the approach used to model and perform critical path analysis relevant for the modern superscalar out-of-order processors:
There's also a broader line of research on performance modeling in this vein (some pretty detailed, including microarchitectural details like branch misprediction penalties, ROB capacity, etc.):
https://gist.github.com/MattPD/85aad98ee8b135e675d49c571b67f...
More on modeling microarchitectural details (chronological order):
- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2 (2009) - http://www.elis.ugent.be/~leeckhou/papers/tocs09.pdf
- Maximilien B. Breughe, Stijn Eyerman, and Lieven Eeckhout. "Mechanistic analytical modeling of superscalar in-order processor performance." ACM Trans. Architec. Code Optim. 11, 4, Article 50 (2014) - https://users.elis.ugent.be/~leeckhou/papers/taco2015-breugh....
For more clarification. The number 0, 1 in these benchmarks mean the level of indirection. IndirectionSort<0, xxx> is sorting pointers on their address. While IndirectionSort<1, xxx> is sorting pointers on the value it's pointing to. Notice how MergeSort regresses much more then the QuickSort. Without an indirection MergeSort is competitive with an indirection QuickSort becomes relatively much better. As you point out sorting pointers to structs is much more important use case. This property becomes even more important as L1 cache is typically kind of small O(32kb), while L2/L3 cache is large O(1mb). So when sorting pointers it's not unreasonable to hit L2/L3 cache but miss L1. With L2/L3 typically being 10-20 cycles latency (vs 5 cycle for L1) QuickSort is largely insensitive to the L1 cache misses.
Because in branchless quicksort, no matter how long it takes to figure out where the element has to go relative to the pivot it doesn't obstruct the CPU's ability in this algorithm to start working on the next element the next cycle.