gerbenst's comments

gerbenst · on Jan 30, 2022

The key point of this article (I'm the author) is that even if cmov would be like 50 cycle latency instruction and the penalty of a mispredicted branch is 0 (ie. after the cpu sees a branch is mispredicted it would immediately start executing the instructions from the correct branch), than cmov would still be faster on modern CPU's.

Because in branchless quicksort, no matter how long it takes to figure out where the element has to go relative to the pivot it doesn't obstruct the CPU's ability in this algorithm to start working on the next element the next cycle.

gerbenst · on Jan 30, 2022

it's not difficult to swap the pivot with the last element and have the compiler figure out it can keep pivot in register most of the time.

CJefferson · on Jan 30, 2022

I agree with the first part -- I implemented the movable std::sort for g++, and I found it very hard to convince the compiler to keep the pivot in a register (although I did write it over 10 years ago, maybe the compiler would do a better job by now!)

gerbenst · on Jan 30, 2022

https://godbolt.org/z/zPWr73qjc

Here is an implementation. This should work with all types. Unfortunate I had to annotate with restrict, but a small price to pay.

CJefferson · on Jan 30, 2022

I think life gets harder when you have to deal with iterators, and have to use iterator::reference (which might not be a raw C++ reference, although it almost always is in practice).

gerbenst · on May 31, 2020

Yes that is very much possible. In general you won't get 4 equal length ranges though. For instance in an already sorted array the median results in (n/2,0) and a (0,n/2) division. But the average case should be two sets of merge intervals both of size (n/4, n/4). In reality I think you will quickly run into other limits. You need quite a bit of registers, that shouldn't spill to stack.

frankmcsherry · on May 31, 2020

I think the median of the results works great in any case (but, you are the expert here). If we find the result median in each of the two inputs, say 'a' and 'b' we have four merges (two forward, two reverse):

    [0, ..) w (0, ..)
    (.., a) w (.., b)
    [a, ..) w [b, ..)
    (.., n) w (.., n)

each of which should produce n/4 elements and then meet.

If the array is already sorted then great; that split will just "merge":

    [0, ..) w nothing
    (.., n/2) w nothing
    nothing w [n/2, ..)
    nothing w (.., n)

which should go very fast. Anyhow, possible I'm missing something!

gerbenst · on May 31, 2020

No you are not missing something. This is, I think, correct. I haven't tried it but it seems possible and possibly very effective.

gerbenst · on May 31, 2020

A very interesting suggestion indeed. Shell sort is an insertion sort though. It owes its better complexity from being able to breaking out of loop early. But these are difficult to predict branches that bubble sort avoids.

Certainly this is worth an experiment

kzrdude · on May 31, 2020

Well, you are the expert :). I misremebered and thought of the base case for shell sort to be a kind of bubble sort.

gerbenst · on May 31, 2020

The "hybrid" as used in this article referred to a partition scheme that combines a "Lomuto partitioning like algorithm" together with a "Hoare partitioning like algorithm".

This amplifies the the strength of each scheme and mitigates the weakness. Ie. Lomuto does too many swaps, but naturally fits a branchfree implementation. Hoare is frugal with swaps, but is rather branchy. The "hybrid" here inherits the frugality of swaps from Hoare's scheme, while benefits from the branchfree implementation of Lomuto scheme.

What you seem to describe here is the classical and well-known hybridization of divide and conquer algorithms. This is also used and implicitly implied when branchless bubble sort is discussed. But the "hybrid" as used in the article wasn't meant to refer to this.

MaxBarraclough · on May 31, 2020

Thanks.

gerbenst · on May 31, 2020

That was just hand analysis of the data flow together with Google Diagrams.

As I mention in the article, dependency analysis has, IMO, not received the widespread attention it should. I like Fabian Giesen's article if you want an introduction.

https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introdu...

matt_d · on May 31, 2020

FWIW, there are some interesting tools for microarchitectural performance analysis which are getting closer (although it's a work in progress): https://github.com/MattPD/cpplinks/blob/master/performance.t...

One particularly interesting example in this context is OSACA (Open Source Architecture Code Analyzer), https://github.com/RRZE-HPC/osaca

The related publications explain the approach used to model and perform critical path analysis relevant for the modern superscalar out-of-order processors:

- Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels (2019): https://arxiv.org/abs/1910.00214

- Cross-Architecture Automatic Critical Path Detection For In-Core Performance Analysis (2020): https://hpc.fau.de/files/2020/02/Masterarbeit_JL-_final.pdf

There's also a broader line of research on performance modeling in this vein (some pretty detailed, including microarchitectural details like branch misprediction penalties, ROB capacity, etc.): https://gist.github.com/MattPD/85aad98ee8b135e675d49c571b67f...

More on modeling microarchitectural details (chronological order):

- Tejas S. Karkhanis and James E. Smith. "A First-Order Superscalar Processor Model." (ISCA 2004) - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.....

- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2 (2009) - http://www.elis.ugent.be/~leeckhou/papers/tocs09.pdf

- Maximilien B. Breughe, Stijn Eyerman, and Lieven Eeckhout. "Mechanistic analytical modeling of superscalar in-order processor performance." ACM Trans. Architec. Code Optim. 11, 4, Article 50 (2014) - https://users.elis.ugent.be/~leeckhou/papers/taco2015-breugh....

- "Modeling Superscalar Processor Memory-Level Parallelism", Sam Van den Steen and Lieven Eeckhout, IEEE Computer Architecture Letters (CAL), Vol 17, No 1 (2018) - https://users.elis.ugent.be/~leeckhou/papers/cal2018-MLP.pdf

gerbenst · on May 31, 2020

For more clarification. The number 0, 1 in these benchmarks mean the level of indirection. IndirectionSort<0, xxx> is sorting pointers on their address. While IndirectionSort<1, xxx> is sorting pointers on the value it's pointing to. Notice how MergeSort regresses much more then the QuickSort. Without an indirection MergeSort is competitive with an indirection QuickSort becomes relatively much better. As you point out sorting pointers to structs is much more important use case. This property becomes even more important as L1 cache is typically kind of small O(32kb), while L2/L3 cache is large O(1mb). So when sorting pointers it's not unreasonable to hit L2/L3 cache but miss L1. With L2/L3 typically being 10-20 cycles latency (vs 5 cycle for L1) QuickSort is largely insensitive to the L1 cache misses.

    BM_IndirectionSort<1, std_sort>                       93 ns         93 ns    7700000
    BM_IndirectionSort<1, std_stable_sort>               126 ns        125 ns    5600000
    BM_IndirectionSort<1, std_heap_sort>                 172 ns        172 ns    4200000
    BM_IndirectionSort<1, exp_gerbens::QuickSort>         34 ns         34 ns   19000000
    BM_IndirectionSort<1, HeapSort>                      157 ns        157 ns    4700000
    BM_IndirectionMergeSort<1>                            68 ns         68 ns   10400000
    BM_IndirectionSort<0, std_sort>                       74 ns         74 ns    9500000
    BM_IndirectionSort<0, std_stable_sort>                91 ns         91 ns    7800000
    BM_IndirectionSort<0, std_heap_sort>                 125 ns        125 ns    5800000
    BM_IndirectionSort<0, exp_gerbens::QuickSort>         26 ns         26 ns   27200000
    BM_IndirectionSort<0, HeapSort>                       77 ns         77 ns    9300000
    BM_IndirectionMergeSort<0>                            36 ns         36 ns   19400000

gerbenst · on May 30, 2020

No and I suspect that there is room for significant improvement here.