It is only linear in std::distance(first, last), so it very well might be (and l...

lorenzhs · on March 10, 2017

But std::distance(first, last) is how n is defined.

std::nth_element is typically implemented using a fancy version of quickselect called introselect. You can imagine quickselect as a version of quicksort where the recursive call is only executed on the side which contains the element. Introselect adds some fanciness to ensure worst-case linear running time if quickselect takes too long (bad pivot selection).

asafira · on March 10, 2017

very cool! I stand corrected, I didn't think of this when I considered it. Sorry for the mix-up

bogomipz · on March 10, 2017

Its O(n logk) then? It maintain a heap I'm guessing?

lorenzhs · on March 10, 2017

No, as per my previous comment, it t does not use a heap. It's like quicksort, but only doing one of the recursive calls:

- choose a pivot p randomly from the elements

- partition into elements <= p and > p. This takes time O(n).

- if the number of elements <= p is at most k (the rank of the element we want), do a recursive call on these elements. Otherwise recurse on the other elements (those >p).

Because a random pivot splits the elements well enough most of the time, this has an expected running time of O(n): 1 + ½ + ¼ + ⅛ + … < 2 in the best case, and similar constants if the partitioning is a little worse. But in the worst case, when the pivot is the smallest/largest element in each iteration, it's O(n²). That's where the fanciness of introselect comes in: if quickselect doesn't converge, it switches to an algorithm with worst-case O(n) time, which is slower in most cases but saves the day when quickselect has trouble.

username223 · on March 10, 2017

See "median of medians algorithm": https://en.wikipedia.org/wiki/Median_of_medians

It's hairier than a partial quicksort, but (if you erase the proof and weigh the chalk dust) guaranteed linear time.

gpderetta · on March 10, 2017

and median of medians can be used to implement quicksort with O(NlogN) worst case. It is usually not worth it as other hybrid quicksorts like introsort are faster in practice.

On an unrelated note, I was once asked to prove the upper bound for median of median on an interview...

bogomipz · on March 10, 2017

Thanks, right, the idea to is to only recurse down one side on each pass right? This is the (sub)family of "selection" algorithms?

This is interesting I had not heard of introselect before, I will have to read up on this.

lorenzhs · on March 10, 2017

Yup, exactly. It's a fun area that gets even more interesting when you look at distributed systems, where every node only stores a small portion of the overall data and you want to keep communication volume low.

bogomipz · on March 10, 2017

>"It's a fun area that gets even more interesting when you look at distributed systems"

Can you elaborate a little bit, specifically how selection algorithms help reduce "chatter" in distributed systems? I am not familiar with this and this sounds like an interesting context.

I have heard of Merkle Trees for similar but that's obviously hashing and not selection algorithms, or is there some connection I am not making?

lorenzhs · on March 11, 2017

Well if you'd like to select the k-th biggest element from a distributed array it's impractical - if not impossible - to send everything to one machine so another approach is needed. A fairly simple algorithm would select a pivot randomly on a random node and broadcast it to all other nodes. Each node then does the partitioning locally and counts partition sizes. Now you do a sum all-reduction on the sizes to decide which side you need to recurse on. This takes logarithmically many rounds in expectation just like the sequential algorithm, but it could result in very unbalanced work if the data isn't distributed randomly. It also has even more trouble if the pivot selection is bad because of the communication overhead. There are various techniques to overcome these challenges but it's a bit much to go into here.