Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That does not mean that b-trees are unequivocally better than binary search trees. There are some applications, like concurrency or persistence, where comparison count is not as important.

For instance, fewer values per node means less information to copy when copying the node. Locking is more granular with fewer values per node because fewer values are locked at a time.

The binary structure also makes it possible to support randomized balancing and self-adjusting strategies like splay trees.

I am disappointed to still see frequent mention of red-black trees – I see no reason for red-black trees to ever be the best choice in practice, or in theory, or in school. LBST's (logarithmic binary search trees) [1] are generally simpler, more intuitive, and more efficient [2] than red-black trees. They also support positional access inherently, so the balancing information is useful (subtree size).

[1] https://www.semanticscholar.org/paper/A-New-Method-for-Balan... [2] https://rtheunissen.github.io/bst



> For instance, fewer values per node means less information to copy when copying the node. Locking is more granular with fewer values per node because fewer values are locked at a time.

But this also assumes single-threaded computation. Increasingly, high-performance computers are SIMD-compute (aka: GPUs, and if not, at a minimum you're using AVX512 or ARM SVE).

If you have 1024-threads per thread group (common maximum in GPUs), that will commonly be used to create 1024-element B-Tree nodes, as sorting can be accomplished in a single pass network. (Bitonic network: https://en.wikipedia.org/wiki/Bitonic_sorter)

A singular lock that covers the whole 1024-node would better match these very wide GPUs that perform very wide SIMD-execution in practice. Physically, GPUs are 32-wide for NVidia and 32-wide or 64-wide for AMD. But various software tricks increase the width by either ganging-up neighboring compute-units in parallel and synchronizing them... or by running the same code across sequential passes like in AMD GCN architecture.

However, the programming interface of a 1024-wide (looking) piece of hardware/equipment naturally maps to 1024-wide B-Tree nodes, a 1024-wide Bitonic sort (or MergePath sort), and other such parallel sorting networks.

----------

B-Trees are "some number larger than 2", which I think for modern high-performance SIMD-based GPUs, the number absolutely should be "more than 2".

Where I'm wondering, is if the optimal algorithm is 32-wide (aka: one NVidia wave, the smallest unit of computation on an NVidia GPU), or if the optimal algorithm is 1024-wide (the maximum gang of waves that can work as a team). Or if maybe using the said maximum-gang across an even larger node (ex: 4096-wide) has any benefits.

--------

We are reaching the point where compute is so cheap that even a 1024-wide parallel sort is a trifling thing.


Learning a lot here, thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: