> For instance, fewer values per node means less information to copy when copyin...

> For instance, fewer values per node means less information to copy when copying the node. Locking is more granular with fewer values per node because fewer values are locked at a time.

But this also assumes single-threaded computation. Increasingly, high-performance computers are SIMD-compute (aka: GPUs, and if not, at a minimum you're using AVX512 or ARM SVE).

If you have 1024-threads per thread group (common maximum in GPUs), that will commonly be used to create 1024-element B-Tree nodes, as sorting can be accomplished in a single pass network. (Bitonic network: https://en.wikipedia.org/wiki/Bitonic_sorter)

A singular lock that covers the whole 1024-node would better match these very wide GPUs that perform very wide SIMD-execution in practice. Physically, GPUs are 32-wide for NVidia and 32-wide or 64-wide for AMD. But various software tricks increase the width by either ganging-up neighboring compute-units in parallel and synchronizing them... or by running the same code across sequential passes like in AMD GCN architecture.

However, the programming interface of a 1024-wide (looking) piece of hardware/equipment naturally maps to 1024-wide B-Tree nodes, a 1024-wide Bitonic sort (or MergePath sort), and other such parallel sorting networks.

----------

B-Trees are "some number larger than 2", which I think for modern high-performance SIMD-based GPUs, the number absolutely should be "more than 2".

Where I'm wondering, is if the optimal algorithm is 32-wide (aka: one NVidia wave, the smallest unit of computation on an NVidia GPU), or if the optimal algorithm is 1024-wide (the maximum gang of waves that can work as a team). Or if maybe using the said maximum-gang across an even larger node (ex: 4096-wide) has any benefits.

--------

We are reaching the point where compute is so cheap that even a 1024-wide parallel sort is a trifling thing.