AFAIK, the main bottleneck on training is memory bandwidth. Distributed gpu compute has multiple orders of magnitude less than an equivalent number of GPUs colocated, because they don’t share a physical bus, but have a network connection instead. This work improves on that, but the fundamental limitations remain.