Their formula for quantizing gradients involves quite a bit of extra computation. It's not clear to me how much it complicates an actual implementation (in software, or hardware). To me, the most interesting question is if we can do training using just 8 bits (weights, activations, and gradients), without all that acrobatics. If so, then we can get another significant (and free!) speed up from GPUs.