As far as Blake3 in hardware for anything other than very low-power smart cards ...

oconnor663 · on Dec 16, 2022

> Blake3 was designed from the ground up to be highly optimized for vector instructions operating on four vectors, each of 4 32-bit words.

This is true, and the BLAKE family inherits this structure from ChaCha, but there's also more to it than that. If you have enough input to fill many blocks, you can run multiple blocks in parallel. In this situation, rather than dividing up the 16 words of a block into four vectors, you put each word in a different vector, and the words of each vector represent the same position in different blocks. (I.e rather than representing columns or rows, the vectors point "out of the page.) There are several benefits to this arrangement:

1. You don't need to do that diagonalization operation anymore.

2. If your CPU supports "instruction-level parallelism" for vector operation, working across the four words/vectors in a row gets to take advantage of that.

3. Best of all, you're no longer limited to 4-word vectors. If you have enough input to fill 8 blocks (AVX2) or 16 blocks (AVX-512), you can use those much larger instruction sets.

This is all easy to take advantage of in a stream cipher like ChaCha, because each block is independent. With a hash function, things are more complicated, because you usually have data dependencies between different blocks. That's why the tree structure of BLAKE3 (or somewhat similarly, KangarooTwelve) is so important for performance. It's not just about multithreading; it also about SIMD. See section 5.3 of the BLAKE3 paper for more on this.