As far as Blake3 in hardware for anything other than very low-power smart cards or similar:
Blake3 was designed from the ground up to be highly optimized for vector instructions operating on four vectors, each of 4 32-bit words. If you already have the usual 4x32 vector operations, plus a vector permute (to transform the operations across the diagonal of your 4x4 matrix into operations down columns) and the usual bypass network to reduce latency, I think it would rarely be worth the transistor budget to create dedicated Blake3 (or Blake2s/b) instructions.
In contrast, SHA-3's state is conceptually five vectors each of five 32-bit words, which doesn't map as neatly onto most vector ISAs. As I remember, it has column and row operations rather than column and diagonal operations that parallelize better on vector hardware.
SHA-2 is a Merkle-Damgard construction where the round function is a Davies-Meyer construction where the internal block cipher is a highly unbalanced Feistel cipher. Conceptually, you have a queue of 8 words (32-bit or 64-bit, depending on which variant). Each operation pops the first word from the queue, combines it in a nonlinear way with 6 of the other words, adds one word from the "key schedule" derived from the message, and pushes the result on the back of the queue. The one word that wasn't otherwise used is increased by the sum of the round key and a non-linear function of 3 of the other words. As you might imagine, this doesn't map very well onto general-purpose vector instructions. This cipher is wrapped in a step (Davies-Meyer construction) where you save a copy of the state, encrypt the state using the next block of the message, and then add the saved copy to the encrypted result (making it non-invertible, making meet-in-the middle attacks much more difficult). The key schedule uses a variation on a lagged Fibonacci generator to expand each message block into a larger number of round keys.
> Blake3 was designed from the ground up to be highly optimized for vector instructions operating on four vectors, each of 4 32-bit words.
This is true, and the BLAKE family inherits this structure from ChaCha, but there's also more to it than that. If you have enough input to fill many blocks, you can run multiple blocks in parallel. In this situation, rather than dividing up the 16 words of a block into four vectors, you put each word in a different vector, and the words of each vector represent the same position in different blocks. (I.e rather than representing columns or rows, the vectors point "out of the page.) There are several benefits to this arrangement:
1. You don't need to do that diagonalization operation anymore.
2. If your CPU supports "instruction-level parallelism" for vector operation, working across the four words/vectors in a row gets to take advantage of that.
3. Best of all, you're no longer limited to 4-word vectors. If you have enough input to fill 8 blocks (AVX2) or 16 blocks (AVX-512), you can use those much larger instruction sets.
This is all easy to take advantage of in a stream cipher like ChaCha, because each block is independent. With a hash function, things are more complicated, because you usually have data dependencies between different blocks. That's why the tree structure of BLAKE3 (or somewhat similarly, KangarooTwelve) is so important for performance. It's not just about multithreading; it also about SIMD. See section 5.3 of the BLAKE3 paper for more on this.
Blake3 was designed from the ground up to be highly optimized for vector instructions operating on four vectors, each of 4 32-bit words. If you already have the usual 4x32 vector operations, plus a vector permute (to transform the operations across the diagonal of your 4x4 matrix into operations down columns) and the usual bypass network to reduce latency, I think it would rarely be worth the transistor budget to create dedicated Blake3 (or Blake2s/b) instructions.
In contrast, SHA-3's state is conceptually five vectors each of five 32-bit words, which doesn't map as neatly onto most vector ISAs. As I remember, it has column and row operations rather than column and diagonal operations that parallelize better on vector hardware.
SHA-2 is a Merkle-Damgard construction where the round function is a Davies-Meyer construction where the internal block cipher is a highly unbalanced Feistel cipher. Conceptually, you have a queue of 8 words (32-bit or 64-bit, depending on which variant). Each operation pops the first word from the queue, combines it in a nonlinear way with 6 of the other words, adds one word from the "key schedule" derived from the message, and pushes the result on the back of the queue. The one word that wasn't otherwise used is increased by the sum of the round key and a non-linear function of 3 of the other words. As you might imagine, this doesn't map very well onto general-purpose vector instructions. This cipher is wrapped in a step (Davies-Meyer construction) where you save a copy of the state, encrypt the state using the next block of the message, and then add the saved copy to the encrypted result (making it non-invertible, making meet-in-the middle attacks much more difficult). The key schedule uses a variation on a lagged Fibonacci generator to expand each message block into a larger number of round keys.