You're not really supposed to write AVX yourself - the compiler should be doing ...

inetknght · on May 31, 2023

> You're not really supposed to write AVX yourself - the compiler should be doing that for you. And it will, if you write your code in a SIMD-compatible way and turn on the right compiler flags.

Take it from experience: sure you can write high-level code that is SIMD-compatible. But the compiler is garbage at understanding the semantics and will write terrible SIMD code.

arthur2e5 · on May 31, 2023

The best thing a current compiler can provide is probably replacing intrinsics with more conventional-looking things like GCC's Vector extensions[1] and C++'s simd<T>. Even then you'd need to do a little bit of union work for the cooler operations.

[1] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

janwas · on June 1, 2023

The union work turns out to be a lot because compiler extensions and simd<> barely support anything beyond operators.

Examples of what's missing: interleaved load/store, compress/expand, software AES/CLMUL, popcount, lzcnt, saturated add/sub, 128-bit compare/minmax, fixed-point mul, mask find/set, masked load/store, scatter/gather, reductions.

Highway supports those (and >200 operations in total) on all platforms.

inetknght · on May 31, 2023

I 100% agree that there's a lot of room for improvement here.

jandrewrogers · on May 31, 2023

Compilers will always be terrible at vectorizing code because the required transformation is architectural. It would require the compiler to understand your code well enough to replace the scalar algorithms and data structures with new ones that are semantically equivalent in all contexts with all necessary invariants preserved (e.g. how memory is organized). The code transformation would be very non-local.

Compilers can't generally rewrite your scalar code as vector code for the same reason they can't rewrite your b-tree as a skip list.

janwas · on May 31, 2023

hm, that seems optimistic for this use-case. I heard from a compiler engineer that autovectorizing a sort (which is full of permutes/shuffles) is much harder and is likely to remain infeasible for quite some time.

dragontamer · on May 31, 2023

GPUs have a crossbar that allows for high speed lane-to-lane permutes and bpermutes, but it's still slow compared to butterfly shuffles.

I do believe that compilers can optimize any movement pattern into the right butterfly shuffles (not today in the general case. Modern compilers in CUDA are impressive but this is a hard problem) but I'm convinced that the programmer needs to be aware of the low level difficult nature of many-to-many data movements on a 16-wide AVX512 register, or a 32-wide GPU block / warp / wavefront.

--------

EDIT: I'm like 90% sure some dude at Bell Labs from 1950s working on CLOS network or Benes network design probably has an efficient representation for many-to-many data shuffles on a parallel architecture. But I'm not PH.d enough to have read all those papers or keep up with those old designs.

Many-to-many data movements is traditionally a networking and routing problem. But SIMD-programmers and SIMD-chip designers are starting to run up against this problem... because a ton of parallel programming is about efficient movements of data between conceptual lanes and/or threads.