You're not really supposed to write AVX yourself - the compiler should be doing that for you. And it will, if you write your code in a SIMD-compatible way and turn on the right compiler flags.
I do agree that it is the wrong level of abstraction. Explicitly stating the SIMD width leads to a compatibility nightmare. RISC-V vector instructions instead use an explicit "vector width" register, which pretty much entirely solves this problem.
> You're not really supposed to write AVX yourself - the compiler should be doing that for you. And it will, if you write your code in a SIMD-compatible way and turn on the right compiler flags.
Take it from experience: sure you can write high-level code that is SIMD-compatible. But the compiler is garbage at understanding the semantics and will write terrible SIMD code.
The best thing a current compiler can provide is probably replacing intrinsics with more conventional-looking things like GCC's Vector extensions[1] and C++'s simd<T>. Even then you'd need to do a little bit of union work for the cooler operations.
Compilers will always be terrible at vectorizing code because the required transformation is architectural. It would require the compiler to understand your code well enough to replace the scalar algorithms and data structures with new ones that are semantically equivalent in all contexts with all necessary invariants preserved (e.g. how memory is organized). The code transformation would be very non-local.
Compilers can't generally rewrite your scalar code as vector code for the same reason they can't rewrite your b-tree as a skip list.
hm, that seems optimistic for this use-case.
I heard from a compiler engineer that autovectorizing a sort (which is full of permutes/shuffles) is much harder and is likely to remain infeasible for quite some time.
GPUs have a crossbar that allows for high speed lane-to-lane permutes and bpermutes, but it's still slow compared to butterfly shuffles.
I do believe that compilers can optimize any movement pattern into the right butterfly shuffles (not today in the general case. Modern compilers in CUDA are impressive but this is a hard problem) but I'm convinced that the programmer needs to be aware of the low level difficult nature of many-to-many data movements on a 16-wide AVX512 register, or a 32-wide GPU block / warp / wavefront.
--------
EDIT: I'm like 90% sure some dude at Bell Labs from 1950s working on CLOS network or Benes network design probably has an efficient representation for many-to-many data shuffles on a parallel architecture. But I'm not PH.d enough to have read all those papers or keep up with those old designs.
Many-to-many data movements is traditionally a networking and routing problem. But SIMD-programmers and SIMD-chip designers are starting to run up against this problem... because a ton of parallel programming is about efficient movements of data between conceptual lanes and/or threads.
I do agree that it is the wrong level of abstraction. Explicitly stating the SIMD width leads to a compatibility nightmare. RISC-V vector instructions instead use an explicit "vector width" register, which pretty much entirely solves this problem.