This is good stuff. It will make writing assembly on Intel at least slightly less painful.
I am still amazed we don't have an x86 assembler that is anywhere near what DSP people use. I mean, we still have to track register usage manually and do manual instruction reordering to improve performance! The x86 world could learn a lot from the DSP world here (Texas Instruments tools are a good example).
Seriously, you're downmodding this? Write some C sometime, and watch the compiler output perfectly-optimized assembly for your architecture. You write a high-level solution to the problem, the compiler makes it work efficiently.
If it doesn't, it's a compiler bug, and should be fixed at that level.
I program C signal processing code on x86. Sure, it's usually sufficient for my needs, but "perfectly-optimized assembly"? Not for those tight loops where you really want it. It's decent and if you hold the compiler's hand will get vectorized somewhat, but the code is still heavy.
It is obvious you don't know what you're talking about and have never seen tightly optimized signal-processing code (as in, say, H.264 weighted prediction or interpolation).
I am still amazed we don't have an x86 assembler that is anywhere near what DSP people use. I mean, we still have to track register usage manually and do manual instruction reordering to improve performance! The x86 world could learn a lot from the DSP world here (Texas Instruments tools are a good example).