I am not sure if it is a good idea to compile code targeted to modern processors...

mysterymath · on Sept 21, 2021

Most of the inoptimality in the article isn't due to the issues you've raised, but rather due to us just starting to optimize LLVM-MOS.

First, I have utterly no idea why there are so many calls to memset; it looks like it's unrolling a loop or something... poorly. It also doesn't seem to be reusing registers when setting up the calls; that's also bad and should be fixed.

Second, if you take a look at the actual structure of the prologue and epilogue, you might notice that it's copying zero page to an absolute memory region called __clear_screen_sstk. This is because LLVM-MOS ran a whole-program analysis on the program and proved that at most one activation of that function could occur at any given time. Thus, it's "stack frame" was automatically allocated statically as a global array, not relative to a moving stack pointer.

The reason that the prologue and epilogue spends so much time copying in and out of the zero page is just that we haven't taught LLVM-MOS how to access the stack directly, but there's no technical obstacle to doing so. Once that's done, the whole body of the function would operate on __clear_screen_sstk directly, and the prologue and epilogue would disappear completely.

Of course, from the first point, you shouldn't need any stack locations to do the body of this routine; there's a big ball of yarn here, but pulling on any of a number of threads would unravel it.