This covers a lot of ground! The author is interested in concurrency, in particular coroutines. They follow a standard approach of PL research to extend a lambda calculus, discussing typing and compilation. This will seem academic but there are certainly many PL researchers who have deep expertise in compiler design and implementation and who also deeply care about communicating these ideas in a precise, formalized manner.
To put this in perspective: even after so many years, decades of research, industry and hobby projects turned million of professional users (before thinking about a type system)... the question of exposing concurrency to programmers is far from settled even if it is so important. There are many approaches and different perspectives on what matters more.
Formalizing these approaches has likely been done before, but in practical terms we don't care about abstract properties but about performance.
That is why I like the article, implementation concerns are important, but the ability to fully specify behavior is also important. Keep it up.
> Each syntactic function (a lambda \x -> ...) gets a unique name
> Determine the stack size of its captures based on the largest captures of any lambda in the lambda set it's involved in.
If I understand these two points correctly, this would require a whole-program optimization and would mean that libraries need to be recompiled alongside the current program. For example, if there are libraries `foo` and `bar` to be used by a program `baz`, it seems that the names of functions in `foo`, `bar` and `baz` need to be distinct. But how do you ensure this if `foo` and `bar` are compiled separately? Similarly, I cannot compile `foo` without knowing about `baz`, because any higher-order function in `foo` needs to know what kind of lambda set it will have in `baz`. That would seem to imply a big increase in compile-times when working with a larger codebase.
I don't know if this is well-known, but one thing I find helpful during compilation is to have the procedure that compiles an expression take a parameter indicating where the expression should be compiled, rather than (in this case) always compiling to the stack and storing the result elsewhere later. This eliminates a lot of trivially-reducable load and stores.
Author here. Yes, the scheme requires whole-program compilation, which is unfortunate. If you’re okay with statically-linked dependencies there is only the large problem of making a compiler that is adept to incremental re-compilation. However, any monomorphizing compiler faces such a challenge, so the problem is not unique.
Thanks for confirming, and that is a good point! However, I think that for monomorphizing compilers, programmers often write code in such a way as to avoid triggering a larger rec-compilation. With monomorphization this might be easier to achieve since you need to control how your datastructures are instantiated. But with functions you would need to control how the functions are passed. For example, a use of a Haskell-style `lens` package would probably trigger a re-compilation of `lens -> kan-extensions -> profunctors -> comonad -> base`. Programmers can work around that, but it may place restrictions on what they can easily do with the language.
Yes - that’s a great observation. The scheme here means that you can end up with compilation dependencies that aren’t reflected in your explicit dependency graph, exactly e.g. via the path you describe.
The same is true of monomorphization in presence of typeclasses (or traits, concepts, etc). I have some ideas about how to do such incremental compilations optimally, but they haven’t been written down.
Anyway, I totally agree with you. I don’t mean to suggest this is the best way to do things - if anything Rust/C++/etc have taught us excessive monomorphization is probably not the way to go for developer experience reasons. You may be able to imagine some interesting derivative of the scheme presented here with something like Swift’s “witness table”-based compilation of protocols, which may be much more compile-time performant, and support separate compilation. But, I don’t even have a sketch of that. This is only one technique and the design space is very wide.
Another question: In the section on eliminating heap-allocated captures, you discuss creating a datatype for captures that is passed by-value. But what if that datatype is recursive? For example, how would you compile a CPS-transformed `reverse`:
fun reverse(xs : list<a>, k : list<a> -> list<a>)
match xs
Cons(x, xx) -> reverse(xx, \ys -> k(Cons(x, ys)))
Nil -> k(Nil)
Here, the `k` that is recursively passed depends on the previous `k`. As such `k` should be as large as the input list and can not be stack-allocated?
Yes - if the closure representation is recursive, it must be boxed. I believe your example is also that given by the original author (William Brandon et.al.). I don’t think there is any way around this, because it is the same as any other recursive data type at that point.
Although, if you know at compile time the size of the list, you could imagine compiling the recursive data type “unrolled” into a statically-sized array - but now you need dependent types.
Ah thanks! I didn't find that paper, is it online somewhere?
You may also be interested in the stack allocation system in OCaml: https://www.youtube.com/watch?v=yGRn5ZIbEW8 In particular, Stephen gives a nice example in the end how it can avoid some allocations of lambdas (but it also would need to box the closure in the CPS-reverse).
You may also be interested in
Jeffrey M. Bell, Françoise Bellegarde, and James Hook. “Type-Driven Defunctionalization,” ICFP ’97, . Association for Computing Machinery, New York, NY, USA, 25–37. 1997. doi:10.1145/258948.258953.
and
Andrew Tolmach, and Dino P Oliva. “From ML to Ada: Strongly-Typed Language Interoperability via Source Translation.” Journal of Functional Programming 8 (4). Cambridge University Press: 367–412. 1998.
who thought about whole-program-defunctionalization before.
The manuscript is currently unpublished so it’s not available ):
Thanks for the links - I’ve seen them, and it’s interesting that defunctionalization is usually an optimization. The “Type-driven …” paper is the earliest work I’m aware of that does it over surface types, but doesnt “go all the way”.
Id love to chat more - a link to email can be found around the post, including at the bottom.
As someone who lives in embedded land, but wishes we had better concurrency and approaches (as embedded gains more and more cores and performance while still remaining low power, and sequencing tasks and race-to-idle becomes even more important), whole-program compilation is fine for me haha. We already have to do that for the most part anyway. Awesome ideas, I'm definitely digging in deeper this evening! Heap-less directly called closures would be amazing.
>If I understand these two points correctly, this would require a whole-program optimization and would mean that libraries need to be recompiled alongside the current program. For example, if there are libraries `foo` and `bar` to be used by a program `baz`, it seems that the names of functions in `foo`, `bar` and `baz` need to be distinct. But how do you ensure this if `foo` and `bar` are compiled separately?
Couldn't you just prefix or namespace them behind the scenes when you compile each of them with e.g. a hash of the code for the function (or the library they are in), or some similar way to get a unique id/hash/uuid per unit of code (e.g. library, file) that needs to have distinct names of function with other such units across the whole program?
You could also keep the association with the original name to show when you print error messages or debug or whatever.
Interesting presentation, thanks. That problem with compilation of conditional statements makes me suspect that older languages didn't have first-class Boolean values but restricted Boolean expressions to only appear as tests for conditional/looping constructs exactly because of it.
> For example, if there are libraries `foo` and `bar` to be used by a program `baz`, it seems that the names of functions in `foo`, `bar` and `baz` need to be distinct. But how do you ensure this if `foo` and `bar` are compiled separately?
There’s an interesting question embedded here about the relative effectiveness of modern CPU branch predictors vs. (unconditional) branch target predictors.
Total speculation mode, but I’d guess that if you have >1 static conditions (on runtime values) that determine the target, you’re indeed better off encoding those as explicit branches. Both predictors are fundamentally pattern matchers, and it should be easier to identify patterns in independent conditions separately rather than in combination.
Unfortunately the link crashes my browser (Chrome/Chromium, but a little out of date).
I'd be very interested to read about this, not least as I'm extending the untyped LC with a notion of a 'tag' term, and a branch on tag operation, thus resulting in dynamic dispatch without (formal) types.
Any pointers to a suitable graph reduction engine gratefully received, I've built my own naive implementation (in Javascript!) but of course it is rather slow.
ok. so normally we would put these on the heap and garbage collect them. instead, we're going to run on a stack, which remove the gc overhead, but couples our ability to reclaim frames to the control flow.
I skimmed the article, but I didn't understand the goal here. clearly that would work better for _some_ workloads
> couples our ability to reclaim frames to the control flow.
I'm not sure what you mean here. Are you referring to the fact that we (probably*) have to be purely functional for this to work?
* Saying "probably" here because there may be some subsets of side-effects which work fine in this model, which I just don't feel like thinking through at the moment.
I don't know if the OP's implementation uses arguments by copied values or not, but I will say that there's a fairly simple transformation that converts purely functional code into continuation-passing style (CPS), where instead of returning, you take a continuation, and pass what you would normally return, into the continuation like this:
(def avg (a b) (/ (+ a b) 2))
...becomes:
(def avg (a b cont) (+ a b (lambda (s) (/ s 2 cont))))
Note that in the converted code + and / also don't return values, instead calling the continuations they're passed on their results.
Pervasive CPS means that everything is a tail call, which means that you can tail call optimize (TCO) everything. The trivial implementation of TCO probably uses copy, but you can notably store values created in the current frame on the stack instead of the heap, and then pass by reference. Any unreferenced values can be removed, and the referenced values can be compacted within the frame to reduce stack space. However, in practice, it's faster to wait until you run out of stack space and then go back and compact everything in one go (this approach was pioneered by Chicken Scheme, I think). If you think through how this stack compaction might work, you'll quickly see that this is equivalent to a mark-sweep-compact garbage collector.
If your language is purely functional (Chicken Scheme isn't) then you have an additional property, which is that objects can only be referenced by objects higher on the stack. This means that if you start from the top of the stack, you can mark and sweep in a single pass, because when you arrive at a node, if it hasn't been marked by any of the objects higher in the stack, it's not referenced and can be swept. If your stack is stored in reverse order in memory (that is, the top of the stack is the earliest in memory) this mark/sweep pass is extremely cache friendly. You can also build up a backreference table during this pass, which means you only need one additional pass to compact. Again, the inverted stack is cache friendly, because even though you're passing over the stack in reverse order, backpatched references are likely to be local to the things they reference, which pulls objects into cache before they are referenced.
That's an absurdly long way to say I don't know the answer the question, but that's all to say that there are good reasons why an implementation of a heapless architecture might pass by reference rather than by copy.
Super cool - I have gone done the rabbit hole of trying to compile functional languages/lambda calculus with continuations a few times. I will be bookmarking and returning to this when the I get the itch again!