Are there any projects that surgically augment the memory APIs to be more cooper...

pizlonator · on June 1, 2022

I've thought about this a lot. I think that overall, malloc/free/new/delete are already so hard to use that if you added more stuff, it would create too much cognitive load for the programmer. The result would be that the hints the programmer gave you would be more wrong than the malloc's best guess.

Let me go through these point by point and offer some thoughts.

- Perhaps `malloc` could ask for an intended lifecycle (think GC generation) that could swap arenas or other algo internals. This opens us up to meta-allocators.

It could, and if that information was accurate, then it would be profitable. But if that information was inaccurate, then you'd get worse perf and worse memory usage than if you let malloc just do whatever it wants. Also, anything like this creates more fragmentation since it creates a new opportunity for the programmer to force the allocator to pick something other than the first fit.

- Perhaps program lifecycle hooks that give allocators more context into how already-allocated memory intends to be accessed. ("Done bootstrapping the server, so now new memory will be accessed by single threads and only live for one request", etc)

If you convey this information after you've already allocated the object then there isn't much the malloc could do. If the malloc uses this information to decide where the next malloc call puts memory then you've created a fragmentation risk.

- Perhaps the programmer could periodically call `remalloc` to allow objects to be moved or promoted between arenas/generations.

Yes, they could, and this is a great idea. I believe that calling realloc (or just mallocing a new object, memcpying or copy-constructing, and freeing the old one) relieves fragmentation because it gives the malloc an opportunity to put the object in a more profitable location.

- Perhaps mallocs/frees could be more intelligently batched (where explicit structs/arrays don't normally make sense) for surprise-free defrag or OS allocation.

libpas batches allocation and deallocation. These strategies help performance. They have only a small effect on memory usage (and that effect is that you use a bit more memory by batching).

- Intelligent pairing of malloc/free with explicit notions of intended re-use; some indication of "I will free this and ask for the same memory again in the same call" or auto-struct-ing mallocs which know that we tend stack malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual OS/memory involvement whatsoever.

Allocators like libpas (and mimalloc and many others) have very cheap malloc/free calls that perform great in these situations. They don't do any locking or call into the OS in the common case when you do this.

- Perhaps `identmalloc` could handle common flyweight patterns of instantiate-unless-exists; e.g. intelligently turn shared state into offsets into a contiguous arenas. This may be more useful in C++ with existing conventions around copy-by-value constructs.

I think that most fast mallocs like libpas (and others) make this situation cheap enough that you don't need anything else.

- Some of these are indeed components of existing mallocs/allocators (e.g. libpas is type-aware), but my point is that allocators have to jump through hoops to derive heuristics, and these heuristics often amount to "the programmer often intends to use memory like X." Performance engineers then pattern-match for X rather than cooperating with the compiler to expose X intrinsically from the beginning. The above ideas impose slightly different programming paradigms, so there's still the white wale of a perfect drop-in heuristic, but it seems less subject to temporary or local maxima.

If the program is simple enough that you understand the object lifetimes completely, or you have enough resources to throw at the problem that you can completely understand lifetimes even for a complex program, then the best path forward is to just not call malloc. Lay out your memory a priori or write a highly customized "allocator" (that may not even have an API that looks anything like malloc).

If the program is too complex for such reasoning, then I believe there's no way for the programmer to tell the malloc anything it can't deduce with heuristics. The programmer can say some stuff, but when you have a large sophisticated heap, then the programmer will not be able to predict what it is about the heap that creates pain for the malloc. In most experiments where I've tried to do this kind of tuning, I end up with something that runs slower or uses more memory, and the investigation usually leads me to learn that my hypothesis about the malloc's pain point was totally wrong.

- So: Are there compelling (albeit slightly less general-purpose) alternatives where the allocator and programmer communicate additional hints or other API contracts that allow the allocator to move away from heuristics and into proven patterns? Or is this basically akin to "use c++/smart pointers or just get a gc"?

Maybe! I'm not proving a negative here. But everything I've tried in this area has been a negative result for me.

twoodfin · on June 1, 2022

I've thought about this a lot. I think that overall, malloc/free/new/delete are already so hard to use that if you added more stuff, it would create too much cognitive load for the programmer. The result would be that the hints the programmer gave you would be more wrong than the malloc's best guess.

I think this is a generally applicable lesson. Programmers overestimate their ability to statically optimize the dynamic behavior of their code, especially relative to a runtime that can access and act on dynamic state.

This comes up in the database world as application developers desiring to "pin" certain tables in memory. Experience has shown this to be an anti-feature: As the schema and workload of a database system evolves, these pinning decisions can become obsolete, starving more relevant tables of buffer space. Dynamic decisions, even made by simple heuristics like LRU, are an empirically winning strategy.

The failure of Itanium/EPIC and the "sufficiently smart compiler" to beat traditional OoO execution is another example from a different computing domain.

titzer · on June 3, 2022

I'm even more bullish on computers assisting humans in memory allocation than ever. I think we as humans should just write programs and assume an unboundedly intelligent compiler will study our code, determine what if any lifetimes are clear in the code from usage (e.g. basically type-inferring ownership, rather than type-checking it), do trial region allocation, with dynamic profiling and adaptive optimization, and GC the rest with the best collector based on dynamic tuning.

Basically, just write assuming GC, then whereever a hint could possibly exist, throw computers at it. If anything, such hints should live outside the source code and have no semantic impact (i.e. program works fine if you throw it out).

irogers · on June 2, 2022

Not having cognitive load motivates something like the LLAMA work: https://research.google/pubs/pub49008/