Apple was running a model double the size of the available memory. Not sure if that was a sweet spot they found or if you could sacrifice response time to run even bigger models.
The paper is worth a read in full as what they are doing is pretty cool:
"Then, we introduce two complementary techniques to minimize data transfer and maximize flash memory throughput:
• Windowing: We load parameters for only the past few tokens, reusing activations from recently computed tokens. This sliding window approach reduces the number of IO requests to load weights.
• Row-column bundling: We store a concatenated row and column of the up-projection and down-projection layers to read bigger contiguous chunks from flash memory. This increases throughput by reading larger chunks."
Half the memory was just an example not a sweet spot. With smaller window sizes you can use less memory, but it will come with the cost of loading more from flash.
p.s: The window size in the paper is showing how many token's feed forward layer is in the memory.
I was thinking a couple of days ago about this concept of windowing for LLMs, but I lack the technical skills to implement it. Now Apple just published a paper on it. This is what I call synchronicity
They are targeting limited gpu memory and limit cpu to gpu memory transfer. I don't know how it could be useful on Macs because MacBooks have a unified memory and you don't need to do that transfer necessarily.
The paper is worth a read in full as what they are doing is pretty cool:
https://arxiv.org/pdf/2312.11514
Highlight from the paper...
"Then, we introduce two complementary techniques to minimize data transfer and maximize flash memory throughput:
• Windowing: We load parameters for only the past few tokens, reusing activations from recently computed tokens. This sliding window approach reduces the number of IO requests to load weights.
• Row-column bundling: We store a concatenated row and column of the up-projection and down-projection layers to read bigger contiguous chunks from flash memory. This increases throughput by reading larger chunks."