Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Apple was running a model double the size of the available memory. Not sure if that was a sweet spot they found or if you could sacrifice response time to run even bigger models.

The paper is worth a read in full as what they are doing is pretty cool:

https://arxiv.org/pdf/2312.11514

Highlight from the paper...

"Then, we introduce two complementary techniques to minimize data transfer and maximize flash memory throughput:

• Windowing: We load parameters for only the past few tokens, reusing activations from recently computed tokens. This sliding window approach reduces the number of IO requests to load weights.

• Row-column bundling: We store a concatenated row and column of the up-projection and down-projection layers to read bigger contiguous chunks from flash memory. This increases throughput by reading larger chunks."



Half the memory was just an example not a sweet spot. With smaller window sizes you can use less memory, but it will come with the cost of loading more from flash.

p.s: The window size in the paper is showing how many token's feed forward layer is in the memory.


I was thinking a couple of days ago about this concept of windowing for LLMs, but I lack the technical skills to implement it. Now Apple just published a paper on it. This is what I call synchronicity



By the time we put our paper on arxiv this paper was not out so we were not aware of, but it is similar in some ways. Both this paper and us are reliant on our previous paper https://arxiv.org/abs/2310.04564 and dejavu https://arxiv.org/abs/2310.17157.

They are targeting limited gpu memory and limit cpu to gpu memory transfer. I don't know how it could be useful on Macs because MacBooks have a unified memory and you don't need to do that transfer necessarily.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: