Apple was running a model double the size of the available memory. Not sure if t...

keivanalizadeh · on Dec 20, 2023

Half the memory was just an example not a sweet spot. With smaller window sizes you can use less memory, but it will come with the cost of loading more from flash.

p.s: The window size in the paper is showing how many token's feed forward layer is in the memory.

majestic5762 · on Dec 20, 2023

I was thinking a couple of days ago about this concept of windowing for LLMs, but I lack the technical skills to implement it. Now Apple just published a paper on it. This is what I call synchronicity

rbr91 · on Dec 20, 2023

Is this https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-... related? Could it be the origin?

keivanalizadeh · on Dec 20, 2023

By the time we put our paper on arxiv this paper was not out so we were not aware of, but it is similar in some ways. Both this paper and us are reliant on our previous paper https://arxiv.org/abs/2310.04564 and dejavu https://arxiv.org/abs/2310.17157.

They are targeting limited gpu memory and limit cpu to gpu memory transfer. I don't know how it could be useful on Macs because MacBooks have a unified memory and you don't need to do that transfer necessarily.