Q4 comes out to be ~26GB but Apple doesn't let you load it on a 32GB Mac machine because they put a limit on the max usable unified memory at ~21GB (`device. recommendedMaxWorkingSetSize`) [1]. So for Q4 Mixtral MoE you'd need a 64GB Mac machine unfortunately.
There’s a brand new hybrid quantization for Mixtral out that uses 4b for shared neurons and 2b for experts, which does not bleed much perplexity, but fits it into a 32G machine. Haven’t had it in hand yet and no link here on mobile, but can’t wait to try.
Unless you use this hack [2].
[1] https://developer.apple.com/forums/thread/732035
[2] https://github.com/ggerganov/llama.cpp/discussions/2182#disc...