This is the sparsest model thats been put out in a while (maybe ever, kinda forg...

coder543 · on April 24, 2024

It has 480B parameters total, apparently. You would only need 512GB of RAM if you were running at 8-bit. It could probably fit into 256GB at 4-bit, and 4-bit quantization is broadly accepted as a good trade-off these days. Still... that's a lot of memory.

EDIT: This[0] confirms 240GB at 4-bit.

[0]: https://github.com/ggerganov/llama.cpp/issues/6877#issue-226...

kaibee · on April 24, 2024

I know quantizing larger models seems to be more forgiving but I’m wondering if that applies less to these extreme-MoE models. It seems to be that it should be more like quantizing a 3B model.

coder543 · on April 24, 2024

4-bit is fine for models of all sizes, in my experience.

The only reason I personally don’t quantize tiny models very much is because I don’t have to, not because the accuracy gains from running at 8-bit or fp16 are that great. I tried out 4-bit Phi-3 yesterday, and it was just fine.

refulgentis · on April 24, 2024

Yeah, and usually GPU RAM, unless you enjoy waiting for a minute for filling the context :(

Manabu-eo · on April 24, 2024

The old google's Switch-C transformer [1] had 2048 experts, 1.6T parameters, with only one activated for each layer, so much more sparse. But also severely undertrained as the other models of that era, and thus useless now.

1. https://huggingface.co/google/switch-c-2048

imachine1980_ · on April 24, 2024

it performs worst than 8b llama 3 so you probably don't need that much.

coder543 · on April 24, 2024

Where do you see that? This comparison[0] shows it outperforming Llama-3-8B on 5 out of 6 benchmarks. I'm not going to claim that this model looks incredible, but it's not that easily dismissed for a model that has the compute complexity of a 17B model.

[0]: https://www.snowflake.com/wp-content/uploads/2024/04/table-3...