I have no formal credentials to say this, but intuitively I feel this is obvious...

brucethemoose2 · on Dec 9, 2023

The MOE models are essentially trained as a single model. Its not 7 independent models, individually (AFAIK) they are all totally useless without each other.

Its just that each bit picks up different "parts" of the training more strongly, which can be selectively picked at runtime. This is actually kinda analogous to animals, which dont fire every single neuron so frequently like monolithic models do.

The tradeoff, at equivalent quality, is essentially increased VRAM usage for faster, more splittable inference and training, though the exact balance of this tradeoff is an excellent question.