I have no formal credentials to say this, but intuitively I feel this is obviously wrong. You couldn’t have taken 50 rats brains and “mixed” them and expected the result to produce new science.
For some uninteresting regurgitation, sure. But size - width and depth - seems like an important piece for ability to extract deep understanding of the universe.
Also, MoE, as I understand it, will inherently not be able to glean insight into, nor reason about, and certainly not be able to come up with novel understanding, for cross-expert areas.
The MOE models are essentially trained as a single model. Its not 7 independent models, individually (AFAIK) they are all totally useless without each other.
Its just that each bit picks up different "parts" of the training more strongly, which can be selectively picked at runtime. This is actually kinda analogous to animals, which dont fire every single neuron so frequently like monolithic models do.
The tradeoff, at equivalent quality, is essentially increased VRAM usage for faster, more splittable inference and training, though the exact balance of this tradeoff is an excellent question.
For some uninteresting regurgitation, sure. But size - width and depth - seems like an important piece for ability to extract deep understanding of the universe.
Also, MoE, as I understand it, will inherently not be able to glean insight into, nor reason about, and certainly not be able to come up with novel understanding, for cross-expert areas.
I believe size matters, a lot.