It would be interesting to see if these abilities were to go away if you subjected the large model to drop-outs as you continued training, until it were reduced to the size of the small model.
I think once an "ability" is learned by the model, it is useful to help compress information, and is more likely than not (>50%) to be retained.
I think once an "ability" is learned by the model, it is useful to help compress information, and is more likely than not (>50%) to be retained.