I'm not saying everything is memorized, I'm guessing there are some tasks you ca...

I'm not saying everything is memorized, I'm guessing there are some tasks you can succeed on by having very similar things, or maybe even some kind of transfer knowledge.

But how do you know that a "new" task you are making up doesn't have very similar examples in the (pre-existing) training set? Or that if you could isolate the data that plays into the "ability", that a much smaller model trained purely on that data wouldn't have a similar success rate? Or that your "ability" is a good proxy for the similar ability in people? In the extreme case, you can create a model that has the "ability" to perfectly solve 80 four-option questions in 20 bytes by using a truth table, but it wouldn't be a very general model.

My complaint is that we have the veneer of empirical evidence but almost every aspect is loosey-goosey and full of potential confounders. We don't know the data sets, we don't really know why we want to test the tasks we test, we don't know if the tasks are similar or different to other tasks, the internal operations are illegible so we can't inspect its "thinking", and yet we are willing to list a specific number of tasks with detailed charts as though the concept of "emergence" is scientifically grounded rather than potentially being an artifact of something else.