Now consider the case when you tell GPT to "think it out loud" before giving you the answer - which, coincidentally, is a well-known trick that tends to significantly improve its ability to produce good results. Is that thinking?
Maybe. Mechanically we might also describe it as causing the model to condition more explicitly on specific tokens derived from the training data rather than the implicit conditioning happening in the raw model parameters. This would tend to more tightly constrain the output space—making a smaller haystack to look for a needle. And leveraging the fact that “next token prediction” implies some consistency with preceding tokens.
It could be thinking, but I don’t think that’s strong evidence that it is thinking.
I would say that it's very strong evidence that it is thinking, if that "thinking out loud" output affects outputs in ways that are consistent with logical reasoning based on the former. Which is easy to test by editing the outputs before they're submitted back to the model to see how it changes its behavior.