I wrote that double descent is widespread, not grokking. Of course a transformer...

I wrote that double descent is widespread, not grokking.

Of course a transformer can't do multiplication or any other kind of operation on an infinite set of numbers, because it has only bounded depth which limits the number of steps it can emulate of any algorithm. But I think I see how I could build a transformer by hand that could multiply any two 4-digit numbers. The difficulty is the quadratic number of steps. Addition and subtraction are far easier, [1] shows that can be solved: "By introducing position tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation". But they needed to change the input representation, otherwise finding the n-th digit would require scanning the number from the right end while counting, which seems to be difficult to learn.

But we are in partial agreement. I don't actually think transformers are great, I think they're awfully limited, but the fact that mere pattern-matching can achieve so much makes me highly optimistic about better methods, e.g. adding working memory.

[1] Investigating the Limitations of Transformers with Simple Arithmetic Tasks https://arxiv.org/abs/2102.13019