Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's important to remember the first principle of what GPT does.

It looks at the pattern of a bunch of unique tokens in a dataset (in this case words online) and riffs on those patterns to make outputs.

It will never learn math this way, no matter how much training you give it.

BUT we have already solved computers doing math with regular rules based algorithms. The way to solve the math problem is to filter inputs and send some to the GPT NN and some to a regular algorithm (this is what google search does now for example).

GPT is an amazing tool that can do a bunch of amazing stuff, but it will never do everything (the metaphor I always give is that your pre-frontal cortex is the most complex part of your brain, but it will never learn how to beat your heart).



> It will never learn math this way, no matter how much training you give it.

Not so. Actually, (for example) the phenomenon of "grokking" is when with enough training a NN eventually experiences a phase-change from memorising data to learning the general rules underlying it [1].

Grokking isn't actually desirable, it's better that the model go more directly and quickly to learning the general rule, which is achievable in toy problems (called "comprehension" in [2]).

I feel that people seem to have forgotten that deep learning is so powerful because it performs feature/representation learning, not because it can memorise, although that's powerful too. IMO that is the proper definition of 'deep learning'.

[1] Power &al. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

[2] Liu &al. Towards Understanding Grokking: An Effective Theory of Representation Learning https://arxiv.org/abs/2205.10343


NN can certainly assimilate a simple algorithm, and will be even able to do so for bigger and more complex algorithms. But I think it's mostly impractical in the current level of technology, especially in terms of speed, size, and energy efficiency.

It kinda reminds me of DeepBlue. In fact, a simple DFS has always been able to beat human in the chess, but, only in 1990s, a computer finally could beat a chess grandmaster. Reason? Because a dumb DFS is impractically slow, and the human player will die old before the computer can finish its calculation.

I believe the same goes with the current AI trend. What we have right now is rather crude. The approach itself has lots of potential, but the actual solution is yet to be found. It's really sad that people keep hyping up these partial solutions as zee AI. Whatever.


>Not so. Actually, (for example) the phenomenon of "grokking" is when with enough training a NN eventually experiences a phase-change from memorising data to learning the general rules underlying it.

Reading the paper, what they're seeming to get at is "when the dataset is algorithmic (like multiplication tables), the parameters get set in a way that appears to replicate the algorithm."

That's cool, but not what GPT is.

>I feel that people seem to have forgotten that deep learning is so powerful because it performs feature/representation learning, not because it can memorise, although that's powerful too. IMO that is the proper definition of 'deep learning'.

That's not what GPT is going.


Grokking doesn't just happen for algorithmic data, it also happens less dramatically in other datasets [3]. Grokking seems to be closely related to double descent [4], which is quite widespread. Anyway I only wanted to give grokking as an example of how memorisation doesn't preclude generalisation, it may simply precede it.

> That's not what GPT is going.

I don't follow. Of course GPT models are learning representations (but I doubt you meant to deny this), that's how they can do semantic matching of its knowledge base (memorised information) in order to generalise from it. They don't only spit out training data verbatim.

Anyway, I didn't claim any GPT variant has actually "learn[t] math", but that it's not impossible with unlimited training.

[3] Liu &al. Omnigrok: Grokking Beyond Algorithmic Data https://openreview.net/forum?id=zDiHoIWa0q1

[4] Davies &al. Unifying Grokking and Double Descent https://openreview.net/pdf?id=JqtHMZtqWm


Again, reading these papers, Grokking can happen in very limited circumstances for non-algorithmic datasets.

> They verify this observation in a student teacher setup, and show that it can arise in non-algorithmic datasets if initialized in a certain weight regime for appropriate sample size.

It’s not a widespread phenomenon by any means and it is not observably happening inside GPT. No amount of training will change that, only a drastic specialization of the training data (which defeats the purpose).

> They don't only spit out training data verbatim.

I’m not saying verbatim. But I am saying it won’t return a pattern it hasn’t seen in its dataset before. The whole point of attention is that the token isn’t just the word, but the word as it exists in context. If you expand verbatim to include that as the token, yes that is exactly what GPT does (it will not connect two tokens unless it was trained on data that implies those tokens should be connected, it know nothing else about what those tokens are)

Again to put it simply, a 3rd grader can multiply any (and I mean literally the infinite set) two numbers. GPT cannot and never will be able to multiple an infinite set of numbers.


I wrote that double descent is widespread, not grokking.

Of course a transformer can't do multiplication or any other kind of operation on an infinite set of numbers, because it has only bounded depth which limits the number of steps it can emulate of any algorithm. But I think I see how I could build a transformer by hand that could multiply any two 4-digit numbers. The difficulty is the quadratic number of steps. Addition and subtraction are far easier, [1] shows that can be solved: "By introducing position tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation". But they needed to change the input representation, otherwise finding the n-th digit would require scanning the number from the right end while counting, which seems to be difficult to learn.

But we are in partial agreement. I don't actually think transformers are great, I think they're awfully limited, but the fact that mere pattern-matching can achieve so much makes me highly optimistic about better methods, e.g. adding working memory.

[1] Investigating the Limitations of Transformers with Simple Arithmetic Tasks https://arxiv.org/abs/2102.13019


It is a transformer model which means it has layers for decoding and encoding information.

This means you can ask it to translate from one representation to another. You can write a sentence and turn it into an equivalent SQL query or a poem, for instance.

But this means whenever you are asking chatgpt to do something for you, it basically tries to decode your question or order and encode its answer representation.

When people ask it to write a program or command it can turn it into its help text representation which then looks like a believable command that can be executed. If you ask it to execute the code, it will try to find a representation that mirrors the output of the program.

At least that is how I imagine it works.


That's not what a transformer model is: a transformer model is just one that uses self-attention blocks in its layers to encode contextual information about the input. A non-transformer model can equally translate from one representation to another: e.g. before transformer models a commonly used architecture for seq2seq models were RNNs.


Lol, lots of people spouting off about how they imagine AI works these days. This is not an accurate description of the GPT2/3 model architectures.


It would be a lot more helpful if you could explain the difference. What’s wrong with that description, it seems pretty close to the descriptions of how it works that I’ve seen so far.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: