I was implementing my own transformer-based models and fine tuning GPT-2 in 2019...

wyager · on Aug 11, 2024

> I'm literally describing how inference is done.

You are very clearly not doing that. Nothing about your comment had anything to do with the internal structure of LLMs.

I believe you that you've set up some models with pytorch or whatever, but this seemingly hasn't translated to a sufficiently coherent mental model to make the distinction between the extrinsic optimization criterion and intrinsic behavior.

lolinder · on Aug 11, 2024

I'm not talking about optimization criteria or training, I'm talking about how we use the model for inference.

We feed in a context and it gives a probability distribution for the next word. We sample from that distribution following some set of rules. We then update the context with the new word and repeat.

That algorithm is an autocomplete algorithm. As long as that's what LLM inference looks like, all problems that we want to feed to an LLM therefore must be translated to autocomplete.

I think you're making the mistake of assuming that because I use language that resembles the language used by cynics I'm therefore arguing that LLMs are useless. I'm not. All I'm saying is that we need to have an accurate mental model for the way these things work, and that mental model is autocomplete. Nearly every major failure in an LLM application was the result of failing to keep that in mind.

wyager · on Aug 11, 2024

> all problems that we want to feed to an LLM therefore must be translated to autocomplete.

I don't disagree with this, but I do disagree with this earlier statement:

> The further you get from autocomplete, the less reliable the resulting product

Any naturally sequential problem is trivial to translate to autocomplete with minimal loss of fidelity.

lolinder · on Aug 11, 2024

In other words, would it be fair to say that any naturally sequential problem is not very far from autocomplete?

Again, I think you're putting words in my mouth and thoughts in my head that aren't there. A lot of people have reacted to AI hype by going the other way and underestimating them—that's not me. I think there are lots of problems they can solve, I just think they all boil down to autocomplete and if you can't boil it down to autocomplete you're not ready to implement it yet with an LLM.

afarviral · on Aug 12, 2024

What's the difference between "talk" or "produce output" in your mind? I feel that automatic completion makes it sound as if the complete result was already supplied to the model or that it is akin to just the most probable next word from simple frequency rather than the "thinking" that is done by the model to produce the next token. Autocomplete doesn't adequately describe anything. Like imagine a hypothetical machine of infinite ability that still works in a way where it produces tokens in a series from previous tokens, you would still be arguing it is called autocomplete and bothering me.

mike_hearn · on Aug 11, 2024

This is kind of reductionist. It's like saying that a human writing a book is just doing manual word completion starting from the title. It's technically correct, but what insight is contributed? Would anything about this conversation be different if someone trained a model that did diffusion-like inference in which every possible word in the answer is pushed towards the final result simultaneously? Probably not.

danielmarkbruce · on Aug 11, 2024

People fine tune LLMs for classification tasks.

This is completely wrong.

lolinder · on Aug 12, 2024

You're just going around saying people are completely wrong without reading what they wrote or providing any justification for that claim. I'm not sure how to respond because your comment is a non sequitur.

danielmarkbruce · on Aug 12, 2024

The justification is that people are fine tuning LLMs for classification. They take out the last layer, replace it with a layer which maps to n classes instead of vocab_size, and the training data Y's aren't next word, they are a class label (I have a job which does binary classification, for example)

It's just completely wrong to say everything in LLM land is autocomplete. It's trivial and common to do the above.

lolinder · on Aug 12, 2024

That's still autocomplete. You use it by feeding in a context (all the text so far) and asking it to produce the next word (your fine-tuned classification token). The only difference is you don't ask for more tokens once you have one.

That's a very clever way of reducing a problem to autocomplete, but it doesn't change the paradigm.

danielmarkbruce · on Aug 12, 2024

If an email says "Respond to win $100 now!" and a classifier has it as 99%/1% for two classes representing spam/not spam, "spam" is not a sensible next token, it's a classification. The model is not trying to predict the next token, it's trying to classify the entire body of text. The training data isn't a bunch of samples where y is whatever came after those tokens.

It's a silly way to think about it. Have you seen how people are fine tuning for classification? It's not like fine tuning for instruction or summarization etc, which are still using next token prediction and where the last layer is still mapping to vocab_size outputs.

stavros · on Aug 11, 2024

Yes, but it's akin to describing humans as just "interactions between molecules". Technically true, but useless.

danielmarkbruce · on Aug 11, 2024

You are conflating how it works with what the goal is.. If someone asks "how does a human run fast?" you don't say "you get a stop watch and ask them to run as fast as they can and then look at the time and figure out how to get there quicker". There is a whole explanation involving biomechanics.... And if you understand how the biomechanics work, you might be able to answer how a human can jump high too.

To make it even more concrete.... I have an LLM where I removed the last layer and fine tuned on a classification problem and the last layer now only has two outputs rather than vocab size outputs. The goal is binary classification. The output is not a completion of an idea or anything of the sort. It's still an LLM. It still works the same way all the way up until the last layer. The weights are the same all the way up to the last layer. It works because an LLM has to create a rich understanding of how a bunch of concepts work together.