I am in no shape, way, or form affiliated with OpenAI or any other AI company.
What I and many others have noticed about the "Are LLMs really smart?" debate is that everyone on the "Nay" side is using 3.5 and everyone on the "Yay" side is using 4.0.
The naming and the versioning implies that GPT 4 is somehow slightly better than 3.5, like not even a "full +1" better, just "+0.5" better. (This goes to show how trivial it is to trick "mere" humans and their primitive meat brains.)
Similarly, all pre-4 LLMs including not just the older ChatGPT variants, but Bard, Vicuna, etc... are all very clearly and obviously sub-par, making glaring mistakes regularly. Hence, people generalise and assume GPT 4 must be more of the same.
For the last few weeks, across many forums, every time someone has said "AIs can't do X" I have put X into ChatGPT 4 and it could do it, with only a very few exceptions.
The unfortunate thing is that there is no free trial for GPT 4, and the version on Bing doesn't seem to be quite the same. (It's probably too restricted by a very long system prompt.)
So no, people won't form their own opinions, at least not yet, because they can't do so without paying for access.
I've been paying for GPT-4 since it came out and have used it extensively. It's clearly an iteration on the same thing and behaves in qualitatively the same way. The differences are just differences of degree.
It's not hard to get a feel for the "edges" of an LLM. You just need to come up with a sequence of related tasks of increasing complexity. A good one is to give it a simple program and ask what it outputs. Then progressively add complications to the program until it starts to fail to predict the output. You'll reliably find a point where it transitions from reliably getting it right to frequently getting it wrong, and doing so in a distinctly non-humanlike way that is consistent with the space of possible programs and outputs becoming too large for its approach of predicting tokens instead of forming and mentally "executing" a model of the code to work. The improvement between 3.5 and 4 in this is incremental: the boundary has moved a bit, but it's still there.
Most developers -- let alone humans -- I've met can't run trivial programs in their head successfully, let alone complex ones.
I've thrown crazy complicated problems at GPT 4 and had mixed results, but then again, I get mixed results from people too.
I've had it explain a multi-page SQL query I couldn't understand myself. I asked it to write doc-comments for spaghetti code that I wrote for a programming competition, and it spat out a comment for every function correctly. One particular function was unintelligible numeric operations on single-letter identifiers, and its true purpose could only be understood through seven levels of indirection! It figured it out.
The fact that we're debating the finer points of what it can and can't do is by itself staggering.
Imagine if next week you could buy a $20K Tesla bipedal home robot. I guarantee you then people would start arguing that it "can't really cook" because it couldn't cook them a Michelin star quality meal with nothing but stale ingredients, one pot, and a broken spatula.
"In a distinctly non-humanlike way". You can learn a lot about how a system works from how it fails and in this case it fails in a way consistent with the token-prediction approach we know it is using rather than the model-forming approach some are claiming has "emerged" from that. It doesn't show the performance on a marginally more complex example that you would expect from a human with the same performance on the slightly simpler one, which is precisely the point Rodney Brooks is making. It applies equally to GPT-3.5 and GPT-4.
But I didn't respond to debate the nature or merits of LLMs. It's been done to death and I wouldn't expect to change your mind. I'm just offering myself as a counterexample to your assertion that everyone (emphasis yours) that is unconvinced by some of the claims being made about LLM capabilities (I dislike your "sides" characterisation) is using GPT-3.5.
Over the long term this is going to be a primary alignment problem of AI as it becomes more capable.
What is my reasoning behind that?
Because humans suck, or at least our constraints that we're presented with do. All your input systems to your brain are constantly behind 'now' and the vast majority of data you could input is getting dropped on the ground. For example if I'm making a robotic visual input system, it makes nearly zero sense for it to behave like human vision. Your 20/20 visual acuity area is tiny and only by moving your eyes around rapidly and then by your brain lying to you, do we have a high resolution view on the world.
And that is just an example of one of those weird human behaviors we know about. It's likely we'll find more of these shortcuts over time because AI won't take them.
My take-away is that your interaction with the OP has not changed your opinion about "everyone", expressed above:
>> What I and many others have noticed about the "Are LLMs really smart?" debate is that everyone on the "Nay" side is using 3.5 and everyone on the "Yay" side is using 4.0.
Sometimes there really is no point in trying to make curious conversation. Curiosity has left the building.
So no, people won't form their own opinions, at least not yet, because they can't do so without paying for access.
People will pay for access is they find it valuable enough.
I work with people who use it, I've not seen anything impressive enough come from them to make me want to pay for it so I don't. I've also screen shared because I was curious what all the fuss was about. What I saw that pissed me off was that they've stopped contributing to our internal libraries and just generate everything now. I found that kind of disturbing. It's not the products fault but it's the kind of thing I imagined would start happening.
I'm glad you like it, I just don't know why people feel the need to sell it so hard.
If you used GPT-4 well enough, you would know, at this point OpenAI does not need to pay any human to engage in online conversation, aside from legal reasons if any.
I personally created some content-creating bots with GPT-4, and it succeeded to a level that I don't trust anything I see online anymore. It does a better job than me, which doesn't say much because I am an engineer not a content creator. But still, I could get same results as one with a script that I made GPT-4 write itself.
...Yes, I am losing sleep over GPT-4's performance. If you are not losing sleep over it yet, you haven't really given it a genuine try yet.
The amount of comments here telling people to “upgrade to ChatGPT 4” is absolutely unprecedented.
I know it might be good, but people will find value in it and upgrade if they see the need to do so?