With large enough training database, this can produce surprisingly similar results in a lot of simple cases.
The real problem is the only foolproof way to detect when they don't is by painstaking duplicating and verifying any results from an LLM.
But this verification negates much of the advantage LLMs are supposed to offer. So the natural human response is to simply skip the due diligence. And this is a liability issue waiting to happen.
The cost of AI liability has yet to be priced into the market. I expect that some companies and AI service providers will start to restrict or even prohibit the use of AI in some cases because the cost of liability outweighs any real benefit.
Liability disclaimers don't legally apply to a lot of professional services. Selling fake "intelligence" to doctors and lawyers is a risky proposition.
You can't predict --- and therein lies the problem. All you can do is verify. And this negates a lot of the value proposition of AI.
The best use of current AI is what it was originally designed for --- things that don't matter much and are highly tolerant of errors --- like web search.
Coding agents work well enough to be useful because they can check their own work. Nevertheless it is too generous to call that reasoning. If they're right 80% of the time and they rerun a prompt if the project won't build, they might be right 95% of the time, or even 99% of the time on the third try. And if you know a bit about coding you're probably able to recognize when to intervene.
That's not to denigrate a real productivity booster. It is however a warning to anyone applying LLM-based AI to use cases where you don't have the kind of training corpus and formal framework around correctness that characterizes coding.
Coding surely matters. But it also might be a truly unique use case.
To an extent… that will get you a project that builds, and passes any other objective detectable test, but it can’t tell you if it’s reasonable or good.
Which means we’re getting a lot of shitty work that passes tests.
That’s a massive oversimplification of the field's trajectory.
Google introduced the Transformer model in 2017. They built interactive voice response (Google Assistant) and applied it to web search and language translation (all have a large error tolerance) but didn't do much more because they considered reliability to be an issue.
ChatGPT was introduced in 2022. It was based on the Transformer model as are all the current AI chatbots.
ChatGPT's big innovation was scale. They spend billions to digitize everything they could find on the web and beyond and market it as a general purpose AI.
But scale has hit a wall. Even with a world of data and an energy budget larger than a small country, reasoning and reliability remains a largely unresolved issue.
Computing has traditionally been about reliable answers at low cost. AI offers the opposite --- unreliable answers at high cost.
> While LLMs are probabilistic, their accuracy in specific domains—like Tool Calling—is already hitting near-100% reliability. That is where industrialization happens.
Is this just an AI bot replying to comments on its own AI post?
I'm not sure "true" reasoning is really necessary. Coding tasks have a lot of guessing what the problem might be and then investigating. Under those circumstances, a hallucination is just a guess that didn't work out.