Sorry, I didn’t want to talk about what I have personally done to test this. I felt it would be better to refer to other people.
In the past months I have used Gen AI to create multiple proof of concepts, including labelling and summarization tools. In addition, to make sure I took a project to conclusion, I built a website from scratch, without any prior knowledge - using Gen AI as extensively I could.
I am being pretty conscientious with my homework. The results of those experiments are why I am confident in this position. Not just because of the articles.
I am also pointing out that its not the tech, its the expectations in the market.
People expect Chat GPT to be oracular, which it just cant - the breathless claims from proofs of concepts fan the flames.
I leave it to you to recall the results and blame, when unrealistic expectations were not met.
The bigger problem is that an accurate LLM is such a massive speed up in coding (an order of magnitude, hypothetically at least), that there is zero incentive to share it.
All American programming tech has relied on an time-and-knowledge gap to keep big companies in power.
Using visual studio and c++ to create programs is trivial or speedy if you have a team of programmers and know what pitfalls to avoid. If you're a public pleb/peasant who doesn't know the pitfalls, you're going to waste thousands of hours hitting pointless errors, conceptual problems and scaling issues.
Hallucinating LLMs are marketable to the public. Accurate LLMs are a weapon best kept private.
I am always intriguied by the people who say LLMs provide a massive benefit to their programming and never ever provide examples............
That’s an established technique with papers written on the topic and everything.
Anecdotally I tested this by having GPT4 translate Acadian cuneiform — which it can just barely do. I had it do this four times and it returned four gibberish answers. I then prompted it with the source plus the four attempts and asked for a merged result.
It did it better than the human archeologists did! More readable and consistent. I compared it with the human version and it matched the meaning.
Makes sense, it's too obvious to not have already been studied :)
Do you know what I might search for to find info on it?
>Expensive now… soon to be standard?
With translation, better safe than sorry? It's a very important field and preserves human history so, why not?
It's trivial to test, just use ChatGPT yourself and ask it to solve the same problem several times in new sessions. Then paste in all attempts and ask for a combined result.
The main issue is context length: if you use 4 attempts you have to have to fit in the original question, four temporary answers, and the final answer. So that's 6 roughly equal sized chunks of text. With GPT4's 8K limit that's just 1300 tokens per chunk, or about 900 words. That's not a lot!
The LLMs with longer context windows are not as intelligent, and tend to miss details or they don't follow instructions as accurately.
Right now this is just a gimmick that demonstrates that more intelligence can be squeezed out of even existing LLMs...
That only works if the generated outputs are completely independent and not correlated. I'd be interested in research that shows whether multigen actually reduces hallucination rates.
True, I'm just throwing multigen out there as a wild ass solution
However you could do multigen across different models, e.g. GPT/Claude/LLaMA which should not correlate entirely
Hallucinations to the normal person are a bug.
The issue is that only humans can hallucinate. We know there is a “reality”.
For an LLM, everything it does is a hallucination.
That’s why you have more POCs than production goods. Your “hallucination rate” is unknown.
Yesterday Ars has an article that described LLMs as a new type of CPU for a new type of architecture. Others want “LawLLM” or “healthLLM”.
These are simply not going to happen.
If you even get it to 80% accuracy- it’s a 1/5 chance you have a relevant answer.
The issue isn’t a technical one.
It’s expectations.