Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can take a bet that it haha already failed - the hype cycle has already made a promise that LLMs can’t keep.

Hallucinations to the normal person are a bug.

The issue is that only humans can hallucinate. We know there is a “reality”.

For an LLM, everything it does is a hallucination.

That’s why you have more POCs than production goods. Your “hallucination rate” is unknown.

Yesterday Ars has an article that described LLMs as a new type of CPU for a new type of architecture. Others want “LawLLM” or “healthLLM”.

These are simply not going to happen.

If you even get it to 80% accuracy- it’s a 1/5 chance you have a relevant answer.

The issue isn’t a technical one.

It’s expectations.



And while you mention one article with a negative experience, tons of positive article came out too.

GitHub copilot is really good and useful.

All demos I saw which use LLMs were spectacular.

The ai race started this year for everyone which means we will continuesly see progress.

And while you only mention LlM the whole ai space is crazy.

There is a high chance that the architecture from LLMs will change.

And we haven't even touched all possibilities with multi modal LLM models.


Sorry, I didn’t want to talk about what I have personally done to test this. I felt it would be better to refer to other people.

In the past months I have used Gen AI to create multiple proof of concepts, including labelling and summarization tools. In addition, to make sure I took a project to conclusion, I built a website from scratch, without any prior knowledge - using Gen AI as extensively I could.

I am being pretty conscientious with my homework. The results of those experiments are why I am confident in this position. Not just because of the articles.

I am also pointing out that its not the tech, its the expectations in the market.

People expect Chat GPT to be oracular, which it just cant - the breathless claims from proofs of concepts fan the flames.

I leave it to you to recall the results and blame, when unrealistic expectations were not met.


> All demos I saw which use LLMs were spectacular.

Aren't demos always spectacular?


Blockchain demos sure weren't.


The bigger problem is that an accurate LLM is such a massive speed up in coding (an order of magnitude, hypothetically at least), that there is zero incentive to share it.

All American programming tech has relied on an time-and-knowledge gap to keep big companies in power.

Using visual studio and c++ to create programs is trivial or speedy if you have a team of programmers and know what pitfalls to avoid. If you're a public pleb/peasant who doesn't know the pitfalls, you're going to waste thousands of hours hitting pointless errors, conceptual problems and scaling issues.

Hallucinating LLMs are marketable to the public. Accurate LLMs are a weapon best kept private.

I am always intriguied by the people who say LLMs provide a massive benefit to their programming and never ever provide examples............


Why not "simply" multigen every (important) query and take the statistical average?

Hallucinations are random, the truth isn't.

This is absurdly expensive with GPT4, cheaper with 3, and dirt cheap locally with LLaMA


That’s an established technique with papers written on the topic and everything.

Anecdotally I tested this by having GPT4 translate Acadian cuneiform — which it can just barely do. I had it do this four times and it returned four gibberish answers. I then prompted it with the source plus the four attempts and asked for a merged result.

It did it better than the human archeologists did! More readable and consistent. I compared it with the human version and it matched the meaning.

Expensive now… soon to be standard?


I don’t know if I will sound rude, but your example itself illustrates the crux of problem.

The only way you could know that the output was wrong was because you could verify it in the first place.

You can’t verify answers for questions in unfamiliar domains - or even for novel questions in your own domain.

Hah, it feels like a weird version of P!=NP.


Makes sense, it's too obvious to not have already been studied :) Do you know what I might search for to find info on it? >Expensive now… soon to be standard? With translation, better safe than sorry? It's a very important field and preserves human history so, why not?


It's trivial to test, just use ChatGPT yourself and ask it to solve the same problem several times in new sessions. Then paste in all attempts and ask for a combined result.

The main issue is context length: if you use 4 attempts you have to have to fit in the original question, four temporary answers, and the final answer. So that's 6 roughly equal sized chunks of text. With GPT4's 8K limit that's just 1300 tokens per chunk, or about 900 words. That's not a lot!

The LLMs with longer context windows are not as intelligent, and tend to miss details or they don't follow instructions as accurately.

Right now this is just a gimmick that demonstrates that more intelligence can be squeezed out of even existing LLMs...


That only works if the generated outputs are completely independent and not correlated. I'd be interested in research that shows whether multigen actually reduces hallucination rates.


True, I'm just throwing multigen out there as a wild ass solution However you could do multigen across different models, e.g. GPT/Claude/LLaMA which should not correlate entirely


How?

Truth is a human thing. Statistically averaging out 4, 5, 6, N text generations from an LLM will not converge to any “truth”.

You have essentially stated that outputs from a text generator are normally distributed around “Facts”.

May I gently suggest, that and older quote about an infinite number of simians, typewriters and the works of Shakespeare, is more appropriate ?


What makes you say that neural networks hallucinations are random? There's absolutely no reason for them to be.


Just try it!


I'll give this a shot after work I think.

The question I have is what's a prompt which reliably hallucinates but still produces the correct answer some of the time?

I know it gets some python functions "wrong", but i think they were actually "right" in the version it was trained on, so software seems out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: