Deep Research is in its „ChatGPT 2.0“ phase. It will improve, dramatically. And ...

amelius · 2025-02-25T22:12:45 1740521565

This is like saying: y=e^-x+1 will soon be 0, because look at how fast it went through y=2!

submeta · 2025-02-26T05:15:59 1740546959

Many past technologies have defied “it’s flattening out” predictions. Look at Personal computing, the internet, and smartphone technology.

By conflating technology’s evolving development path with a basic exponential decay function, the analogy overlooks the crucial differences in how innovation actually happns.

j_maffe · 2025-02-26T08:49:35 1740559775

> Many past technologies have defied “it’s flattening out” predictions.

And many haven't

dingnuts · 2025-02-26T14:28:43 1740580123

everything you listed was subject to the effects of Moore's Law, explaining their trajectories, but Moore's Law doesn't apply AI in any way. And it's dead.

kridsdale3 · 2025-02-25T22:24:31 1740522271

I appreciate your style of humor.

PeterFBell · 2025-02-25T22:30:40 1740522640

Thanks for making my day :)

whyenot · 2025-02-26T00:38:40 1740530320

Tony Tromba (my math advisor at UCSC) used to tell a low key infuriating, sexist and inappropriate story about a physicist, a mathematician, and a naked woman. It ended with the mathematician giving up in despair and a happy physicist yelling "close enough."

Y_Y · 2025-02-26T14:28:53 1740580133

(from a sibling's link)

> A mathematician and a physicist agree to a psychological experiment. The mathematician is put in a chair in a large empty room and a beautiful naked woman is placed on a bed at the other end of the room. The psychologist explains, "You are to remain in your chair. Every five minutes, I will move your chair to a position halfway between its current location and the woman on the bed." The mathematician looks at the psychologist in disgust. "What? I'm not going to go through this. You know I'll never reach the bed!" And he gets up and storms out. The psychologist makes a note on his clipboard and ushers the physicist in. He explains the situation, and the physicist's eyes light up and he starts drooling. The psychologist is a bit confused. "Don't you realize that you'll never reach her?" The physicist smiles and replied, "Of course! But I'll get close enough for all practical purposes!"

Is that it? Is it sexist because the physicist and mathematician are attracted to the naked woman?

Workaccount2 · 2025-02-26T16:24:27 1740587067

In my experience people's ideas of "offensive" is all over the map. However, peoples treatment towards accusation of being offensive are all treated equally. i.e. punishment for offending is a binary function of accusation, not a function of the actual offense.

It's this mismatch which has contributed heavily towards society's whiplash over the last decade.

rpmisms · 2025-02-26T05:54:25 1740549265

This sounds like a joke with a lot of truth, even if it is offensive.

john_minsk · 2025-02-26T09:41:43 1740562903

Can I have a joke?

nicksrose7224 · 2025-02-25T22:44:09 1740523449

disagree - i actually think all the problems the author lays out about Deep Research apply just as well to GPT4o / o3-mini-whatever. These things just are absolutely terrible at precision & recall of information

simonw · 2025-02-25T22:56:26 1740524186

I think Deep Research shows that these things can be very good at precision and recall of information if you give them access to the right tools... but that's not enough, because of source quality. A model that has great precision and recall but uses flawed reports from Statista and Statcounter is still going to give you bad information.

benedictevans · 2025-02-25T23:17:18 1740525438

Deep Research doesn’t give the numbers that are in statcounter and statista. It’s choosing the wrong sources, but it’s also failing to represent them accurately.

simonw · 2025-02-25T23:34:21 1740526461

Wow, that's really surprising. My experience with much simpler RAG workflows is that once you stick a number in the context the LLMs can reliably parrot that number back out again later on.

Presumably Deep Research has a bunch of weird multi-LLM-agent things going on, maybe there's something about their architecture that makes it more likely for mistakes like that to creep in?

benedictevans · 2025-02-26T01:01:08 1740531668

Have a look at the previous essay. I couldn't get ChatGPT 4o to give me a number in a PDF correctly even when I gave it the PDF, the page number, and the row and column.

https://www.ben-evans.com/benedictevans/2025/1/the-problem-w...

simonw · 2025-02-26T01:21:44 1740532904

I have a hunch that's a problem unique to the way ChatGPT web edition handles PDFs.

Claude gets that question right: https://claude.ai/share/7bafaeab-5c40-434f-b849-bc51ed03e85c

ChatGPT treats a PDF upload as a data extraction problem, where it first pulls out all of the embedded textual content on the PDF and feeds that into the model.

This fails for PDFs that contain images of scanned documents, since ChatGPT isn't tapping its vision abilities to extract that information.

Claude (and Gemini) both apply their vision capabilities to PDF content, so they can "see" the data.

I talked about this problem here: https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide....

So my hunch is that ChatGPT couldn't extract useful information from the PDF you provided and instead fell back on whatever was in its training data, effectively hallucinating a response and pretending it came from the document.

That's a huge failure on OpenAI's behalf, but it's not illustrative of models being unable to interpret documents: it's illustrative of OpenAI's ChatGPT PDF feature being unable to extract non-textual image content (and then hallucinating on top of that inability).

benedictevans · 2025-02-26T02:18:14 1740536294

Interesting, thanks. I think the higher level problem is that 1: I have no way to know this failure mode when using the product and 2: I don't really know if I can rely on Claude to get this right every single time either, or what else it would fail at instead.

simonw · 2025-02-26T02:22:36 1740536556

Yeah, completely understand that. I talked about this problem on stage as an illustration of how infuriatingly difficult these tools are to use because of the vast number of weird undocumented edge cases like this.

This is an unfortunate example though because it undermines one of the few ways in which I've grown to genuinely trust these models: I'm confident that if the model is top tier it will reliably answer questions about information I've directly fed into the context.

[... unless it's GPT-4o and the content was scanned images bundled in a PDF!]

It's also why I really care that I can control the context and see what's in it - systems that hide the context from me (most RAG systems, search assistants etc) leave me unable to confidently tell what's been fed in, which makes them even harder for me to trust.

fragmede · 2025-02-26T08:48:02 1740559682

Unfortunately that's not how trust works. If someone comes into your life and steals $1,000, and then the next time they steal $500, you don't trust them more, do you?

Code is one thing, but if I have to spend hours checking the output, then I'd be better off doing it myself in the first place, perhaps with the help of some tooling created by AI, and then feeding that into ChatGPT to assemble into a report. By showing off a report about smartphones that is total crap, I can't remotely trust the output of deep research.

hiq · 2025-02-26T11:08:21 1740568101

> Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.

I don't share this enthusiasm, things are better now because of better integrations and better UX, but the LLM improvements themselves have been incremental lately, with most of the gains from layers around them (e.g. you can easily improve code generation if you add an LSP in the loop / ensure the code actually compiles instead of trusting the output of the LLM blindly).

dchichkov · 2025-02-25T23:37:12 1740526632

I agree, they are only starting the data flywheel there. And at the same time making users pay $200/month for it, while the competition is only charging $20/month.

And note, the system is now directly competing with "interns". Once the accuracy is competitive (is it already?) with an average "intern", there'd be fewer reasons to hire paid "interns" (more expensive than $200/month). Which is maybe a good thing? Fewer kids wasting their time/eyes looking at the computer screens?

nuancebydefault · 2025-02-26T08:46:27 1740559587

The interns of today are tomorrow's skilled scientists.

moduspol · 2025-02-26T09:09:44 1740560984

Just FYI: They did roll out Deep Research to those of us on the $20/mo tier at (I think) about the same time you made this comment.