The dotcom bubble burst and 26 years later we’re all hopelessly addicted to the internet and the top companies on the stock market are almost all what would have been called “dotcoms” then.
The railroad bubble burst in 1846 not because trains were a dead end - passenger number would increase more than 10x in the UK in the following 50 years.
It’s absolutely not winner take all. LLMs have become a commodity and the cost of switching models is essentially nil.
Even if ChatGPT has brand recognition amongst lay people, your grandparents aren’t the ones shelling out $200/mo for a Claude code subscription and paying for extra Opus tokens on top of that. Anthropic’s revenue is now neck and neck with OpenAI, but if tomorrow they increased the price of Opus by 5x without increasing its capabilities, many would switch to Gemini, GPT 5.4, Cursor, or any cheap Chinese model. In fact I know many engineers that have multiple subscriptions active and switch when they hit the rate limits of one, precisely the tools are so interchangeable.
At some point it could even become cheaper to just buy 8x H100s and host Qwen/Deepseek/Kimi/etc yourself if you’re one of those companies paying $3k/mo per engineers in tokens.
I have non-tech friends telling me about preferring other models like gemini, this feels like the early days of search engines when people were willing to switch to find better results.
Yep i have nontech friends and even the younger generation students talking about how Claude is better at certain tasks or types of homework problems lol.
If it's used as a tool not just search, then people will definitely talk about the other stuff. Students who rely on free tiers will also definitely just have everything bookmarked.
> What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.
I don't see the problem if you give the LLM the ability to generate multiple search queries at once. Even simple vector search can give you multiple results at once.
> "How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"
I'm a human and I have a hard time parsing that query. Are you asking only for Mercedes E-Class? The number of cars, as in how many were sold?
I agree with you that simple vector search + context stuffing is dead as a method, but I think it's ridiculous to reserve the term "RAG" for just the earliest most basic implementation. The definition of Retrieval Augmented Generation is any method that tries to give the LLM relevant data dynamically as opposed to relying purely on it memorising training data, or giving it everything it could possibly need and relying on long context windows.
The RAG system you mentioned is just RAG done badly, but doing it properly doesn't require a fundamentally different technique.
> it's ridiculous to reserve the term "RAG" for just the earliest most basic implementation
Whether we like it or not, dumb semantic search became the colloquial definition of RAG.
And when you hear someone saying "we use RAG here" 95% of the time this is exactly what they mean.
When you inject user's name into the system prompt, technically you're doing RAG - but nobody thinks about it that way. I think it's one of those case where colloquial definition is actually more useful that the formal one.
> doing it properly doesn't require a fundamentally different technique
Then what do you call RAG done well? You need a term for it.
> And when you hear someone saying "we use RAG here" 95% of the time this is exactly what they mean.
That's just Sturgeon's law in action. 95% of every implementation is crap. Back in the 90s, you might have heard "we use OOP here" and come to a similar conclusion, but that doesn't mean you need to invent a new word for doing OOP properly.
> But agentic RAG is fundamentally different.
From an implementation POV, absolutely not.
I've personally gradually converted a dumb semantic search to a more fully featured agentic RAG in small steps like these:
- Have a separate LLM call write the query instead of just using the user's message.
- Make the RAG search a synthetic injected tool call, instead of appending it to the system prompt.
- Improve the search endpoint by using an LLM to pre-process the data into structured chunks with hierarchical categories, tags, and possible search queries, embedding the search queries separately from the desired information (versus originally just having a raw blob).
- Have the LLM be able to search both with a semantic sentence, and a list of tags.
- Have the LLM view and navigate the hierarchy in a tree-like manner.
- Make the original LLM able to call the search on its own instead of being automatically injected using a separate query rewriting call, letting it search in multiple rounds and refine its own queries.
When did the system go from RAG to "not RAG"? Because fundamentally, all you need to do to make an agentic RAG is to have the LLM be able to write/rewrite its own search queries (possibly in multiple passes) as opposed to just passing the user's messages(s) directly.
I like the audacity of parent poster that equates 95% of implementations he has seen with 95% of all there is. When it easily could have been 0.01% of all there is. World is much bigger than we think :)
>all you need to do to make an agentic RAG is to have the LLM be able to write/rewrite its own search queries (possibly in multiple passes)
I think this is a huge oversimplification, the term "search query" is doing a lot of heavy lifting here.
When Claude Code calls something like
find . -type d -maxdepth 3 -not -path '*/node_modules/*'
to understand the project hierarchy before doing any of the grep calls, I don't think it's fair to call it just a "search query", it's more like "analyze query". Just because text goes in and out in both cases, doesn't mean that it's all the same.
When you give the agent the ability to query the nature of the data (e.g. hierarchy), and not just data itself, it means that you need to design your product around it. Agentic RAG has entirely different implementation, product implications, cost, latency, and primarily, outcomes. I don't think it's useful to pretend that it's just a different flavor of the same thing, simply because at the end of the day it's just some text flying over the network.
Some previous techniques for RAG, like directly using a user message’s embedding to do a vector search and stuffing the results in the prompt, are probably obsolete. Newer models work much better if you use tool calls and let them write their own search queries (on an internal database, and perhaps with multiple rounds), and some people consider that “agentic AI” as opposed to RAG. It’s still augmenting generation with retrieved information, just in a more sophisticated way.
Mamba doesn't assume auto-regressive decoding, and you can use absolutely use it for diffusion, or pretty much any other common objective. Same with a conventional transformer. For a discrete diffusion language model, the output head is essentially the same as an autoregressive one. But yes, the training/objective/inference setup is different.
You don't even need to go into the pipeline details. The 9800X3D has 8x more L2 cache, 6x more L3 cache, 2x the memory bandwidth than the now 8 years old i9 9900K. 3D V-cache is pretty cool.
I'm not from that generation so that's a bit hard for me to understand. Even if you used a closed-source C compiler, wouldn't you still have been able to look at the header file, which would have been pretty self-explanatory?
And surely if you bought a C compiler, you would have gotten a manual or two with it? Documentation from the pre-Internet age tended to be much better than today.
Yeah - but you have to be a good enough programmer to really understand the headers.. the 'bootstrapping' problem was real :-) Especially if you didn't live in a metropolitan/college area. My local library was really short on programming books - especially anything 'in depth'. Also, 'C' was considered a "professional's language" back then - so bookstores/libraries were more likely to have books on BASIC then 'C'
The railroad bubble burst in 1846 not because trains were a dead end - passenger number would increase more than 10x in the UK in the following 50 years.
reply