Hacker Newsnew | past | comments | ask | show | jobs | submit | softwaredoug's commentslogin

The real thing I think people are rediscovering with file system based search is that there’s a type of semantic search that’s not embedding based retrieval. One that looks more like how a librarian organizes files into shelves based on the domain.

We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.

https://softwaredoug.com/blog/2026/01/08/semantic-search-wit...


Exactly. Traditional library science truly captured deep patterns of information architecture.

https://x.com/wibomd/status/1818305066303910006

Pixar got this right in Ralph Wrecks The Internet.

https://x.com/wibomd/status/1827067434794127648


Agreed. I've been working on a codebase with 400+ Python files and the difference is stark. With embedding-based RAG, the agent kept pulling irrelevant code snippets that happened to share vocabulary. Switched to just letting the agent browse the directory tree and read files on demand -- it figured out the module structure in about 30 seconds and started asking for the right files by path.

The directory hierarchy is already a human-curated knowledge graph. We just forgot that because we got excited about vector math.


There a lot of methods in IR/RAG that maintain structure as metadata used in a hybrid fusion to augment search. Graph databases is an extreme form but some RAG pipelines pull out and embed the metadata with the chunk together. In the specific case of code, other layered approaches like ColGrep (late interaction) show promise.... the point is most search most of the time will benefit from a combination approach not a silver bullet

Just like the approach in the article.

Everything is based on the metadata stored with chunks, just allowing the agent to navigate that metadata through ls, cd, find and grep.


> Switched to just letting the agent browse the directory tree and read files on demand -- it figured out the module structure in about 30 seconds

You guess what's the difference between code and loosely structured text...


[flagged]


Parent may or may not be AI generated or AI edited. As such it MAY breach one of the HN commenting guidelines

Your comment however definitely breaches several of them.


indeed. moltbook vibes

I'd rather read a hundred comments like that than one more like yours.

I spent a while working on a retrieval system for LLMs and ended up reinventing a concordance (which is like an index).

It's basically the same thing as Google's inverted index, which is how Google search works.

Nothing new under the sun :)


Someone simply assumed at some point that RAG must be based on vector search, and everyone followed.

It’s something of a historical accident

We started with LLMs when everyone in search was building question answering systems. Those architectures look like the vector DB + chunking we associate with RAG.

Agents ability to call tools, using any retrieval backend, call that into question.

We really shouldn’t start RAG with the assumption we need that. I’ll be speaking about the subject in a few weeks

https://maven.com/p/7105dc/rag-is-the-what-agentic-search-is...


Right. R in RAG stands for retrieval, and for a brief moment initially, it meant just that: any kind of tool call that retrieves information based on query, whether that was web search, or RDBMS query, or grep call, or asking someone to look up an address in a phone book. Nothing in RAG implies vector search and text embeddings (beyond those in the LLM itself), yet somehow people married the acronym to one very particular implementation of the idea.

Yeah there's a weird thing where people would get really focused on whether something is "actually doing RAG" when it's pulling in all sorts of outside information, just not using some kind of purpose built RAG tooling or embeddings.

Now, the pendulum on that general concept seems to be swinging the opposite direction where a lot of those people just figured out that you don't need embeddings. That's true, but I'd suggest that people don't overindex on thinking that means embeddings are not actually useful or valuable. Embeddings can be downright magical in what you can build with them, they're just one more tool at your disposal.

You can mix and match these things, too! Indexing your documents into semantically nested folders for agents to peruse? Try chunking and/or summarizing each one, and putting the vectors in sidecar files, or even Yaml frontmatter. Disks are fast these days, you can rip through a lot of files indexed like that before you come close to needing something more sophisticated.


> yet somehow people married the acronym to one very particular implementation of the idea.

Likely due to the rise in popularity of semantic search via LLM embeddings, which for some reason became the main selling point for RAG. Meanwhile keyword search has existed for decades.


I'm still using the old definition, never got the memo.

That’s OK. Most got ReST wrong, too.

Stuck it on my calendar, looking forward to it.

You seem like someone who knows what they're doing, and I understand the theoretical underpinnings of LLMs (math background), but I have little kids that were born in 2016 and so the entire AI thing has left me in the dust. Never any time to even experiment.

I am active in fandoms and want to create a search where someone can ask "what was that fanfic where XYZ happened?" and get an answer back in the form of links to fanfiction that are responsive.

This is a RAG system, right? I understand I need an actual model (that's something like ollama), the thing that trawls the fanfiction archive and inserts whatever it's supposed to insert into one of these vector DBs, and I need a front-facing thing I write, that takes a user query, sends it to ollama, which can then search the vector DB and return results.

Or something like that.

Is it a RAG system that solves my use case? And if so, what software might I go about using to provide this service to me and my friends? I'm assuming it's pretty low in resource usage since it's just text indexing (maybe indexing new stuff once a week).

The goal is self-hosting. I don't wanna be making monthly payments indefinitely for some silly little thing I'm doing for me and my friends.

I am just a stay at home dad these days and don't have anyone to ask. I'm totally out the tech game for a few years now. I hope that you could respond (or someone else could), and maybe it will help other people.

There's just so many moving parts these days that I can't even hope to keep up. (It's been rather annoying to be totally unable to ride this tech wave the way I've done in the past; watching it all blow by me is disheartening).


In the definition of RAG discussed here, that means the workflow looks something like this (simplified for brevity): When you send your query to the server, it will first normalise the words, then convert them to vectors, or embeddings, using an embedding model (there are also plain stochastic mechanisms to do this, but today most people mean a purpose-built LLM). An embedding is essentially an array of numeric coordinates in a huge-dimensional space, so [1, 2.522, …, -0.119]. It can now use that to search a database of arbitrary documents with pre-generated embeddings of their own. This usually happens during inserting them to the database, and follows the same process as your search query above, so every record in the database has its own, discrete set of embeddings to be queried during searches.

The important part here is that you now don’t have to compare strings anymore (like looking for occurrences of the word "fanfiction" in the title and content), but instead you can perform arbitrary mathematical operations to compare query embeddings to stored embeddings: 1 is closer to 3 than 7, and in the same way, fanfiction is closer to romance than it is to biography. Now, if you rank documents by that proximity and take the top 10 or so, you end up with the documents most similar to your query, and thus the most relevant.

That is the R in RAG; the A as in Augmentation happens when, before forwarding the search query to an LLM, you also add all results that came back from your vector database with a prefix like "the following records may be relevant to answer the users request", and that brings us to G like Generation, since the LLM now responds to the question aided by a limited set of relevant entries from a database, which should allow it to yield very relevant responses.

I hope this helps :-)


I think the example you give is a little backwards — a RAG system searches for relevant content before sending anything to the LLM, and includes any content retrieved this way in the generative prompt. User query -> search -> results -> user query + search results passed in same context to LLM.

Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this. I bet if you put in your comment verbatim with Opus 4.6 and went back and forth a bit, it could help you figure out exactly what you need and build a first version in a couple hours. Seriously, if you know the fundamentals and can poke and prod, these tools are amazing for helping expand your knowledge base. And constraints like how much you want to pay are excellent for steering the models. Seriously, just try it!

> Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this.

Paying $20/m sounds like overkill. I have tabs open for all of the most well-known AI chatbots. Despite trying my hardest, it is not possible to exhaust your free options just by learning.

Hell, just on the chatbots alone, small projects can be vibe-coded too! No $20/m necessary.


Yeah, but when it comes to actually building stuff, using Codex is night and day different from using ChatGPT.

We were given a demo of a vector based approach, and it didn't work. They said our docs were too big and for some reason their chunking process was failing. So we ended up using a good old fashioned Elastic backend because that's what we know, and simply forwarding a few of these giant documents to the LLM verbatim along with the user's question. The results have been great, not a single complaint about accuracy, results are fast and cheap using OpenAI's micro models, Elastic is mature tech everyone understands so it's easy to maintain.

I think this turned out to be one of those lessons about premature optimization. It didn't need to be as complex as what people initially assumed. Perhaps with older models it would have been a different story.


> They said our docs were too big and for some reason their chunking process was failing.

Why would the size of your docs have any bearing on whether or not the chunking process works? That makes no sense. Unless of course they're operating on the document entirely in memory which seems not very bright unless you're very confident of the maximum size of document you're going to be dealing with.

(I implemented a RAG process from scratch a few weeks ago, having never done so before. For our use case it's actually not that hard. Not trivial, but not that hard. I realise there are now SaaS RAG solutions but we have almost no budget and, in any case, data residence is a huge concern for us, and to get control of that you generally have to go for the expensive Enterprise tier.)


I agree it makes no sense. The whole point of chunking is to handle large documents. If your chunking system fails because a document is too big, that seems like a pretty glaring omission. I just chalked it up to the tech being new and novel and therefore having more bugs/people not fully understanding how it worked/etc. It was a vendor and they never gave us more details.

Not all problems have to be solved. We just fell back to using older, more proven technology, started with the simplest implementation and iterated as needed, and the result was great.


I don't think this was a simple assumption. LLMs used to be much dumber! GPT-3 era LLMS were not good at grep, they were not that good at recovering from errors, and they were not good at making followup queries over multiple turns of search. Multiple breakthroughs in code generation, tool use, and reasoning had to happen on the model side to make vector-based RAG look like unnecessary complexity

It was the terminology that did that more than anything. The term 'RAG' just has a lot of consequential baggage. Unfortunately.

Certainly a lot of blog posts followed. Not sure that “everyone” was so blinkered.

Doesn't have to be tho, I've had great success letting an agent loose on an Apache Lucene instance. Turns out LLMs are great at building queries.

RAG is like when you want someone to know something they're not quite getting so you yell a bit louder. For a workflow that's mainly search based, it's useful to keep things grounded.

Less useful in other contexts, unless you move away from traditional chunked embeddings and into things like graphs where the relationships provide constraints as much as additional grounding


My intuition is that since AI assistants are fictional characters in a story being autocompleted by an LLM, mechanisms that are interpretable as human interactions with language and appear in the pretraining data have a surprising advantage over mechanisms that are more like speculation about how the brain works or abstract concepts.

This is also why LLMs get 80% of the way there and crap out on logic. They were trained on all the open source abandonware on GitHub.

Similar effort with PageIndex [1], which basically creates a table of contents like tree. Then an LLM traverses the tree to figure out which chunks are relevant for the context in the prompt.

1: https://github.com/VectifyAI/PageIndex


Aren’t most successful RAGs using a combination of embedding similarity + BM25 + reranking? I thought there were very few RAGs that only did pure embedding similarity, but I may be mistaken.

This kind of circles back to ontological NLP, that was using knowledge representation as a primitive for language processing. There is _a ton_ of work in that direction.

Exactly. And LLMs supervised by domain experts unlock a lot of capabilities to help with these types of knowledge organization problems.

> Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built ChromaFs

It's obvious by that sentence that these guys neither understand RAG nor realized that the solution to their agentic problem didn't need any of this further abstractions including vector or grep


I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference.

Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.


> I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference. > > Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.

It depends on the task no? Codebase RAG for example has arguably a different setup than text search. I wonder how much the FS "native" embedding would help.


I think it's cool that LLMs can effectively do this kind of categorization on the fly at relatively large scale. When you give the LLM tools beyond just "search", it really is effectively cheating.

Yep, I was using RAG for all sorts of stuff and now moved everything to just rg+fd+cd+ls, much faster, easier, etc.

And next, we’ll get to tag based file systems

more and more often you see "new discoveries" that are very old concepts. the only discovery that usually happens there is that the author discovers for himself this concept. but it is essential nowadays to post it like if you discovered something new

Inverted indexes have the major advantages of supporting Boolean operators.

Turns out the millions of people in knowledge work arent librarians and they wing shit everywhere

It’s not the noncompetes that’s the problem, it’s confidentiality agreements with extremely broad language.

Learn about the legal principle of “inevitable disclosure”. It’s the idea you can’t work for a competitor because you can’t help yourself but violate an NDA


I haven't heard much about it, but I am incredibly curious about how this is currently shaking out in the AI craze.

It seems these labs are revolving doors, and any kind of breakthrough knowledge would immediately make you incredibly valuable to other labs or incredibly valuable as a spinoff start-up. Never mind these researchers all knowing each other and certainly having more than a few common spaces (digital or IRL). And the excitement of working in a fresh field still littered with low hanging fruit.

I can't help but feel that a large part of the reason why the labs are neck and neck is because everyone is talking to everyone else.

I can't substantiate any of this though, it seems to have largely dodged anything besides internal conversation.


They're all in California where the law is very pro-employee. As long as you're not taking actual documents or code with you, there's nothing your former employer can do about what's in your head.

This is a huge part of how SV as a whole works. People figure out what works and point out how to do things better at their next roles. It's mostly a good thing. The main downside is that it exacerbates tendencies to cargo cult apply solutions for problems that come from a particular organizational scale to orgs without them.

Inevitably, it's just the need for lawyers to intervene in "common sense" negotiations. It's never legal to do X, Y, Z, but if the business has all the lawyers and the employee has non, then it doesn't really matter whats legal; it's whose willing to exhaust the cash to fight the issue.

Which of course, is why unions are what's needed to properly negotiate employee-employer relationships, the same way a strong government is needed to negotiate corporate-civil relationships.

Americans, however, have decided that "individual freedom" is _soooooo_ valuable, that it only exists for people with enough cash to defend it.


Have fun trying that in CA.

“The Great Depression: A Diary” is a great day by day first person account of someone living through the depression. It’s a great reminder how we don’t have a monopoly on insane politics

https://www.goodreads.com/book/show/6601224-the-great-depres...


I read this more than 10 years ago, so I don't remember a lot, but I do appreciate it for being the only account of the crash that doesn't have historical hindsight. It was interesting to hear someone trying to make sense of things on a near daily basis during the fog of uncertainty. It makes me want to find other such accounts of historical events without the inevitable-seeming cause and effect sequence of events you normally read about history.

Check out Demons of Unrest, which covers the few months before the American civil war. It covers the stories of Lincoln, other union leaders and Confederate leaders and their understanding and misunderstandings about what the other side thought.

It's remarkable about what assumptions people can make without talking to people from other places.


George Orwell's Homage to Catalonia is about his experience in the Spanish Civil War, published in 1938. He was there between '36 and '37 I think. It's pre WWII, and I found it very interesting for the same reason you say here: his account doesn't have the benefit of hindsight. The civil war wasn't even over when the book was published. It's very interesting to see his perspective, what things he saw coming, and what things he didn't.

The Wind is Rising, by HM Tomlinson. It's a diary of the first year or so of the second world war. It has an unforgettable first line: "All we hear from Berlin is the music of marrow bones and cleavers," and is similarly vivid throughout.

It looks like you can borrow it from archive.org, but I suggest buying a physical copy. It was printed in 1941 - and I don't believe ever had a second edition - so it's on thin, wartime paper, which adds to the experience of reading it. It's like something pulled out of a time-capsule, a tangible relic of the time it covers.


Interesting how there is so little information about this book online. It’s a good reminder of how a ton of stuff basically still isn’t on the internet and is still only accessible in old books.

(in no small part due to copyright law)

> It has an unforgettable first line: "All we hear from Berlin is the music of marrow bones and cleavers," and is similarly vivid throughout.

A nice example of the power of media to bring something to life in the reader.


There's the diary 1660-1669 of Samuel Pepys, which covers as the Great Plague of London, the Second Anglo-Dutch War and the Great Fire of London.

Good recommendations here, thanks. I was aware of Orwell's account of the Spanish civil war, so maybe I'll start there.

A book along similar lines: https://www.amazon.com/Not-Nickel-Spare-Sally-Cohen/dp/04399...

(haven’t read it yet so I can’t vouch for it)


I have some algorithms I absolutely must know. So I’m hand coding them and asking the agent to critique me.

I do a very similar thing in writing - I need feedback, don’t rewrite this!

In both cases I need the struggle of editing / failing to arrive at a deeper understanding.

The future dev will need to know when to hand code vs when to not waste your time. And the advantage will still go to the person willing to experience struggle to understand what they need to.


Was Sora just a honeypot to get a media company (ie Disney) to invest a lot of money into OpenAI?

Maybe it achieved its objective?


As far as I can tell Disney didn't actually hand over the money yet. They were still in preparation and it's cancelled now obviously.

So whatever reason they say to shut this down, it was more important than 1B investment.


Sora was fun

But it was largely fun to try to transgress against the limitations. Who could trick the AI to generate something outlandish and ridiculous.


IIRC those were solar hot water heaters. More of a curiosity than something legitimately powering the white house.

The symbolism, and the stupidity, was there though. As time has gone on it has been more clear every year how intelligent Carter's administration was and how terrible the following administration was. Investing in/promoting solar was just one of many smart moves by Carter that were attacked purely to gain political points that only harmed us in the long run.

Carter: “This energy crisis shows us how vulnerable we are to foreign autocrats. We should work toward energy independence via renewable energy and waste reduction, to lead the world away from this risky and unsustainable fossil fuel market and secure ourselves a brighter future.”

America: throws a decades-long, ongoing tantrum

It’s fairly reductive… but still kinda true.


To be fair, he was essentially wrong about the efficiency angle because of the Jevons paradox and the "make your dryer not actually dry your clothes" kind of thing was pretty stupid.

A lot of the methods of subsidizing things were also quite incompetent, e.g. Solyndra. If you want to subsidize something like this you do it on the consumer side, e.g. 75% tax credit for every US-made solar panel you install, which drives demand for US-made solar panels without opening you up to scandals like that or the usual corruption where the money goes to the administration's buddies.


"In the year 2000, the solar water heater behind me, which is being dedicated today, will still be here supplying cheap, efficient energy. A generation from now, this solar heater can either be a curiosity, a museum piece, an example of a road not taken, or it can be just a small part of one of the greatest and most exciting adventures ever undertaken by the American people: harnessing the power of the Sun to enrich our lives as we move away from our crippling dependence on foreign oil." - Jimmy Carter (1979) [1]

[1] https://energyhistory.yale.edu/president-jimmy-carters-remar...


Jimmy Carter has supported not only solar energy, but also domestic fossil fuel production and encouraged both of them. The policy of Carter administration never was to go 100% renewable.

"Our Nation's energy problem is very serious—and it's getting worse. We're wasting too much energy, we're buying far too much oil from foreign countries, and we are not producing enough oil, gas, or coal in the United States."

Energy Address to the Nation. April 05, 1979 https://www.presidency.ucsb.edu/documents/energy-address-the...


One of the lasting consequences of Carter's administration is the strong increase in worldwide CO2 output. Why? Yes they did encourage, at that time, developing countries (now becoming industrial countries) to pursue renewable energy resources but the main goal was to stop them developing nuclear technology.

Nuclear Non-Proliferation Act of 1978 Title V – United States Assistance to Developing Countries https://en.wikipedia.org/wiki/Nuclear_Non-Proliferation_Act_...

Notable absent from the "Nuclear Non-Proliferation Act of 1978" is the word "coal". Developing countries were barred from developing nuclear technology, but were free solve their growing energy needs using coal.


Carter was too good for America, in the sense that he was actually a good person (sidenote is he still alive?).

December 29, 2024,

I think it’s in part returning money this company paid the government

It’s not as big of a deal as it sounds.

Theses wind farms have not even started construction yet. Once Don Quixote is out of office, some future administration undoubtedly will start wind farm construction.


I'm actually less concerned about the continued non-existence of a bunch of windmills, vs the billion-dollar payout to ensure that they continue to not exist.

I've spent my entire life not building any windmills and nobody's paid me a billion dollars for it yet.


It’s a refund for buying the lease, not a payout.

This is so non-specific to be meaningless.

Like actually tell us what you know so we can make useful decisions about our safety.


> Like actually tell us what you know so we can make useful decisions about our safety

"Americans abroad should follow the guidance in security alerts issued by the nearest U.S. embassy or consulate."


This is just more victim fear being pushed so (mostly Christian) conservatives can claim to be the victims, once again, as they colonize another people/land.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: