Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Summary "Semantic search on codebases works better if you first translate the code to natural language, before generating embedding vectors. It also works better if you chunk more “tightly” - on a per-function level rather than a per-file level. This is because noise negatively impacts retrieval quality in a huge way."

This makes a lot of sense. You should also embed information about how the code is related to other functions/code and where it is located in the codebase. One approach is to add really wonderful comments to the code so that when humans and machines read it they are brought on a step by step journey of how the code fulfills a goal. I tell the LLM to explain step by step to junior developers and and to inspire seniour engineers with glimpse of the profound beauty of the code and its architecture.



We use tree-sitter right now and experimenting with call graphs currently. Will write about the results when we have data on that.


I've been using tree-sitter also. Looking forward to reading about your results with call graphs




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: