Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I largely agree with this (and especially your first and last points!)

I'm intruiged by your second bullet. What have you seen in the way of test frameworks and test data that's been beneficial? I will admit, a lot of my iteration has fallen into the "tweak till the demos look good" camp...

On LangChain, I'd say it slightly differently. It is a fantastic tool for getting up and running with something quickly, and for giving a high-level overview of the pattern and components you will need. That said, it is pretty awful for scaling or customizing. I (and many other people I have talked to) benefited greatly from starting a project with LangChain, but then eventually re-wrote ~everything in the stack piece-by-piece, as needed. As I mentioned in the post, the LangChain loaders are the most valuable piece for me, if not for direct use, at least as a starting point of working code to build various integrations.



Sounds legit what you mention wrt. Langchain. I agree that it's nice for grokking the patterns and getting something out there. Maybe I worded my post too harshly...

Testing; I haven't really found something solid, something that I'd write up in good conscience as a clear 'best practice'. What I meant to say with testing framework is more along the lines of "take some time to set up a bespoke test harness for your use case".

For example, a RAG summarizer chatbot-and-PowerPoint generator that is backed by a data store containing 100s of sales pitches. No low latency reqs, but we need some appreciable accuracy.

There's plenty of cool ideas to try: do I generate n expanded versions of my user query to help retrieval, get m records back, summarize those in one go? Or do I go for a single query but do I let the LLM filter out false positives? Etc etc.

It helps to invest in putting together an (almost) statistically relevant, diverse set of queries and grepping (or even ranking with a LLM) for expected results. In our case, when summarizing sales in the public sector in the past 5 years, did the thing exclude our Y2K work from 23 years back?

Simple, diverse test jobs you can use to give some flavor of scientific hypothesis testing to the weirdness of figuring out how to tweak everything inside the pipeline. This also allows you to register succes criteria upfront, eg. we accept this if we get 19/20 right. Then that becomes 190/200 in the next project phase. Then that becomes "less than 5% thumbs down feedback from users when we go live".

It's an exciting field. Makes me think about building my own little convnets before pytorch become so good.


For testing, there’s a whole slew of ranking-based metrics you can use. You want to make sure the content being retrieved is actually relevant. Offline precision, recall, nDCG and MRR tend to be pretty good. Online you can just directly measure user behavior as a function of your model’s inputs and outputs.

https://en.m.wikipedia.org/wiki/Evaluation_measures_(informa...


LangChain is stuck in the LLMs of the GPT-3 era, all its concepts don't translate well to GPT-4 (multi-task single shot vs chain of multi-shot in LangChain) and it often gets in the way/overcomplicates things there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: