Sounds legit what you mention wrt. Langchain. I agree that it's nice for grokking the patterns and getting something out there. Maybe I worded my post too harshly...
Testing; I haven't really found something solid, something that I'd write up in good conscience as a clear 'best practice'. What I meant to say with testing framework is more along the lines of "take some time to set up a bespoke test harness for your use case".
For example, a RAG summarizer chatbot-and-PowerPoint generator that is backed by a data store containing 100s of sales pitches. No low latency reqs, but we need some appreciable accuracy.
There's plenty of cool ideas to try: do I generate n expanded versions of my user query to help retrieval, get m records back, summarize those in one go? Or do I go for a single query but do I let the LLM filter out false positives? Etc etc.
It helps to invest in putting together an (almost) statistically relevant, diverse set of queries and grepping (or even ranking with a LLM) for expected results. In our case, when summarizing sales in the public sector in the past 5 years, did the thing exclude our Y2K work from 23 years back?
Simple, diverse test jobs you can use to give some flavor of scientific hypothesis testing to the weirdness of figuring out how to tweak everything inside the pipeline. This also allows you to register succes criteria upfront, eg. we accept this if we get 19/20 right. Then that becomes 190/200 in the next project phase. Then that becomes "less than 5% thumbs down feedback from users when we go live".
It's an exciting field. Makes me think about building my own little convnets before pytorch become so good.
Testing; I haven't really found something solid, something that I'd write up in good conscience as a clear 'best practice'. What I meant to say with testing framework is more along the lines of "take some time to set up a bespoke test harness for your use case".
For example, a RAG summarizer chatbot-and-PowerPoint generator that is backed by a data store containing 100s of sales pitches. No low latency reqs, but we need some appreciable accuracy.
There's plenty of cool ideas to try: do I generate n expanded versions of my user query to help retrieval, get m records back, summarize those in one go? Or do I go for a single query but do I let the LLM filter out false positives? Etc etc.
It helps to invest in putting together an (almost) statistically relevant, diverse set of queries and grepping (or even ranking with a LLM) for expected results. In our case, when summarizing sales in the public sector in the past 5 years, did the thing exclude our Y2K work from 23 years back?
Simple, diverse test jobs you can use to give some flavor of scientific hypothesis testing to the weirdness of figuring out how to tweak everything inside the pipeline. This also allows you to register succes criteria upfront, eg. we accept this if we get 19/20 right. Then that becomes 190/200 in the next project phase. Then that becomes "less than 5% thumbs down feedback from users when we go live".
It's an exciting field. Makes me think about building my own little convnets before pytorch become so good.