I largely agree with this (and especially your first and last points!) I'm intru...

isoprophlex · on Aug 25, 2023

Sounds legit what you mention wrt. Langchain. I agree that it's nice for grokking the patterns and getting something out there. Maybe I worded my post too harshly...

Testing; I haven't really found something solid, something that I'd write up in good conscience as a clear 'best practice'. What I meant to say with testing framework is more along the lines of "take some time to set up a bespoke test harness for your use case".

For example, a RAG summarizer chatbot-and-PowerPoint generator that is backed by a data store containing 100s of sales pitches. No low latency reqs, but we need some appreciable accuracy.

There's plenty of cool ideas to try: do I generate n expanded versions of my user query to help retrieval, get m records back, summarize those in one go? Or do I go for a single query but do I let the LLM filter out false positives? Etc etc.

It helps to invest in putting together an (almost) statistically relevant, diverse set of queries and grepping (or even ranking with a LLM) for expected results. In our case, when summarizing sales in the public sector in the past 5 years, did the thing exclude our Y2K work from 23 years back?

Simple, diverse test jobs you can use to give some flavor of scientific hypothesis testing to the weirdness of figuring out how to tweak everything inside the pipeline. This also allows you to register succes criteria upfront, eg. we accept this if we get 19/20 right. Then that becomes 190/200 in the next project phase. Then that becomes "less than 5% thumbs down feedback from users when we go live".

It's an exciting field. Makes me think about building my own little convnets before pytorch become so good.

janalsncm · on Aug 25, 2023

For testing, there’s a whole slew of ranking-based metrics you can use. You want to make sure the content being retrieved is actually relevant. Offline precision, recall, nDCG and MRR tend to be pretty good. Online you can just directly measure user behavior as a function of your model’s inputs and outputs.

https://en.m.wikipedia.org/wiki/Evaluation_measures_(informa...

treprinum · on Aug 28, 2023

LangChain is stuck in the LLMs of the GPT-3 era, all its concepts don't translate well to GPT-4 (multi-task single shot vs chain of multi-shot in LangChain) and it often gets in the way/overcomplicates things there.