"We picked the latter, which also gave us our performance metric - percentage of generated comments that the author actually addresses."
This metric would go up if you leave almost no comments. Would it not be better to find a metric that rewards you for generating many comments which are addressed, not just having a high relevance?
You even mention this challenge yourselves: "Sadly, even with all kinds of prompting tricks, we simply could not get the LLM to produce fewer nits without also producing fewer critical comments."
If that was happening, that doesn't sound like it would be reflected in your performance metric.
Good criticism that we should pay closer attention to. Someone else pointed this out and too and since then we’ve started tracking addressed comment per file changed as well.
This metric would go up if you leave almost no comments. Would it not be better to find a metric that rewards you for generating many comments which are addressed, not just having a high relevance?
You even mention this challenge yourselves: "Sadly, even with all kinds of prompting tricks, we simply could not get the LLM to produce fewer nits without also producing fewer critical comments."
If that was happening, that doesn't sound like it would be reflected in your performance metric.