Hey, member of the benchmark team. We iterated on the prompts based on observed model behaviors. A few key examples:
Schema introspection: Models were spending significant tokens exploring the database structure through trial-and-error SQL queries, so we included the complete data model in the system prompt upfront.
Reward hacking: We added explicit instructions against gaming the reconciliation checks. This reduced the frequency initially, but models would eventually ignore these constraints.
Domain context: Including company background (YC-backed startup) substantially improved transaction categorization, particularly for startup-specific items like SAFE notes that require domain knowledge to classify correctly.
Hey, member of the benchmark team. We actually seeded the ledger with the company's chart of accounts and 8 months of historical transactions. For the Vercel example specifically, there were prior instances showing how to categorize hosting costs that the models could reference. The expectation wasn't for them to guess blindly, but to use the provided transaction history as guidance for similar categorizations (which they often, but not always, did).
Ahh, that's a good solution! I missed that, and you definitely instruct them to do that:
> You must follow the established patterns for categorization, revrec, etc for past months... If you must use a new account or treatment, explicitly note why existing patterns don't apply
Schema introspection: Models were spending significant tokens exploring the database structure through trial-and-error SQL queries, so we included the complete data model in the system prompt upfront.
Reward hacking: We added explicit instructions against gaming the reconciliation checks. This reduced the frequency initially, but models would eventually ignore these constraints.
Domain context: Including company background (YC-backed startup) substantially improved transaction categorization, particularly for startup-specific items like SAFE notes that require domain knowledge to classify correctly.