1. People discover things LLMs can kind of do, but very poorly.
2. Frontier labs sample these discoveries and incorporate them into benchmarks to monitor internally.
3. Next generation model improves on said benchmarks, and the improvements generalize to improvements on loosely correlated real world tasks.
1. People discover things LLMs can kind of do, but very poorly.
2. Frontier labs sample these discoveries and incorporate them into benchmarks to monitor internally.
3. Next generation model improves on said benchmarks, and the improvements generalize to improvements on loosely correlated real world tasks.