Fun experiment. Main limitation I see is the delay between actions and commentary because of the whole script generation & TTS overhead. It seems like the commentary can quickly fall behind, especially in fast-paced sports.
Naw there are tricks you can use to pipeline these things so that apparent latency is under 500ms even with significant game state history awareness, and also to interrupt ongoing but freshly out of date commentary.
I couldn’t get it under 250ms though (for rocket league), but the tech should be better now than 2024.
Author here. TTS and script generation can be a bit of an overhead for now, which is why I've worked with metric aggregates - 30+ bounces rather than exactly 33, for example. For this game, one might ideally want this overhead to be less than the time it takes for the ball to bounce from one paddle to another, which can be around 1–2 seconds. However, there may be another strategy to (maybe?) overcome this: start synthesizing numbers (ignoring the fractional part) using TTS and cache them for both commentators. Then, patch those audio clips together after core part is synthesized. It should be doable, I think - I just haven't gotten to it yet. Note that matching the excitement and tempo of core commentary with those numbers is key - otherwise, it will feel janky.