I am similarly less-than-impressed. If you click through to the website, you can watch the replay of one of the games mentioned in the article (the one with the clue "invader").
In that instance, the clues all matched 2-3 words, and the winning team got lucky twice (they guessed an unclued word using an unintended correlation, and their opponent guessed a different one of their unclued words.)
You also see a number of instances where the agents continue guessing words for a clue even though they've already gotten enough matches. For instance, in round 2, for the clue "Japan (2)", the blue team guesses sumo and cherry, then goes for a rather tenuous followup guess for round 1's 007 with "ring" (despite having gotten the two clued matches in the first round). A sillier example is in the final round, where the Red Team guesses 3 clues (thereby identifying all nine of their target words), then going ahead and guessing another word.
(For what it's worth, I think "shark" would have been a better guess for another 007 tie-in seeing as there are multiple Bond movies with sharks, but it's also not a match, and again, I wouldn't have gone for a third guess here when there were only two clued words.)
I was wondering about the same. It is possible that the instructions didn’t try to make the gameplay as aggressive as possible. A good model could optimize the separator to make it easy to guess the most words possible. By having access to its own state, it should be possible to reach 5–6 words in most cases. There is an argument for keeping words around that would increase the difficulty of the opponents guessing large/clean separations, so it is possible that optimal play includes simple pairs on occasion. Very interesting application nonetheless.