Look at Browser Use.
They self-reported 89% on WebVoyager. On hard tasks with a real benchmark, they score 8.1%. That's not a performance drop….. that's a different product than what's being advertised.
To be fair, this isn't just a Browser Use problem. Look at the drop-off for every agent as tasks get harder:
Operator goes from 83% easy → 43% hard. That's a 40-point cliff.
Claude Computer Use: 90% easy → 32% hard. 58-point drop.
Browser Use: 55% easy → 8% hard. Just falls off a cliff entirely.
TinyFish: 97.5% easy → 81.9% hard. 15-point drop.
The gap between easy and hard is where you see if a system actually works or if it's just good at simple tasks. Every other agent loses half its ability or more when tasks get complex. We lose 15 points.
That's the difference between "cool demo" and "I can actually ship this."