Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When you're at Roblox's scale, it is often difficult to know in advance whether you will have a lower MTTR by rolling back or fixing forward. If it takes you longer to resolve a problem by rolling back a significant change than by tweaking a configuration file, then rolling back is not the best action to take.

Also, multiple changes may have confounded the analysis. Adjusting the Consul configuration may have been one of many changes that happened in the recent past, and certainly changes in client load could have been a possible culprit.



In most cases, if you've planned your deployment well (meaning in part that you've specified the rollback steps for your deployment) it's almost impossible to imagine rollback being slower than any other approach.

When I worked at Amazon, oncalls within our large team initially had leeway over whether to roll backwards or try to fix problems in situ ("roll forward"). Eventually, the amount of time wasted trying to fix things, and new problems introduced by this ad hoc approach, led to a general policy of always rolling back if there were problems (I think VP approval became required for post-deploy fixes that weren't just rolling back).

In this case, though, the deployment happened ages (a whole day!) before the problems erupted. The rollback steps wouldn't necessarily be valid (to your "multiple confounding changes" point). So there was no avoiding at least some time spent analyzing and strategizing before deciding to roll back.


Some changes are extremely hard to rollback, but this doesn’t sound like one of them. From their report, sounds like the rollback process involved simply making a config change to disable the streaming feature, it took a bit to rollout to all nodes, and then Consul performance almost immediately returned to normal.

Blind rollbacks are one thing, but they identified Consul as the issue early on, and clearly made a significant Consul config change shortly before the outage started, that was also clearly quite reversible. Not even trying to roll that back is quite strange to me - that’s gotta be something you try within the first hour of the outage, nevermind the first 50 hours.


> When you're at Roblox's scale

Yet a regional Consul deployment is the single point of failure. I apologize if that sounds sarcastic. There’re obviously a lot of lessons to be learned and blames have no places in this type of situations - excuses as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: