When you're at Roblox's scale, it is often difficult to know in advance whether ...

mypalmike · on Jan 21, 2022

In most cases, if you've planned your deployment well (meaning in part that you've specified the rollback steps for your deployment) it's almost impossible to imagine rollback being slower than any other approach.

When I worked at Amazon, oncalls within our large team initially had leeway over whether to roll backwards or try to fix problems in situ ("roll forward"). Eventually, the amount of time wasted trying to fix things, and new problems introduced by this ad hoc approach, led to a general policy of always rolling back if there were problems (I think VP approval became required for post-deploy fixes that weren't just rolling back).

In this case, though, the deployment happened ages (a whole day!) before the problems erupted. The rollback steps wouldn't necessarily be valid (to your "multiple confounding changes" point). So there was no avoiding at least some time spent analyzing and strategizing before deciding to roll back.

yashap · on Jan 20, 2022

Some changes are extremely hard to rollback, but this doesn’t sound like one of them. From their report, sounds like the rollback process involved simply making a config change to disable the streaming feature, it took a bit to rollout to all nodes, and then Consul performance almost immediately returned to normal.

Blind rollbacks are one thing, but they identified Consul as the issue early on, and clearly made a significant Consul config change shortly before the outage started, that was also clearly quite reversible. Not even trying to roll that back is quite strange to me - that’s gotta be something you try within the first hour of the outage, nevermind the first 50 hours.

uvdn7 · on Jan 21, 2022

> When you're at Roblox's scale

Yet a regional Consul deployment is the single point of failure. I apologize if that sounds sarcastic. There’re obviously a lot of lessons to be learned and blames have no places in this type of situations - excuses as well.