Some changes are extremely hard to rollback, but this doesn’t sound like one of them. From their report, sounds like the rollback process involved simply making a config change to disable the streaming feature, it took a bit to rollout to all nodes, and then Consul performance almost immediately returned to normal.
Blind rollbacks are one thing, but they identified Consul as the issue early on, and clearly made a significant Consul config change shortly before the outage started, that was also clearly quite reversible. Not even trying to roll that back is quite strange to me - that’s gotta be something you try within the first hour of the outage, nevermind the first 50 hours.
Blind rollbacks are one thing, but they identified Consul as the issue early on, and clearly made a significant Consul config change shortly before the outage started, that was also clearly quite reversible. Not even trying to roll that back is quite strange to me - that’s gotta be something you try within the first hour of the outage, nevermind the first 50 hours.