> Running all Roblox backend services on one Consul cluster left us exposed to an outage of this nature. We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services. We have efforts underway to move to multiple availability zones within these data centers; we have made major modifications to our engineering roadmap and our staffing plans in order to accelerate these efforts.
If they were in AWS they could have used Consul across multi-AZs and done changes in a roll out fashion.
So that next time they can spend 96 hours on recovery, this time adding a split brain issue to the list of problems to deal with. Jokes aside, the write-up is quite good after after thinking about all the problems they had to deal with, I was quite humbled.
It doesn't really explain how they reached the conclusion that that would help. Like, yes, it's a problem that they had a giant Consul cluster that was a SPOF, but you can run multiple smaller Consul clusters within a single AZ if you want.
Honestly it reads to me like an internal movement for a multi-AZ deployment successfully used this crisis as an opportunity.