I would not have guessed Roblox was on-prem with such little redundancy. Later i...

otterley · on Jan 20, 2022

Since the issue's root cause was a pathological database software issue, Roblox would have suffered the same issue in the public cloud. (I am assuming for this analysis that their software stack would be identical.) Perhaps they would have been better off with other distributed databases than Consul (e.g., DynamoDB), but at their scale, that's not guaranteed, either. Different choices present different potential difficulties.

Playing "what-if" thought experiments is fun, but when the rubber hits the road, you often find that things that are stable for 99.99%+ of load patterns encounter previously unforeseen problems once you get into that far-right-hand side of the scale. And it's not like we've completely mastered squeezing performance out of huge CPU core counts on NUMA architectures while avoiding bottlenecking on critical sections in software. This shit is hard, man.

baskethead · on Jan 20, 2022

This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.

Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.

Symbiote · on Jan 20, 2022

> So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.

But this has nothing to do with cloud vs. colo.

baskethead · on Jan 20, 2022

The parent poster said that it would have happened even if they had cloud, ie. another datacenter. That's my assumption for the comment.

As far as I can tell from reading, Roblox doesn't have multiple datacenters. I find that really hard to believe, so if that's not true, then my point would be incorrect. If it is true, then if they completely duplicated their datacenters, they would be able to make the switch in one datacenter to streaming while keeping the other datacenter the old setting until they validated that everything was fine. That would have caught the problem, having slow rollout across datacenters.

yuliyp · on Jan 21, 2022

Uber is also a service that has a much lower tolerance for downtime: If people can't play a game, they're sad. If they're trying to get a ride and it doesn't work, or drivers apps stop working suddenly, the stranded people get very upset in a hurry, and the company loses a lot of customers.

It can be totally reasonable for Uber to pay for 2x the amount of infra they need for serving their products while not being worth it for a company like Roblox.

otterley · on Jan 20, 2022

The Consul streaming changes were rolled out months before the incident occurred.

baskethead · on Jan 20, 2022

You didn't read it properly. The changes were rolled out months before, but the switch to streaming based on that rollout was made 1 day before the incident. That was the root cause.

noahtallen · on Jan 20, 2022

I think the public cloud is a good choice for startups, teams, and projects which don't have infrastructure experience. Plenty of companies still have their own infrastructure expertise and roll their own CDNs, as an example.

Not only can one save a significant amount of money, it can also be simpler to troubleshoot and resolve issues when you have a simpler backend tech stack. Perhaps that doesn’t apply in this case, but there are plenty of use cases which don’t need a hundred micro services on AWS, none of which anyone fully understands.

nomel · on Jan 20, 2022

> But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up

You're assuming the average profits lost are more than the average cost of doing things differently, which, according to their statement, is not the case.

dylan604 · on Jan 20, 2022

>I wonder about their ability to recruit the level of talent required to run a service at this scale.

According to this user's comments, it doesn't look like it'll be that tough for them:

https://news.ycombinator.com/item?id=30014748