Availability Zones aren't the same thing as regions. AWS regions have multiple A...

mlyle · on Jan 20, 2022

> (N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)

What he said was perfectly cogent.

Outages in us-east-1 AZ us-east-1a have caused outages in us-west-1a, which is a different region and a different AZ.

Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.

So, if you span multiple availability zones, you are not spared from events that will impact all of them.

Karrot_Kream · on Jan 20, 2022

> Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.

It's up to the _user_ of AWS to design around this level of reliability. This isn't any different than not using AWS. I can run my web business on the super cheap by running it out of my house. Of course, then my site's availability is based around the uptime of my residential internet connection, my residential power, my own ability to keep my server plugged into power, and general reliability of my server's components. I can try to make things more reliable by putting it into a DC, but if a backhoe takes out the fiber to that DC, then the DC will become unavailable.

It's up to the _user_ to architect their services to be reliable. AWS isn't magic reliability sauce you sprinkle on your web apps to make them stay up for longer. AWS clearly states in their SLA pages what their EC2 instance SLAs are in a given AZ; it's 99.5% availability for a given EC2 instance in a given region and AZ. This is roughly ~1.82 days, or ~ 43.8 hours, of downtime in a year. If you add a SPOF around a single EC2 instance in a given AZ then your system has a 99.5% availability SLA. Remember the cloud is all about leveraging large amounts commodity hardware instead of leveraging large, high-reliability mainframe style design. This isn't a secret. It's openly called out, like in Nishtala et al's "Scaling Memcache at Facebook" [1] from 2013!

The background of all of this is that it costs money, in terms of knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions) who understand these issues. Most companies don't care; they're okay with being down for a couple days a year. But if you want to design high reliability architectures, there are plenty of senior engineers willing to help, _if_ you're willing to pay their salaries.

If you want to come up with a lower cognitive overhead cloud solution for high reliability services that's economical for companies, be my guest. I think we'd all welcome innovation in this space.

[1]: https://www.usenix.org/system/files/conference/nsdi13/nsdi13...

roughly · on Jan 20, 2022

During a recent AWS outage, the STS service running in us-east-1 was unavailable. Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.

This is what kreeben was referring to - not some abstract misconception about the difference between AZs and Regions, but an actual real world incident in which a failure in one AZ had an impact in other Regions.

otterley · on Jan 20, 2022

It's more subtle than that.

For high availability, STS offers regional endpoints -- and AWS recommends using them[1] -- but the SDKs don't use them by default. The author of the client code, or the person configuring the software, has to enable them.

[1] https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...

(I work for AWS. Opinions are my own and not necessarily those of my employer.)

roughly · on Jan 20, 2022

The client code which defaults to STS in us-east-1 includes the AWS console website, as far as I can tell.

Real question, though - are those genuinely separate endpoints that remained up and operational during the outage? I don’t think I saw or knew a single person unaffected by this outage, so either there’s still some bleed over on the backend or knowledge of the regional STS endpoints is basically zero (which I Can believe, y’all run a big shop)

Karrot_Kream · on Jan 21, 2022

My team didn't use STS but I know other teams at the company did. Those that did rely on non-us-east-1 endpoints did stay up IIRC. Our company barely use the AWS console at all and base most of our stuff around their APIs to hook into our deployment/CI processes. But I don't work at AWS so I don't know if it's true or if there was some other backend replication lag or anything else going on that was impacted by us-east-1 being down. We had some failures for some of our older services that were not properly sharded out, but most of our stuff failed over and continued to work as expected.

Karrot_Kream · on Jan 20, 2022

> Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.

That's not true. STS offers regional endpoints, for example if you're in Australia and don't want to pay the latency cost to transit to us-east-1 [1]. It's up to the user to opt into them though. And that goes back to what I was saying earlier, you need engineers willing to read their docs closely and architect systems properly.

[1]: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...

mlyle · on Jan 21, 2022

> knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions)

I think this breaks the site guidelines. Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.

That is, I've read the comments to say "they're not only in different AZ's, they're in different regions". It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.

Karrot_Kream · on Jan 21, 2022

> Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.

Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.

> That is, I've read the comments to say "they're not only in different AZ's, they're in different regions"

So I've read. The earlier example about STS that someone brought up was incorrect; both I and another commenter linked to the doc with the correct information.

> It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.

You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.

Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day. When I see people here make easily falsifiable comments full of hearsay ("I had a friend of a friend who works at Amazon and they did X, Y, Z bad things") and use that to drum up a frenzy, it flies in the face of what I do everyday. There's lots of issues with cloud providers as a whole and AWS in particular but to get to that level of conversation you need to understand what the system is actually doing, not just get angry and guess why it's failing.

mlyle · on Jan 21, 2022

> > being in a different region implies being in a different availability zone.

> Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.

Right.. So if you are in a different region, you are by definition in a different availability zone.

> You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.

Yah, I really thought about it and you're just reeking unkindness. And the people above that you're replying to and mocking are not wrong.

> Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day.

If you're unable to be civil about this, maybe you should avoid the threads. Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.

I've got >20 years of experience in building geographically distributed, sharded, and consensus-based systems. I think you are being unfair to the people you're discussing with. Be nice.

dastbe · on Jan 21, 2022

> Amazon seeks to avoid common-mode failures between AZs (and thus regions).

there is a distinction between azs within a region vs azs in different regions. the overwhelming majority of services are offered regionally and provide slas at that level. services are expected to have entirely independent infrastructure for each region, and cross-regional/global services are built to scope down online cross regional dependencies as much as possible.

the specific example brought up (cross regional sts) is wrong in the sense that sts is fully regionalized as evidenced by the overwhelming number of aws services that leverage sts not having a global meltdown. but as others mentioned in a lot of ways it’s even worse because customers are opted into the centralized endpoint implicitly.

Karrot_Kream · on Jan 21, 2022

> If you're unable to be civil about this, maybe you should avoid the threads.

I didn't read my tone as uncivil, just harsh. I guess it came across harsher than intended. I'll try to cool it a bit more next time, but I have to say it's not like the rest of HN is taking this advice to heed when they're criticizing AWS. I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something? Anyway point noted and I'll try to keep my snark down.

> Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.

Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.

While I don't work at AWS, my company also publishes an SLA and we refund our customers when we dip below that SLA. When an outage, SLA-impacting or not, occurs, we spend a _lot_ of time getting to the bottom of what happened and documenting what went wrong. Frequently it's multiple things that go wrong which cause a sort of cascading failure that we didn't catch or couldn't reproduce in chaos testing. Part of the process of architecting solutions for high scale (~ billions/trillions of weekly requests) is to work through the AWS docs and make sure we select the right architecture to get the guarantees we seek. I'd like to see evidence of common-mode failures and the defensive guarantees that failed in order show proof of them, or proof positive through a dashboard or something, before I'm willing to malign AWS so easily.

> And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.

Sure if you're not operating high reliability services at high scale, it's true, you don't need cross-AZ or cross-region failover. But if you chose, through balance sheet or ignorance, not to take advantage of AWS's reliability features then you shouldn't get to complain that AWS is unreliable. Their guarantees are written on their SLA pages.

mlyle · on Jan 21, 2022

> I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something?

... I still don't think your overall starting assertions about the other people not understanding regions vs. AZs is correct, and it triggered you to repeatedly assert that the people you were talking to are unskilled.

I could very easily use the same words as them, and I have decade-old spreadsheets where I was playing with different combinations of latencies for commits and correlation coefficients for failures to try and estimate availability.

> Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.

I remember 2011, where EBS broke across all US-EAST AZs and lots of control plane services were impacted and you couldn't launch instances across all AZs in all regions for 12 hours.

Now maybe you'll be like "pfft, a decade ago!". I do think Amazon has significantly improved architecture. At the same time, AZs and regions being engineered to be independent doesn't mean they really are. We don't attain independent, uncorrelated failures on passenger aircraft, let alone these more complicated, larger, and less-engineered systems.

Further, even if AWS gets it right, going multi-AZ introduces new failure modes. Depending on the complexity of data model and operations on it, this stuff can be really hard to get right. Building a geographically distributed system with current tools is very expensive and there's no guarantee that your actual operational experience will be better than in a single site for quite some time of climbing the maturity curve.

> Their guarantees are written on their SLA pages.

Yup, and it's interesting to note that their thresholds don't really assume independence of failures. E.g. .995/.990/.95 are the thresholds for instances and .999/.990/.950 are the uptime thresholds for regions.

If Amazon's internal costing/reliability engineering model assumed failures would be independent, they could offer much better SLAs for regions safely. (e.g. back of the envelope, 1- (.005 * .005) * 3C2 =~ .999925 ) Instead, they imply that they expect multi-AZ has a failure distribution that's about 5x better for short outages and about the same for long outages.

And note there's really no SLA asserting independence of regions... You just have the instance level and region level guarantees.

Further, note that the SLA very clearly excludes some causes of multi-AZ failures within a region. Force majeure, and regional internet access issues beyond the "demarcation point" of the service.

mlyle · on Jan 20, 2022

Yes, but the underlying point you're willfully missing is:

You can't engineer around AWS AZ common-mode failures using AWS.

The moment that you have failures that are not independent and common mode, you can't just multiply together failure probabilities to know your outage times.

johnmarcus · on Jan 20, 2022

Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".

It's not. .1% of 36524 = 87.6 hours of downtime - that's over 3 days of complete downtime every year!

For a more complete list of their SLA's for every service: https://aws.amazon.com/legal/service-level-agreements/?aws-s...

They only refund 100% when they fall below 95% of availability! 95-99= 30%. I believe the real target is above 99.9% though, as that results in 0 refund to the customer. What that means is, 3 days of downtime is acceptable!

Alternatively, you can return to your own datacenter and find out first hand that it's not particularly as easy to deliver that as you may think. You too will have power outages, network provider disruptions, and the occasional "oh shit, did someone just kick that power cord out?" or complete disk array meltdowns.

Anywho, they have a lot more room in their published SLA's than you think.

Edit: as someone correctly pointed out i did a typo in my math. it is only ~9 hours of aloted downtime. Keeping in mind that this is per service though - meaning each service can have a different 9 hours of downtime before they need to pay out 10% of that one service. I still stand by my statement thier SLA's have a lot of wiggle room that people should take more seriously.

sciurus · on Jan 20, 2022

As someone else said, your math is off. Your point is still reasonable, though.

The uptime.is website is a handy resource for these calculations. For example, http://uptime.is/99.9 says

"SLA level of 99.9 % uptime/availability results in the following periods of allowed downtime/unavailability:

    Daily: 1m 26s
    Weekly: 10m 4s
    Monthly: 43m 49s
    Quarterly: 2h 11m 29s
    Yearly: 8h 45m 56s"

mqnfred · on Jan 20, 2022

Your computation is incorrect, 3 days out of 365 is 1% of downtime, not 0.1%. I believe your error stems from reporting .1% as 0.1. Indeed:

0.001 (.1%) * 8760 (365d*24h) = 8.76h

Alternatively, the common industry standard in infrastructure (the place I work at at least,) is 4 nines, so 99.99% availability, which is around 52 mins a year or 4 mins a month iirc. There's not as much room as you'd think! :)

Karrot_Kream · on Jan 21, 2022

> Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".

Maybe this is the problem. 99.9% isn't being used by AWS the way people use it in conversation; it has a definite meaning, and they'll refund you based on that definite meaning.

kreeben · on Jan 20, 2022

>> you need to load balance across multiple independent availability zones

The only problem with that is, there are no independent availability zones.

What we do have, though, is an architecture where errors propagate cross-zone until they can't propagate any further, because services can't take any more requests, because they froze, because they weren't designed for a split brain scenario, and then, half the internet goes down.

outworlder · on Jan 20, 2022

> The only problem with that is, there are no independent availability zones.

There are - they can be as independent as you need them to be.

Errors won't necessarily propagate cross-zone. If they do, someone either screwed up, or they made a trade-off. Screwing up is easy, so you need to do chaos testing to make sure your system will survive as intended.

kreeben · on Jan 20, 2022

I'm not talking about my global app. I'm talking about the system I deploy to, the actual plumbing, and how a huge turd in a western toilet causes east's sewerage system to over-flow.