>but your self-managed hardware is definitely going to be less reliable than Amazon's infrastructure.
I don't buy this. I've seen many multi-datacenter self-managed deployments provide better uptime than Amazon web services. You are forgetting that when you own the hardware, you can actually orchestrate maintenance windows with live migrations, etc and then take down an entire datacenter with no impact. Guess when Amazon does maintenance? That's right, you don't know and one screw up can mean instances in "degraded status" (a.k.a. you might as well terminate it and launch a new one) or all of S3 is down during critical business hours.
Of course your own hardware in a single data-center is going to be exposed to high probability of failures, but that's the equivalent of using a single instance in EC2 (which I have lost two of in the last 7 years of managing 15 or so of them for a small company).
I will admit that it takes strong ops skills to maintain high uptime on your own hardware, but that's just due to a lack of good open source tooling in this area. I would rather see a movement to improve tooling rather than continue to boost the stranglehold the public cloud is putting on everyone.
> I've seen many multi-datacenter self-managed deployments provide better uptime than Amazon web services.
Self-managed, multi-DC? Congrats on having a lot of money to blow, I guess.
Yes, with enough money you can match Amazon for uptime or scalability or whatever metric you prefer. For the same money you can probably buy triple the capacity in Amazon or your preferred cloud provider, so this is mostly a game for people with really deep pockets, really large scale, or really poor budgeting.
> You are forgetting that when you own the hardware, you can actually orchestrate maintenance windows with live migrations, etc and then take down an entire datacenter with no impact.
How many DCs are you talking about here? Are you self-managing in 4+ DCs? Or are you running in 2 DCs and your capacity is overbuilt by 100+%? In either case, deep pockets are nice to have.
Also, does your maintenance strategy seriously involve bringing down entire DCs? This is kind of absurd and makes half of me jealous of the bathtub full of cash you must bathe in. It makes the other half of me question some engineering decisions you've apparently made.
> all of S3 is down during critical business hours.
I have trouble believing people when they claim to do significantly better than Amazon (or another favorite cloud provider) for infrastructure uptime. If you stand up a fairly complex system comprised of a number of loosely-coupled services, you're going to end up experiencing some outages, because you'll face the same challenges as Amazon and those guys aren't idiots. You'll lose your message queue due to a bug, or you'll lose a network switch and realize your failover takes 30 minutes to complete instead of the 5 seconds you hoped for, or you'll accidentally DDOS a subsystem when exercising a failover or a system upgrade, or something else. Complex systems fail and when people tell me they built an "internet scale" system with better uptime than Amazon, I'm left to assume that they probably just do a bad job of tracking uptime or else that their systems are not at the scale they imagine. Everyone who builds large systems experiences outages.
> I basically don't believing people when they claim to do significantly better than Amazon (or another favorite cloud provider) for infrastructure uptime.
That needs a dollar-for-dollar or something to that effect qualification. It's possible but very expensive.
There are for instance long running (and I mean really long running, many years or even decades) experiments where any amount of downtime would cause a do-over.
One of my customers had something like this on the go. The amount of money they spent on their power and network redundancy was off the scale, but they definitely had better uptime than Amazon.
Their problems were more along the lines of 'this piece of equipment is nearly eol, how do we replace it without interrupting the work it does'.
Yes, sorry. I was assuming similar expense. Enough money can buy just about anything, including a few additional nines.
If your goal is to build out scale more reliably than Amazon, at the same or lower cost, that's tough and you're unlikely to achieve it unless your scale is approaching that of Amazon (and you have really good people).
>Self-managed, multi-DC? Congrats on having a lot of money to blow, I guess.
Putting a rack in a COLO is still self-managed for the purpose of what I'm talking about. It's easy to get multiple data centers where you are renting the space and electricity but you still own the hardware and can make agreements with various ISPs to get service from.
>How many DCs are you talking about here? Are you self-managing in 4+ DCs? Or are you running in 2 DCs and your capacity is overbuilt by 100+%? In either case, deep pockets are nice to have.
See comment above.
>Also, does your maintenance strategy seriously involve bringing down entire DCs? This is kind of absurd and makes half of me jealous of the bathtub full of cash you must bathe in. It makes the other half of me question some engineering decisions you've apparently made.
See comment above. "bringing down a DC" doesn't mean shutting everything off, it means from the perspective of your end users, your service is not available there.
> because you'll face the same challenges as Amazon and those guys aren't idiots.
No, but they have much different priorities. If all I want is static asset hosting, the loosely-coupled micro-service architecture you are referring to is completely overkill and results in the very instability you are claiming is normal.
>Complex systems fail and when people tell me they built an "internet scale" system with better uptime than Amazon, I'm left to assume that they probably just do a bad job of tracking uptime or else that their systems are not at the scale they imagine. Everyone who builds large systems experiences outages.
Nobody except Google and Microsoft are building something as complex as the entire AWS stack. The vast majority of AWS users are using a tiny percentage of the features that come with AWS and can get by on much simpler systems that are easier to reason about and maintain.
When you dump the majority of what Amazon is actually running, you have a much simpler system and architecture and actually can beat Amazon's uptime.
Amazon charges at least 15 to 20 times the going rate for bandwidth. So if you are serving large amounts of data, it could easily be the case that you can pay for enhanced uptime with just the savings on bandwidth alone.
I don't buy this. I've seen many multi-datacenter self-managed deployments provide better uptime than Amazon web services. You are forgetting that when you own the hardware, you can actually orchestrate maintenance windows with live migrations, etc and then take down an entire datacenter with no impact. Guess when Amazon does maintenance? That's right, you don't know and one screw up can mean instances in "degraded status" (a.k.a. you might as well terminate it and launch a new one) or all of S3 is down during critical business hours.
Of course your own hardware in a single data-center is going to be exposed to high probability of failures, but that's the equivalent of using a single instance in EC2 (which I have lost two of in the last 7 years of managing 15 or so of them for a small company).
I will admit that it takes strong ops skills to maintain high uptime on your own hardware, but that's just due to a lack of good open source tooling in this area. I would rather see a movement to improve tooling rather than continue to boost the stranglehold the public cloud is putting on everyone.