This articles misses the point of the cloud. The argument for "live migrations" and lambasting of AWS for bouncing servers for the Xen exploit are both completely misguided.
The point of the Cloud is not that a server never restarts or has downtime. It's that your app runs on many servers in different AZs and regions such that any one server failing is not going to have a real impact and can be easily and quickly replaced.
AWS, when bouncing their servers, did them one az and one region at a time. Because of that, if you ran across multiple AZs you would get to test your failure scenarios, but not be down.
The author's focus on other cloud providers having restarts makes it sound like he's saying "Verizon is bad, but in bad company". There is no comparison to be made between taking down your whole cloud, and losing a server here and there.
Spot on. A random server could go down at any moment, either by a hardware failure or because your cloud provider decides it needs maintenance. The latter actually gives you an up-front warning, the former is just "bang". So any well designed system independent of running on a cloud or on your own hardware should be able to deal with this, since it's the reality of servers: sometimes one breaks.
All of the bigger Clouds (Amazon, Rackspace, SoftLayer) handle their security related restarts this way, one area/region/DC at a time carefully making sure that well designed systems don't notice downtime.
What Verizon does is a totally different thing and something most "important but not life threatening" systems do not design for: Taking down the whole cloud with all servers at the same time.
What hasn't really been said well is that there's constant incidents that are going on in large cloud providers and you don't need to have a maintenance period like this to experience failures. I work in a place where people are still used to individual machines being up constantly by IP alone and aren't practicing high availability architectures in their cloud environments. I had to explain to an architect that VMs can simply just hiccup in AWS or any other provider and that just relaunching the instance is fine. The idea of random failures may be foreign, but I was certainly a bit surprised that someone with such rudimentary understanding of infrastrucure wouldn't be familiar with the idea of "have you tried turning it off and on again?"
With that said, I have zero hope that most enterprises will be able to get a decently solid failover strategy for their cloud applications because an awful lot of them are barely able to get anything to even one cloud provider setup at the most basic layout. Half the folks I'm aware of don't even have cross AZ redundancy let alone cross region failover. When people are turning off 3-4 VMs to save money, you're in no position to think about cross provider load balancing.
Most of the enterprises I'm familiar with would throw Verizon more and more money in the blind hope that it'll make their service more reliable. They'll do this to an incredible dollar amount partly because it'll still be cheaper than their old, traditional in house IT that cost multiple orders magnitude more but got way worse availability. Stuff like this scheduled maintenance shows how foolish it is to believe that someone will magically become ultra reliable by you virtue of you paying them to be.
The point of the Cloud is not that a server never restarts or has downtime. It's that your app runs on many servers in different AZs and regions such that any one server failing is not going to have a real impact and can be easily and quickly replaced.
AWS, when bouncing their servers, did them one az and one region at a time. Because of that, if you ran across multiple AZs you would get to test your failure scenarios, but not be down.
The author's focus on other cloud providers having restarts makes it sound like he's saying "Verizon is bad, but in bad company". There is no comparison to be made between taking down your whole cloud, and losing a server here and there.