That scenario is what Disaster Recovery plans are for. Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."
You not only have backups in place, you have documentation in place, including a back-up vendor who has copies of the documentation and can staff up workers to get it up and running again without any help from existing staff.
And we tested those scenarios. I'm not sure which dry runs were less fun - when you got paged at 3 AM to go to the DR site and restore the entire infrastructure from scratch... or when you got paged at 3 AM and were instructed to stay home and not communicate with anyone for 24 hours to prove it can be done with out you. (OK, so staying home was definitely more fun, but disturbing.)
This scenario isn't as far fetched as people think. I was running a global deployment in 2012 when hurricane Sandy hit the east cost. The entire eastern seaboard went offline and was off for several days. Some data centers were down for weeks. Our plan had covered that contingency and we failed all of our US traffic to the two west coast regions of amazon. Our downtime on the east cost was around two minutes. Yet a sister company had only one data center in downtown New York, and they were offline for weeks, scrambling to get a backup loaded and online.
I worked for a regional company in the oil and gas industry and the HQ and both datacenters were in the same earthquake zone. A twice per century earthquake had a real risk of taking down both DCs and the HQ. The plan would have been for every gas station in the vertical to switch to a contingency plan distributing critical emergency supplies and selling non-essential supplies using off-grid procedures.
That’s some really good thoughts on DR planning. I have never thought DR to be to such an extent.
How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed? Does even companies like Google plan for this kind of event?
> Some of the temporary locations, such as the W Hotel, required significant upgrades to their network infrastructure, Klepper said. "We're running a Gigabit Ethernet now here in the W Hotel,'' Klepper said, with a network connected to four T1 (1.54M bit/sec) circuits. That network supports the code development for a Web-based interface to the company's systems, which Klepper called "critical" to Empire's efforts to serve its customers. Despite the lost time and the lost code in the collapse of the World Trade Center towers, Klepper said, "we're going to get this done by the end of the year."
> Shevin Conway, Empire's chief technology officer, said that while the company lost about "10 days' worth" of source code, the entire object-oriented executable code survived, as it had been electronically transferred to the Staten Island data center.
The two I've worked for that took it that far were a Federal bank, and an energy company. I have no idea how far Google or other large software companies take their plans.
But based on my experience, the initial recovery planning is the hard part. The documentation to tell a new team how to do it isn't so painful once the base plan exists, although you do need to think ahead to make sure somebody at your back-up vendor has an account with enough access to set up all the other accounts that will need to be created, including authorization to spend money to make it happen.
The last company I worked for where I was (de facto) in charge of IT (small company so I wore lots of hats) could have recovered if both sites burnt down and I got hit by a bus since I made sure that all code, data and instructions to re-up everything existed off site, that both most senior managers understood how to access everything and enough to hand it to a competent firm with a memory stick and a password.
In some ways losing your ERP and it's backups would be harder to recover from than both sites burning down, insurance would cover that at least.
Yes, Google plans extensively and runs regular drills.
It's hearsay, but I was once told that achieving "black start" capability was a program that took many years and about a billion dollars. But they (probably) have it now.
"black start" for GCP would be something to see. Since the global root keys for Cloud KMS are kept on physical encrypted keys locked safes, accessible to only a few core personnel, that would be interesting, akin to a missile silo launch.
"Black start" is a term that refers to bringing up services when literally everything is down.
It's most often referred to in the electricity sector, where bringing power up after a major regional blackout (think 2003 NE blackout) is extremely nontrivial, since the normal steps to turn on a power plant usually requires power: for example, operating valves in a hydro plant or blowers in a coal/gas/oil plant, synchronizing your generation with grid frequency, having something to consume the power; even operating the relays and circuit breakers to connect to the grid may require grid power.
The idea here is presumably that Google services have so many mutual dependencies that if everything were to go down, restarting would be nontrivial because every service would be blocked on starting up due to some other service not being available.
I work for a bank. We have to do a full DR test for our regulator every six months. That means failing all real production systems and running customer workloads in DR, for realsies, twice a year. We also have to do periodic financial stress tests - things like "$OTHER_BANK collapsed. What do you do?" - and be able to demonstrate what we'll do if our vendors choose to sever links with us or go out of business.
It's pretty much part of the basic day-to-day life in some industries.
The company I work for plans for that and it's definitely not FAANG. In fact, DR planning and testing is far more important than stuff like continuous integration, build pipelines, etc.
> Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."
I sat in on a DR test where the moment one of the Auckland based ops team tried asking the Wellington lead, the boss stepped in and said "Wellington has been levelled by an earthquake. Everyone is dead or trying to get back to their family. They will not be helping you during the exercise."
You not only have backups in place, you have documentation in place, including a back-up vendor who has copies of the documentation and can staff up workers to get it up and running again without any help from existing staff.
And we tested those scenarios. I'm not sure which dry runs were less fun - when you got paged at 3 AM to go to the DR site and restore the entire infrastructure from scratch... or when you got paged at 3 AM and were instructed to stay home and not communicate with anyone for 24 hours to prove it can be done with out you. (OK, so staying home was definitely more fun, but disturbing.)