That’s some really good thoughts on DR planning. I have never thought DR to be t...

twistedpair · on Jan 4, 2021

> How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed?

Since 9/11, more than you might think. For example Empire Blue Cross Blue Shield [1] had its HQ in the WTC.

https://www.computerworld.com/article/2585046/empire-blue-[1... cross-it-group-undaunted-by-wtc-attack--anthrax-scare.html

sebmellen · on Jan 5, 2021

Fixed link: https://www.computerworld.com/article/2585046/empire-blue-cr...

And what a blast from the past:

> Some of the temporary locations, such as the W Hotel, required significant upgrades to their network infrastructure, Klepper said. "We're running a Gigabit Ethernet now here in the W Hotel,'' Klepper said, with a network connected to four T1 (1.54M bit/sec) circuits. That network supports the code development for a Web-based interface to the company's systems, which Klepper called "critical" to Empire's efforts to serve its customers. Despite the lost time and the lost code in the collapse of the World Trade Center towers, Klepper said, "we're going to get this done by the end of the year."

> Shevin Conway, Empire's chief technology officer, said that while the company lost about "10 days' worth" of source code, the entire object-oriented executable code survived, as it had been electronically transferred to the Staten Island data center.

codingdave · on Jan 4, 2021

The two I've worked for that took it that far were a Federal bank, and an energy company. I have no idea how far Google or other large software companies take their plans.

But based on my experience, the initial recovery planning is the hard part. The documentation to tell a new team how to do it isn't so painful once the base plan exists, although you do need to think ahead to make sure somebody at your back-up vendor has an account with enough access to set up all the other accounts that will need to be created, including authorization to spend money to make it happen.

noir_lord · on Jan 4, 2021

The last company I worked for where I was (de facto) in charge of IT (small company so I wore lots of hats) could have recovered if both sites burnt down and I got hit by a bus since I made sure that all code, data and instructions to re-up everything existed off site, that both most senior managers understood how to access everything and enough to hand it to a competent firm with a memory stick and a password.

In some ways losing your ERP and it's backups would be harder to recover from than both sites burning down, insurance would cover that at least.

jacques_chester · on Jan 4, 2021

Yes, Google plans extensively and runs regular drills.

It's hearsay, but I was once told that achieving "black start" capability was a program that took many years and about a billion dollars. But they (probably) have it now.

twistedpair · on Jan 4, 2021

"black start" for GCP would be something to see. Since the global root keys for Cloud KMS are kept on physical encrypted keys locked safes, accessible to only a few core personnel, that would be interesting, akin to a missile silo launch.

jacques_chester · on Jan 4, 2021

It would be amazing to see. But I hope we never have to.

blntechie · on Jan 4, 2021

So 'black start' is a program to start over from scratch? The scale required for it itself would be amazing.

jcranmer · on Jan 4, 2021

"Black start" is a term that refers to bringing up services when literally everything is down.

It's most often referred to in the electricity sector, where bringing power up after a major regional blackout (think 2003 NE blackout) is extremely nontrivial, since the normal steps to turn on a power plant usually requires power: for example, operating valves in a hydro plant or blowers in a coal/gas/oil plant, synchronizing your generation with grid frequency, having something to consume the power; even operating the relays and circuit breakers to connect to the grid may require grid power.

The idea here is presumably that Google services have so many mutual dependencies that if everything were to go down, restarting would be nontrivial because every service would be blocked on starting up due to some other service not being available.

rodgerd · on Jan 5, 2021

I work for a bank. We have to do a full DR test for our regulator every six months. That means failing all real production systems and running customer workloads in DR, for realsies, twice a year. We also have to do periodic financial stress tests - things like "$OTHER_BANK collapsed. What do you do?" - and be able to demonstrate what we'll do if our vendors choose to sever links with us or go out of business.

It's pretty much part of the basic day-to-day life in some industries.

eecks · on Jan 4, 2021

The company I work for plans for that and it's definitely not FAANG. In fact, DR planning and testing is far more important than stuff like continuous integration, build pipelines, etc.