I remember some huge DDOS attacks like a decade ago, and people were speculating who could be behind it. The three top theories were Russian intelligence, the Mossad, and this guy on 4chan who claimed to have a Botnet doing it.
That was the start of living in the future for me.
This felt like something straight out of a post modern novel during the whole WSB press rodeo, where some user names being used on TV were somewhere between absurd to repulsive.
The problem with tweets on transgender bathrooms is that you can be attacked for them by either side at any point in the future, so the user OverTheCounterIvermectin should have known better.
Curious what the internal "privacy" limitations are. Certainly FB must track reddit users : fb account even if they don't actually display it. It just makes sense.
Well, you want the right people to have access. If you're a small shop or act like one, that's your "top" techs.
If you're a mature larger company, that's the team leads in your networking area on the team that deal with that service area (BGP routing, or routers in general).
Most likely Facebook et. al. management never understood this could happen because it's "never been a problem before".
I can't fathom how they didn't plan for this. In any business of size, you have to change configuration remotely on a regular basis, and can easily lock yourself out on a regular basis. Every single system has a local user with a random password that we can hand out for just this kind of circumstance...
Organizational complexity grows super-linearly; in general, the number of people a company can hire per unit time is either constant or grows linearly.
Google once had a very quiet big emergency that was, ironically(1), initiated by one of their internal disaster-recovery tests. There's a giant high-security database containing the 'keys to the kingdom', as it were... Passwords, salts, etc. that cannot be represented as one-time pads and therefore are potentially dangerous magic numbers for folks to know. During disaster recovery once, they attempted to confirm that if the system had an outage, it would self-recover.
It did not.
This tripped a very quiet panic at Google because while the company would tick along fine for awhile without access to the master password database, systems would, one by one, fail out if people couldn't get to the passwords that had to be occasionally hand-entered to keep them running. So a cross-continent panic ensued because restarting the database required access to two keycards for NORAD-style simultaneous activation. One was in an executive's wallet who was on vacation, and they had to be flown back to the datacenter to plug it in. The other one was stored in a safe built into the floor of a datacenter, and the combination to that safe was... In the password database. They hired a local safecracker to drill it open, fetched the keycard, double-keyed the initialization machines to reboot the database, and the outside world was none the wiser.
(1) I say "ironically," but the actual point of their self-testing is to cause these kinds of disruptions before chance does. They aren't generally supposed to cause user-facing disruption; sometimes they do. Management frowns on disruption in general, but when it's due to disaster recovery testing, they attach to that frown the grain of salt that "Because this failure-mode existed, it would have occurred eventually if it didn't occur today."
Thanks for telling this story as it was more amusing than my experiences of being locked in a security corridor with a demagnetised access card, looooong ago.
EDIT: I had mis-remembered this part of the story. ;) What was stored in the executive's brain was the combination to a second floor safe in another datacenter that held one of the two necessary activation cards. Whether they were able to pass it to the datacenter over a secure / semi-secure line or flew back to hand-deliver the combination I do not remember.
If you mean "Would the pick-pocket have access to valuable Google data," I think the answer is "No, they still don't have the key in the safe on the other continent."
If you mean "Would the pick-pocket have created a critical outage at Google that would have required intense amounts of labor to recover from," I don't know because I don't know how many layers of redundancy their recovery protocols had for that outage. It's possible Google came within a hair's breadth of "Thaw out the password database from offline storage, rebuild what can be rebuilt by hand, and inform a smaller subset of the company that some passwords are now just gone and they'll have to recover on their own" territory.
Maybe because they were planning for a million other possible things to go wrong, likely with higher probability than this. And busy with each day's pressing matters.
Anyone who has actually worked in the field can tell you that a deploy or config change going wrong, at some point, and wiping out your remote access / ability to deploy over it is incredibly, crazy likely.
That someone will win the lottery is also incredibly likely. That a given person will win the lottery is, on the other hand, vanishingly unlikely. That a given config change will go wrong in a given way is ... eh, you see where I'm going with this
Right, which is why you just roll in protection for all manner of config changes by taking pains to ensure there are always whitelists, local users, etc. with secure(ly stored) credentials available for use if something goes wrong; rather than assuming your config changes will be perfect.
I'm not sure it's possible to speculate in a way which is generic over all possible infrastructures. You'll also hit the inevitable tradeoff of security (which tends towards minimal privilege, aka single points of failure) vs reliability (which favours 'escape hatches' such as you mentioned, which tend to be very dangerous from a security standpoint).
Absolutely, and I'd even call it a rite of passage to lock yourself out in some way, having worked in a couple of DCs for three years. Low-level tooling like iLO/iDRAC can sure help out with those, but is often ignored or too heavily abstracted away.
Exactly! Obviously they have extremely robust testing and error catching on things like code deploys: how many times do you think they deploy new code a day? And at least personally, their error rate is somewhere below 1%.
Clearly something about their networking infrastructure is not as robust.
Most likely they did plan for this. Then, something happened that the failsafe couldn't handle. E.g. if something overwrites /etc/passwd, having a local user won't help. I'm not saying that specific thing happened here -- it's actually vanishingly unlikely -- but your plan can't cover every contingency.
Agreed, it’s also worth mentioning that at the end of every cloud is real physical hardware, and that is decidedly less flexible than cloud, if you locked yourself out of a physical switch or router you have many fewer options.
In risk management cultures where consequences from failures are much, much higher, the saying goes that “failsafe systems fail by failing to be failsafe”. Explicit accounting for scenarios where the failsafe fails is a requirement. Great truths of the 1960s to be relearned, I guess.
My company runs copies of all our internal services in air-gapped data centers for special customers. The operators are just people with security clearance who have some technical skills. They have no special knowledge of our service inner workings. We (the dev team) aren’t allowed to see screenshots or get any data back. So yeah, I have done that sort of troubleshooting many times. It’s very reminiscent of helping your grandma set up her printer over the phone.
For all the hours I spent on the phone spelling grep, ls, cd, pwd, raging that we didn't keep nano instead of fucking vim (and I'm a vim person)... I could have stayed young and been solving real customer problems, not imperium-typing on a fucking keyboard with a 5s delay 'cause colleague is lost in the middle of nowhere and can't remember what file he just deleted and the system doesn't start anymore your software is fragile, just shite.
Yes, and it works if both parties are able to communicate using precise language. The onus is on the remote SME to exactly articulate steps, and on the local hands to exactly follow instructions and pause for clarifications when necessary.
Sometimes the DR plan isn't so much I have to have a working key, I just have to know who gets their first with a working key, and break glass might be literal.
Not OP, but many times. Really makes you think hard about log messages after an upset customer has to read them line by line over the phone.
One was particularly painful, as it was a "funny" log message I had added the code when something went wrong. Lesson learned was to never add funny / stupid / goofy fail messages in the logs. You will regret it sooner or later.
this is not new, this is everyday life with helping hands, on duty engineers, l2-l3 levels telling people with physical access which commands to run etc. etc. etc.
The places I've seen this at had specific verification codes for this. One had a simple static code per person that the hands-on guys looked up in a physical binder on their desk. Very disaster proof.
The other ones had a system on the internal network in which they looked you up, called back on your company phone and asked for a passphrase the system showed them. Probably more secure but requires those systems to be working.
This is not a real datacenter case but normal social hacking. On the datacenter side you have many more security checks plus many of the times the helping hands and engineers are part of the same company, using internal communication tools etc. so they are on the same logical footprint anyhow
Welcome to the brave new world of troubleshooting. This will seriously bite us one day.