>Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.
I could feel the sweat drops just from reading this.
I'd bet every one of us has experienced the panicked Ctrl+C of Death at some point or another.
Brings back memories, though not of anything I did. Quoting a comment I made on HN recently in a different thread:
---
Back in 2009 we were outsourcing our ops to a consulting company, who managed to delete our app database... more than once.
The first time it happened, we didn't understand what, exactly, had caused it. The database directory was just gone, and it seemed to have gone around 11pm. I (not they!) discovered this and we scrambled to recover the data. We had replication, but for some reason the guy on call wasn't able to restore from them -- he was standing in for our regular ops guy, who was away on site with another customer -- so after he'd struggled for a while, I said screw it, let's just restore the last dump, which fortunately had run an hour earlier; after some time we were able to get a new master set up, and although we had lost one hour of data, it was fortunately from a quiet period with very little writes. Everyone went to bed around 1am and things were fine, the users were forgiving, and it seemed like a one-time accident. The techs promised that setting up a new replication slave would happen the next day.
Then, the next day, at exactly 11pm, the exact same thing happened! This obviously pointed to a regular maintenance job as being the culprit. It turns out the script they used to rotate database backup files did an "rm -rf" of the database directory by accident. Again we scrambled to fix. This time the dump was 4 hours old, and there was no slave we could promote to master. We restored the last dump, and I spent the night writing and running a tool that reconstructed the most important data from our logs (fortunately we logged a great deal, including the content of things users were creating). I was able to go bed around 5am. The following afternoon, our main guy was called back to help fix things and set up replication. He had to travel back to the customer, and the last things he told the other guy was: "Remember to disable the cron job".
Then at 10pm... well, take a guess. Kaboom, no database. Turns out they were using Puppet for configuration management, and when the on-call guy had fixed the cron job, he hadn't edited Puppet; he'd edited the crontab on the machine manually. So Puppet ran 15 mins later and put the destructive cron job back in. This time we called everyone, including the CEO. The department head cut his vacation short and worked until 4am restoring the master from the replication logs.
We then fired the company (which filed for bankruptcy not too long after), got a ton of money back (we threatened to sue for damages), and took over the ops side of things ourselves. Haven't lost a database since.
I'm no sysadmin, and I know mistakes are inevitable and all... but I find this kind of mistake is unlikely to come from me. I feel as though a lot of developers are too nonchalant about production boxes. I think one or two close calls where I nearly did this exact thing served as a good wakeup call for me.
Steps I personally take to avoid this:
- Avoid prod boxes like the plague
- Set up a prompt (globally) to make it extremely obvious that you're in production. Something like a red background and black text saying "PRODUCTION"
- When changing data in production (DB's, config, etc) write a script (or just commands to copy and paste) and have that peer reviewed. If anything doesn't go to plan, treat it as a red flag. This serves a dual purpose of having a quick record of your actions without hunting through logs.
- Never ever leave open sessions
- Avoid prod boxes. This is important enough for me to say twice. Most of the time it can be avoided, especially if you use configuration management tools and write tools to perform common operations.
Now, lets just cross my fingers I don't jinx myself :-)
This sounds really cool. How about you are working for amazon or google and you have 1M production boxes? A nice prompt wont save you, change management will. Writing down the exact steps and reading it yourself and get others to read it and you execute it line by line. In my experience this is much better approach than avoid production boxes or have a red prompt and also scales to larger infrastructures. If something goes sideways (and in some cases it will) you can pinpoint the root cause quickly.
It may be unlikely to come from you, but it's for damn sure that it'll never happen to said engineer again.
Also, I would make sure to have a different prompt than default for non-prod systems too. That way you know to be suspicious if it hasn't been changed from default.
I don't think most of your points really apply though. They were setting up replication in production, so they had to work on production boxes. Setting prompt to say just "production" wouldn't help for the same reason. Production was intended.
Peer review though - yes. That could help. I wouldn't say "I'm unlikely to make that mistake" - it's likely to go on the famous last words list...
Sure, maybe the point about "PRODUCTION" may not have applied, I was mainly commenting on OPs post about how we've all been in a situation and the steps I take to avoid such situations. I'm curious about your arguments about the remaining 3/4 of the steps I take and how those wouldn't have applied to this situation?
The red PS1 would've clearly indicated to the engineer that he was typing `rm -rf ...` on the _master_, not the secondary. This assumes that the master and secondary would have differing prompts based on their relative importance.
That would help, but that's not what OP advocated. Sure, you can improve on those ideas. I was mainly pointing out that saying "it's unlikely to happen to me" was a bit dangerous and too sure, if most of the reasons do not apply to the situation.
Would the steps I describe prevent actions taken in the GitLab incident? I would never make no assumptions to that. Maybe. Maybe not. Did I say following those steps would make it unlikely to happen to you? No. That's why I prefaced it with "I'm not a sysadmin." Would it prevent cases described by the person I was responding to? Absolutely. Not 100% of the time, but some percentage of the time.
So, I'll say it more clearly, and you can mark my words. It's unlikely I'll ever log into a production system, type the wrong command, and do something bad as a result.
Could I deploy code that does very bad things to production? Yes. It'll probably happen to me. Is that the situation described above? No.
I treat logging into a production system as if one wrong move could result in me losing my job. Why? Because one wrong move could result in me losing my job. I'm not joking when I say I avoid logging into a production system like the plague. It's unlikely to happen to me because its extremely rare for me to put myself in a situation where I could let this happen. There's almost always better alternatives that I'll resort to, well before doing anything like this.
I messed up an XP computer at home with `cd D:\backups\something; del /s * ` many years ago; `cd` without the /D flag doesn’t change the drive, so although D:\backups\something was the working directory on the D: drive, the working directory was still C:\WINDOWS\system32, and cmd.exe was running as administrator.
Fortunately disks were slower back then, so it hadn’t deleted too many files when I interrupted it, and the computer was able to be recovered without too much inconvenience.
I did the Windows equivalent once a long time ago (I think it's deltree?) and I did it on a university computer system. It cleared out a TON of files and the computer itself pretty much stopped working. I had to hard turn it off.
Fortunately the University was using some tool that can re-image a computer each time it boots before hitting Windows so starting it back up and all the deleted system and application files were back.
Something else that is really useful in these situations (in bash at least) is alt-* (alt-shift-8). It will expand a directory or glob into all effected top level files/directories.
For example, it will expand `ls *` to `ls foo bar baz`, etc
Potential outage prevention plan: put an alias on all production machines that emails HR to schedule a disciplinary meeting every time you run `rm -rf`.
At a small web host, early in my career, I once saw the boss blurr past my desk towards the server room. Throw open the big vault door and disappear inside.
Turns out he had accidentally executed an rm of the home dir on a major web server in the background so in panic, instead of killing the right pid, he just ran to the server and pulled the power cords. :D
I could feel the sweat drops just from reading this.
I'd bet every one of us has experienced the panicked Ctrl+C of Death at some point or another.