>Trying to restore the replication process, an engineer proceeds to wipe the Pos...

atombender · on Feb 11, 2017

Brings back memories, though not of anything I did. Quoting a comment I made on HN recently in a different thread:

---

Back in 2009 we were outsourcing our ops to a consulting company, who managed to delete our app database... more than once.

The first time it happened, we didn't understand what, exactly, had caused it. The database directory was just gone, and it seemed to have gone around 11pm. I (not they!) discovered this and we scrambled to recover the data. We had replication, but for some reason the guy on call wasn't able to restore from them -- he was standing in for our regular ops guy, who was away on site with another customer -- so after he'd struggled for a while, I said screw it, let's just restore the last dump, which fortunately had run an hour earlier; after some time we were able to get a new master set up, and although we had lost one hour of data, it was fortunately from a quiet period with very little writes. Everyone went to bed around 1am and things were fine, the users were forgiving, and it seemed like a one-time accident. The techs promised that setting up a new replication slave would happen the next day.

Then, the next day, at exactly 11pm, the exact same thing happened! This obviously pointed to a regular maintenance job as being the culprit. It turns out the script they used to rotate database backup files did an "rm -rf" of the database directory by accident. Again we scrambled to fix. This time the dump was 4 hours old, and there was no slave we could promote to master. We restored the last dump, and I spent the night writing and running a tool that reconstructed the most important data from our logs (fortunately we logged a great deal, including the content of things users were creating). I was able to go bed around 5am. The following afternoon, our main guy was called back to help fix things and set up replication. He had to travel back to the customer, and the last things he told the other guy was: "Remember to disable the cron job".

Then at 10pm... well, take a guess. Kaboom, no database. Turns out they were using Puppet for configuration management, and when the on-call guy had fixed the cron job, he hadn't edited Puppet; he'd edited the crontab on the machine manually. So Puppet ran 15 mins later and put the destructive cron job back in. This time we called everyone, including the CEO. The department head cut his vacation short and worked until 4am restoring the master from the replication logs.

We then fired the company (which filed for bankruptcy not too long after), got a ton of money back (we threatened to sue for damages), and took over the ops side of things ourselves. Haven't lost a database since.

lojack · on Feb 11, 2017

I'm no sysadmin, and I know mistakes are inevitable and all... but I find this kind of mistake is unlikely to come from me. I feel as though a lot of developers are too nonchalant about production boxes. I think one or two close calls where I nearly did this exact thing served as a good wakeup call for me.

Steps I personally take to avoid this:

- Avoid prod boxes like the plague

- Set up a prompt (globally) to make it extremely obvious that you're in production. Something like a red background and black text saying "PRODUCTION"

- When changing data in production (DB's, config, etc) write a script (or just commands to copy and paste) and have that peer reviewed. If anything doesn't go to plan, treat it as a red flag. This serves a dual purpose of having a quick record of your actions without hunting through logs.

- Never ever leave open sessions

- Avoid prod boxes. This is important enough for me to say twice. Most of the time it can be avoided, especially if you use configuration management tools and write tools to perform common operations.

Now, lets just cross my fingers I don't jinx myself :-)

StreamBright · on Feb 11, 2017

This sounds really cool. How about you are working for amazon or google and you have 1M production boxes? A nice prompt wont save you, change management will. Writing down the exact steps and reading it yourself and get others to read it and you execute it line by line. In my experience this is much better approach than avoid production boxes or have a red prompt and also scales to larger infrastructures. If something goes sideways (and in some cases it will) you can pinpoint the root cause quickly.

DigitalJack · on Feb 11, 2017

It may be unlikely to come from you, but it's for damn sure that it'll never happen to said engineer again.

Also, I would make sure to have a different prompt than default for non-prod systems too. That way you know to be suspicious if it hasn't been changed from default.

viraptor · on Feb 11, 2017

I don't think most of your points really apply though. They were setting up replication in production, so they had to work on production boxes. Setting prompt to say just "production" wouldn't help for the same reason. Production was intended.

Peer review though - yes. That could help. I wouldn't say "I'm unlikely to make that mistake" - it's likely to go on the famous last words list...

lojack · on Feb 11, 2017

Sure, maybe the point about "PRODUCTION" may not have applied, I was mainly commenting on OPs post about how we've all been in a situation and the steps I take to avoid such situations. I'm curious about your arguments about the remaining 3/4 of the steps I take and how those wouldn't have applied to this situation?

cookiecaper · on Feb 11, 2017

The red PS1 would've clearly indicated to the engineer that he was typing `rm -rf ...` on the _master_, not the secondary. This assumes that the master and secondary would have differing prompts based on their relative importance.

viraptor · on Feb 11, 2017

That would help, but that's not what OP advocated. Sure, you can improve on those ideas. I was mainly pointing out that saying "it's unlikely to happen to me" was a bit dangerous and too sure, if most of the reasons do not apply to the situation.

lojack · on Feb 11, 2017

Would the steps I describe prevent actions taken in the GitLab incident? I would never make no assumptions to that. Maybe. Maybe not. Did I say following those steps would make it unlikely to happen to you? No. That's why I prefaced it with "I'm not a sysadmin." Would it prevent cases described by the person I was responding to? Absolutely. Not 100% of the time, but some percentage of the time.

So, I'll say it more clearly, and you can mark my words. It's unlikely I'll ever log into a production system, type the wrong command, and do something bad as a result.

Could I deploy code that does very bad things to production? Yes. It'll probably happen to me. Is that the situation described above? No.

I treat logging into a production system as if one wrong move could result in me losing my job. Why? Because one wrong move could result in me losing my job. I'm not joking when I say I avoid logging into a production system like the plague. It's unlikely to happen to me because its extremely rare for me to put myself in a situation where I could let this happen. There's almost always better alternatives that I'll resort to, well before doing anything like this.

chrismorgan · on Feb 11, 2017

I messed up an XP computer at home with `cd D:\backups\something; del /s * ` many years ago; `cd` without the /D flag doesn’t change the drive, so although D:\backups\something was the working directory on the D: drive, the working directory was still C:\WINDOWS\system32, and cmd.exe was running as administrator.

Fortunately disks were slower back then, so it hadn’t deleted too many files when I interrupted it, and the computer was able to be recovered without too much inconvenience.

rectang · on Feb 11, 2017

For me, it was when I meant to execute this...

    rm -rf ~/foo

... but executed this instead:

    rm -rf ~ /foo

cjbprime · on Feb 11, 2017

My worst data loss was:

    $ tar cvfz mbox outbox mbox.tar.gz

The argument order is backwards -- the output file is supposed to be first, then the input files.

On my system, this overwrote my full mailbox with a gzipped copy of my outbox, and a complaint that the mbox.tar.gz input file didn't exist.

That's right, the worst data loss happened while I was trying to take a backup. :(

chuckdries · on Feb 11, 2017

What were you trying to do? What's the outbox argument for?

dasil003 · on Feb 11, 2017

He was trying to backup his mbox and outbox.

BinaryIdiot · on Feb 11, 2017

I did the Windows equivalent once a long time ago (I think it's deltree?) and I did it on a university computer system. It cleared out a TON of files and the computer itself pretty much stopped working. I had to hard turn it off.

Fortunately the University was using some tool that can re-image a computer each time it boots before hitting Windows so starting it back up and all the deleted system and application files were back.

overcast · on Feb 11, 2017

Probably Ghost server, that's why schools do this.

dsavinkov · on Feb 11, 2017

this is the reason why I always use quotes before specifying folders with rf -rf :)

memracom · on Feb 11, 2017

I always type

ls /stuff/wherever/*

Then examine the output to see if that is the stuff I really want to delete and if it is,

up-arrow Ctrl-A right right backspace backspace rm -rf enter

Never deleted the wrong stuff again in 30 years of doing that

okbake · on Feb 11, 2017

Something else that is really useful in these situations (in bash at least) is alt-* (alt-shift-8). It will expand a directory or glob into all effected top level files/directories.

For example, it will expand `ls *` to `ls foo bar baz`, etc

was_boring · on Feb 11, 2017

I always cd to the parent directory, ls with tab complete to check, the rm with the same tab complete.

bartvk · on Feb 12, 2017

Great tip, thanks!

meowface · on Feb 11, 2017

For me it was a simple `rm -rf .`. Thought I was in ~/somefolder/, was actually in ~...

I stopped it after about 3 seconds, but that was enough to do critical damage.

a_t48 · on Feb 11, 2017

Misread that as "The engineer was terminated" at first. Poor guy.

haldean · on Feb 11, 2017

Potential outage prevention plan: put an alias on all production machines that emails HR to schedule a disciplinary meeting every time you run `rm -rf`.

INTPenis · on Feb 11, 2017

At a small web host, early in my career, I once saw the boss blurr past my desk towards the server room. Throw open the big vault door and disappear inside.

Turns out he had accidentally executed an rm of the home dir on a major web server in the background so in panic, instead of killing the right pid, he just ran to the server and pulled the power cords. :D

Ended up restoring a few home dirs from tape.