Do you expect management to be staring over your shoulder every time you do some...

yarper · on Feb 11, 2017

People get tired, sick, frustrated, panic - part of being a responsible engineer is accepting you're as fallible as the next person and building in protection against your own errors.

However, if "the engineer" that caused this happens to read this, the above is not a sign that you should quit the profession and become a hermit. A chain of events caused this, you just happened to be the one without a chair to sit in when the music stopped.

CaptSpify · on Feb 11, 2017

One of the coolest things I've read about is how airlines do root cause analysis. If you get to a point where a human can mess up the situation like this, it's considered a systemic issue. Mobile now, but can try to find it later

EDIT: https://dvikan.no/ntnu-studentserver/reports/A%20Human%20Err...

> That is, much like falling dominoes, Bird and others (Adams, 1976; Weaver, 1971) have described the cascading nature of human error beginning Human Error Perspective 39 with the failure of management to control losses (not necessarily of the monetary sort) within the organization.

JorgeGT · on Feb 11, 2017

In general, in aviation, the existence of any single point of failure (SPOF) is considered a systemic issue, be it a single human who can fail (anyone can faint or have a heart attack), a single component, or a single process. That's why there are not only redundant systems, but redundant humans and even redundant processes for the same task (you can power the control surfaces hydraulics through the engines, then through the aux power unit, then through the windmill... ).

If a design contains an SPOF, then it's a bad design and should not be approved until the SPOF is removed by adequate redundancy or other means.

damagednoob · on Feb 11, 2017

Management is often at fault for not giving engineers the resources to do their job properly. How much of the 'Improving Recovery Procedures' were already highlighted but ignored? Were they pressured to deliver other features instead of bedding down some of their operations procedures?

I'm not saying this is the case here but it's all too easy to blame someone for making a mistake. Even the most experienced make mistakes but reducing your MTTR is often overlooked in favour of other seemingly more pressing concerns.

_ph_ · on Feb 11, 2017

It was not the managements job to prevent the engineer from typing "rm". It was the managements job to make sure that typing "rm" would not result in big data loss. This is, assuming the engineer was not already the highest ranking technical person in the company.

I am very happy about their open post-mortem, so that anyone can learn from it. Reading it, it looks to me that the "rm" was not the cause of the disaster, it just triggered it. The real problem was the whole setup, which failed. And that is something, which falls under managements responsibility.

chadcmulligan · on Feb 11, 2017

doesn't everyone alias rm to rm -i on prod?

likewise all tty's have red backgrounds on prod.

nathancahill · on Feb 11, 2017

I'm not sure if you're making an equally sarcastic point as your parent or not..

chadcmulligan · on Feb 11, 2017

nope serious. I was at a place that the dba did exactly what happened at gitlab but in sql

select * from table > script

@script.

(drop all the tables)

It was in prod, he thought it was a dev db, the backups had never worked. After this the edict was all terminals for prod will be red. A simple solution