> "My error lies in forgetting that Gilbert and Lynch’s formulation of availability requires only non-failing nodes to respond, and that without ≥1 live nodes in a partition there isn’t the option for split-brain syndrome."
Yes distributed system are hard, even those writing blogs decrying the silliness of others also can very easily make mistakes.
Another way to think about partition tolerance is not only "what do I do when a net-split occurs" but what do I do if net-split heals and now information has to merge. In other words the fact that one server got struck by lighting, fried and is never coming back is a nice case to have. Having it come back a month later for whatever reason and becoming a "master" while discarding and logically rolling back whole swathes of data is problem.
> Multi-node failures may be rarer than single-node failures, but they are still common enough to have serious effects on business. [...] PDU failures, switch failures, accidental power cycles of whole racks, whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC system. And I’m not even an ops guy.
I think it is funny that the author manages to blame everything and everyone except bad software. A very likely multi-node failure is hitting a bug in your code. Not saying hypoglycemic drivers are to be taken lightly around your high capacity blade servers, but screwing up something in code, and have the same code run on all 1000 servers could easily bring those 1000 servers down.
> I think it is funny that the author manages to blame everything and everyone except bad software.
That's a good point about bad software, but to be fair the author is reciting actual things he has seen, not hypothetical scenarios. The part of the quote you hid in an ellipsis is important for context:
> In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), PDU failures...
The article is essentially pointing out a flaw in the CAP theorem itself: Network partitions are an availability failure. It's not "consistency, availability, partition tolerance; pick two." It's "a specific type of availability failure can be mitigated by sacrificing consistency."
The problem with the CAP theorem's formulation is that it makes it seem like you have to choose between availability and consistency. But you only have to choose if you've already suffered a serious availability failure. The more important trade off is between availability and cost. You can increase availability (including making network partitions less likely) by adding redundancy. That's the real trade off. If the one switch that connects all your devices fails and has no backup, you're going to lose availability and there is no amount of consistency you can sacrifice to get it back.
> "My error lies in forgetting that Gilbert and Lynch’s formulation of availability requires only non-failing nodes to respond, and that without ≥1 live nodes in a partition there isn’t the option for split-brain syndrome."
Yes distributed system are hard, even those writing blogs decrying the silliness of others also can very easily make mistakes.
Another way to think about partition tolerance is not only "what do I do when a net-split occurs" but what do I do if net-split heals and now information has to merge. In other words the fact that one server got struck by lighting, fried and is never coming back is a nice case to have. Having it come back a month later for whatever reason and becoming a "master" while discarding and logically rolling back whole swathes of data is problem.
> Multi-node failures may be rarer than single-node failures, but they are still common enough to have serious effects on business. [...] PDU failures, switch failures, accidental power cycles of whole racks, whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC system. And I’m not even an ops guy.
I think it is funny that the author manages to blame everything and everyone except bad software. A very likely multi-node failure is hitting a bug in your code. Not saying hypoglycemic drivers are to be taken lightly around your high capacity blade servers, but screwing up something in code, and have the same code run on all 1000 servers could easily bring those 1000 servers down.