IMO looking at the root causes here isn't that helpful. Software is complicated and there will always be some unknown bottleneck or bug lurking to knock you over on a bad day. The important lessons here are about:
* How their system architecture made them particularly vulnerable to this kind of issue
* Their actions to diagnose and attempt to mitigate the issue
* The whole later part about effectively cold-starting their entire infrastructure, all while millions of users were banging on their metaphorical door to start using the service again.