I will add my inputs on RDS. I gave this comment on the GitLab incident thread. ...

illumin8 · on Feb 11, 2017

There is no magic silver bullet that will let you upgrade a database without some minor amount of downtime. RDS minimizes this as much as possible by upgrading your standby database, initiating a failover, then creating a new standby. Clients will always be impacted because you have to, by definition, restart your database to be running the new version.

You can select your maintenance window, and you can defer updates as long as you want - nobody will force you to update, unless you check the "auto minor version update" box.

Please don't blame AWS for your lack of understanding of the platform. They try to protect you from yourself, and the default behavior of taking a final snapshot before deleting an instance is in both CloudFormation and the Console. If you choose to override those defaults, don't blame AWS.

yeukhon · on Feb 12, 2017

You come across strong. No one is blaming AWS.

cookiecaper · on Feb 11, 2017

>So I learned my first lesson with RDS - make sure final snapshot flag is enabled (for EC2 user, please remind yourself anything stored on ephemeral storage are going to be loss upon a hard VM stop/start operation, so backup!!!).

This bit us once. Someone issued a `shutdown -h now` out of habit in an instance that was going for reboot, and it came back without its data, because "shutdown" is the same as "stop", and "stop" on ephemeral instances means "delete all my data". Since the command was issued from inside the VM, no warning or message that would've appeared on the EC2 console was displayed.

Amazon's position on ephemeral storage was shockingly unacceptable and unprofessional. They claimed they had to scrub the physical storage as soon as the stop button was pressed for security purposes, which is a complete cop-out. Of course they can't reallocate that chunk of the disk to the next instance while your stuff is on it, but they could've implemented a small cooldown period between stoppage, scrubbing, and reallocating the disk so that there would at least be a panic button and/or so accidental reboots-as-shutdowns don't destroy data. The only reason they didn't do that is because they didn't want to need to expand their infrastructure to accommodate it. Very sloppy, and not at all OK. That's not how you treat customer data.

Fortunately, AWS has moved on; I don't think that any new instances can be created with ephemeral storage anymore. Pure EBS now.

>I also learned that RDS is not truly HA in the case of upgrading servers, both minor and major upgrade. I've tested major upgrade and saw DB connection unavailable up to 10 min. In some minor version upgrades both primary and secondary had to be taken down.

You need multi-AZ for true HA. Failover within the same AZ has a small delay, as you've noted.

>I still enjoy using RDS, the service is stable and quick to use, but just make sure you have the habit of backuping and take serious ownership and responsibility of the database.

As many others in this thread have said, AWS and other cloud providers aren't a silver bullet. Competent people are still needed to manage these sorts of things. GitLab most likely would not have fared any better under AWS.

illumin8 · on Feb 11, 2017

Don't blame AWS because you don't understand what ephemeral storage is.

There is a significant security reason why they blank the ephemeral storage. How would you feel if a competitor got the same physical server as you, and was able to read all of your data? AWS takes great lengths to protect customer data privacy in a shared, multi-tenant environment. They are very public through their documentation about how this works, so I think it's a bit negligent to blame them because you don't understand the platform.

cookiecaper · on Feb 11, 2017

Did you read my post? I understand what ephemeral storage is, and that giving another instance access to that physical device without scrubbing it is insecure. That's not the point. There's no reason that AWS needs to delete that data the instant a stop command is issued.

AWS gets paid the big bucks to abstract such concerns away in a pleasant manner. The device with customer data can sit in reserve, attached to the customer's account, for a cooldown period (of maybe 24 hours?) that would allow the customer to redeem it. AWS could even charge a fee for such data redemptions to compensate for the temporary utilization of the resource, or they could say ephemeral instances will always cost your use + 1 day. They can put a quota on the number of times you can hop ephemeral nodes.

They could do basically anything else, because basically anything else is better than accidentally deleting data that you need due to a counterintuitive vendor-specific quirk that conflicts with established conventions and habits and then being told "Sorry, you should've read the docs better."

This is an Amazon-specific thing that bucks established convention and converts the otherwise-harmless habits of sysadmins into potential data loss events. It's very bad to do this ever (looking at you, killall Linux v killall Solaris), but it's especially bad to do it on a new platform like AWS where you know lots of people are going to be carrying over their established habits and learning the lay of the land. It is not reasonable for Amazon to tell the users that they just have to suck it up and read the docs more thoroughly next time.

This is not like invoking rm on your system or database root, which is a multi-decade danger that everyone is aware of and acclimated to accounting for, and which has multiple system-level safeguards in place to prevent it: user access control, safe-by-default versions of rm that have been distributed with most major distributions lately, etc., and for which thorough backup and replication solutions exist to provide remedies when inevitable accidents do happen.

The point is that just instantly deleting that data ASAP and providing 0 chance for recovery is wanton recklessness, and there's no excuse for it. Security is not an excuse because there's no reason they have to reallocate the storage the instant the node is stopped.

If such deletions could only be triggered from the EC2 console after removing a safeguard similar to Termination Protection, that may be more reasonable, but allowing a shutdown command from the CLI to destroy the data is patently irresponsible.

Good system design considers that humans will use the system, that humans make mistakes, and it will provide safeguards and forgiveness. Ephemeral storage fails on all of those fronts. Yes, technically, it's the user's fault for mistakenly pressing the buttons that make this happen. But that doesn't matter. The system needs to be reasonably safe. AWS's implementation of ephemeral storage is neither safe nor reasonable.

Amazon has done a good job of tucking ephemeral storage away. It used to be the default on certain instance sizes. As another commenter points out, it now requires one to specifically launch an AMI with instance-backed storage. It's good that they've made it harder to get into this mess, but it's bad that they continue to mistreat customers this way, especially when their prices are so exorbitant.

illumin8 · on Feb 12, 2017

So, the solution to some customers not understanding the economics and functionality of ephemeral storage is to charge all customers for a minimum of 25 hours of use, even if they only use the instance for a single hour? That seems crazy.

Look, AWS is trying to balance the economics of a large, shared, multi-tenant platform. It would be great if they had enough excess capacity around to keep ephemeral instance hardware unused for 24 hours after the customer terminates or stops the instance, but frankly, that's an edge case, and they would be forcing other customers to subsidize your edge case by charging everyone more.

cookiecaper · on Feb 12, 2017

>So, the solution to some customers not understanding the economics and functionality of ephemeral storage

Let me stop you there. In our case, it wasn't that we didn't understand what ephemeral storage was or how it functioned, or that it would get cleared if the instance was stopped (though I've frequently met people who are confused over whether instance storage gets wiped when a machine is stopped or when it's terminated; it gets wiped when an instance is stopped).

The issue was that someone typed "sudo shutdown -h now" out of habit instead of "sudo shutdown -r now" (and yes, something like "sudo reboot" should've been used instead to prevent such mistakes). Stopping an instance, which is what happens when you "shut down", can have other ramifications that are annoying, like getting a different IP address when it's started back up, but those annoyances are usually pretty easy to recover from, not a big deal. Much different ball park from getting your stuff wiped.

Destroying consumer data IS a big deal. It's ALWAYS a big deal. If your system allows users to destroy their data without being 1000% clear about what's happening, your system's design is broken. High-cost actions like that should require multiple confirmations.

Even the behavior of the `rm` command has been adjusted to account for this (though it could be argued that it hasn't been adjusted far enough); for the last several years, an extra flag has been required to remove the filesystem root.

>is to charge all customers for a minimum of 25 hours of use, even if they only use the instance for a single hour? That seems crazy.

One of several potential solutions. It doesn't seem crazy to me; at least, not in comparison to making a platform with such an abnormal design that something which is an innocent, non-destructive command everywhere else can unexpectedly destroy tons of data.

The ideal solution would be for Amazon to fix their design so that this is fully transparent to the user. Instance storage should be transmitted into a temporary EBS disk on shutdown and automatically re-applied to a new instance store when it's spun back up (it's OK if this happens asynchronously). The EBS disk would follow conventional EBS disk termination policies; that data shouldn't be deleted except at times that the EBS root disk would also be deleted (typically on instance termination, unless special action is taken to preserve it).

That could be an optional extension, but it should be on by default -- that is, you could start an instance store at a lower cost per hour if you disabled this functionality, similar to reduced redundancy storage in S3, etc. Almost every company would be thrilled to pay the extra few cents per hour to safeguard against the accidental destruction of virtually any quantity of data that might be important.

>Look, AWS is trying to balance the economics of a large, shared, multi-tenant platform. It would be great if they had enough excess capacity around to keep ephemeral instance hardware unused for 24 hours after the customer terminates or stops the instance, but frankly, that's an edge case, and they would be forcing other customers to subsidize your edge case by charging everyone more.

A redemption fee would punish the user who made the mistake for failing to account for Amazon's flawed design. Under this model, such fees should be at least high enough to make up the cost incurred by Amazon in keeping the hardware idle.

This way Amazon can punish people who impugn upon its bad design choices by making them embarrass themselves before their bosses when they have to explain why the AWS bill is $300 higher this month or whatever, and the data won't be gone. Winners all around.

illumin8 · on Feb 13, 2017

A redemption fee is a good idea, but it would still take engineering effort to build such a feature, so the opportunity cost is that other features customers need wouldn't get built.

Another thing I'd like to point out is that you really need to plan for ephemeral storage to fail. All it takes is a single disk drive failure in your physical host, and you've lost data. If you are using ephemeral storage at all, you should definitely have good, reliable backups, or the data should be protected in other ways (like HDFS replication).

mortar · on Feb 11, 2017

> I don't think that any new instances can be created with ephemeral storage anymore. Pure EBS now.

Still around, can be launched with an instance-store backed AMI:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceS...

AlisdairO · on Feb 11, 2017

Not sure how it works on cloudformation, but in the console and API you have to explicitly skip the final snapshot.