Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is usually true, except when it's not :

I have personally experienced a near-catastrophic situation 3 years ago, where 13 out of 15 days' worth of nightly RDS MySQL snapshots were corrupt and would not restore properly.

The root cause was a silent EBS data corruption bug (RDS is EBS-based), that Amazon support eventually admitted to us had slipped through and affected a "small" number of customers. Unlucky us.

We were given exceptional support including rare access to AWS engineers working on the issue, but at the end of the day, there was no other solution than to attempt restoring each nightly snapshot one after the other, until we hopefully found one that was free of table corruption. The lack of flexibility to do any "creative" problem-solving operations within RDS certainly bogged us down.

With a multi-hundred gigabyte database, the process was nerve-wracking as each restore attempt took hours to perform, and each failure meant saying goodbye to another day's worth of user data, with the looming armageddon scenario that eventually we would reach the end of our snapshots without having found a good one.

Finally, after a couple of days of complete downtime, the second to last snapshot worked (IIRC) and we went back online with almost two weeks of data loss, on a mostly user-generated content site.

We got a shitload of AWS credits for our trouble, but the company obviously went through a very near-death experience, and to this day I still don't 100% trust cloud backups unless we also have a local copy created regularly.



> We got a shitload of AWS credits for our trouble, but the company obviously went through a very near-death experience, and to this day I don't 100% trust cloud backups unless we also have a local copy created regularly.

Cloud backups, and more generally all backups should be treated like nuclear profliferation treaties: Trust, but verify!

If your periodically restore your backups you'll catch this kind of crap when it's not an issue, rather than when shit had already hit the fan.


Years ago I had my side project server hacked twice. I've been security and backup paranoid ever since.

At my current startup, we have triple backup redundancy for a 500GB pg database:

1/ A postgres streaming replication hot standby server (who at this moment doesn't serve reads, but might in the future)

2/ WAL level streaming backups to aws s3 using WAL-E, which we automatically restore every week to our staging server

3/ Nightly logical pg_dump backups.

9 months ago we only had option 3 and were hit with a database corruption problem. Restoring the logical backup took hours and caused painful downtime as well as the loss of almost a day of user generated content. That's why we added options 1 and 2.

I can't recommend WAL-E enough for an additional backup strategy. Restoring from a wal (binary) backup is ~10x faster in our usecase (YMMV) and the most data you can loose is about 1 minute. As an additional bonus you get the ability to rollback to any point in time. This has helped us to recover user deleted data.

We have a separate Slack #backups channel where our scripts send a message for every succesful backup, along with the backup size (MB's) and duration. This helps everyone to check if backups ran, and if size and duration are increasing in an expected way.

Because we restore our staging on a weekly basis, we have a fully tested restore script, so when a real restore is needed, we have a couple of people who can handle the task with confidence.

I feel like this is about as "safe" as we should be.


I agree, the backup that isn't used to populate a stage/qa instance right after being taken is untrushworthy.


Even before that there are steps you can take. For example if you take a Postgres backup with pg_dump, you can run pg_restore on it to verify it.

If a database isn't specified, pg_restore will output the SQL commands to restore the database and the exit code will be zero (success) if it makes it through the entire backup. That lets you know that the original dump succeeded and there was no disk error for whatever was written. Save the file to something like S3 as well as the sha256 of it. If the hash matches after you retrieve you can be pretty damn sure that it's a valid backup!

Otherwise you get the blind scripts like GitLab had where pg_dump fails. No exit code checking. No verification. No beuno!


Are there any guidelines on how often you should be doing restore tests?

It probably depends on the criticality of the data, but if you test say every 2 week, you can still fall in the OPs case, right?

At what size/criticality should you have a daily restore test? maybe even a rolling restore test? so you check today's backup, but then check it again every month or something?


Ideally it should be immediately after a logical backup.

For physical backups (ex: wal archiving), a combination of read replicas that are actively queried against, rebuilt from base backups on a regular schedule, and staging master restores lesa frequent yet still regular schedule, will give you a high level of confidence.

Rechecking old backups isn't necessary if you save the hashes of the backup and can compare they still match.


Not immediately (imo); you should push the backup to wherever it's being stored. Then your db test script is the same as your db restore script: both start by downloading the most recent backup. The things you'll catch here are, eg, the process that uploads the dump to s3 deciding to time out after uploading for an hour, NOT fail, silently truncate, and instead exit with 0!


Wow, I'm sorry you experienced that. This points to the importance of regularly testing your backups. I hope AWS will offer an automated testing capability at some point in the future.

In the meantime, I hope you've developed automation to test your backups regularly. You could just launch a new RDS instance from the latest nightly snapshot, and run a few test transactions against it.


This is certainly true of all backups to an extent though not just the cloud. Back in the day of backing up to external tape storage it was important to test restores in case heads weren't calibrated or were calibrated differently between different tape machines etc.

I am curious did you manage to automate an restore smoke test after going through this?


Snapshots are not backups, although many people use them as backups and believe they are good backups. Snapshots are snapshots. Only backups are backups.


What is the difference exactly?


A snapshot could be a backup depending on what you're calling a snapshot, but yeah, in general, to be a backup things need to have these features:

1. stored on separate infrastructure so that obliteration of the primary infrastructure (AWS account locked out for non-payment, password gets stolen and everything gets deleted, datacenter gets eaten by a sinkhole, etc.) doesn't destroy the data.

2. offline, read-only. This is where most people get confused.

Backups are unequivocally NOT a live mirror like RAID 1, slightly-delayed replication setup like most databases provide, or a double-write system. These aren't backups because they make it impossible to recover from human errors, which include obvious things like dropping the wrong table, but also less obvious things, like a subtle bug that corrupts/damages some records and may take days or weeks to notice. Your standbys/mirrors are going to copy both of obvious and non-obvious things before you have a chance to stop them.

This is one of the most important things to remember. Redundancy is not backup. Redundancy is redundancy and it primarily protects against hardware and network failures. It's not a backup because it doesn't protect against human or software error.

3. regularly verified by real-world restoration cases; backups can't be trusted until they're confirmed, at least on a recurring, periodic basis. Automated alarms and monitoring should be used to validate that the backup file is present and that it is within a reasonable size variance between human-supervised verifications. Automatic logical checksums like those suggested by some other users in this thread (e.g., run pg_restore on a pg_dump to make sure that the file can be read through) are great too and should be used whenever available.

4. complete, consistent, and self-contained archive up to the timestamp of the backup. Differenced backups count as long as the full chain needed for a restoration is present.

This excludes COW filesystem snapshots, etc., because they're generally dependent on many internal objects dispersed throughout the filesystem; if your FS gets corrupted, it's very likely that some of the data referenced by your snapshots will be corrupted too (snapshots are only possible because COW semantics mean that the data does not have to be copied, just flagged as in use in multiple locations). If you can export the COW FS snapshot as a whole, self-contained unit that can live separately and produce a full and valid restoration of the filesystem, then that exported thing may be a backup, but the internal filesystem-local snapshot isn't (see also point 1).


Backups protect against bugs and operator errors and belong on a separate storage stack to avoid all correlation, ideally on a separate system (software bugs) with different hardware (firmware and hardware bugs), in a different location.


The purpose of a backup is to avoid data loss in scenarios included in your risk analysis. For example, your storage system could corrupt data, or an engineer could forget a WHERE clause in a delete, or a large falling object hits your data center.

Snapshots will help you against human error, so they are one kind of backup (and often very useful), but if you do not at least replicate those snapshots somewhere else, you are still vulnerable to data corruption bugs or hardware failures in the original system. Design your backup strategy to meet your requirements for risk mitigation.


I'd also add not just different location but different account.

If your cloud account, datacenter/Colo or office is terminated, hacked, burned down, or swallowed by a sink hole.. You don't want your backups going with it.

Cloud especially: even if you're on aws and have your backups on Glacier+s3 with replication to 7 datacenters in 3 continents... If your account goes away, so do your backups (or at least your access to them).


RDS snapshots are backups. They are copied to S3 storage, which is replicated across 3 datacenters within a region.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: