This is usually true, except when it's not : I have personally experienced a nea...

koolba · on Feb 11, 2017

> We got a shitload of AWS credits for our trouble, but the company obviously went through a very near-death experience, and to this day I don't 100% trust cloud backups unless we also have a local copy created regularly.

Cloud backups, and more generally all backups should be treated like nuclear profliferation treaties: Trust, but verify!

If your periodically restore your backups you'll catch this kind of crap when it's not an issue, rather than when shit had already hit the fan.

andruby · on Feb 11, 2017

Years ago I had my side project server hacked twice. I've been security and backup paranoid ever since.

At my current startup, we have triple backup redundancy for a 500GB pg database:

1/ A postgres streaming replication hot standby server (who at this moment doesn't serve reads, but might in the future)

2/ WAL level streaming backups to aws s3 using WAL-E, which we automatically restore every week to our staging server

3/ Nightly logical pg_dump backups.

9 months ago we only had option 3 and were hit with a database corruption problem. Restoring the logical backup took hours and caused painful downtime as well as the loss of almost a day of user generated content. That's why we added options 1 and 2.

I can't recommend WAL-E enough for an additional backup strategy. Restoring from a wal (binary) backup is ~10x faster in our usecase (YMMV) and the most data you can loose is about 1 minute. As an additional bonus you get the ability to rollback to any point in time. This has helped us to recover user deleted data.

We have a separate Slack #backups channel where our scripts send a message for every succesful backup, along with the backup size (MB's) and duration. This helps everyone to check if backups ran, and if size and duration are increasing in an expected way.

Because we restore our staging on a weekly basis, we have a fully tested restore script, so when a real restore is needed, we have a couple of people who can handle the task with confidence.

I feel like this is about as "safe" as we should be.

voidlogic · on Feb 11, 2017

I agree, the backup that isn't used to populate a stage/qa instance right after being taken is untrushworthy.

koolba · on Feb 11, 2017

Even before that there are steps you can take. For example if you take a Postgres backup with pg_dump, you can run pg_restore on it to verify it.

If a database isn't specified, pg_restore will output the SQL commands to restore the database and the exit code will be zero (success) if it makes it through the entire backup. That lets you know that the original dump succeeded and there was no disk error for whatever was written. Save the file to something like S3 as well as the sha256 of it. If the hash matches after you retrieve you can be pretty damn sure that it's a valid backup!

Otherwise you get the blind scripts like GitLab had where pg_dump fails. No exit code checking. No verification. No beuno!

saganus · on Feb 11, 2017

Are there any guidelines on how often you should be doing restore tests?

It probably depends on the criticality of the data, but if you test say every 2 week, you can still fall in the OPs case, right?

At what size/criticality should you have a daily restore test? maybe even a rolling restore test? so you check today's backup, but then check it again every month or something?

koolba · on Feb 11, 2017

Ideally it should be immediately after a logical backup.

For physical backups (ex: wal archiving), a combination of read replicas that are actively queried against, rebuilt from base backups on a regular schedule, and staging master restores lesa frequent yet still regular schedule, will give you a high level of confidence.

Rechecking old backups isn't necessary if you save the hashes of the backup and can compare they still match.

x0x0 · on Feb 11, 2017

Not immediately (imo); you should push the backup to wherever it's being stored. Then your db test script is the same as your db restore script: both start by downloading the most recent backup. The things you'll catch here are, eg, the process that uploads the dump to s3 deciding to time out after uploading for an hour, NOT fail, silently truncate, and instead exit with 0!

illumin8 · on Feb 11, 2017

Wow, I'm sorry you experienced that. This points to the importance of regularly testing your backups. I hope AWS will offer an automated testing capability at some point in the future.

In the meantime, I hope you've developed automation to test your backups regularly. You could just launch a new RDS instance from the latest nightly snapshot, and run a few test transactions against it.

bogomipz · on Feb 11, 2017

This is certainly true of all backups to an extent though not just the cloud. Back in the day of backing up to external tape storage it was important to test restores in case heads weren't calibrated or were calibrated differently between different tape machines etc.

I am curious did you manage to automate an restore smoke test after going through this?

mikiem · on Feb 11, 2017

Snapshots are not backups, although many people use them as backups and believe they are good backups. Snapshots are snapshots. Only backups are backups.

debaserab2 · on Feb 11, 2017

What is the difference exactly?

cookiecaper · on Feb 11, 2017

A snapshot could be a backup depending on what you're calling a snapshot, but yeah, in general, to be a backup things need to have these features:

1. stored on separate infrastructure so that obliteration of the primary infrastructure (AWS account locked out for non-payment, password gets stolen and everything gets deleted, datacenter gets eaten by a sinkhole, etc.) doesn't destroy the data.

2. offline, read-only. This is where most people get confused.

Backups are unequivocally NOT a live mirror like RAID 1, slightly-delayed replication setup like most databases provide, or a double-write system. These aren't backups because they make it impossible to recover from human errors, which include obvious things like dropping the wrong table, but also less obvious things, like a subtle bug that corrupts/damages some records and may take days or weeks to notice. Your standbys/mirrors are going to copy both of obvious and non-obvious things before you have a chance to stop them.

This is one of the most important things to remember. Redundancy is not backup. Redundancy is redundancy and it primarily protects against hardware and network failures. It's not a backup because it doesn't protect against human or software error.

3. regularly verified by real-world restoration cases; backups can't be trusted until they're confirmed, at least on a recurring, periodic basis. Automated alarms and monitoring should be used to validate that the backup file is present and that it is within a reasonable size variance between human-supervised verifications. Automatic logical checksums like those suggested by some other users in this thread (e.g., run pg_restore on a pg_dump to make sure that the file can be read through) are great too and should be used whenever available.

4. complete, consistent, and self-contained archive up to the timestamp of the backup. Differenced backups count as long as the full chain needed for a restoration is present.

This excludes COW filesystem snapshots, etc., because they're generally dependent on many internal objects dispersed throughout the filesystem; if your FS gets corrupted, it's very likely that some of the data referenced by your snapshots will be corrupted too (snapshots are only possible because COW semantics mean that the data does not have to be copied, just flagged as in use in multiple locations). If you can export the COW FS snapshot as a whole, self-contained unit that can live separately and produce a full and valid restoration of the filesystem, then that exported thing may be a backup, but the internal filesystem-local snapshot isn't (see also point 1).

fh973 · on Feb 11, 2017

Backups protect against bugs and operator errors and belong on a separate storage stack to avoid all correlation, ideally on a separate system (software bugs) with different hardware (firmware and hardware bugs), in a different location.

chousuke · on Feb 11, 2017

The purpose of a backup is to avoid data loss in scenarios included in your risk analysis. For example, your storage system could corrupt data, or an engineer could forget a WHERE clause in a delete, or a large falling object hits your data center.

Snapshots will help you against human error, so they are one kind of backup (and often very useful), but if you do not at least replicate those snapshots somewhere else, you are still vulnerable to data corruption bugs or hardware failures in the original system. Design your backup strategy to meet your requirements for risk mitigation.

gregmac · on Feb 11, 2017

I'd also add not just different location but different account.

If your cloud account, datacenter/Colo or office is terminated, hacked, burned down, or swallowed by a sink hole.. You don't want your backups going with it.

Cloud especially: even if you're on aws and have your backups on Glacier+s3 with replication to 7 datacenters in 3 continents... If your account goes away, so do your backups (or at least your access to them).

illumin8 · on Feb 11, 2017

RDS snapshots are backups. They are copied to S3 storage, which is replicated across 3 datacenters within a region.