> We got a shitload of AWS credits for our trouble, but the company obviously we...

andruby · on Feb 11, 2017

Years ago I had my side project server hacked twice. I've been security and backup paranoid ever since.

At my current startup, we have triple backup redundancy for a 500GB pg database:

1/ A postgres streaming replication hot standby server (who at this moment doesn't serve reads, but might in the future)

2/ WAL level streaming backups to aws s3 using WAL-E, which we automatically restore every week to our staging server

3/ Nightly logical pg_dump backups.

9 months ago we only had option 3 and were hit with a database corruption problem. Restoring the logical backup took hours and caused painful downtime as well as the loss of almost a day of user generated content. That's why we added options 1 and 2.

I can't recommend WAL-E enough for an additional backup strategy. Restoring from a wal (binary) backup is ~10x faster in our usecase (YMMV) and the most data you can loose is about 1 minute. As an additional bonus you get the ability to rollback to any point in time. This has helped us to recover user deleted data.

We have a separate Slack #backups channel where our scripts send a message for every succesful backup, along with the backup size (MB's) and duration. This helps everyone to check if backups ran, and if size and duration are increasing in an expected way.

Because we restore our staging on a weekly basis, we have a fully tested restore script, so when a real restore is needed, we have a couple of people who can handle the task with confidence.

I feel like this is about as "safe" as we should be.

voidlogic · on Feb 11, 2017

I agree, the backup that isn't used to populate a stage/qa instance right after being taken is untrushworthy.

koolba · on Feb 11, 2017

Even before that there are steps you can take. For example if you take a Postgres backup with pg_dump, you can run pg_restore on it to verify it.

If a database isn't specified, pg_restore will output the SQL commands to restore the database and the exit code will be zero (success) if it makes it through the entire backup. That lets you know that the original dump succeeded and there was no disk error for whatever was written. Save the file to something like S3 as well as the sha256 of it. If the hash matches after you retrieve you can be pretty damn sure that it's a valid backup!

Otherwise you get the blind scripts like GitLab had where pg_dump fails. No exit code checking. No verification. No beuno!

saganus · on Feb 11, 2017

Are there any guidelines on how often you should be doing restore tests?

It probably depends on the criticality of the data, but if you test say every 2 week, you can still fall in the OPs case, right?

At what size/criticality should you have a daily restore test? maybe even a rolling restore test? so you check today's backup, but then check it again every month or something?

koolba · on Feb 11, 2017

Ideally it should be immediately after a logical backup.

For physical backups (ex: wal archiving), a combination of read replicas that are actively queried against, rebuilt from base backups on a regular schedule, and staging master restores lesa frequent yet still regular schedule, will give you a high level of confidence.

Rechecking old backups isn't necessary if you save the hashes of the backup and can compare they still match.

x0x0 · on Feb 11, 2017

Not immediately (imo); you should push the backup to wherever it's being stored. Then your db test script is the same as your db restore script: both start by downloading the most recent backup. The things you'll catch here are, eg, the process that uploads the dump to s3 deciding to time out after uploading for an hour, NOT fail, silently truncate, and instead exit with 0!