Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk.

What's a bit flip?



Sometimes data on disk and in memory are randomly corrupted. For a pretty amazing example, check out "bitsquatting"[1]--it's like domain name squatting, but instead of typos, you squat on domains that would bit looked up in the case of random bit flips. These can occur due e.g. to cosmic rays. On-disk, HDDs and SSDs can produce the wrong data. It's uncommon to see actual invalid data rather than have an IO fail on ECC, but it certainly can happen (e.g. due to firmware bugs).

[1]: https://en.wikipedia.org/wiki/Bitsquatting


Basically it's that memory changes out from under you. As we know, computers use Binary, so everything boils down to it being a 0 or a 1. A bit flip is changing what was say a 0 into a 1.

Usually attributed to "cosmic rays", but really can happen for any number of less exciting sounding reasons.

Basically, there is zero double checking in your computer for almost everything except stuff that goes across the network. Memory and disks are not checked for correctness, basically ever on any machine anywhere. Many servers(but certainly not all) are the rare exception when it comes to memory safety. They usually have ECC(Error Correction Code) Memory, basically a checksum on the memory to ensure that if memory is corrupted, it's noticed and fixed.

Essentially every filesystem everywhere does zero data integrity checking:

  MacOS APFS: Nope
  Windows NTFS: Nope
  Linux EXT4: Nope
  BSD's UFS: Nope
  Your mobile phone: Nope
ZFS is the rare exception for file systems that actually double check the data you save to it is the data you get back from it. Every other filesystem is just a big ball of unknown data. You probably get back what you put it, but there is zero promises or guarantees.


> disks are not checked for correctness, basically ever on any machine anywhere.

I'm not sure that's really accurate -- all modern hard drives and SSD's use error-correcting codes, as far as I know.

That's different from implementing additional integrity checking at the filesystem level. But it's definitely there to begin with.


But SSDs (to my knowledge) only implement checksum for the data transfer. Its a requirement of the protocol. So you can be sure that the Stuff in memory and checksum computed by the CPU arrives exactly like that in the SSD driver. In the past this was a common error source with hardware raid which was faulty.

But there is ABSOLUTELY NO checksum for the bits stored on a SSD. So bit rot at the cells of the SSDs are undetected.


That is ABSOLUTELY incorrect. SSDs have enormous amounts of error detection and correction builtin explicitly because errors on the raw medium are so common that without it you would never be able to read correct data from the device.

It has been years since I was familiar enough with the insides of SSDs to tell you exactly what they are doing now, but even ~10-15 years ago it was normal for each raw 2k block to actually be ~2176+ bytes and use at least 128 bytes for LDPC codes. Since then the block sizes have gone up (which reduces the number of bytes you need to achieve equivalent protection) and the lithography has shrunk (which increases the raw error rate).

Where exactly the error correction is implemented (individual dies, SSD controller, etc) and how it is reported can vary depending on the application, but I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.


> I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.

While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.

There are lots of weasel words there on purpose. There is generally zero guarantee in reality and drives lie all the time about data being safely written to disk, even if it wasn't actually safely written to disk yet. This means on power failure/interruption the outcome of being able to read XYZ back is 100% unknown. Drive Manufacturers make zero promises here.

On most consumer compute, there is no promises or guarantees that what you wrote on day 1 will be there on day 2+. It mostly works, and the chances are better than even that your data will be mostly safe on day 2+, but there is zero promises or guarantees. We know how to guarantee it, we just don't bother(usually).

You can buy laptops and desktops with ECC RAM and use ZFS(or other checksumming FS), but basically nobody does. I'm not aware of any mobile phones that offer either option.


> While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.

I'm not really sure what point you're trying to make. It's using ECC, so they should be the same bytes.

There isn't infinite reliability, but nothing has infinite reliability. File checksums don't provide infinite reliability either, because the checksum itself can be corrupted.

You keep talking about promises and guarantees, but there aren't any. All there is are statistical rates of reliability. Even ECC RAM or file checksums don't offer perfect guarantees.

For daily consumer use, the level of ECC built into disks is generally plenty sufficient. It's chosen to be so.


I would disagree that disks alone are good enough for daily consumer use. I see corruption often enough to be annoying with consumer grade hardware without ECC & ZFS. Small images are where people usually notice. They tend to be heavily compressed and small in size means minor changes can be more noticeable. In larger files, corruption tends to not get noticed as much in my experience.

We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.

We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.


All MLC SSDs absolutely do data checksums and error recovery, otherwise they would very lose your data much more than they do.

You can see some stats using `smartctl`.


Yes, the disk mostly promises what you write there will be read back correctly, but that's at the disk level only. The OS, Filesystem and Memory generally do no checking, so any errors at those levels will propagate. We know it happens, we just mostly choose to not do anything about it.

My point was, on most consumer compute, there is no promises or guarantees that what you see on day 1 will be there on day 2. It mostly works, and the chances are better than even that your data will be mostly safe on day 2, but there is zero promises or guarantees, even though we know how to do it. Some systems do, those with ECC memory and ZFS for example. Other filesystems also support checksumming, like BTRFS being the most common counter-example to ZFS. Even though parts of BTRFS are still completely broken(see their status page for details).


"Basically, there is zero double checking in your computer for almost everything except stuff that goes across the network."

This is so not true.

All the high speed busses (QPI, UPI, DMI, PCIe, etc.) have "bit flip" protection in multiple layers: differential pair signaling, 8b/10b (or higher) encoding, and packet CRCs.

Hard drives (the old spinning rust kind) store data along with a CRC.

SSD/NVMe drives use strong ECC because raw flash memory flips so many bits that it is unusable without it.

If most filesystems don't do integrity checks it's probably because there's not much need to.


I agree that transfers(like the high speed buses) have some checks to ensure transfers happen properly, but that doesn't help much if the data is/was corrupted on either side.

> If most filesystems don't do integrity checks it's probably because there's not much need to.

I would disagree that disks alone are good enough for daily consumer use. I see corruption often enough to be annoying with consumer grade hardware without ECC & ZFS. Small images are where people usually notice. They tend to be heavily compressed and small in size means minor changes can be more noticeable. In larger files, corruption tends to not get noticed as much in my experience.

We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.

We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.


Btrfs and bcachefs both have data checksumming. I think ReFS does as well.


Yes, ZFS is not the only filesystem with data checksumming and guarantees, but it's one of the very rare exceptions that do.

ZFS has been in productions work loads since 2005, 20 years now. It's proven to be very safe.

BTRFS has known fundamental issues past one disk. It is however improving. I will say BTRFS is fine for a single drive. Even the developers last I checked(a few years ago) don't really recommend it past a single drive, though hopefully that's changing over time.

I'm not familiar enough with bcachefs to comment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: