Question for ZFS experts: the docs say ([1]) that I can change the checksum used, and I want to change it to BLAKE3.
However, based on [2], it seems the checksum doesn't store the type of checksum, so it seems that it's an all-or-nothing thing instead of a setting that only applies to new blocks.
Can I actually change the checksum? What happens when I do?
Edit: It appears the type of checksum actually is stored per-block, in `blk_prop` ([3]) of `blkptr_t` ([4]).
Edit 2: The manpage says that the change only applies to new data, so yes, it's safe.
A lot of really nice, focused polish items in this release that have real practical benefits. The corrective "healing" receive for example, ZFS has always been able self-heal when there is redundancy (either device level or using the copies attribute), and that's always been a core feature. And in its original incarnation there'd always be at least some redundancy. But as it gets wider use fact is the vast majority of systems have a single device and don't want to cut available storage in half for copies=1. Corruption can still be detected with a scrub thanks to checksums, so ZFS is still useful anyway. It's better to know about corruption asap, while you still have good backups. But being able to smoothly and very quickly repair that without any local redundancy using backups you should have anyway is a nice bit of polish and leveraging of the foundation.
One question if anyone who follows it more closely knows: are there any efforts to work specifically on ZVOL performance? They're very handy for iSCSI and other features, but my understanding is they haven't gotten much focus for quite awhile now. Maybe that just doesn't have any real company backing right now for R&D but I hope it gets some attention eventually.
> But as it gets wider use fact is the vast majority of systems have a single device and don't want to cut available storage in half for copies=1.
Setting copies=2 for the root filesystem, but segmenting stuff like caches, user downloads, etc into separate datasets with copies=1, is totally reasonable for this purpose. The base system in most distributions only takes a few GiB. 1 TiB SSDs are pretty common even on entry level laptops these days, and it's becoming increasingly difficult to find new machines without at least 512 GiB.
A scrub can both detect and repair damage. You don't need to set additional copies. If you don't want to buy 4TB of drives to store 2TB of data in a mirror, use a raidz instead.
In a raidz with at least one parity disk (and any number of data disks), corruption found on any data disk can be repaired immediately. For non-overlapping corruption a single parity disk is enough. If you have corruption of the same block on two different disks (extremely unlikely) then this can be repaired if you have two parity disks. (Etc. for three identically-corrupt data disks and three parity disks)
You can have eg. two parity disks and eight data disks (or more) and so long as no more than two disks are corrupted in the same block at the same time then a scrub will repair it completely.
The system of parity disks in ZFS' raidz is meant for reconstructing entire failed data disks, so fixing corruption is no big deal.
> are there any efforts to work specifically on ZVOL performance
There's this[1] work which integrated ZFS better with the kernel so it could merge small IOs on zvols better. It was merged almost a year ago. However due to a recent issue[2] it has been disabled until they can figure out what exactly is going on. If they can get it working it seemed to give a decent boost for certain scenarios.
> But as it gets wider use fact is the vast majority of systems have a single device and don't want to cut available storage in half for copies=1
I'm still patiently holding out for the first (current, usable) filesystem that let's us set a parity fraction and let me use FEC to heal small errors.
I suppose in principle you could partition a drive n ways the run RAIDZ1? Though performance would probably be mediocre and there'd be some oddities.
Occurs to me know that as a practical matter perhaps raw storage will solve this for a lot of people. The pace of increase in storage/$ still seems to be steady, and in turn continues to mean more and more people trivially have their total needs exceeded. At some point may be perfectly reasonable to throw half at redundancy?
If you want repair, you need to either set more than one copy in ZFS (still pretty dumb) or use multiple drives in a raidz (or mirror, though raidz is more efficient).
When you try to partition a drive like this (or configure ZFS to write multiple copies) you loose everything in the event that the whole drive fails. Honestly, it's a far better plan to just use multiple drives in a raidz so that whole-drive failures can be reconstructed in addition to data corruption.
Please read the thread before commenting. The scenarios we're discussing are explicitly in the context of single storage device systems getting helped by the new healing replication feature. Huge numbers of systems do not have multiple drives nor even necessarily the ability to have multiple drives. Nobody is suggesting it's optimal. I've run ZFS for 12+ years now with hundreds of systems. You don't need to tell us the obvious.
I'm always impressed w/ ZFS. I really like the look of:
> Corrective "zfs receive" (#9372[0]) - A new type of zfs receive which can be used to heal corrupted data in filesystems, snapshots, and clones when a replica of the data already exists in the form of a backup send stream.
Does this mean I can use the overlay2 driver with docker instead of the ZFS one? I've had many people tell me this was possible in the past when it definitely wasn't.
The performance and reliability of the ZFS driver leaves lots to be desired. The last time I ran Docker on a ZFS partition I actually used a zvol formatted ext4 just to avoid these issues.
I once wrote a line of shell piping zfs into awk while walking the tree of datasets, but it was so slow i decided i didn't need to after all. I wonder what a filesystem would look like that made the performance of meta-reflection a primary concern.
Somebody chose to use a library as abstraction that looks good but is implemented as a MVP (nothing wrong with that). "In the future, we hope to work directly with libzfs" should have raised an alarm somewhere, though.
Having Proxmox-based homelab setup I was waiting for quite a while for Linux containers support so that Docker in LXC containers just works. It was quite inconvenient when even a pull for a simple image freezes. Hopefully now a docker-composed single logical set of Docker containers (or just a single Docker container) per LXC container will just work without VM overheads.
I've recently tried ZFS on an NVMe drive in a USB enclosure, the USB connection is fragile and when rsyncing data onto it, it dis- and reconnects, after which the zpool is in SUSPENDED state and only a reboot helps :(.
But the issue isn't really at storage level, it's about the filesystem driver not dealing with it. Most modern journaling file systems don't really care if my drive intermittently disconnects, all that will happen is some failed file operations, and at worst the necessity to run a quick filesystem check. With ZFS's emphasis on storage being fallible (checksums to detect silent data corruption etc) it's not weird to expect it to also handle connection issues at least as well as other file systems.
There's fuck all that ZFS can do if the drive lies about what has or hasn't been written yet, journal or no journal. USB devices are known to do this all the time.
If the drive lies and says that part of the journal has been written when it hasn't yet, and ZFS goes ahead and writes the next part of the journal, then when you unplug the drive and the first part of the journal goes away (which the 2nd depended on) you're hosed. There have to be places where ZFS blocks until something critical has definitely, absolutely, been written to disk.
At that point the only thing ZFS can do is try to unwind back to whatever it thinks is a consistent state, but this isn't 100% guaranteed. (it depends on what old data is still hanging around)
I have a machine with dual-boot. Kali Linux (think I used ext4 for it), and Windows 11 (NTFS). Due to space constrains I mount /var/cache/apt on the NTFS partition. Whenever I forgot to cleanly close Windows 11 though, I cannot mount that partition because of potential data loss and I first need to go to Windows 11. This used to never happen. But you know what? I'm OK with it. Heck, I should probably run both OSes on ZFS. Although the Windows 11 partition is just to play around with Windows 11; nothing serious on it.
To be clear, I don't mind it getting suspended when there is a hardware failure. What I don't understand is why aren't there sane way (i.e some command sequence) to re-attach it.
Otherwise the feature set seems great: both a volume manager and a file system, seamless compression, deduplication, COW and snapshots - what's not to love :)
> why aren't there sane way (i.e some command sequence) to re-attach it.
But there is. It's `zpool clear <poolname>`.
That said, you probably need to have created or imported the pool using stable device names (e.g. `zpool import -d /dev/disk/by-id` or `by-part-uuid`, for example). Otherwise, when reattaching the USB device it might get assigned a different device name (e.g. `/dev/sdb` instead of `/dev/sda`) and ZFS might think the device is still unavailable.
Yeah, super reliable in the enterprise. It’s not designed around hanging a single ssd of your rpi with a five dollar usb to m.2 adapter. It’s designed around having a disk shelf or at the very least a mirror.
There's fuck all that ZFS can do if hardware lies about what has or hasn't been written to disk yet. There have to be points where ZFS blocks until something critical has been absolutely, totally, 100% written to disk before continuing. Unfortunately, USB devices lie all the time.
I've been using ZFS for over a decade and love it, but it was created as an "enterprise server filesystem" and in certain areas like this it really shows.
That said, I have you tried something like this[1]? Also what device name did you use when creating the pool? Using one of the /dev/disk/by-'s that doesn't change when you reconnect would make it a lot smoother I imagine.
I often find myself wishing for this block cloning thing, but with rsync hooks.
Like suppose you're copying a large genome from A to B, and you already have a genome from that species (but different organism) lying around on B.
Sure, you could clone and then rsync on top of the clone to avoid transferring the common bits again. But that's forethought that users often don't have. Better to use the rolling hash related metadata that rsync generates as a query into all possible targets and pick one automatically so that the user doesn't have to think about it and just sees it as a really fast copy.
I wonder what holds back a ZFS-level offline dedupe function now that that's implemented since you could already basically write a shell script to do something like it.
I tried it a few months ago and ReFS ate my data. No indication of why in event logs or SMART data. It had IsPowerProtected set because I have a UPS and I had a unclean restart, I would expect it to lose data, but not to corrupt the filesystem metadata. I had a backup of the data but wanted some recent changes. Refsutil (the official Microsoft tool) didn't help because it has not been updated for the newest ReFS version. I couldn't read most files because I had integrity enable and files failed the check. Hetman's Data Recovery was able to recover most of the data. In later testing I found out that IsPowerProtected is just very unsafe. I have since put some time into testing and sometimes fixing https://github.com/openzfsonwindows/openzfs , it is not ready for use yet, but it is making great progress.
Might be easy to extende fdupes and jdupes to be able to do this without much effort. I haven't seen the api/syscall invokved but I use them with btrfs for a specific use where I have a lot of known duplicates.
I use both, for different use cases. The flexibility of BTRFS is hard to beat, I've yet to have a single data loss issue (5ish years, multiple machines), and for low resource systems (like an RPi3) it's a great choice.
I have to say as a Linux user I find the ZFS tooling surprisingly difficult, and I've had far more issues with Linux on ZFS root than BTRFS. YMMV. On my one BSD machine I currently have a usb-attached zpool that seems completely frozen in spite of reboots for over a week -- all zpool and zfs commands hang indefinitely with no output (even though the machine is otherwise working fine). No idea what to make of that, was working fine for 3+ years previously.
That said, people far more knowledgeable than me far prefer ZFS, so I keep my really important long-term storage and backups on a RAIDZ2 array.
Both are CoW filesystems with different strengths and weaknesses. BTRFS is part of the kernel, has interesting integrations with systemd (and possibly better Linux tooling in general), and has more flexibility with adding and removing disks (for now). ZFS has (arguably) better reliability and has more built-in features like encryption, read/write caching, and exposing virtual volumes.
I don't think folks should default to using one or the other, everything depends on your workloads. Ext4 might even be the best choice. Though bcachefs could be an "endgame FS" if it delivers on its promises.
> ZFS RAID is production quality and safe to use. Not the case with Btrfs
Btrfs works and has worked fine for RAID0 and RAID1. It is specifically RAID5, RAID6, et al.
That said, I prefer ZFS (we'll see what happens when bcachefs is merged), though Synology DSM doesn't offer it and it can be a PITA to not being able to run latest Linux kernel. Especially (solely?) on a rolling distro I didn't like that.
There's drivers for both to run under Windows btw.
However, based on [2], it seems the checksum doesn't store the type of checksum, so it seems that it's an all-or-nothing thing instead of a setting that only applies to new blocks.
Can I actually change the checksum? What happens when I do?
Edit: It appears the type of checksum actually is stored per-block, in `blk_prop` ([3]) of `blkptr_t` ([4]).
Edit 2: The manpage says that the change only applies to new data, so yes, it's safe.
[1]: https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Chec...
[2]: https://people.freebsd.org/~gibbs/zfs_doxygenation/html/d9/d...
[3]: https://github.com/openzfs/zfs/blob/c0e58995e33479a9c1d97fb2...
[4]: https://github.com/openzfs/zfs/blob/c0e58995e33479a9c1d97fb2...