Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenZFS 2.2: Block Cloning, Linux Containers, BLAKE3 (github.com/openzfs)
105 points by buybackoff on Oct 13, 2023 | hide | past | favorite | 64 comments


Question for ZFS experts: the docs say ([1]) that I can change the checksum used, and I want to change it to BLAKE3.

However, based on [2], it seems the checksum doesn't store the type of checksum, so it seems that it's an all-or-nothing thing instead of a setting that only applies to new blocks.

Can I actually change the checksum? What happens when I do?

Edit: It appears the type of checksum actually is stored per-block, in `blk_prop` ([3]) of `blkptr_t` ([4]).

Edit 2: The manpage says that the change only applies to new data, so yes, it's safe.

[1]: https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Chec...

[2]: https://people.freebsd.org/~gibbs/zfs_doxygenation/html/d9/d...

[3]: https://github.com/openzfs/zfs/blob/c0e58995e33479a9c1d97fb2...

[4]: https://github.com/openzfs/zfs/blob/c0e58995e33479a9c1d97fb2...


The checksum type for the block is defined elsewhere (include/sys/zio.h) per a "enum zio_checksum".


It would go against a fundamental principle of ZFS' design for changing a setting like that mid-flight to not be safe.


A lot of really nice, focused polish items in this release that have real practical benefits. The corrective "healing" receive for example, ZFS has always been able self-heal when there is redundancy (either device level or using the copies attribute), and that's always been a core feature. And in its original incarnation there'd always be at least some redundancy. But as it gets wider use fact is the vast majority of systems have a single device and don't want to cut available storage in half for copies=1. Corruption can still be detected with a scrub thanks to checksums, so ZFS is still useful anyway. It's better to know about corruption asap, while you still have good backups. But being able to smoothly and very quickly repair that without any local redundancy using backups you should have anyway is a nice bit of polish and leveraging of the foundation.

One question if anyone who follows it more closely knows: are there any efforts to work specifically on ZVOL performance? They're very handy for iSCSI and other features, but my understanding is they haven't gotten much focus for quite awhile now. Maybe that just doesn't have any real company backing right now for R&D but I hope it gets some attention eventually.


> But as it gets wider use fact is the vast majority of systems have a single device and don't want to cut available storage in half for copies=1.

Setting copies=2 for the root filesystem, but segmenting stuff like caches, user downloads, etc into separate datasets with copies=1, is totally reasonable for this purpose. The base system in most distributions only takes a few GiB. 1 TiB SSDs are pretty common even on entry level laptops these days, and it's becoming increasingly difficult to find new machines without at least 512 GiB.


A scrub can both detect and repair damage. You don't need to set additional copies. If you don't want to buy 4TB of drives to store 2TB of data in a mirror, use a raidz instead.

In a raidz with at least one parity disk (and any number of data disks), corruption found on any data disk can be repaired immediately. For non-overlapping corruption a single parity disk is enough. If you have corruption of the same block on two different disks (extremely unlikely) then this can be repaired if you have two parity disks. (Etc. for three identically-corrupt data disks and three parity disks)

You can have eg. two parity disks and eight data disks (or more) and so long as no more than two disks are corrupted in the same block at the same time then a scrub will repair it completely.

The system of parity disks in ZFS' raidz is meant for reconstructing entire failed data disks, so fixing corruption is no big deal.


> are there any efforts to work specifically on ZVOL performance

There's this[1] work which integrated ZFS better with the kernel so it could merge small IOs on zvols better. It was merged almost a year ago. However due to a recent issue[2] it has been disabled until they can figure out what exactly is going on. If they can get it working it seemed to give a decent boost for certain scenarios.

[1]: https://github.com/openzfs/zfs/pull/13148

[2]: https://github.com/openzfs/zfs/issues/15351


> But as it gets wider use fact is the vast majority of systems have a single device and don't want to cut available storage in half for copies=1

I'm still patiently holding out for the first (current, usable) filesystem that let's us set a parity fraction and let me use FEC to heal small errors.


I suppose in principle you could partition a drive n ways the run RAIDZ1? Though performance would probably be mediocre and there'd be some oddities.

Occurs to me know that as a practical matter perhaps raw storage will solve this for a lot of people. The pace of increase in storage/$ still seems to be steady, and in turn continues to mean more and more people trivially have their total needs exceeded. At some point may be perfectly reasonable to throw half at redundancy?


This is a bad idea.

If you want repair, you need to either set more than one copy in ZFS (still pretty dumb) or use multiple drives in a raidz (or mirror, though raidz is more efficient).

When you try to partition a drive like this (or configure ZFS to write multiple copies) you loose everything in the event that the whole drive fails. Honestly, it's a far better plan to just use multiple drives in a raidz so that whole-drive failures can be reconstructed in addition to data corruption.


Please read the thread before commenting. The scenarios we're discussing are explicitly in the context of single storage device systems getting helped by the new healing replication feature. Huge numbers of systems do not have multiple drives nor even necessarily the ability to have multiple drives. Nobody is suggesting it's optimal. I've run ZFS for 12+ years now with hundreds of systems. You don't need to tell us the obvious.



Use raidz.


Note that cloning was already possible at the filesystem/snapshot level for a while:

* https://openzfs.github.io/openzfs-docs/man/master/8/zfs-clon...

* https://openzfs.github.io/openzfs-docs/man/master/7/zfsconce...

(ZFS uses the fairly 'standard' nomenclature of "snapshots" being read-only copies and "clones" being read-write copies.)


I'm always impressed w/ ZFS. I really like the look of:

> Corrective "zfs receive" (#9372[0]) - A new type of zfs receive which can be used to heal corrupted data in filesystems, snapshots, and clones when a replica of the data already exists in the form of a backup send stream.

[0] https://github.com/openzfs/zfs/pull/9372


> support for overlayfs

Does this mean I can use the overlay2 driver with docker instead of the ZFS one? I've had many people tell me this was possible in the past when it definitely wasn't.

The performance and reliability of the ZFS driver leaves lots to be desired. The last time I ran Docker on a ZFS partition I actually used a zvol formatted ext4 just to avoid these issues.


> The performance and reliability of the ZFS driver leaves lots to be desired.

No wonder: it calls the zfs(1) tool and parses its output to do its work. Again. and again. and again.


I once wrote a line of shell piping zfs into awk while walking the tree of datasets, but it was so slow i decided i didn't need to after all. I wonder what a filesystem would look like that made the performance of meta-reflection a primary concern.


Shouldn't it be directly calling the "zed" ZFS daemon to get that info instead?


Perhaps.

Thing is, https://github.com/moby/moby/blob/670bc0a46c4ca03b75f1e72f73... is using https://github.com/mistifyio/go-zfs which features code like `out, err := zfsOutput("get", "-H", key, d.Name)` (Source: https://github.com/mistifyio/go-zfs/blob/master/zfs.go#L315) to get a single zfs property.

Somebody chose to use a library as abstraction that looks good but is implemented as a MVP (nothing wrong with that). "In the future, we hope to work directly with libzfs" should have raised an alarm somewhere, though.


Interesting. I'd been meaning to check if there are go libraries for working with ZFS.

Looks like the answer to that is "yes, but be careful as some are kind of dodgy". :/

---

Oh, that library bills itself as "Simple wrappers for ZFS command line tools", rather than as a ZFS interface.

Wonder why the Moby project picked that one then? Maybe a case of "it was the best choice at the time".


There's https://github.com/openzfs/zfs/tree/master/contrib/pyzfs, which is as official as it gets.


Sounds like that would be the better choice. :)


Recent and related:

ZFS 2.2.0 (RC): Block Cloning merged - https://news.ycombinator.com/item?id=36588240 - July 2023 (165 comments)


Does this also include the ability to add additional drives to a zpool? For example, going from 4 disks in to 6 disks with raidz2/raidz3 zpool?


Nah, the PR for that is still being reviewed prior to getting merged:

https://github.com/openzfs/zfs/pull/15022


Nice, thanks. Looks like it'll be soon.

Can drives be removed in the same way if there's the space for it?


Good question, I'm not personally sure.


Having Proxmox-based homelab setup I was waiting for quite a while for Linux containers support so that Docker in LXC containers just works. It was quite inconvenient when even a pull for a simple image freezes. Hopefully now a docker-composed single logical set of Docker containers (or just a single Docker container) per LXC container will just work without VM overheads.


I've recently tried ZFS on an NVMe drive in a USB enclosure, the USB connection is fragile and when rsyncing data onto it, it dis- and reconnects, after which the zpool is in SUSPENDED state and only a reboot helps :(.


This is a very Stack Overflow response but, yes, don’t use USB for any storage you expect to be reliable.


But the issue isn't really at storage level, it's about the filesystem driver not dealing with it. Most modern journaling file systems don't really care if my drive intermittently disconnects, all that will happen is some failed file operations, and at worst the necessity to run a quick filesystem check. With ZFS's emphasis on storage being fallible (checksums to detect silent data corruption etc) it's not weird to expect it to also handle connection issues at least as well as other file systems.


There's fuck all that ZFS can do if the drive lies about what has or hasn't been written yet, journal or no journal. USB devices are known to do this all the time.

If the drive lies and says that part of the journal has been written when it hasn't yet, and ZFS goes ahead and writes the next part of the journal, then when you unplug the drive and the first part of the journal goes away (which the 2nd depended on) you're hosed. There have to be places where ZFS blocks until something critical has definitely, absolutely, been written to disk.

At that point the only thing ZFS can do is try to unwind back to whatever it thinks is a consistent state, but this isn't 100% guaranteed. (it depends on what old data is still hanging around)


I have a machine with dual-boot. Kali Linux (think I used ext4 for it), and Windows 11 (NTFS). Due to space constrains I mount /var/cache/apt on the NTFS partition. Whenever I forgot to cleanly close Windows 11 though, I cannot mount that partition because of potential data loss and I first need to go to Windows 11. This used to never happen. But you know what? I'm OK with it. Heck, I should probably run both OSes on ZFS. Although the Windows 11 partition is just to play around with Windows 11; nothing serious on it.


Lol, ZFS is supposed to be this super reliable filesystem that does your laundry. But if you hard drive disconnects for a moment tough luck.


ZFS is a very reliable file system. At some point ZFS will give up and say "I cannot reliably save bits."

It is your hardware where the unreliability lies and ZFS detects that. Would you rather prefer silent corruption?


To be clear, I don't mind it getting suspended when there is a hardware failure. What I don't understand is why aren't there sane way (i.e some command sequence) to re-attach it.

Otherwise the feature set seems great: both a volume manager and a file system, seamless compression, deduplication, COW and snapshots - what's not to love :)


> why aren't there sane way (i.e some command sequence) to re-attach it.

But there is. It's `zpool clear <poolname>`.

That said, you probably need to have created or imported the pool using stable device names (e.g. `zpool import -d /dev/disk/by-id` or `by-part-uuid`, for example). Otherwise, when reattaching the USB device it might get assigned a different device name (e.g. `/dev/sdb` instead of `/dev/sda`) and ZFS might think the device is still unavailable.


Yeah, super reliable in the enterprise. It’s not designed around hanging a single ssd of your rpi with a five dollar usb to m.2 adapter. It’s designed around having a disk shelf or at the very least a mirror.


> it's super reliable on high quality reliable hardware.

Okay


There's fuck all that ZFS can do if hardware lies about what has or hasn't been written to disk yet. There have to be points where ZFS blocks until something critical has been absolutely, totally, 100% written to disk before continuing. Unfortunately, USB devices lie all the time.


Lungs are usually pretty reliable, but they're still working on a long-term fix for when air becomes unavailable.


I've been using ZFS for over a decade and love it, but it was created as an "enterprise server filesystem" and in certain areas like this it really shows.

That said, I have you tried something like this[1]? Also what device name did you use when creating the pool? Using one of the /dev/disk/by-'s that doesn't change when you reconnect would make it a lot smoother I imagine.

[1]: https://github.com/openzfsonosx/zfs/issues/104#issuecomment-...


I often find myself wishing for this block cloning thing, but with rsync hooks.

Like suppose you're copying a large genome from A to B, and you already have a genome from that species (but different organism) lying around on B.

Sure, you could clone and then rsync on top of the clone to avoid transferring the common bits again. But that's forethought that users often don't have. Better to use the rolling hash related metadata that rsync generates as a query into all possible targets and pick one automatically so that the user doesn't have to think about it and just sees it as a really fast copy.


Couldn't you just send a diff/patch?


That would require you to know ahead of time what you're patching, and to have a copy of it on the source machine so you could generate the patch.

This would be automatic. If there's similarity on the target, use it, but without requiring the user to tell you where to find it.


Previous thread with discussion of the release candidate: https://news.ycombinator.com/item?id=36588240

While block cloning isn't supported for encrypted datasets yet, looks like there is a WIP already: https://github.com/openzfs/zfs/pull/14705

Very cool to have the ability for near instant cp or mv between datasets or snapshot (at least locally; the links don't persist with send/recv?).


Block cloning is kind of a significant update!

I wonder what holds back a ZFS-level offline dedupe function now that that's implemented since you could already basically write a shell script to do something like it.


For filesystem geeks (of a sort) like myself, note that ReFS has had block cloning support for some time now:

https://learn.microsoft.com/en-us/windows-server/storage/ref...


ReFS had many problems though.[1] Are they better now?

[1] https://www.reddit.com/r/DataHoarder/comments/iow60w/testing...


I tried it a few months ago and ReFS ate my data. No indication of why in event logs or SMART data. It had IsPowerProtected set because I have a UPS and I had a unclean restart, I would expect it to lose data, but not to corrupt the filesystem metadata. I had a backup of the data but wanted some recent changes. Refsutil (the official Microsoft tool) didn't help because it has not been updated for the newest ReFS version. I couldn't read most files because I had integrity enable and files failed the check. Hetman's Data Recovery was able to recover most of the data. In later testing I found out that IsPowerProtected is just very unsafe. I have since put some time into testing and sometimes fixing https://github.com/openzfsonwindows/openzfs , it is not ready for use yet, but it is making great progress.


Might be easy to extende fdupes and jdupes to be able to do this without much effort. I haven't seen the api/syscall invokved but I use them with btrfs for a specific use where I have a lot of known duplicates.


What about NVME optimizations?


Any comparison to BTRFS? I am using BTRFS in place of Zfs because it had a problem with Zfs moubts


I use both, for different use cases. The flexibility of BTRFS is hard to beat, I've yet to have a single data loss issue (5ish years, multiple machines), and for low resource systems (like an RPi3) it's a great choice.

I have to say as a Linux user I find the ZFS tooling surprisingly difficult, and I've had far more issues with Linux on ZFS root than BTRFS. YMMV. On my one BSD machine I currently have a usb-attached zpool that seems completely frozen in spite of reboots for over a week -- all zpool and zfs commands hang indefinitely with no output (even though the machine is otherwise working fine). No idea what to make of that, was working fine for 3+ years previously.

That said, people far more knowledgeable than me far prefer ZFS, so I keep my really important long-term storage and backups on a RAIDZ2 array.


Both are CoW filesystems with different strengths and weaknesses. BTRFS is part of the kernel, has interesting integrations with systemd (and possibly better Linux tooling in general), and has more flexibility with adding and removing disks (for now). ZFS has (arguably) better reliability and has more built-in features like encryption, read/write caching, and exposing virtual volumes.

I don't think folks should default to using one or the other, everything depends on your workloads. Ext4 might even be the best choice. Though bcachefs could be an "endgame FS" if it delivers on its promises.


I’ve come around to using btrfs as basically ext4++ but ZFS is head and shoulders a better product.

This article is somewhat outdated but the author is knowledgeable on the subject https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...


ZFS doesn't integrate with Linux' page cache well though. That's quite a large negative IMO.


But the reworked ARC implementation in here now will help a lot


> Any comparison to BTRFS?

ZFS RAID is production quality and safe to use. Not the case with Btrfs:

* https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5...

Certainly striping-mirroring will give you more IOps, but if you want more space efficiency for bulk storage, RAID-y solutions is probably better.

ZFS has been in production use since Solaris 10u2 (June 2006), so we're approaching twenty years of it being banged on.


> ZFS RAID is production quality and safe to use. Not the case with Btrfs

Btrfs works and has worked fine for RAID0 and RAID1. It is specifically RAID5, RAID6, et al.

That said, I prefer ZFS (we'll see what happens when bcachefs is merged), though Synology DSM doesn't offer it and it can be a PITA to not being able to run latest Linux kernel. Especially (solely?) on a rolling distro I didn't like that.

There's drivers for both to run under Windows btw.


Cool, but I'm really looking forward to being able to add disks to an existing volume.


I want this too.


Pretty exciting that we're getting COW support in ZFS. It used to be one of the key features of XFS, but I can stick to ZFS only now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: