OpenZFS deduplication is good now and you shouldn't use it

454 points by type0 7 days ago | 243 comments

nikisweeting 7 days ago |
I'm so excited about fast dedup. I've been wanting to use ZFS deduping for ArchiveBox data for years, as I think fast dedup may finally make it viable to archive many millions of URLs in one collection and let the filesystem take care of compression across everything. So much of archive data is the same jquery.min.js, bootstrap.min.css, logo images, etc. repeated over and over in thousands of snapshots. Other tools compress within a crawl to create wacz or warc.gz files, but I don't think anyone has tried to do compression across the entire database of all snapshots ever taken by a tool.
Big thank you to all the people that worked on it!
BTW has anyone tried a probabilistic dedup approach using soemthing like a bloom filter so you don't have to store the entire dedup table of hashes verbatim? Collect groups of ~100 block hashes into a bucket each, and store a hyper compressed representation in a bloom filter. On write, lookup the hash of the block to write in the bloom filter, and if a potential dedup hit is detected, walk the 100 blocks in the matching bucket manually to look for any identical hashes. In theory you could do this with layers of bloom filters with different resolutions and dynamically swap out the heavier ones to disk when memory pressure is too high to keep the high resolution ones in RAM. Allowing the accuracy of the bloom filter to be changed as a tunable parameter would let people choose their preference around CPU time/overhead:bytes saved ratio.
uniqueuid 7 days ago |
I get the use case, but in most cases (and particularly this one) I'm sure it would be much better to implement that client-side.
You may have seen in the WARC standard that they already do de-duplication based on hashes and use pointers after the first store. So this is exactly a case where FS-level dedup is not all that good.
nikisweeting 7 days ago |
WARC only does deduping within a single WARC, I'm talking about deduping across millions of WARCs.
uniqueuid 7 days ago |
That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction.
[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.
https://support.archive-it.org/hc/en-us/articles/208001016-A...
nikisweeting 7 days ago |
Ah cool, TIL, thanks for the link. I didn't realize that was possible.
I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.
mappu 7 days ago |
Even with this change ZFS dedupe is still block-aligned, so it will not match repeated web assets well unless they exist at consistently identical offsets within the warc archives.
dm-vdo has the same behaviour.
You may be better off with long-range solid compression instead, or unpacking the warc files into a directory equivalent, or maybe there is some CDC-based FUSE system out there (Seafile perhaps)
nikisweeting 7 days ago |
I should clarify I don't use WARCs at all with archivebox, it just stores raw files on the filsystem because I rely on ZFS for all my compression, so there is no offset alignment issue.
The wget extractor within archivebox can produce WARCs as an output but no parts of ArchiveBox are built to rely on those, they are just one of the optional extractors that can be run.
alchemist1e9 7 days ago |
While a slightly different use case, I suspect you’d like zbackup if you don’t know about it.
dark-star 7 days ago |
I wonder why they are having so much trouble getting this working properly with smaller RAM footprints. We have been using commercial storage appliances that have been able to do this for about a decade (at least) now, even on systems with "little" RAM (compared to the amount of disk storage attached).
Just store fingerprints in a database and run through that at night and fixup the block pointers...
wmf 7 days ago |
Fixup block pointers is the one thing ZFS didn't want to do.
magicalhippo 7 days ago |
> and fixup the block pointers
That's why. Due to reasons[1], ZFS does not have the capability to rewrite block pointers. It's been a long requested feature[2] as it would also allow for defragmentation.
I've been thinking this could be solved using block pointer indirection, like virtual memory, at the cost of a bit of speed.
But I'm by no means a ZFS developer, so there's surely something I'm missing.
[1]: http://eworldproblems.mbaynton.com/posts/2014/zfs-block-poin...
[2]: https://github.com/openzfs/zfs/issues/3582
phongn 7 days ago |
It looks like they’re playing more with indirection features now (created for vdev removal) for other features. One of the recent summit hackathons sketched out using indirect vdevs to perform rebalancing.
Once you get a lot of snapshots, though, the indirection costs start to rise.
olavgg 7 days ago |
You can also use DragonFlyBSD with Hammer2, which supports both online and offline deduplication. It is very similar to ZFS in many ways. The big drawback though, is lack of file transfer protocols using RDMA.
I've also heard there are some experimental branches that makes it possible to run Hammer2 on FreeBSD. But FreeBSD also lacks RDMA support. For FreeBSD 15, Chelsio has sponsored NVMe-oF target, and initiator support. I think this is just TCP though.
Wowfunhappy 7 days ago |
I want "offline" dedupe, or "lazy" dedupe that doesn't require the pool to be fully offline, but doesn't happen immediately.
Because:
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.
But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.
magicalhippo 7 days ago |
The author of the new file-based block cloning code had this in mind. A backround process would scan files and identify dupes, delete the dupes and replace them with cloned versions.
There are of course edge cases to consider to avoid data loss, but I imagine it might come soon, either officially or as a third-party tool.
Dylan16807 7 days ago |
The ability to alter existing snapshots, even in ways that fully preserve the data, is extremely limited in ZFS. So yes that would be great, but if I was holding my breath for Block Pointer Rewrite I'd be long dead.
Wowfunhappy 7 days ago |
You need block pointer rewrite for this?
Dylan16807 7 days ago |
You don't need it to dedup writable files. But redundant copies in snapshots are stuck there as far as I'm aware. So if you search for duplicates every once in a while, you're not going to reap the space savings until your snapshots fully rotate.
lazide 7 days ago |
The issue with this, in my experience, is that at some point that pro (exactly, and literally, only one copy of a specific bit of data despite many apparent copies) can become a con if there is some data corruption somewhere.
Sometimes it can be a similar issue in some edge cases performance wise, but usually caching can address those problems.
Efficiency being the enemy of reliability, sometimes.
Dylan16807 7 days ago |
Redundant copies on a single volume are a waste of resources. Spend less on size, spend more on an extra parity drive, or another backup of your most important files. That way you get more safety per gigabyte.
lazide 7 days ago |
Notably, having to duplicate all data x2 (or more) is more of a waste than having 2 copies of a few files - if full drive failure is not the expected failure mode, and not all files should be protected this heavily.
It’s why metadata gets duplicated in ZFS the way it does on all volumes.
Having seen this play out a bunch of times, it isn’t an uncommon need either.
Dylan16807 7 days ago |
> having to duplicate all data x2
Well I didn't suggest that. I said important files only for the extra backup, and I was talking about reallocating resources not getting new ones.
The simplest version is the scenario where turning on dedup means you need one less drive of space. Convert that drive to parity and you'll be better off. Split that drive from the pool and use it to backup the most important files and you'll be better off.
If you can't save much space with dedup then don't bother.
lazide 7 days ago |
There was an implication in your statement that volume level was the level of granularity, yeah?
I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.
Dylan16807 7 days ago |
Note: I assume volume means pool?
> There was an implication in your statement that volume level was the level of granularity, yeah?
There was an implication that the volume level was the level of granularity for adding parity.
But that was not the implication for "another backup of your most important files".
> I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.
You can't choose just by copying files around, but it's pretty easy to set copies=2 on specific directories. And I'd say that's generally a better option, because it keeps your copies up to date at all times. Just make sure snapshots are happening, and files in there will be very safe.
Manual duplication is the worst kind of duplication, so while it's good to warn people that it won't work with dedup on, actually losing the ability is not a big deal when you look at the variety of alternatives. It only tips the balance in situations where dedup is near-useless to start with.
Wowfunhappy 7 days ago |
Thanks. I do think dedupe for non-snapshots would still be useful, since as you say most people will get rid of old snapshots eventually.
I also wonder if it would make sense for ZFS to always automatically dedupe before taking a snapshot. But you'd have to make this behavior configurable since it would turn shapshotting from a quick operation into an expensive one.
EvanAnderson 7 days ago |
> I just wish we had "offline" dedupe, or even "lazy" dedupe...
This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".
I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.
UltraSane 7 days ago |
The neat thing about inline dedupe is that if the block hash already exists than the block doesn't have to be written. This can save a LOT of write IO in many situations. There are even extensions where a file copy between to VMs on a dedupe storage array will not actually copy any data but just increment the original blocks reference counter. You will see absurd TB/s write speeds in the OS, it is pretty cool.
aidenn0 7 days ago |
This is only a win if the dedupe table fits in RAM; otherwise you pay for it in a LOT of read IO. I have a storage array where dedupe would give me about a 2.2x reduction in disk usage, but there isn't nearly enough RAM for it.
UltraSane 7 days ago |
yes inline dedupe has to fit in RAM. Perhaps enterprise storage arrays have spoiled me.
aidenn0 7 days ago |
This array is a bit long-in-the-tooth and only has 192GB of RAM, but a bit over 40TB of net storage, which would be a 200GB dedup table size using the back-of-the-envelope estimate of 5GB/TB.
A more precise calculation on my actual data shows that today's data would allow the dedup table to fit in RAM, but if I ever want to actually use most of the 40TB of storage, I'd need more RAM. I've had a ZFS system swap dedup to disk before, and the performance dropped to approximately zero; fixing it was a PITA, so I'm not doing that anytime soon.
barrkel 7 days ago |
Be aware that ZFS performance rapidly drops off north of 80% utilization, when you head into 90%, you will want to buy a bigger array just to escape the pain.
gnu8 7 days ago |
I think that is well known amongst storage experts, though maybe not everyone who might be interested in using ZFS for storage in a professional or personal application. What I’m curious about is how ZFS’s full-disk performance (what is the best term for this?) compares to btrfs, WAFL, and so on. Is ZFS abnormally sensitive to this condition, or is it a normal property?
In any case it doesn’t stick out to me as a problem that needs to be fixed. You can’t fill a propane tank to 100% either.
aidenn0 6 days ago |
ZFS has gotten significantly better at 80%, but 90% is painful enough that I almost wish it would reserve 10% a bit more explicitly (maybe like the old Unix systems that would prevent non-root users from using the last 5% of the root partition).
All my arrays send me nightly e-mails at 80% so I'm aware of when I hit there, but on a desktop system that's typically not the case.
hinkley 7 days ago |
I get the feeling that a hypothetical ZFS maintainer reading some literature on concurrent mark and sweep would be... inspirational, if not immediately helpful.
You should be able to detect duplicates online. Low priority sweeping is something else. But you can at least reduce pause times.
p_l 7 days ago |
They were aware. The reasons it works the way it works are due to higher priority decisions regarding reliability in face of hardware or software corruption.
That said, you can still do a two-space GC, but it's slow and possibly wasteful.
LeoPanthera 7 days ago |
btrfs has this. You can deduplicate a filesystem after the fact, as an overnight cron job or whatever. I really wish ZFS could do this.
DannyBee 7 days ago |
I sent a PR to add support for the necessary syscall (FIDUPERANGE) to zfs that i just have to clean up again.
Once that is in, any of the existing dupe finding tools that use it (IE jdupes, duperemove) will just work on ZFS.
edelbitter 7 days ago |
Knowing what you had to know to write that, would you dare using it?
Compression, encryption and streaming sparse files together are impressive already. But now we get a new BRT entry appearing out of nowhere, dedup index pruning one that was there a moment ago, all while correctly handling arbitrary errors in whatever simultaneous deduped writes, O_DIRECT writes, FALLOC_FL_PUNCH_HOLE and reads were waiting for the same range? Sounds like adding six new places to hold the wrong lock to me.
DannyBee 7 days ago |
"Knowing what you had to know to write that, would you dare using it?"
It's no worse than anything else related to block cloning :)
ZFS already supports FICLONERANGE, the thing FIDEDUPRANGE changes is that the compare is part of the atomic guarantee.
So in fact, i'd argue it's actually better than what is there now - yes, the hardest part is the locking, but the locking is handled by the dedup range call getting the right locks upfront, and passing them along, so nothing else is grabbing the wrong locks. It actually has to because of the requirements to implement the ioctl properly. We have to be able to read both ranges, compare them, and clone them, all as an atomic operation wrt to concurrent writes. So instead of random things grabbing random locks, we pass the right locks around and everything verifies the locks.
This means fideduprange is not as fast as it maybe could be, but it does not run into the "oops we forgot the right kind of lock" issue. At worst, it would deadlock, because it's holding exclusive locks on all that it could need before it starts to do anything in order to guarantee both the compare and the clone are atomic. So something trying to grab a lock forever under it will just deadlock.
This seemed the safest course of implementation.
ficlonerange is only atomic in the cloning, which means it does not have to read anything first, it can just do blind block cloning. So it actually has a more complex (but theoretically faster) lock structure because of the relaxed constraints.
rattt 7 days ago |
Shouldn't jdupes like tools already work now that ZFS has reflink copy support?
DannyBee 7 days ago |
No, because none of these tools use copy_file_range. Because copy_file_range doesn't guarantee deduplication or anything. It is meant to copy data. So you could just end up copying data, when you aren't even trying to copy anything at all.
All modern tools use FIDEDUPRANGE, which is an ioctl meant for explicitly this use case - telling the FS that two files have bytes that should be shared.
Under the covers, the FS does block cloning or whatever to make it happen.
Nothing is copied.
ZFS does support FICLONERANGE, which is the same as FIDEDUPRANGE but it does not verify the contents are the same prior to cloning.
Both are atomic WRT to concurrent writes, but for FIDEDUPRANGE that means the compare is part of the atomicness. So you don't have to do any locking.
If you used FICLONERANGE, you'd need to lock the two file ranges, verify, clone, unlock
FIDEDUPRANGE does this for you.
So it is possible, with no changes to ZFS, to modify dedup tools to work on ZFS by changing them to use FICLONERANGE + locking if FIDEDUPRANGE does not exist.
DannyBee 7 days ago |
Note - anyone bored enough could already make any of these tools work by using FICLONERANGE (which ZFS already supports), but you'd have to do locking - lock, compare file ranges, clone, unlock.
Because FIDEDUPRANGE has the compare as part of the atomic guarantee, you don't need to lock in userspace around using it, and so no dedup utility bothers to do FICLONERANGE + locking. Also, ZFS is the only FS that implements FICLONERANGE but not FIDEDUPRANGE :)
Wowfunhappy 6 days ago |
Oh cool! Does this work on the block level or only the file level?
Sakos 7 days ago |
This is my favourite hack for the Steam Deck. By switching my SD cards and the internal SSD to btrfs, the space savings are unreal (easily halving used space). Every game gets its own prefix which means a crazy amount of file duplication.
kccqzy 6 days ago |
Which tool did you end up using for btrfs? I tried out bees https://github.com/Zygo/bees but it is way too slow.
Dylan16807 6 days ago |
How established was the drive when you set up bees? In my experience you really want to do the initial pass without snapshots existing, but after that it's pretty smooth.
kccqzy 6 days ago |
Quite established. Although having read about snapshots causing slowness, I actually deleted all my snapshots and didn't notice any improvement.
DannyBee 7 days ago |
You can use any of the offline dupe finders to do this.
Like jdupes or duperemove.
I sent PR's to both the ZFS folks and the duperemove folks to support the syscalls needed.
I actually have to go followup on the ZFS one, it took a while to review and i realized i completely forget to finish it up.
tiagod 7 days ago |
I run rdfind[1] as a cronjob to replace duplicates with hardlinks. Works fine!
https://github.com/pauldreik/rdfind
Wowfunhappy 7 days ago |
But then you have to be careful not to remove the one which happens to be the "original" or the hardlinks will break, right?
Dylan16807 7 days ago |
No, pointing to an original is how soft links work.
Hard links are all equivalent. A file has any number of hard links, and at least in theory you can't distinguish between them.
The risk with hardlinks is that you might alter the file. Reflinks remove that risk, and also perform very well.
Wowfunhappy 7 days ago |
Thank you, I was unaware of this.
However, the fact that editing one copy edits all of them still makes this a non-solution for me at least. I'd also strongly prefer deduping at the block level vs file level.
mdaniel 7 days ago |
I would suspect a call to $(chmod a-w) would fix that, or at least serve as a very fine reminder that there's something special about them
AndrewDavis 7 days ago |
So this is great, if you're just looking to deduplicate read only files. Less so if you intend to write to them. Write to one and they're both updated.
Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.
spockz 7 days ago |
How would this work if I have snapshots? Wouldn’t then the version of the file I just replaced still be in use there? But maybe I also need to store the copy again if I make another snapshot because the “original “ file isn’t part of the snapshot? So now I’m effectively storing more not less?
magicalhippo 7 days ago |
AFAIK, yes. Blocks are reference counted, so if the duplicate file is in a snapshot then the blocks would be referenced by the snapshot and hence not be eligible for deallocation. Only once the reference count falls to zero would the block be freed.
This is par for the course with ZFS though. If you delete a non-duplicated file you don't get the space back until any snapshots referencing the file are deleted.
spockz 6 days ago |
Yes that snapshots incur a cost I know. But I’m wondering whether now the action of deduplicating actually created an extra copy instead of saving’one.
magicalhippo 6 days ago |
I don't fully understand the scenario you mentioned. Could you perhaps explain in a bit more detail?
DannyBee 7 days ago |
copy_file_range already works on zfs, but it doesn't guarantee anything interesting.
Basically all dupe tools that are modern use fideduprange, which is meant to tell the FS which things should be sharing data, and let it take care of the rest. (BTRFS, bcachefs, etc support this ioctl, and zfs will soon too)
Unlike copy_file_range, it is meant for exactly this use case, and will tell you how many bytes were dedup'd, etc.
sureglymop 7 days ago |
Quite cool, though it's not as storage saving as deduplicating at e.g. N byte blocks, at block level.
cryptonector 7 days ago |
Lazy/off-line dedup requires block pointer rewrite, but ZFS _cannot_ and will not ever get true BP rewrite because ZFS is not truly a CAS system. The problem is that physical locations are hashed into the Merkle hash tree, and that makes moving physical locations prohibitively expensive as you have to rewrite all the interior nodes on the way to the nodes you want to rewrite.
A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.
But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.
Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.
Pet_Ant 7 days ago |
> CAS system
It looks like it means: https://en.wikipedia.org/wiki/Content-addressable_storage
cryptonector 7 days ago |
Sorry, yes, CAS really means that pointers are hash values -- maybe with extra metadata, yes, but _not_ including physical locations. The point is that you need some other way to map logical pointers to physical locations. The easiest way to do that is to store the mappings nearby to the references so that they are easy to find, but the mappings must be left out of the Merkle hash tree in order to make it possible to change the physical locations of the referenced blocks.
nixdev 4 days ago |
You can already do offline/lazy dedupe.
zfs set mountpoint=foopy/foo /mnt/foo zfs set dedup=off foopy/foo zfs set mountpoint=foopy/baz /mnt/baz zfs set dedup=on foopy/baz
Save all your stuff in /mnt/foo, then when you want to dedup do
mv /mnt/foo/bar /mnt/baz/
Yeah... this feels like picrel, and it is
https://i.pinimg.com/originals/cb/09/16/cb091697350736aae53afe4b548b9d43.jpg
but it's here and now and you can do it now.
tilt_error 7 days ago |
If writing performance is critical, why bother with deduplication at writing time? Do deduplication afterwards, concurrently and with lower priority?
klysm 7 days ago |
Kinda like log structured merge tree?
0x457 7 days ago |
Because to make this work without a lot of copying, you would need to mutate things that ZFS absolutely does not want to make mutable.
UltraSane 7 days ago |
If the block to be written is already being stored then you will match the hash and the block won't have to be written. This can save a lot of write IO in real world use.
magicalhippo 7 days ago |
Keep in mind ZFS was created at a time when disks were glacial in comparison to CPUs. And, the fastest write is the one you don't perform, so you can afford some CPU time to check for duplicate blocks.
That said, NVMe has changed that balance a lot, and you can afford a lot less before you're bottlenecking the drives.
klysm 7 days ago |
I really wish we just had a completely different API as a filesystem. The API surface of filesystem on every OS is a complete disaster that we are locked into via backwards compatibility.
magicalhippo 7 days ago |
Internally ZFS is essentially an object store. There was some work which tried to expose it through an object store API. Sadly it seems to not have gone anywhere.
Tried to find the talk but failed, was sure I had seen it on a Delveloper Summit but alas.
UltraSane 7 days ago |
Why is it a disaster and what would you replace it with? Is the AWS S3 style API an improvement?
mappu 7 days ago |
High-density drives are usually zoned storage, and it's pretty difficult to implement the regular filesystem API on top of that with any kind of reasonable performance (device- vs host- managed SMR). The S3 API can work great on zones, but only because it doesn't let you modify an existing object without rewriting the whole thing, which is an extremely rough tradeoff.
lazide 7 days ago |
It’s only a ‘disaster’ if you are using it exclusively programmatically and want to do special tuning.
File systems are pretty good if you have a mix of human and programmatic uses, especially when the programmatic cases are not very heavy duty.
The programmatic scenarios are often entirely human hostile, if you try to imagine what would be involved in actually using them. Like direct S3 access, for example.
perlgeek 7 days ago |
One way it's a disaster is that file names (on Linux at least, haven't used Windows in a long time) are byte strings that can contain directory paths from different/multiple file systems.
So if you have non-ASCII characters in your paths, encoding/decoding is guesswork, and at worst, differs from path segment to path segment, and there's no metadata attached which encoding to use.
UltraSane 7 days ago |
That definitely does not sound like much fun to deal with.
p_l 7 days ago |
ZFS actually has settings related to that which originated from providing filesystems for different OSes, where it enforces canonical utf-8 with a specific canonization rule. AFAIK the reason for it existing was cooperation between Solaris, Linux, Windows, and Mac OS X computers all sharing same network filesystem hosted from ZFS.
tjwds 7 days ago |
Edit: disregard this, I was wrong and missed the comment deletion window.
gtirloni 7 days ago |
HN will automatically redirect the submitter to a recent submission instead of allowing a new post... if it had a significant number of comments.
https://news.ycombinator.com/newsfaq.html
kderbe 7 days ago |
I clicked because of the bait-y title, but ended up reading pretty much the whole post, even though I have no reason to be interested in ZFS. (I skipped most of the stuff about logs...) Everything was explained clearly, I enjoyed the writing style, and the mobile CSS theme was particularly pleasing to my eyes. (It appears to be Pixyll theme with text set to the all-important #000, although I shouldn't derail this discussion with opinions on contrast ratios...)
For less patient readers, note that the concise summary is at the bottom of the post, not the top.
emptiestplace 7 days ago |
It scrolls horizontally :(
going_north 7 days ago |
It's because of this element in one of the final sections [1]:
<code>kstat.zfs.<pool>.misc.ddt_stats_<checksum></code>
Typesetting code on a narrow screen is tricky!
[1] https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-...
ThePowerOfFuet 7 days ago |
Not on Firefox on Android it doesn't.
dspillett 7 days ago |
It does in chrome on android (1080 px wide screen, standard ppi & zoom levels) but not by enough that you see it on the main body text (scrolling just reveals more margin), so you might find it does for you too but not enough that you noticed.
As it is scrolling here, though inconsequentially, it might be bad on a smaller device with less screen and/or other ppi settings.
Aachen 7 days ago |
That being:
> As we’ve seen from the last 7000+ words, the overheads are not trivial. Even with all these changes, you still need to have a lot of deduplicated blocks to offset the weight of all the unique entries in your dedup table. [...] what might surprise you is how rare it is to find blocks eligible for deduplication are on most general purpose workloads.
> But the real reason you probably don’t want dedup these days is because since OpenZFS 2.2 we have the BRT (aka “block cloning” aka “reflinks”). [...] it’s actually pretty rare these days that you have a write operation coming from some kind of copy operation, but you don’t know that came from a copy operation. [...] [This isn't] saving as much raw data as dedup would get me, though it’s pretty close. But I’m not spending a fortune tracking all those uncloned and forgotten blocks.
> [Dedup is only useful if] you have a very very specific workload where data is heavily duplicated and clients can’t or won’t give direct “copy me!” signal
The section labeled "summary" imo doesn't do the article justice by being fairly vague. I hope these quotes from near the end of the article give a more concrete idea of why (not) use it
londons_explore 7 days ago |
> offset the weight of all the unique entries in your dedup table
Didn't read the 7000 words... But isn't the dedup table in the form of a bunch of bloom filters so the whole dedup table can be stored with ~1 bit per block?
When you know there is likely a duplicate, you can create a table of blocks where there is a likely duplicate, and find all the duplicates in a single scan later.
That saves having massive amounts of accounting overhead storing any per-block metadata.
bastloing 7 days ago |
Forget dedupe just use zfs compression, a lot more bang for your buck
Joel_Mckay 7 days ago |
Unless your data-set is highly compressed media files.
In general, even during rsync operations one often turns off compression on large video files, as the compression operation has low or negative impact on storage/transfers while eating ram and cpu power.
De-duplication is good for Virtual Machine OS images, as the majority of the storage cost is a replicated backing image. =3
bastloing 7 days ago |
Compression is still king. Check out HP's Nimble storage arrays. Way quicker to do compression, fewer iops, and less overhead. Even when it misses, like video files, it's still a winner.
eek2121 7 days ago |
So many flaws. I want to see the author repeat this across 100TB of random data from multiple clients. He/she/whatever will quickly realize why this feature exists. One scenario I am aware of that uses another filesystem in a cloud setup saved 43% of disk space by using dedupe.
No, you won't save much on a client system. That isn't what the feature is made for.
hinkley 7 days ago |
When ZFS first came out I had visions of it being a turnkey RAID array replacement for nontechnical users. Pop out the oldest disk, pop in a new (larger one), wait for the pretty lights to change color. Done.
It is very clear that consumer was never a priority, and so I wonder what the venn diagram is of 'client system' and 'zfs filesystem'. Not that big right?
UltraSane 7 days ago |
My reaction also. Dedupe is a must have for when you are storing hundreds of VMs. you WILL save so much data and inline dedupe will save a lot of write IO.
XorNot 7 days ago |
It's an odd notion in the age of containers where dedupe is like, one of the core things we do (but stupidly: amongst dissimilar images there's definitely more identical files then different ones).
edelbitter 7 days ago |
I tried two of the most non-random archives I had and was disappointed just as the author. For mail archives, I got 10%. For entire filesystems, I got.. just as much as with any other COW. Because indeed, I duplicate them only once. Later shared blocks are all over the place.
doublepg23 7 days ago |
I assuming the author is aware why the feature exists since they state in the second sentence they funded the improvement over the course of two years?
UltraSane 7 days ago |
"And this is the fundamental issue with traditional dedup: these overheads are so outrageous that you are unlikely to ever get them back except on rare and specific workloads."
This struck me as a very odd claim. I've worked with Pure and Dell/EMC arrays and for VMWare workloads they normally got at least 3:1 dedupe/compression savings. Only storing one copy of the base VM image works extremely well. Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.
The effectiveness of dedupe is strongly affected by the size of the blocks being hashed, with the smaller the better. As the blocks get smaller the odds of having a matching block grow rapidly. In my experience 4KB is my preferred block size.
acdha 7 days ago |
> Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.
Don’t you compress these directly? I normally see at least twice that for logs doing it at the process level.
UltraSane 7 days ago |
What software?
SteveNuts 7 days ago |
Logrotate
acdha 7 days ago |
Log rotate, cron, or simply having something like Varnish or Apache log to a pipe which is something like bzip2 or zstd. The main question is whether you want to easily access the current stream - e.g. I had uncompressed logs being forwarded to CloudWatch so I had daemons logging to timestamped files with a post-rotate compression command which would run after the last write.
UltraSane 7 days ago |
That is one wrinkle of using storage based dedupe/compression is you need to avoid doing compression on the client to avoid compressing already compressed data. When a company I worked at first got their Pure array they were using windows file compression heavily and had to disable it as the storage array was now doing it automatically.
acdha 7 days ago |
Definitely. We love building abstraction layers but at some point you really need to make decisions across the entire stack.
chasil 7 days ago |
Logrotate is the rhel utility, likely present in Fedora, that is easily adapted for custom log handling. I still have rhel5 and I use it there.
CentOS made it famous. I don't know if it has a foothold in the Debian family.
E39M5S62 7 days ago |
logrotate is used on Debian and plenty of other distros. It seems pretty widely used, though maybe not as much so now that things log through systemd.
pezezin 7 days ago |
Yes, that ratio is very small.
I built a very simple, custom syslog solution, a syslog-ng server writing directly to a TimescaleDB hypertable (https://www.timescale.com/) that is then presented as a Grafana dashboard, and I am getting a 30x compression ratio.
pdimitar 7 days ago |
Would love to see your solution -- if it's open source.
pezezin 6 days ago |
Sure, no problem. But after you asked me, I realized that the system was not properly documented anywhere, I didn't even have a repo with the configuration files, what an embarrassment :(
I just created the repo and uploaded the documentation, please give me some more time to write the documentation: https://github.com/josefrcm/simple-syslog-service
jorvi 7 days ago |
> In my experience 4KB is my preferred block size
That makes sense considering Advanced Format harddrives already have a 4K physical sector size, and if you properly low-level format them (to get rid of the ridiculous Windows XP compatibility) they also have 4K logical sector size. I imagine there might be some real performance benefits to having all of those match up.
UltraSane 7 days ago |
In the early days of VMware people had a lot of VMs that were converted from physical machines and this causes a nasty alignment issue between the VMDK blocks and the blocks on your storage array. The effect was to always add one block to every read operation, and in the worst case of reading one block would double the load on the storage array. On NetApp this could only be fixed when the VM wasn't running.
Joe_Cool 7 days ago |
Even with the rudimentary Dedup features of NTFS on a Windows Hyper-V Server all running the same base image I can overprovision the 512GB partition to almost 2 GB.
You need to be careful and do staggered updates in the VMs or it'll spectacularly explode but it's possible and quite performant for less than mission critical VMs.
tw04 7 days ago |
I think you mean 2TB volume? But yes, this works. But also: if you're doing anything production, I'd strongly recommend doing deduplication on the back-end storage array, not at the NTFS layer. It'll be more performant and almost assuredly have better space savings.
Joe_Cool 7 days ago |
For sure it's not for production. At least not for stuff that's critical. MS also doesn't recommend using it for live VHDX.
The partition/NTFS volume is 512GB. It currently stores 1.3 TB of "dedupped" data and has about 200GB free. Dedup runs asynchronously in the background and as a job during off hours.
It's a typo, yes. Thanks.
wongarsu 7 days ago |
I haven't tried it myself, but the widely quoted number for old ZFS dedup is that you need 5GB of RAM for every 1TB of disk space. Considering that 1 TB of disk space currently costs about $15 and 5GB of server RAM about $25, you need a 3:1 dedupe ratio just to break even.
If your data is a good fit you might get away with 1GB per TB, but if you are out of luck the 5GB might not even be enough. That's why the article speaks of ZFS dedup having a small sweet spot that your data has to hit, and why most people don't bother
Other file systems tend to prefer offline dedupe which has more favorable economics
UltraSane 7 days ago |
Why does it need so much RAM? It should only need to store the block hashes which should not need anywhere near that much RAM. Inline dedupe is pretty much standard on high-end storage arrays nowadays.
remexre 7 days ago |
(5GiB / 1TiB) * 4KiB to bits ((5 gibibytes) / (1 tebibyte)) × (4 kibibytes) = 160 bits
AndrewDavis 7 days ago |
The linked blog post covers this, and the improvements made to make the new dedup better.
floating-io 7 days ago |
That doesn't account for OpEx, though, such as power...
wongarsu 7 days ago |
Assuming something reasonable like 20TB Toshiba MG10 HDDs and 64GB DDR4 ECC RAM, quick googling suggests that 1TB of disk space uses about 0.2-0.4W of power (0.2 in idle, 0.4 while writing), 5GB of RAM about 0.3-0.5W. So your break even on power is a bit earlier depending on the access pattern, but in the same ball park.
UltraSane 7 days ago |
What about rack space?
spockz 7 days ago |
Not just rack space. At a certain amount of disks you also need to get a separate server (chassis + main board + cpu + ram) to host the disks. Maybe you need that for performance reasons any way. But saving disk space and only paying for it with some ram sounds cost effective.
janc_ 6 days ago |
That works out only as long as you don’t have to replace the whole machine (motherboard & possibly CPU/CPUs) to be able to add more RAM… So essentially the same problem as with disks.
In the end it all comes down to: there are a whole lot of trade-offs that you have to take into account, and which ceilings you hit first depends entirely on everyone’s specific situation.
wmf 7 days ago |
VMs are known to benefit from dedupe so yes, you'll see benefits there. ZFS is a general-purpose filesystem not just an enterprise SAN so many ZFS users aren't running VMs.
Dedupe/compression works really well on syslog
I apologize for the pedantry but dedupe and compression aren't the same thing (although they tend to be bundled in the enterprise storage world). Logs are probably benefiting from compression not dedupe and ZFS had compression all along.
ants_everywhere 7 days ago |
If we're being pedants, then storing the same information in fewer bits than the input is by definition a form of compression, no?
(Although yes I understand that file-level compression with a standard algorithm is a different thing than dedup)
tw04 7 days ago |
They are not the same thing, but when you boil it down to the raw math, they aren't identical twins, but they're absolutely fraternal twins.
Both are trying to eliminate repeating data, it's just the frame of reference that changes. Compression in this context is operating on a given block or handful of blocks. Deduplication is operating on the entire "volume" of data. "Volume" having a different meaning depending on the filesystem/storage array in question.
UltraSane 7 days ago |
Well put. I like to say compression is just short range dedupe. Hash based dedupe wouldn't be needed if you could just to real-time LZMA on all of the data on a storage array but that just isn't feasible and hash-based dedupe is a very effective compromise.
ShroudedNight 7 days ago |
Is "paternal twins" a linguistic borrowing of some sort? It seems a relatively novel form of what I've mostly seen referred to as monozygotic / 'identical' twins. Searching for some kind of semi-canonical confirmation of its widespread use turns up one, maybe two articles where it's treated as an orthodox term, and at least an equal number of discussions admonishing its use.
spockz 7 days ago |
If anything I would expect the term “maternal” twin to be used as whether or not a twin is monozygotic or “identical” depends on the amount of eggs from the mother.
xxs 7 days ago |
compression tends NOT to use a global dictionary. So to me they are vastly different even if they have the same goal of reducing the output size.
Compression with a global dict would like do better than dedup yet it will have a lot of other issues.
abrookewood 7 days ago |
Couple of comments. Firstly, you are talking about highly redundant information when referencing VM images (e.g. the C drive on all Windows Serer images will be virtually identical), whereas he was using his own laptop contents as an example.
Secondly, I think you are conflating two different features: compression & de-duplication. In ZFS you can have compression turned on (almost always worth it) for a pool, but still have de-duplication disabled.
UltraSane 7 days ago |
Fair point. My experience is with enterprise storage arrays and I have always used dedupe/compression at the same time. Dedupe is going to be a lot less useful on single computers.
I consider dedupe/compression to be two different forms of the same thing. compression reduces short range duplication while deduplication reduces long range duplication of data.
abrookewood 7 days ago |
Yeah agreed, very closely related - even more so on ZFS where the compression (AFAIK) is on a block level rather than a file level.
E39M5S62 7 days ago |
ZFS compression is for sure at the block level - it's fully transparent to the userland tools.
lazide 7 days ago |
It could be at a file level and still transparent to user land tools, FYI. Depending on what you mean by ‘file level’, I guess.
UltraSane 7 days ago |
Windows NTFS has transparent file level compression that works quite well.
Dylan16807 6 days ago |
I don't know how much I agree with that.
The old kind of NTFS compression from 1993 is completely transparent, but it uses a weak algorithm and processes each 64KB of file completely independently. It also fragments files to hell and back.
The new kind from Windows 10 has a better algorithm and can have up to 2MB of context, which is quite reasonable. But it's not transparent to writes, only to reads. You have to manually apply it and if anything writes to the file it decompresses.
I've gotten okay use out of both in certain directories, with the latter being better despite the annoyances, but I think they both have a lot of missed potential compared to how ZFS and BTRFS handle compression.
UltraSane 5 days ago |
I'm talking about the "Compress contents to save disk space" option in the Advanced Attributes. It makes the file blue. I enable it on all .txt log files because it is so effective and completely transparent. It compresses a 25MB Google Drive log file to 8MB
Dylan16807 5 days ago |
That's the old kind.
It's useful, but if they updated it it could get significantly better ratios and have less impact on performance.
phil21 7 days ago |
Base VM images would be a rare and specific workload. One of the few cases dedupe makes sense. However you are likely using better strategies like block or filesystem cloning if you are doing VM hosting off a ZFS filesystem. Not doing so would be throwing away one of it's primary differentiators as a filesystem in such an environment.
General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead. Backups are hit or miss depending on both how the backups are implemented, and if they are encrypted prior to the filesystem level.
Compression is a totally different thing and current ZFS best-practice is to enable it by default for pretty much every workload - the CPU used is barely worth mentioning these days, and the I/O savings can be considerable ignoring any storage space savings. Log storage is going to likely see a lot better than 6:1 savings if you have typical logging, at least in my experience.
XorNot 7 days ago |
> General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead.
I would contest this is because we don't have a good transparent deduplication right now - just some bad compromises. Hard copies? Edit anything and it gets edited everywhere - not what you want. Symlinks? Look different enough that programs treat them differently.
I would argue your regular desktop user actually has an enormous demand for a good deduplicating file system - there's no end of use cases where the first step is "make a separate copy of all the relevant files just in case" and a lot of the time we don't do it because it's just too slow and wasteful of disk space.
If you're working with say, large video files, then a good dedupe system would make copies basically instant, and then have a decent enough split algorithm that edit's/cuts/etc. of the type people try to do losslessly or with editing programs are stored efficiently without special effort. How many people are producing video content today? Thanks to Tiktok we've dragged that skill right down to "random teenagers" who might hopefully pipeline into working with larger content.
armada651 7 days ago |
But according to the article the regular desktop already has such a dedup system:
> If you put all this together, you end up in a place where so long as the client program (like /bin/cp) can issue the right copy offload call, and all the layers in between can translate it (eg the Window application does FSCTL_SRV_COPYCHUNK, which Samba converts to copy_file_range() and ships down to OpenZFS). And again, because there’s that clear and unambiguous signal that the data already exists and also it’s right there, OpenZFS can just bump the refcount in the BRT.
EasyMark 7 days ago |
I figured he was mostly talking about using dedup on your work (dev machine) computer or family computer at home, not on something like a cloud or streaming server or other back end type operations.
m463 7 days ago |
I would think VMs qualify as a specific workload, since cloning is almost a given.
bobmcnamara 7 days ago |
> In my experience 4KB is my preferred block size.
This probably has something to do with the VM's filesystem block size. If you have a 4KB filesystem and an 8KB file, the file might be fragmented differently but is still the same 2x4KB blocks just in different places.
Now I wonder if filesystems zero the slack space at the end of the last block in a file in hopes of better host compression. Vs leaving it as past bytes.
Maakuth 7 days ago |
Certainly it makes sense to not have deep copies of VM base images, but the deduplication is not the right way to do it in ZFS. Instead, you can clone the base image and before changes it will take almost no space at all. This is thanks to the copy-on-write nature of ZFS.
ZFS deduplication instead tries to find existing copies of data that is being written to the volume. For some use cases it could make a lot of sense (container image storage maybe?), but it's very inefficient if you already know some datasets to be clones of the others, at least initially.
UltraSane 7 days ago |
When a new VM is created from a template on a ZFS file system with dedupe enabled what actually happens? Isn't the ref count of every block of the template simply incremented by one? The only time new data will actually be stored is when a block hash a hash that doesn't already exist.
Maakuth 6 days ago |
That's right, though deduplication feature is not the way to do it. The VM template would be a zvol, which is a block device backed by the lower levels of ZFS, and it would be cloned to a new zvol for each VM. Alternatively, if image files were used, the image file could be a reflinked copy. In both cases, new data would be stored only when changes accumulate.
Compare this to the deduplication approach: the filesystem would need to keep tabs on the data that's already on disk, identify the case where the same data is being written and then make that a reference to the existing data instead. Very inefficient if on application level you already know that it is just a copy being made.
In both of these cases, you could say that the data ends up being deduplicated. But the second approach is what the deduplication feature does. The first one is "just" copy-on-write.
mrgaro 7 days ago |
For text based logs I'm almost entirely sure that just using compression is more than enough. ZFS supports compression natively on block level and it's almost always turned on. Trying to use dedup alongside of compression for syslog most likely will not yield any benefits.
jyounker 7 days ago |
TL;DR; Declares claim that "that feature is only good for specific rare workloads" is odd. Justifies that statement by pointing out that the feature works well of their specific rare workload.
UltraSane 7 days ago |
Knowing that your storage has really good inline dedupe is awesome and will affect how you design systems. Solid dedupe lets you effectively treat multiple copies of data as symlinks.
nisten 7 days ago |
can someone smarter than me explain what happens when instead of the regular 4kb block size in kernel builds we use 16kb or 64kb block size or is that only for the memory part, i am confused. Will a larger block size make this thing good or bad?
UltraSane 7 days ago |
Generally the smaller the dedupe block the better as you are far more likely to find a matching block. But larger blocks will reduce the number of hashes you have to store. In my experience 4KB is the sweet spot to maximize how much data you save.
spockz 7 days ago |
So in this case I think it would make sense to have a separate pool where you store large files like media so you can save on the dedup for them.
Is there an inherent performance loss of using 64kB blocks on FS level when using storage devices that are 4kB under the hood?
nisten 7 days ago |
Hmmm you might be able to do both no? Like the dedube is gonna run at the filesystem level but your memory security & ownership stuff is gonna run more efficiently. I am not sure.
tiffanyh 7 days ago |
OT: does anyone have a good way to dedupe iCloud Photos. Or my Dropbox photos?
nikisweeting 7 days ago |
- https://github.com/markfasheh/duperemove
- https://codeberg.org/jbruchon/jdupes / https://www.jdupes.com/
- https://github.com/adrianlopezroche/fdupes
- https://github.com/pauldreik/rdfind
acdha 7 days ago |
The built in Photos duplicate feature is the best choice for most people: it’s not just generic file-level dedupe but smart enough to do things like take three versions of the same photo and pick the highest-quality one, which is great if you ever had something like a RAW/TIFF+JPEG workflow or mixed full res and thumbnails.
spockz 7 days ago |
Or better yet. A single photo I take of the kids will be stored in my camera roll. I will then share it with family using three different messengers. Now I have 4 copies. Each of the individual (recoded) are stored inside those messengers and also backed up. This even happens when sharing the same photo multiple times in different chats with the same messenger.
Is there any way to do de duplication here? Or just outright delete all the derivatives?
EraYaN 7 days ago |
digiKam can dedupe on actual similarity (so different resizes and formats of the same image). But it does take some time to calculate all the hashes.
wpollock 7 days ago |
When the lookup key is a hash, there's no locality over the megabytes of the table. So don't all the extra memory accesses to support dedup affect the L1 and L2 caches? Has anyone at OpenZFS measured that?
It also occurs to me that spacial locality on spinning rust disks might be affected, also affecting performance.
forrestthewoods 7 days ago |
My dream Git successor would use either dedupe or a simple cache plus copy-on-write so that repos can commit toolchains and dependencies and users wouldn’t need to worry about disk drive bloat.
Maybe someday…
fragmede 7 days ago |
It does dedup using Sha-1 on entire files. you might try git-lfs for your usecase though.
forrestthewoods 7 days ago |
Git LFS is a really really bad gross hack. It’s awful.
https://www.forrestthewoods.com/blog/dependencies-belong-in-...
fragmede 6 days ago |
It's quite functional and usable, now. so I'd agree with hack, just not the rest of your adjectives.
That was a good read! I've been thinking a lot about what comes after git too. One thing you don't address is that no one wants all parts at once either, not would it fit on one computer, so I should be able to checkout just one subdirectory of a repo.
forrestthewoods 5 days ago |
> One thing you don't address is that no one wants all parts at once either, not would it fit on one computer, so I should be able to checkout just one subdirectory of a repo.
That’s the problem that a virtual file system solves. When you clone a repo it only materializes files when they’re accessed. This is how my work repo operates. It’s great.
fragmede 4 days ago |
The problem I'm trying to avoid is having to dive down a hierarchy, and that not everyone needs or wants to know there is such a hierarchy. Like graphics design for foo-team only needs to have access to some subset of graphics. Arguably they could be given a symlink into the checkout, but the problem with that is of having a singular checkout.
The problem I had with Google3 is that the tools weren't great at branching and didn't fit my workflow, which tends to involve multiple checkouts (or worktrees using git). being forced to checkout the root of the repo, and then having to manage a symlink on top of that is no good for users that don't need/want to manage the complexity of having a single machine-global checkout.
girishso 7 days ago |
Off topic, any tool to deduplicate files across different external Hard disks?
Over the years I made multiple copies of my laptop HDD to different external HDDs, ended up with lots of duplicate copies of files.
nikisweeting 7 days ago |
How would you want the duplicates resolved? Just reported in some interface or would you want the duplicates deleted off some machines automatically?
There are a few different ways you could solve it but it depends on what final outcome you need.
girishso 5 days ago |
Just reporting in some plain text format so I can manually delete the duplicates, or create some script to delete.
I can't have like 10 external HDDs attached at the same time, so the tool needs to dump details (hashes?) somewhere on Mac HDD, and compare against those to find the duplicates.
nikisweeting 4 days ago |
Here you go:
cd /path/to/drive find . -type f -exec sha256sum {} + | sed -E 's/^([^ ]+) \./\1,/' >> ~/all_hashes.txt
Run that for each drive, then when you're done run:
sort ~/all_hashes.txt > ~/sorted_hashes.txt awk -F, 'NR==1{print;next} {print $0 | "sort | uniq -w64 -D"}' ~/sorted_hashes.txt > ~/non_unique_hashes.txt
The output in ~/non_unique_hashes.txt will contain only the non-unique hashes that appear on more than one path.
UltraSane 7 days ago |
dupeGuru works pretty well.
hhdhdbdb 7 days ago |
Any timing attacks possible on a virtualized system using dedupe?
Eg find out what my neighbours have installed.
Or if the data before an SSH key is predictable, keep writing that out to disk guessing the next byte or something like that.
aidenn0 7 days ago |
I don't think you even need timing attacks if you can read the zpool statistics; you can ask for a histogram of deduped blocks.
Guessing one byte at a time is not possible though because dedupe is block-level in ZFS.
beng-nl 7 days ago |
Gosh, you’re likely right, but what if comparing the blocks (to decide on deduping) is a byte at a time and somehow that can be detected (with a timing channel or a uarch side channel)? Zfs likely compares the hash, but I think KSM doesn’t use hashes but memcmp (or something in that spirit) to avoid collisions. So just maybe… just maybe GP is onto something.. interesting fantasy ;-)
hhdhdbdb 6 days ago |
Thanks for putting meat on the (speculitive) bone I threw out! Very interesting.
UltraSane 7 days ago |
VMWare ESXi used to dedupe RAM and had to disable this by default because of a security issue it caused that leaded data between VMs.
rodarmor 7 days ago |
General-purpose deduplication sounds good in theory but tends not to work out in practice. IPFS uses a rolling hash with variable-sized pieces, in an attempt to deduplicate data rysnc-style. However, in practice, it doesn't actually make a difference, and adds complexity for no reason.
rkagerer 7 days ago |
I'd love if dedicated hardware existing in disk controllers for calculating stuff like ECC could be enhanced to expose hashes of blocks to the system. Getting this for free for all your I/O would allow some pretty awesome things.
UltraSane 7 days ago |
That is a neat idea. Hard drives could do dedupe from the ECC they calculate for each sector. The main issue with that is that the current ECC is optimal for detecting bit errors but doesn't have the same kind of statistical guarantee of uniqueness that SHA256 or MetroHash has. You need to be VERY confident of the statistical properties of the hash used in dedupe if you are going to increment the ref count of the block hash instead of writing the data to disk.
watersb 7 days ago |
I've used ZFS dedupe for a personal archive since dedupe was first introduced.
Currently, it seems to be reducing on-disk footprint by a factor of 3.
When I first started this project, 2TB hard drives were the largest available.
My current setup uses slow 2.5-inch hard drives; I attempt to improve things somewhat via NVMe-based Optane drives for cache.
Every few years, I try to do a better job of things but at this point, the best improvement would be radical simplification.
ZFS has served very well in terms of reliability. I haven't lost data, and I've been able to catch lots of episodes of almost losing data. Or writing the wrong data.
Not entirely sure how I'd replace it, if I want something that can spot bit rot and correct it. ZFS scrub.
roygbiv2 7 days ago |
Do you have data that is very obviously dedupeable? Or just a mix of things? A factor of three is not to be sniffed at.
watersb 6 days ago |
This archive was created as a dumping ground for all of my various computers, so a complete file by file dump of their system drives.
While that obviously leads to duplicate data from files installed by operating systems, there's a lot of duplicate media libraries. (File-oriented dedupe might not be very effective for old iTunes collections, as iTunes stores metadata like how many times a song has been played in the same file as the audio data. So the hash value of a song will change every time it's played; it looks like a different file. ZFS block-level dedupe might still work ok here because nearly all of the blocks that comprise the song data will be identical.)
Anyway. It's a huge pile of stuff, a holding area of data that should really be sorted into something small and rational.
The application leads to a big payoff for deduplication.
emptiestplace 7 days ago |
Cache or ZIL (SLOG device)?
watersb 6 days ago |
Both the ZIL and the L2ARC, plus a third "special" cache which aggregates small blocks and could hold the dedupe table.
The ZIL is the "ZFS Intent Log", a log-structured ordered stream of file operations to be performed on the ZFS volume.
If power goes out, or the disk controller goes away unexpectedly, this ZIL is the log that will get replayed to bring the volume back to a consistent state. I think.
Usually the ZIL is on the same storage devices as the rest of the data. So a write to the ZIL has to wait for disk in the same line as everybody else. It might improve performance to give the ZIL its own, dedicated storage devices. NVMe is great, lower latency the better.
Since the ZFS Intent Log gets flushed to disk every five seconds or so, a dedicated ZIL device doesn't have to be very big. But it has to be reliable and durable.
Windows made small, durable NVMe cache drives a mainstream item for a while, when most laptops still used rotating hard drives. Optane NVMe at 16GB is cheap, like twenty bucks, buy three of them and use a mirrored pair of two for your ZIL.
----
Then there's the read cache, the ARC. I use 1TB mirrored NVMe devices.
Finally, there's a "special" device that can for example be designated for use for intense things like the dedupe table (which Fast Dedupe is making smaller!).
emptiestplace 6 days ago |
A couple things:
- The ZIL is used exclusively for synchronous writes - critical for VMs, databases, NFS shares, and other applications requiring strict write consistency. Many conventional workloads won't benefit. Use `zilstat` to monitor.
- The cheap 16GB Optane devices are indeed great in terms of latency, but they were designed primarily for read caching and have significantly limited write speeds. If you need better throughput, look for the larger Optane models which don't have these limitations.
- SLOG doesn't need to be mirrored - the only risk is if your SLOG device fails at the exact moment your system crashes. While mirroring is reasonable for production systems, with these cheap 16GB Optanes you're just guaranteeing they'll wear out at the same time. You could kill one at a time instead. :)
- As for those 1TB NVMe devices for read cache (L2ARC) - that's probably overkill unless you have a very specific use case. L2ARC actually consumes RAM to track what's in the cache, and that RAM might be better used for ARC (the main memory cache). L2ARC only makes sense when you have well-understood workload patterns and your ARC is consistently under pressure - like in a busy database server or similar high-traffic scenario. Use `arcstat` to monitor your cache hit ratios before deciding if you need L2ARC.
shiroiushi 6 days ago |
>Optane NVMe at 16GB is cheap, like twenty bucks, buy three of them and use a mirrored pair of two for your ZIL.
I've been building a home media server lately and have thought about doing something like this. However, there's a big problem: these little 16GB Optane drives are NVMe. My main boot drive and where I keep the apps is also NVME (not mirrored, yet: for now I'm just regularly copying to the spinning disks for backup, but a mirror would be better). So ideally that's 4 NVMe drives, and that's with me "cheating" and making the boot drive a partition on the main NVMe drive instead of a separate drive as normally recommended.
So where are you supposed to plug all these things in? My pretty-typical motherboard has only 2 NVMe slots, one that connects directly to the CPU (PCIe 4.0) and one that connects through the chipset (slower PCIe 3.0). Is the normal method to use some kind of PCIe-to-NVMe adapter card and plug that into the PCIe x16 video slot?
emptiestplace 6 days ago |
Why are you looking at 16GB Optane drives? You probably don't need a SLOG device for your media server.
I think you're pretty far into XY territory here. I'd recommend hanging out in r/homelab and r/zfs, read the FAQs, and then if you still have questions, maybe start out with a post explaining your high level goals and challenges.
shiroiushi 6 days ago |
I'm not using them yet; I've already built my server without one, but I was wondering if it would be beneficial to add one for ZIL. Again, this is a home media server, so the main uses are pretty standard for a "home server" these days I think: NFS share, backups (of our PCs), video/music/photo storage, Jellyfin server, Immich server. I've read tons of FAQs and /r/homelab and /r/homeserver (honestly, /r/homelab isn't very useful, it's overkill for this kind of thing, with people building ridiculous rack-mount mega-systems; /r/homeserver is a lot better but it seems like a lot of people are just cobbling together a bunch of old junk, not building a single storage/media server).
My main question here was just what I asked about NVMe drives. Many times in my research that you recommended, people recommended using multiple NVMe drives. But even a mirror is going to be problematic: on a typical motherboard (I'm using a AMD B550 chipset), there's only 2 slots, and they're connected very differently, with one slot being much faster (PCIe4) than the other (PCIe3) and having very different latency, since the fast one connects to the CPU and the slow one goes through the chipset.
emptiestplace 6 days ago |
Ok, understood. The part I'm confused about is the focus on NVMe devices - do you also have a bunch of SATA/SAS SSDs, or even conventional disks for your media? If not, I'd definitely start there. Maybe something like six spinners in RAIDZ2, this would allow you to lose up to two drives without any data loss.
If NVMe is your only option, I'd try to find a couple used 1.92TB enterprise class drives on ebay, and go ahead and mirror those without worrying about the different performance characteristics (the pool will perform as fast as the slowest device, that's all) - but 1.92TB isn't much for a media server.
In general, I'd say consumer class SSDs aren't worth the time it'll take you to install them. I'd happily deploy almost any enterprise class SSD with 50% beat out of it over almost any brand new consumer class drive. The difference is stark - enterprise drives offer superior performance through PLP-improved latency and better sustained writes (thanks to higher quality NAND and over-provisioning), while also delivering much better longevity.
shiroiushi 6 days ago |
>The part I'm confused about is the focus on NVMe devices - do you also have a bunch of SATA/SAS SSDs
I do have 4 regular SATA spinning disks (enterprise-class), for bulk data storage, in a RAIDZ1 array. I know it's not as safe as RAIDZ2, but I thought it'd be safe enough with only 4 disks, and I want to keep power usage down if possible.
I'm using (right now) a single 512GB NVMe drive for both booting and app storage, since it's so much faster. The main data will on the spinners, but the apps themselves on the NVMe which should improve performance a lot. It's not mirrored obviously, so that's one big reason I'm asking about the NVMe slots; sticking a 2nd NVMe drive in this system would actually slow it down, since the 2nd slot is only PCIe3 and connected through the chipset, so I'm wondering if people do something different, like using some kind of adapter card for the x16 video slot. I just haven't seen any good recommendations online in this regard. For now, I'm just doing daily syncs to the raid array, so if the NVMe drive suddenly dies somehow, it won't be that hard to recover, though obviously not nearly as easy as with a mirror. This obviously isn't some kind of mission-critical system so I'm ok with this setup for now; some downtime is OK, but data loss is not.
Thanks for the advice!
emptiestplace 6 days ago |
Yeah, RAIDZ1 is a reasonable trade-off for the four disks.
Move your NVMe to the other slot, I bet you can't tell a difference without synthetic benchmarks.
david_draco 7 days ago |
In addition to the copy_file_range discussion at the end, it would be great to be able to applying deduplication to selected files, identified by searching the filesystem for say >1MB files which have identical hash.
nabla9 7 days ago |
You should use:
cp --reflink=auto
You get file level deduplication. The command above performs a lightweight copy (ZFS clone in file level), where the data blocks are copied only when modified. Its a copy, not a hard link. The same should work in other copy-on-write transactional filesystems as well if they have reflink support.
simonjgreen 7 days ago |
We used to make extensive use of, and gained huge benefit from, dedup in ZFS. The specific use case was storage for VMWare clusters where we had hundreds of Linux and Windows VMs that were largely the same content. [this was pre-Docker]
jamesbfb 7 days ago |
Can relate. I’ve recently taken ownership of a new work laptop with Ubuntu (with “experimental” zfs) and using dedupe on my nix store has been an absolute blessing!
rwarfield 7 days ago |
Isn't it better to use `nix store optimise` for dedup of the nix store? The nix command has more knowledge of the structure of the nix store so should be able to do a better job with fewer resources. Also the store is immutable so you don't actually need reflinks - hard links are enough.
Filligree 7 days ago |
It is, yeah, though you have to turn it on. I'm not actually sure why it's off by default.
amarshall 6 days ago |
It’s off by default as it can make builds slower (regardless of platform)—you should test this if you care. There also are (or were) some bugs on macOS that would cause corruption.
Filligree 6 days ago |
That seems like the wrong default. Most people do very little building on their desktops; they get all their software from the cache.
amarshall 7 days ago |
Nix already has some builtin deduplication, see `man nix-store-optimise`. Nix’s own hardlinking optimization reduces disk usage of the store (for me) by 30–40%.
jamesbfb 6 days ago |
Well, TIL. Being relatively new to nix, you’ve let me down another rabbit hole :)
jamesbfb 6 days ago |
Update. Turns out PyCharm does not play nice with a plethora of symlinks. :(
amarshall 6 days ago |
Nix optimise does not use symlinks, it uses hardlinks.
aniviacat 7 days ago |
I've read multiple comments on using dedup for VMs here. Wouldn't it be a lot more efficient for this to be implemented by the hypervisor rather than the filesystem?
UltraSane 7 days ago |
I'm a former VMware certified admin. How do you envision this to work? All the data written to the VM's virtual disk will cause blocks to change and the storage array is the best place to keep track of that.
wang_li 7 days ago |
You do it at the file system layer. Clone the template which creates only metadata referencing the original blocks then you perform copy-on-write as needed.
UltraSane 7 days ago |
But that is exactly what the storage array is doing. What is the advantage?
anyfoo 7 days ago |
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
Linked clones shouldn’t need that. They likely start out with only references to the original blocks, and then replace them when they change. If so, it’s a different concept (as it would mean that any new duplicate blocks are not shared), but for the use case of “spin up a hundred identical VMs that only change comparably little” it sounds more efficient performance-wise, with a negligible loss in space efficiency.
Am I certain of this? No, this is just what I quickly pieced together based on some assumptions (albeit reasonable ones). Happy to be told otherwise.
UltraSane 6 days ago |
Linked clones aren't used in ESXi, instant clones and they ARE pretty nifty and heavily used in VDI where you need to spin up many thousands of desktop VMs. But they have to keep track of what blocks change and so ever clone has a delta disk. At the end of the day you are just moving around where this bookkeeping happens. And it is best to happen on a enterprise grade array with ultra optimized inline dedupe like a Pure array.
https://www.yellow-bricks.com/2018/05/01/instant-clone-vsphe...
anyfoo 6 days ago |
I’m not sure that’s true, because the hypervisor can know which blocks are related to begin with? From what I quoted above it seems that the file system instead does a lookup based on the block content to determine if a block is a dupe (I don’t know if it uses a hash, necessitating processing the whole block, or something like an RB tree, which avoids having to read the whole block if it already differs early from candidates). Unless there is a way to explicitly tell the file system that you are copying blocks for that purpose, and that VMware is actually doing that. If not, then leaving it to the file system or even the storage layer should have a definite impact on performance, albeit in exchange for higher space efficiency because a lookup can deduplicate blocks that are identical but not directly related. This would give a space benefit if you do things like installing the same applications across many VMs after the cloning, but assuming that this isn’t commonly done (I think you should clone after establishing all common state like app installations if possible), then my gut feeling is very much that the performance benefit of more semantic-level hypervisor bookkeeping outweighs the space gains from “dumb” block-oriented fs/storage bookkeeping.
Dylan16807 6 days ago |
Your phrasing sounds like you're unaware that filesystems can also do the same kind of cloning that a hypervisor does, where the initial data takes no storage space and only changes get written.
In fact, it's a much more common feature than active deduplication.
VM drives are just files, and it's weird that you imply a filesystem wouldn't know about the semantics of a file getting copied and altered, and would only understand blocks.
anyfoo 6 days ago |
Uh, thanks for the personal attack? I am aware that cloning exists, and I very explicitly allowed for the use of such a mechanism to change the conclusion in both of my comments. My trouble was that I wasn't sure how much filesystem-cloning is actually in use in relevant contexts. Does POSIX have some sort of "copyfile()" system call nowadays? Last I knew (outdated, I'm sure), the cp command for example seemed to just read() blocks into a buffer and write() them out again. I'm not sure how the filesystem layer would detect this as a clone without a lookup. I was quoting and basing my assumptions on the article:
> The downside is that every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
Which, if universally true, is very much different from what a hypervisor could do instead, and I've detailed the potential differences. But if a hypervisor does use some sort of clone system call instead, that can indeed shift the same approach into the fs layer, and my genuine question is whether it does.
Dylan16807 6 days ago |
I said "your phrasing sounds like" specifically to make it not personal. Clearly some information was missing but I wasn't sure exactly what. I'll try to phrase that better in the future.
It sounds like the information you need is that cp has a flag to make cloning happen. I think it even became default behavior recently.
Also that the article quote is strictly talking about dedup. That downside does not generalize to the clone/reflink features. They use a much more lightweight method.
This is one of the relevant syscalls: https://man7.org/linux/man-pages/man2/copy_file_range.2.html
anyfoo 6 days ago |
Thanks, that was educational.
dpedu 6 days ago |
> Linked clones aren't used in ESXi
Huh? What do you mean? They absolutely are. I've made extensive use of them in ESXi/vsphere clusters in situations where I'm spinning up and down many temporary VMs.
UltraSane 6 days ago |
Linked clones do not exist in ESXi. Horizon Composer is what is/was used to create them, and that requires a vCenter Server and a bit of infrastructure, including a database.
dpedu 4 days ago |
No, you can create them if you only have vCenter via its API. No extra infrastructure beyond that, though. The pyVmomi library has example code of how to do it. IIRC it is true that standalone ESXi does not offer the option to create a linked clone by itself, but if I wanted to be a pendant I'd argue that linked clones do exist in ESXi, as that is where vCenter deploys them.
SteveNuts 7 days ago |
VMware allows linked clones which you can do when deploying from template
https://docs.vmware.com/en/VMware-Fusion/13/com.vmware.fusio...
PittleyDunkin 6 days ago |
> VMware certified admin
Not to be rude, but does this have any meaning?
UltraSane 6 days ago |
I understand how VMware ESXi works better than most people.
iwontberude 7 days ago |
COW is significantly slower and has nesting limits when compared to these deduped clones. Great question!
nobrains 7 days ago |
What are the use cases where it makes sense to use de-dup? Backup comes to mind. What else?
3np 7 days ago |
Many similar VM images or even live root filesystems.
zitsarethecure 7 days ago |
Flatpak uses OSTree dedup.
https://blogs.gnome.org/wjjt/2021/11/24/on-flatpak-disk-usag...
UltraSane 7 days ago |
VMware + All flash storage array + inline dedupe/compression = happy users.
merpkz 7 days ago |
I don't get it - many people here claim in this thread that VM base image deduplication is great use case for this. So lets assume there are couple of hundreds of VMs on a ZFS dataset with dedupe on, each of them ran by different people for different purposes entirely - some databases, some web frontends / backends, minio S3 storage or backups ect - this might save you those measly hundreds of megabytes for linux system files those VMs might have in common ( even though knowing how many linux versions are out there with different patch levels - unlikely ) it will still not be worth it considering ZFS will keep track of each users individual files - databases and backup files and whatnot - data which is almost guaranteed to be unique between users so it will completely miss the point of ZFS deduplication. What am I missing?
jeroenhd 7 days ago |
It largely depends on how you set up your environment. On my home server, most VMs consist of a few gigabytes of a base Linux system and then a couple of hundred megabytes of application code. Some of those VMs also store large amounts of data, but most of that data could be stored in something like a dedicated minio server and maybe a dedicated database server. I could probably get rid of a huge chunk of my used storage if I switched to a deduplicating system (but I have plenty of storage so I don't really need to).
If you're selling VMs to customers then there's probably no advantage in using deduplication.
3np 7 days ago |
In such a sevario you'd probably have several partitions. So dedupe activated on the root filesystem (/bin,/lib etc) but not for /home and /var.
xmodem 7 days ago |
We have a read-heavy zpool with some data that's used as part of our build process, on which we see a roughly 8x savings with dedup - and because of this ZFS dedup makes it economically viable for us to store the pool on NVMe rather than spinning rust.
myself248 7 days ago |
And being read-heavy, suboptimal performance at write time is an infrequent pain, I guess?
xmodem 7 days ago |
Not even that - the data being written is coming straight from the network, and the pool has no issues keeping up.
BodyCulture 7 days ago |
I wanted to use ZFS badly, but of course all data must be encrypted. It was surprising to see how usage gets much more complicated than expected and so many people just don’t encrypt their data because things get wild then.
Look, even Proxmox, which I totally expected to support encryption with default installation (it has „Enterprise“ on the website) does loose important features when trying to use with encryption.
Also please study the issue tracker, there are a few surprising things I would not have expected to exist in a productive file system.
eadmund 7 days ago |
The best way to encrypt ZFS is to run unecrypted ZFS atop encrypted volumes (e.g. LUKS volumes). ZFS ‘encryption’ leaves too much in plaintext for my comfort.
BodyCulture 7 days ago |
In the Proxmox forum some people tried this method and do not report big success. Can not recommend for production.
Still the same picture, encryption seems to be not a first class citizen in ZFS land.
burnt-resistor 7 days ago |
Already don't use ZoL because of their history of arms shrug-level support coupled with a lack of QA. ZoL != Solaris ZFS. It is mostly an aspirational cargo cult. Only a few fses like XFS and Ext4 have meaningful real-world, enterprise deployment hours. Technically, btrfs has significant (web ops instead of IT ops) deployment exposure due to its use on 10M boxes at Meta. Many non-mainstream fses also aren't assured to be trustworthy because of their low usage and prevalent lack of thorough, formalized QA. There's nothing wrong with experimentation, but it's necessary to have an accurate understanding of the risk budget for a given technology for a given use-case.
volkadav 7 days ago |
I sympathize with your concerns for stability and testing, but I think that you might reconsider things in open-source ZFS land. OpenZFS/ZoL have been merged since the 2.0 release several years back, and some very large (e.g. Netflix) environments use FreeBSD which in turn uses OpenZFS, as well as being in use by the various Illumos derivatives and such. It is true that there has been some feature divergence between Oracle ZFS and OpenZFS since the fork, but as I recall that was more "nice to haves" like fs-native encryption than essential reliability fixes, fwiw.
ComputerGuru 5 days ago |
Don't disagree with your post but netflix doesn't use zfs for a couple of reasons, one of which is broken sendfile support (though that might be fixed soon!).
qwertox 7 days ago |
What happened to the issue with ZFS which occurred around half a year go?
I never changed a thing (because it also had some cons) and am believing that as long as a ZFS scrub shows no errors, all is OK. Could I be not seeing a problem?
cmiller1 7 days ago |
So if a sweet spot exists where dedup is widely beneficial then:
Is there an easy way to analyze your dataset to find if you're in this sweet spot?
If so, is anyone working on some kind of automated partial dedup system where only portions of the filesystem are dedupped based on analysis of how beneficial it would be?
mdaniel 7 days ago |
I can't speak to the first one, but AIUI the ZFS way of thinking about the second one is to create a new filesystem and just mount it where you want it, versus "portions of the filesystem" which I doubt very seriously that ZFS allows. Bonus points that in that scenario, I would suspect the dedupe and compression would work even better since any such setup is likely to contain more homogeneous content (music, photos, etc)
mlfreeman 7 days ago |
Are there any tools that can run (even across network on another box) to analyze possible duplication at various block sizes?
I am NOT interested in finding duplicate files, but duplicate slices within all my files overall.
I can easily throw together code myself to find duplicate files.
EDIT: I guess I’m looking for a ZFS/BTRFS/other dedupe preview tool that would say “you might save this much if you used this dedupe process.”
teilo 6 days ago |
Why are enterprise SANs so good at dedupe, but filesystems so bad? We use HPE Nimble (yeah, they changed the name recently but I can't be bothered to remember it), and the space savings are insane for the large filesystems we work with. And there is no performance hit.
Some of this is straight up VM storage volumes for ESX virtual disks, some direct LUNs for our file servers. Our gains are upwards of 70%.
growse 6 days ago |
Totally naive question: is this better than you get than simply compressing?
It's not 100% clear to me why explicit deduping blocks would give you any significant benefit over a properly chosen compression algorithm.
onnimonni 6 days ago |
I'm storing a lot of text documents (.html) which contain long similiar sections and are thus not copies but "partial copies".
Would someone know if the fast dedup works also for this? Anything else I could be using instead?