Big thank you to all the people that worked on it!
BTW has anyone tried a probabilistic dedup approach using soemthing like a bloom filter so you don't have to store the entire dedup table of hashes verbatim? Collect groups of ~100 block hashes into a bucket each, and store a hyper compressed representation in a bloom filter. On write, lookup the hash of the block to write in the bloom filter, and if a potential dedup hit is detected, walk the 100 blocks in the matching bucket manually to look for any identical hashes. In theory you could do this with layers of bloom filters with different resolutions and dynamically swap out the heavier ones to disk when memory pressure is too high to keep the high resolution ones in RAM. Allowing the accuracy of the bloom filter to be changed as a tunable parameter would let people choose their preference around CPU time/overhead:bytes saved ratio.
You may have seen in the WARC standard that they already do de-duplication based on hashes and use pointers after the first store. So this is exactly a case where FS-level dedup is not all that good.
[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.
https://support.archive-it.org/hc/en-us/articles/208001016-A...
I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.
dm-vdo has the same behaviour.
You may be better off with long-range solid compression instead, or unpacking the warc files into a directory equivalent, or maybe there is some CDC-based FUSE system out there (Seafile perhaps)
The wget extractor within archivebox can produce WARCs as an output but no parts of ArchiveBox are built to rely on those, they are just one of the optional extractors that can be run.
Just store fingerprints in a database and run through that at night and fixup the block pointers...
That's why. Due to reasons[1], ZFS does not have the capability to rewrite block pointers. It's been a long requested feature[2] as it would also allow for defragmentation.
I've been thinking this could be solved using block pointer indirection, like virtual memory, at the cost of a bit of speed.
But I'm by no means a ZFS developer, so there's surely something I'm missing.
[1]: http://eworldproblems.mbaynton.com/posts/2014/zfs-block-poin...
Once you get a lot of snapshots, though, the indirection costs start to rise.
I've also heard there are some experimental branches that makes it possible to run Hammer2 on FreeBSD. But FreeBSD also lacks RDMA support. For FreeBSD 15, Chelsio has sponsored NVMe-oF target, and initiator support. I think this is just TCP though.
Because:
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.
But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.
There are of course edge cases to consider to avoid data loss, but I imagine it might come soon, either officially or as a third-party tool.
Sometimes it can be a similar issue in some edge cases performance wise, but usually caching can address those problems.
Efficiency being the enemy of reliability, sometimes.
It’s why metadata gets duplicated in ZFS the way it does on all volumes.
Having seen this play out a bunch of times, it isn’t an uncommon need either.
Well I didn't suggest that. I said important files only for the extra backup, and I was talking about reallocating resources not getting new ones.
The simplest version is the scenario where turning on dedup means you need one less drive of space. Convert that drive to parity and you'll be better off. Split that drive from the pool and use it to backup the most important files and you'll be better off.
If you can't save much space with dedup then don't bother.
I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.
> There was an implication in your statement that volume level was the level of granularity, yeah?
There was an implication that the volume level was the level of granularity for adding parity.
But that was not the implication for "another backup of your most important files".
> I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.
You can't choose just by copying files around, but it's pretty easy to set copies=2 on specific directories. And I'd say that's generally a better option, because it keeps your copies up to date at all times. Just make sure snapshots are happening, and files in there will be very safe.
Manual duplication is the worst kind of duplication, so while it's good to warn people that it won't work with dedup on, actually losing the ability is not a big deal when you look at the variety of alternatives. It only tips the balance in situations where dedup is near-useless to start with.
I also wonder if it would make sense for ZFS to always automatically dedupe before taking a snapshot. But you'd have to make this behavior configurable since it would turn shapshotting from a quick operation into an expensive one.
This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".
I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.
A more precise calculation on my actual data shows that today's data would allow the dedup table to fit in RAM, but if I ever want to actually use most of the 40TB of storage, I'd need more RAM. I've had a ZFS system swap dedup to disk before, and the performance dropped to approximately zero; fixing it was a PITA, so I'm not doing that anytime soon.
In any case it doesn’t stick out to me as a problem that needs to be fixed. You can’t fill a propane tank to 100% either.
All my arrays send me nightly e-mails at 80% so I'm aware of when I hit there, but on a desktop system that's typically not the case.
You should be able to detect duplicates online. Low priority sweeping is something else. But you can at least reduce pause times.
That said, you can still do a two-space GC, but it's slow and possibly wasteful.
Once that is in, any of the existing dupe finding tools that use it (IE jdupes, duperemove) will just work on ZFS.
Compression, encryption and streaming sparse files together are impressive already. But now we get a new BRT entry appearing out of nowhere, dedup index pruning one that was there a moment ago, all while correctly handling arbitrary errors in whatever simultaneous deduped writes, O_DIRECT writes, FALLOC_FL_PUNCH_HOLE and reads were waiting for the same range? Sounds like adding six new places to hold the wrong lock to me.
It's no worse than anything else related to block cloning :)
ZFS already supports FICLONERANGE, the thing FIDEDUPRANGE changes is that the compare is part of the atomic guarantee.
So in fact, i'd argue it's actually better than what is there now - yes, the hardest part is the locking, but the locking is handled by the dedup range call getting the right locks upfront, and passing them along, so nothing else is grabbing the wrong locks. It actually has to because of the requirements to implement the ioctl properly. We have to be able to read both ranges, compare them, and clone them, all as an atomic operation wrt to concurrent writes. So instead of random things grabbing random locks, we pass the right locks around and everything verifies the locks.
This means fideduprange is not as fast as it maybe could be, but it does not run into the "oops we forgot the right kind of lock" issue. At worst, it would deadlock, because it's holding exclusive locks on all that it could need before it starts to do anything in order to guarantee both the compare and the clone are atomic. So something trying to grab a lock forever under it will just deadlock.
This seemed the safest course of implementation.
ficlonerange is only atomic in the cloning, which means it does not have to read anything first, it can just do blind block cloning. So it actually has a more complex (but theoretically faster) lock structure because of the relaxed constraints.
All modern tools use FIDEDUPRANGE, which is an ioctl meant for explicitly this use case - telling the FS that two files have bytes that should be shared.
Under the covers, the FS does block cloning or whatever to make it happen.
Nothing is copied.
ZFS does support FICLONERANGE, which is the same as FIDEDUPRANGE but it does not verify the contents are the same prior to cloning.
Both are atomic WRT to concurrent writes, but for FIDEDUPRANGE that means the compare is part of the atomicness. So you don't have to do any locking.
If you used FICLONERANGE, you'd need to lock the two file ranges, verify, clone, unlock
FIDEDUPRANGE does this for you.
So it is possible, with no changes to ZFS, to modify dedup tools to work on ZFS by changing them to use FICLONERANGE + locking if FIDEDUPRANGE does not exist.
Because FIDEDUPRANGE has the compare as part of the atomic guarantee, you don't need to lock in userspace around using it, and so no dedup utility bothers to do FICLONERANGE + locking. Also, ZFS is the only FS that implements FICLONERANGE but not FIDEDUPRANGE :)
Like jdupes or duperemove.
I sent PR's to both the ZFS folks and the duperemove folks to support the syscalls needed.
I actually have to go followup on the ZFS one, it took a while to review and i realized i completely forget to finish it up.
Hard links are all equivalent. A file has any number of hard links, and at least in theory you can't distinguish between them.
The risk with hardlinks is that you might alter the file. Reflinks remove that risk, and also perform very well.
However, the fact that editing one copy edits all of them still makes this a non-solution for me at least. I'd also strongly prefer deduping at the block level vs file level.
Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.
This is par for the course with ZFS though. If you delete a non-duplicated file you don't get the space back until any snapshots referencing the file are deleted.
Basically all dupe tools that are modern use fideduprange, which is meant to tell the FS which things should be sharing data, and let it take care of the rest. (BTRFS, bcachefs, etc support this ioctl, and zfs will soon too)
Unlike copy_file_range, it is meant for exactly this use case, and will tell you how many bytes were dedup'd, etc.
A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.
But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.
Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.
It looks like it means: https://en.wikipedia.org/wiki/Content-addressable_storage
zfs set mountpoint=foopy/foo /mnt/foo
zfs set dedup=off foopy/foo
zfs set mountpoint=foopy/baz /mnt/baz
zfs set dedup=on foopy/baz
Save all your stuff in /mnt/foo, then when you want to dedup do mv /mnt/foo/bar /mnt/baz/
Yeah... this feels like picrel, and it is https://i.pinimg.com/originals/cb/09/16/cb091697350736aae53afe4b548b9d43.jpg
but it's here and now and you can do it now.That said, NVMe has changed that balance a lot, and you can afford a lot less before you're bottlenecking the drives.
Tried to find the talk but failed, was sure I had seen it on a Delveloper Summit but alas.
File systems are pretty good if you have a mix of human and programmatic uses, especially when the programmatic cases are not very heavy duty.
The programmatic scenarios are often entirely human hostile, if you try to imagine what would be involved in actually using them. Like direct S3 access, for example.
So if you have non-ASCII characters in your paths, encoding/decoding is guesswork, and at worst, differs from path segment to path segment, and there's no metadata attached which encoding to use.
For less patient readers, note that the concise summary is at the bottom of the post, not the top.
<code>kstat.zfs.<pool>.misc.ddt_stats_<checksum></code>
Typesetting code on a narrow screen is tricky![1] https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-...
As it is scrolling here, though inconsequentially, it might be bad on a smaller device with less screen and/or other ppi settings.
> As we’ve seen from the last 7000+ words, the overheads are not trivial. Even with all these changes, you still need to have a lot of deduplicated blocks to offset the weight of all the unique entries in your dedup table. [...] what might surprise you is how rare it is to find blocks eligible for deduplication are on most general purpose workloads.
> But the real reason you probably don’t want dedup these days is because since OpenZFS 2.2 we have the BRT (aka “block cloning” aka “reflinks”). [...] it’s actually pretty rare these days that you have a write operation coming from some kind of copy operation, but you don’t know that came from a copy operation. [...] [This isn't] saving as much raw data as dedup would get me, though it’s pretty close. But I’m not spending a fortune tracking all those uncloned and forgotten blocks.
> [Dedup is only useful if] you have a very very specific workload where data is heavily duplicated and clients can’t or won’t give direct “copy me!” signal
The section labeled "summary" imo doesn't do the article justice by being fairly vague. I hope these quotes from near the end of the article give a more concrete idea of why (not) use it
Didn't read the 7000 words... But isn't the dedup table in the form of a bunch of bloom filters so the whole dedup table can be stored with ~1 bit per block?
When you know there is likely a duplicate, you can create a table of blocks where there is a likely duplicate, and find all the duplicates in a single scan later.
That saves having massive amounts of accounting overhead storing any per-block metadata.
In general, even during rsync operations one often turns off compression on large video files, as the compression operation has low or negative impact on storage/transfers while eating ram and cpu power.
De-duplication is good for Virtual Machine OS images, as the majority of the storage cost is a replicated backing image. =3
No, you won't save much on a client system. That isn't what the feature is made for.
It is very clear that consumer was never a priority, and so I wonder what the venn diagram is of 'client system' and 'zfs filesystem'. Not that big right?
This struck me as a very odd claim. I've worked with Pure and Dell/EMC arrays and for VMWare workloads they normally got at least 3:1 dedupe/compression savings. Only storing one copy of the base VM image works extremely well. Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.
The effectiveness of dedupe is strongly affected by the size of the blocks being hashed, with the smaller the better. As the blocks get smaller the odds of having a matching block grow rapidly. In my experience 4KB is my preferred block size.
Don’t you compress these directly? I normally see at least twice that for logs doing it at the process level.
CentOS made it famous. I don't know if it has a foothold in the Debian family.
I built a very simple, custom syslog solution, a syslog-ng server writing directly to a TimescaleDB hypertable (https://www.timescale.com/) that is then presented as a Grafana dashboard, and I am getting a 30x compression ratio.
I just created the repo and uploaded the documentation, please give me some more time to write the documentation: https://github.com/josefrcm/simple-syslog-service
That makes sense considering Advanced Format harddrives already have a 4K physical sector size, and if you properly low-level format them (to get rid of the ridiculous Windows XP compatibility) they also have 4K logical sector size. I imagine there might be some real performance benefits to having all of those match up.
You need to be careful and do staggered updates in the VMs or it'll spectacularly explode but it's possible and quite performant for less than mission critical VMs.
The partition/NTFS volume is 512GB. It currently stores 1.3 TB of "dedupped" data and has about 200GB free. Dedup runs asynchronously in the background and as a job during off hours.
It's a typo, yes. Thanks.
If your data is a good fit you might get away with 1GB per TB, but if you are out of luck the 5GB might not even be enough. That's why the article speaks of ZFS dedup having a small sweet spot that your data has to hit, and why most people don't bother
Other file systems tend to prefer offline dedupe which has more favorable economics
(5GiB / 1TiB) * 4KiB to bits
((5 gibibytes) / (1 tebibyte)) × (4 kibibytes) = 160 bits
In the end it all comes down to: there are a whole lot of trade-offs that you have to take into account, and which ceilings you hit first depends entirely on everyone’s specific situation.
Dedupe/compression works really well on syslog
I apologize for the pedantry but dedupe and compression aren't the same thing (although they tend to be bundled in the enterprise storage world). Logs are probably benefiting from compression not dedupe and ZFS had compression all along.
(Although yes I understand that file-level compression with a standard algorithm is a different thing than dedup)
Both are trying to eliminate repeating data, it's just the frame of reference that changes. Compression in this context is operating on a given block or handful of blocks. Deduplication is operating on the entire "volume" of data. "Volume" having a different meaning depending on the filesystem/storage array in question.
Compression with a global dict would like do better than dedup yet it will have a lot of other issues.
Secondly, I think you are conflating two different features: compression & de-duplication. In ZFS you can have compression turned on (almost always worth it) for a pool, but still have de-duplication disabled.
I consider dedupe/compression to be two different forms of the same thing. compression reduces short range duplication while deduplication reduces long range duplication of data.
The old kind of NTFS compression from 1993 is completely transparent, but it uses a weak algorithm and processes each 64KB of file completely independently. It also fragments files to hell and back.
The new kind from Windows 10 has a better algorithm and can have up to 2MB of context, which is quite reasonable. But it's not transparent to writes, only to reads. You have to manually apply it and if anything writes to the file it decompresses.
I've gotten okay use out of both in certain directories, with the latter being better despite the annoyances, but I think they both have a lot of missed potential compared to how ZFS and BTRFS handle compression.
It's useful, but if they updated it it could get significantly better ratios and have less impact on performance.
General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead. Backups are hit or miss depending on both how the backups are implemented, and if they are encrypted prior to the filesystem level.
Compression is a totally different thing and current ZFS best-practice is to enable it by default for pretty much every workload - the CPU used is barely worth mentioning these days, and the I/O savings can be considerable ignoring any storage space savings. Log storage is going to likely see a lot better than 6:1 savings if you have typical logging, at least in my experience.
I would contest this is because we don't have a good transparent deduplication right now - just some bad compromises. Hard copies? Edit anything and it gets edited everywhere - not what you want. Symlinks? Look different enough that programs treat them differently.
I would argue your regular desktop user actually has an enormous demand for a good deduplicating file system - there's no end of use cases where the first step is "make a separate copy of all the relevant files just in case" and a lot of the time we don't do it because it's just too slow and wasteful of disk space.
If you're working with say, large video files, then a good dedupe system would make copies basically instant, and then have a decent enough split algorithm that edit's/cuts/etc. of the type people try to do losslessly or with editing programs are stored efficiently without special effort. How many people are producing video content today? Thanks to Tiktok we've dragged that skill right down to "random teenagers" who might hopefully pipeline into working with larger content.
> If you put all this together, you end up in a place where so long as the client program (like /bin/cp) can issue the right copy offload call, and all the layers in between can translate it (eg the Window application does FSCTL_SRV_COPYCHUNK, which Samba converts to copy_file_range() and ships down to OpenZFS). And again, because there’s that clear and unambiguous signal that the data already exists and also it’s right there, OpenZFS can just bump the refcount in the BRT.
This probably has something to do with the VM's filesystem block size. If you have a 4KB filesystem and an 8KB file, the file might be fragmented differently but is still the same 2x4KB blocks just in different places.
Now I wonder if filesystems zero the slack space at the end of the last block in a file in hopes of better host compression. Vs leaving it as past bytes.
ZFS deduplication instead tries to find existing copies of data that is being written to the volume. For some use cases it could make a lot of sense (container image storage maybe?), but it's very inefficient if you already know some datasets to be clones of the others, at least initially.
Compare this to the deduplication approach: the filesystem would need to keep tabs on the data that's already on disk, identify the case where the same data is being written and then make that a reference to the existing data instead. Very inefficient if on application level you already know that it is just a copy being made.
In both of these cases, you could say that the data ends up being deduplicated. But the second approach is what the deduplication feature does. The first one is "just" copy-on-write.
Is there an inherent performance loss of using 64kB blocks on FS level when using storage devices that are 4kB under the hood?
Is there any way to do de duplication here? Or just outright delete all the derivatives?
It also occurs to me that spacial locality on spinning rust disks might be affected, also affecting performance.
Maybe someday…
https://www.forrestthewoods.com/blog/dependencies-belong-in-...
That was a good read! I've been thinking a lot about what comes after git too. One thing you don't address is that no one wants all parts at once either, not would it fit on one computer, so I should be able to checkout just one subdirectory of a repo.
That’s the problem that a virtual file system solves. When you clone a repo it only materializes files when they’re accessed. This is how my work repo operates. It’s great.
The problem I had with Google3 is that the tools weren't great at branching and didn't fit my workflow, which tends to involve multiple checkouts (or worktrees using git). being forced to checkout the root of the repo, and then having to manage a symlink on top of that is no good for users that don't need/want to manage the complexity of having a single machine-global checkout.
Over the years I made multiple copies of my laptop HDD to different external HDDs, ended up with lots of duplicate copies of files.
There are a few different ways you could solve it but it depends on what final outcome you need.
I can't have like 10 external HDDs attached at the same time, so the tool needs to dump details (hashes?) somewhere on Mac HDD, and compare against those to find the duplicates.
cd /path/to/drive
find . -type f -exec sha256sum {} + | sed -E 's/^([^ ]+) \./\1,/' >> ~/all_hashes.txt
Run that for each drive, then when you're done run: sort ~/all_hashes.txt > ~/sorted_hashes.txt
awk -F, 'NR==1{print;next} {print $0 | "sort | uniq -w64 -D"}' ~/sorted_hashes.txt > ~/non_unique_hashes.txt
The output in ~/non_unique_hashes.txt will contain only the non-unique hashes that appear on more than one path.Eg find out what my neighbours have installed.
Or if the data before an SSH key is predictable, keep writing that out to disk guessing the next byte or something like that.
Guessing one byte at a time is not possible though because dedupe is block-level in ZFS.
Currently, it seems to be reducing on-disk footprint by a factor of 3.
When I first started this project, 2TB hard drives were the largest available.
My current setup uses slow 2.5-inch hard drives; I attempt to improve things somewhat via NVMe-based Optane drives for cache.
Every few years, I try to do a better job of things but at this point, the best improvement would be radical simplification.
ZFS has served very well in terms of reliability. I haven't lost data, and I've been able to catch lots of episodes of almost losing data. Or writing the wrong data.
Not entirely sure how I'd replace it, if I want something that can spot bit rot and correct it. ZFS scrub.
While that obviously leads to duplicate data from files installed by operating systems, there's a lot of duplicate media libraries. (File-oriented dedupe might not be very effective for old iTunes collections, as iTunes stores metadata like how many times a song has been played in the same file as the audio data. So the hash value of a song will change every time it's played; it looks like a different file. ZFS block-level dedupe might still work ok here because nearly all of the blocks that comprise the song data will be identical.)
Anyway. It's a huge pile of stuff, a holding area of data that should really be sorted into something small and rational.
The application leads to a big payoff for deduplication.
The ZIL is the "ZFS Intent Log", a log-structured ordered stream of file operations to be performed on the ZFS volume.
If power goes out, or the disk controller goes away unexpectedly, this ZIL is the log that will get replayed to bring the volume back to a consistent state. I think.
Usually the ZIL is on the same storage devices as the rest of the data. So a write to the ZIL has to wait for disk in the same line as everybody else. It might improve performance to give the ZIL its own, dedicated storage devices. NVMe is great, lower latency the better.
Since the ZFS Intent Log gets flushed to disk every five seconds or so, a dedicated ZIL device doesn't have to be very big. But it has to be reliable and durable.
Windows made small, durable NVMe cache drives a mainstream item for a while, when most laptops still used rotating hard drives. Optane NVMe at 16GB is cheap, like twenty bucks, buy three of them and use a mirrored pair of two for your ZIL.
----
Then there's the read cache, the ARC. I use 1TB mirrored NVMe devices.
Finally, there's a "special" device that can for example be designated for use for intense things like the dedupe table (which Fast Dedupe is making smaller!).
- The ZIL is used exclusively for synchronous writes - critical for VMs, databases, NFS shares, and other applications requiring strict write consistency. Many conventional workloads won't benefit. Use `zilstat` to monitor.
- The cheap 16GB Optane devices are indeed great in terms of latency, but they were designed primarily for read caching and have significantly limited write speeds. If you need better throughput, look for the larger Optane models which don't have these limitations.
- SLOG doesn't need to be mirrored - the only risk is if your SLOG device fails at the exact moment your system crashes. While mirroring is reasonable for production systems, with these cheap 16GB Optanes you're just guaranteeing they'll wear out at the same time. You could kill one at a time instead. :)
- As for those 1TB NVMe devices for read cache (L2ARC) - that's probably overkill unless you have a very specific use case. L2ARC actually consumes RAM to track what's in the cache, and that RAM might be better used for ARC (the main memory cache). L2ARC only makes sense when you have well-understood workload patterns and your ARC is consistently under pressure - like in a busy database server or similar high-traffic scenario. Use `arcstat` to monitor your cache hit ratios before deciding if you need L2ARC.
I've been building a home media server lately and have thought about doing something like this. However, there's a big problem: these little 16GB Optane drives are NVMe. My main boot drive and where I keep the apps is also NVME (not mirrored, yet: for now I'm just regularly copying to the spinning disks for backup, but a mirror would be better). So ideally that's 4 NVMe drives, and that's with me "cheating" and making the boot drive a partition on the main NVMe drive instead of a separate drive as normally recommended.
So where are you supposed to plug all these things in? My pretty-typical motherboard has only 2 NVMe slots, one that connects directly to the CPU (PCIe 4.0) and one that connects through the chipset (slower PCIe 3.0). Is the normal method to use some kind of PCIe-to-NVMe adapter card and plug that into the PCIe x16 video slot?
I think you're pretty far into XY territory here. I'd recommend hanging out in r/homelab and r/zfs, read the FAQs, and then if you still have questions, maybe start out with a post explaining your high level goals and challenges.
My main question here was just what I asked about NVMe drives. Many times in my research that you recommended, people recommended using multiple NVMe drives. But even a mirror is going to be problematic: on a typical motherboard (I'm using a AMD B550 chipset), there's only 2 slots, and they're connected very differently, with one slot being much faster (PCIe4) than the other (PCIe3) and having very different latency, since the fast one connects to the CPU and the slow one goes through the chipset.
If NVMe is your only option, I'd try to find a couple used 1.92TB enterprise class drives on ebay, and go ahead and mirror those without worrying about the different performance characteristics (the pool will perform as fast as the slowest device, that's all) - but 1.92TB isn't much for a media server.
In general, I'd say consumer class SSDs aren't worth the time it'll take you to install them. I'd happily deploy almost any enterprise class SSD with 50% beat out of it over almost any brand new consumer class drive. The difference is stark - enterprise drives offer superior performance through PLP-improved latency and better sustained writes (thanks to higher quality NAND and over-provisioning), while also delivering much better longevity.
I do have 4 regular SATA spinning disks (enterprise-class), for bulk data storage, in a RAIDZ1 array. I know it's not as safe as RAIDZ2, but I thought it'd be safe enough with only 4 disks, and I want to keep power usage down if possible.
I'm using (right now) a single 512GB NVMe drive for both booting and app storage, since it's so much faster. The main data will on the spinners, but the apps themselves on the NVMe which should improve performance a lot. It's not mirrored obviously, so that's one big reason I'm asking about the NVMe slots; sticking a 2nd NVMe drive in this system would actually slow it down, since the 2nd slot is only PCIe3 and connected through the chipset, so I'm wondering if people do something different, like using some kind of adapter card for the x16 video slot. I just haven't seen any good recommendations online in this regard. For now, I'm just doing daily syncs to the raid array, so if the NVMe drive suddenly dies somehow, it won't be that hard to recover, though obviously not nearly as easy as with a mirror. This obviously isn't some kind of mission-critical system so I'm ok with this setup for now; some downtime is OK, but data loss is not.
Thanks for the advice!
Move your NVMe to the other slot, I bet you can't tell a difference without synthetic benchmarks.
cp --reflink=auto
You get file level deduplication. The command above performs a lightweight copy (ZFS clone in file level), where the data blocks are copied only when modified. Its a copy, not a hard link. The same should work in other copy-on-write transactional filesystems as well if they have reflink support.Linked clones shouldn’t need that. They likely start out with only references to the original blocks, and then replace them when they change. If so, it’s a different concept (as it would mean that any new duplicate blocks are not shared), but for the use case of “spin up a hundred identical VMs that only change comparably little” it sounds more efficient performance-wise, with a negligible loss in space efficiency.
Am I certain of this? No, this is just what I quickly pieced together based on some assumptions (albeit reasonable ones). Happy to be told otherwise.
https://www.yellow-bricks.com/2018/05/01/instant-clone-vsphe...
In fact, it's a much more common feature than active deduplication.
VM drives are just files, and it's weird that you imply a filesystem wouldn't know about the semantics of a file getting copied and altered, and would only understand blocks.
> The downside is that every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
Which, if universally true, is very much different from what a hypervisor could do instead, and I've detailed the potential differences. But if a hypervisor does use some sort of clone system call instead, that can indeed shift the same approach into the fs layer, and my genuine question is whether it does.
It sounds like the information you need is that cp has a flag to make cloning happen. I think it even became default behavior recently.
Also that the article quote is strictly talking about dedup. That downside does not generalize to the clone/reflink features. They use a much more lightweight method.
This is one of the relevant syscalls: https://man7.org/linux/man-pages/man2/copy_file_range.2.html
Huh? What do you mean? They absolutely are. I've made extensive use of them in ESXi/vsphere clusters in situations where I'm spinning up and down many temporary VMs.
https://docs.vmware.com/en/VMware-Fusion/13/com.vmware.fusio...
Not to be rude, but does this have any meaning?
https://blogs.gnome.org/wjjt/2021/11/24/on-flatpak-disk-usag...
If you're selling VMs to customers then there's probably no advantage in using deduplication.
Look, even Proxmox, which I totally expected to support encryption with default installation (it has „Enterprise“ on the website) does loose important features when trying to use with encryption.
Also please study the issue tracker, there are a few surprising things I would not have expected to exist in a productive file system.
Still the same picture, encryption seems to be not a first class citizen in ZFS land.
I never changed a thing (because it also had some cons) and am believing that as long as a ZFS scrub shows no errors, all is OK. Could I be not seeing a problem?
Is there an easy way to analyze your dataset to find if you're in this sweet spot?
If so, is anyone working on some kind of automated partial dedup system where only portions of the filesystem are dedupped based on analysis of how beneficial it would be?
I am NOT interested in finding duplicate files, but duplicate slices within all my files overall.
I can easily throw together code myself to find duplicate files.
EDIT: I guess I’m looking for a ZFS/BTRFS/other dedupe preview tool that would say “you might save this much if you used this dedupe process.”
Some of this is straight up VM storage volumes for ESX virtual disks, some direct LUNs for our file servers. Our gains are upwards of 70%.
It's not 100% clear to me why explicit deduping blocks would give you any significant benefit over a properly chosen compression algorithm.
Would someone know if the fast dedup works also for this? Anything else I could be using instead?