You turned on ZFS dedup because someone said it’s “free space.” Then your pool started behaving like it’s forwarding every I/O request to a committee meeting.
Latency spikes. Writes crawl. Reboots feel like roulette. And the box that used to run hot now runs busy.
The villain is rarely “dedup” in the abstract. It’s the dedup tables (DDT): where they live, how big they get, how often you miss cache, and how brutally
ZFS will go to disk to keep its promises.
DDT in plain terms: what it is
ZFS deduplication works by noticing that two blocks are identical and storing only one physical copy. Great in theory. In practice, to avoid losing your data,
ZFS must be absolutely sure it has seen the block before. That certainty comes from a database-like structure: the Deduplication Table, or DDT.
The DDT is essentially a big index keyed by a block’s checksum (and some extra metadata). When a write comes in, ZFS computes the checksum and asks the DDT:
“Do I already have this block somewhere?” If yes, it increments a reference count and doesn’t write another copy. If no, it writes the new block and adds an entry.
That means the DDT is on the hot path for every dedup-enabled write. Not “sometimes.” Every time. And because ZFS is ZFS, it will maintain that table correctly
even if it hurts. Especially if it hurts.
The first thing to internalize: dedup is not a feature you sprinkle on datasets. It’s a commitment to doing a table lookup for blocks and dealing
with the memory and I/O consequences forever—until you rewrite the data without dedup.
Why DDT hurts: the real cost model
1) The DDT wants RAM like a toddler wants snacks
DDT entries are metadata. Metadata wants to be in memory because reading it from disk repeatedly is slow and jittery. When the working set of DDT entries fits
in ARC (ZFS’s cache), dedup can be “fine-ish.” When it doesn’t, you get DDT cache misses, which means extra random reads, which means latency, which means
angry application owners.
ZFS will keep running with a DDT that doesn’t fit in RAM. It just runs like it’s wearing ankle weights while walking uphill in the rain.
2) DDT lookups turn sequential writes into random reads
Without dedup, a streaming workload can be mostly sequential writes. With dedup, you do:
compute checksum → lookup DDT → maybe read DDT from disk → then write (or not). That lookup pattern is frequently random, and it happens at write time.
Random reads on HDDs are a tax. Random reads on SSDs are a smaller tax, but the bill still arrives.
3) “It saved 30% space” is not a performance budget
Dedup’s success metric is space saved. Production’s success metric is stable latency under load. These are related, but not aligned.
You can have excellent dedup ratio and still have awful performance because the DDT doesn’t fit, or because the storage is too slow for the lookup rate.
4) Dedup changes failure modes
With dedup, multiple logical blocks point to one physical block. That’s safe—ZFS is copy-on-write and uses checksums—but it raises the stakes on metadata health
and on the system’s ability to access DDT entries. You can still recover from device failure; you just don’t want to do it with a starved ARC and a DDT that’s
thundering to disk on every transaction group.
5) The hidden “forever cost”
The painful part isn’t only enabling dedup. It’s living with it. Once data is written with dedup=on, turning dedup=off does not undeduplicate existing blocks.
You’ve committed that on-disk layout to a DDT-driven world until you rewrite the data (send/receive to a non-dedup target, or copy it out and back).
One dry operational truth: if you enable dedup casually, you’ll later schedule a “data migration project” that is just you paying off a loan you didn’t realize you took.
Exactly one quote for the road, because it applies here perfectly:
Hope is not a strategy.
— paraphrased idea often attributed to operations leaders
Facts & historical context (short, concrete)
- Dedup shipped to solve a real problem: early VM farms and backup sets produced massive duplicate blocks across images and full backups.
- ZFS dedup is block-level, not file-level: it doesn’t care about filenames, only fixed-size record blocks (and some variable cases).
- The DDT is persisted on disk: it is not “just cache.” ARC holds working sets, but the canonical table lives in the pool.
- Dedup in ZFS is per-dataset property: but DDT is per-pool reality. Multiple datasets with dedup share the same pool-level DDT structures.
- Checksum choice matters: modern ZFS defaults to strong checksums; dedup relies on them, and collision paranoia is one reason it’s not “cheap.”
- Early guidance was rough: many admins learned “1–2GB RAM per TB deduped” as a rule-of-thumb, then discovered reality is workload-dependent.
- DDT accounting improved over time: later OpenZFS versions exposed better histograms and stats so you can estimate memory impact more accurately.
- Special vdevs changed the conversation: putting metadata (including DDT) on fast devices can help, but it’s not a free pass and has sharp edges.
- Compression stole dedup’s lunch in many shops: for modern data, lz4 often yields meaningful savings with far less performance volatility.
How dedup really works (and where DDT fits)
Dedup is a lookup problem disguised as a space-saving feature
At a high level, dedup flow looks like this:
- Incoming logical block is assembled (recordsize-sized for filesystems, volblocksize for zvols).
- Checksum is computed (and optionally compression is applied; exact ordering depends on configuration and implementation details).
- DDT is queried: does this checksum exist, and does it match the block’s metadata?
- If found, refcount increases; logical block points to existing physical block.
- If not found, block is written, DDT gets a new entry.
What a DDT entry roughly represents
A DDT entry is not just “checksum → block pointer.” It also carries enough information to:
verify the match, locate the physical block, and track how many logical references point to it.
This is why the memory footprint isn’t trivial. The DDT is closer to an index in a database than a simple hash map.
Why misses are catastrophic compared to normal cache misses
Lots of caches miss and the system survives. The DDT miss is special because it’s on the write path and tends to be random I/O. A normal read cache miss might
delay a read. A DDT miss adds read latency before the write can even decide what to do.
On a busy system, that means queue buildup, higher txg sync times, and eventually the familiar symptom: everything “hangs” in bursts, then recovers, then hangs again.
That pattern is classic “metadata thrash” and DDT is one of the fastest routes there.
Dedup ratio isn’t your goal; DDT working set is
Dedup ratio (logical vs physical) tells you the benefit. The DDT working set tells you the operational cost. You want both. You rarely get both.
Joke #1: Dedup is like buying a bigger suitcase to avoid paying baggage fees, then realizing the suitcase itself weighs 20 kilos.
Fast diagnosis playbook
When a dedup-enabled pool is slow, you can spend days arguing about “ARC size,” “SLOG,” “sync,” “network,” or “those disks are old.”
Or you can check the DDT path first and decide quickly if you’re in lookup-hell.
First: confirm dedup is actually in play
- Check dedup property on datasets and zvols doing writes.
- Check if the pool has a non-trivial DDT (if dedup was enabled historically).
Second: check if DDT fits in cache
- Inspect DDT size, histogram, and whether DDT entries are frequently missing from ARC.
- Look at ARC pressure and eviction behavior.
Third: prove it’s random-read bound
- Compare iostat: are you seeing elevated read IOPS during heavy writes?
- Check latency: random reads spike when the DDT lookup path goes to disk.
- Correlate with txg sync times and CPU time in checksum/compress paths.
Fourth: decide the escape hatch
- If DDT doesn’t fit and you can’t add RAM or fast metadata devices, plan a migration off dedup.
- If DDT is manageable, tune record sizes, workload patterns, and consider special vdev / faster storage.
Practical tasks: commands, outputs, what it means, and what you decide
Below are hands-on tasks you can run on a typical Linux OpenZFS host. Commands are realistic; outputs are representative examples. Your exact fields and
numbers will vary by version and distribution.
Task 1: Find where dedup is enabled (and where you forgot it was enabled)
cr0x@server:~$ zfs get -r -o name,property,value,source dedup tank
NAME PROPERTY VALUE SOURCE
tank dedup off default
tank/vm dedup on local
tank/vm/win10 dedup on inherited from tank/vm
tank/home dedup off default
What it means: writes to tank/vm and children do DDT lookups. tank/home does not.
Decision: if dedup is “on” anywhere, inventory the workloads there. If it’s “on” for general-purpose datasets, assume misconfiguration until proven otherwise.
Task 2: Verify whether existing data is deduped (dedup property lies about history)
cr0x@server:~$ zdb -DD tank
DDT-sha256-zap-duplicate: 123456 entries, size 350 on disk, 180 in core
DDT-sha256-zap-unique: 987654 entries, size 2100 on disk, 1200 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
----- ------ ----- ----- ----- ------ ----- ----- -----
1 900K 110G 72G 72G 900K 110G 72G 72G
2 80K 10G 6G 6G 160K 20G 12G 12G
4 10K 2G 1G 1G 40K 8G 4G 4G
What it means: the pool already contains deduped blocks (DDT entries exist). Turning dedup off now won’t remove these.
Decision: if the DDT is significant and performance is poor, you’re looking at a rewrite/migration plan, not a property flip.
Task 3: Quick dedup “is it worth it?” sanity check (ratio vs pain)
cr0x@server:~$ zpool get dedupratio tank
NAME PROPERTY VALUE SOURCE
tank dedupratio 1.14x -
What it means: 1.14x is modest savings. It might be real money on petabytes; it’s rarely worth operational chaos on modest pools.
Decision: if dedupratio is close to 1.0x and you’re paying latency, plan to get off dedup unless you have a very specific reason to keep it.
Task 4: Check ARC size and pressure (are you starving the DDT?)
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep "^(size|c |c_min|c_max|memory_throttle_count|arc_no_grow|demand_data_hits|demand_data_misses) "
size 4 25769803776
c 4 34359738368
c_min 4 8589934592
c_max 4 68719476736
arc_no_grow 4 0
memory_throttle_count 4 12
demand_data_hits 4 91522344
demand_data_misses 4 1822331
What it means: ARC is ~24GiB, target ~32GiB, max ~64GiB; memory throttling occurred (pressure). DDT competes with everything else.
Decision: if the box is memory-constrained or has frequent throttling, dedup is a bad neighbor. Add RAM, reduce workload, or exit dedup.
Task 5: Measure pool latency and read IOPS during writes (DDT miss smell test)
cr0x@server:~$ iostat -x 1 5
avg-cpu: %user %nice %system %iowait %steal %idle
6.20 0.00 5.10 9.80 0.00 78.90
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 320.0 180.0 9200.0 6400.0 56.0 4.20 10.8 9.4 13.3 0.9 45.0
sda 90.0 20.0 1100.0 400.0 27.0 8.80 78.0 85.0 45.0 7.2 79.0
sdb 95.0 22.0 1200.0 420.0 28.0 9.10 81.0 88.0 47.0 7.4 82.0
What it means: lots of reads happening while you expect mostly writes, and read await is huge on HDDs. That’s consistent with DDT lookups missing ARC.
Decision: if you see random-read pressure tied to write bursts, treat DDT misses as prime suspect and validate with DDT stats/histograms.
Task 6: Confirm recordsize/volblocksize and why it matters for dedup
cr0x@server:~$ zfs get -o name,property,value recordsize tank/vm
NAME PROPERTY VALUE
tank/vm recordsize 128K
What it means: larger blocks mean fewer DDT entries for the same data volume, but can reduce dedup opportunities and affect random IO patterns.
Decision: for VM zvols, consider aligning volblocksize to workload and avoid tiny blocks unless you have a reason. But don’t “tune” your way out of an undersized ARC.
Task 7: Check if compression could replace dedup for your workload
cr0x@server:~$ zfs get -o name,property,value compression,compressratio tank/vm
NAME PROPERTY VALUE
tank/vm compression lz4
tank/vm compressratio 1.48x
What it means: you’re already getting 1.48x from compression, which often outperforms “meh” dedup ratios in real life.
Decision: if compression is giving you meaningful savings, that’s your default space lever. Leave dedup for narrow, validated cases.
Task 8: See if a special vdev exists (metadata acceleration) and whether it’s sized sanely
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
special ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
errors: No known data errors
What it means: there is a special vdev mirror. Metadata (and potentially small blocks) can land here, reducing DDT lookup latency when missing ARC.
Decision: if you run dedup at all, special vdev can help. But treat special vdev as tier-0: if you lose it without redundancy, you can lose the pool.
Task 9: Check where metadata is going (special_small_blocks policy)
cr0x@server:~$ zfs get -o name,property,value special_small_blocks tank
NAME PROPERTY VALUE
tank special_small_blocks 0
What it means: only metadata is being allocated to special vdev, not small file blocks. DDT is metadata-ish, so it may benefit even with 0.
Decision: don’t blindly set special_small_blocks without understanding capacity and failure domain. It can shift a lot of data onto the special vdev.
Task 10: Check txg sync behavior (when the system “stalls”)
cr0x@server:~$ cat /proc/spl/kstat/zfs/txg | head
0 0 0x01 7 448 13056000 2112000
# txgs synced txg ...
What it means: txg stats vary by version; the point is to correlate “stalls” with sync cadence. Dedup-induced random reads can extend sync work and block progress.
Decision: if sync intervals blow out under write load, suspect metadata and DDT misses; then validate with iostat latency and ARC pressure.
Task 11: Prove the dataset is paying the dedup tax right now (observe write latency via fio)
cr0x@server:~$ fio --name=randwrite --filename=/tank/vm/testfile --direct=1 --rw=randwrite --bs=16k --iodepth=16 --numjobs=1 --size=2g --runtime=30 --time_based --group_reporting
randwrite: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=16
fio-3.33
...
write: IOPS=850, BW=13.3MiB/s (14.0MB/s)(400MiB/30001msec)
slat (usec): min=7, max=300, avg=18.5, stdev=6.2
clat (msec): min=2, max=180, avg=18.9, stdev=22.1
lat (msec): min=2, max=180, avg=18.9, stdev=22.1
What it means: high tail latency (clat max 180ms) is typical when metadata lookups go to slow disks, even if average looks “okay.”
Decision: if tail latency is ugly and correlates with dedup-heavy workloads, stop trying to “optimize the app.” Fix the storage path or remove dedup.
Task 12: Check for dedup property drift (someone enabled it “temporarily”)
cr0x@server:~$ zfs get -r -s local,inherited -o name,property,value,source dedup tank | head
NAME PROPERTY VALUE SOURCE
tank/vm dedup on local
tank/vm/old dedup on inherited from tank/vm
tank/vm/tmp dedup on inherited from tank/vm
What it means: dedup is spreading via inheritance, which is how “one dataset experiment” becomes “the whole pool is now paying a lookup tax.”
Decision: break inheritance where you don’t need it. Dedup should be opt-in, not ambient.
Task 13: Turn dedup off for new writes (without pretending it fixes existing data)
cr0x@server:~$ sudo zfs set dedup=off tank/vm
What it means: new blocks won’t be deduped, but old blocks remain deduped and still depend on the DDT for reads.
Decision: do this early as damage containment, then plan the real fix: rewrite data to a non-dedup target.
Task 14: Estimate “how bad could the rewrite be?” by checking used and logicalused
cr0x@server:~$ zfs get -o name,property,value used,logicalused,referenced,logicalreferenced tank/vm
NAME PROPERTY VALUE
tank/vm used 8.50T
tank/vm logicalused 11.2T
tank/vm referenced 8.50T
tank/vm logicalreferenced 11.2T
What it means: logical data is larger than physical. If you rewrite without dedup, you need space for the physical expansion (or a bigger destination).
Decision: plan capacity before migrating off dedup. Many “we’ll just copy it” plans die when physical usage jumps.
Task 15: Migrate safely via send/receive (the boring tool that works)
cr0x@server:~$ sudo zfs snapshot -r tank/vm@move-001
cr0x@server:~$ sudo zfs send -R tank/vm@move-001 | sudo zfs receive -u tank_nodedup/vm
What it means: you’ve created a consistent copy. If the destination has dedup=off, new blocks will be written without dedup.
Decision: this is the clean exit ramp. Use -u to receive unmounted, validate, then cut over deliberately.
Three corporate mini-stories from the trenches
1) Incident caused by a wrong assumption: “Dedup is per dataset, so it can’t hurt the pool”
A mid-size company had a ZFS-backed virtualization cluster. The storage team enabled dedup on a dataset used for templated VMs, expecting big savings.
The assumption was comforting: it’s only one dataset; worst case it only slows that dataset.
Two weeks later, the helpdesk started getting “random slowness” tickets. VM boots occasionally hung. Nightly backup windows extended, then missed SLAs.
The initial response was predictable: add more spindles to the RAIDZ, tweak the hypervisor caching, blame the network, then blame “noisy neighbors.”
The turning point was noticing read IOPS rising during heavy write periods. That’s not normal. It became obvious that DDT lookups were missing cache and
forcing random reads across the pool. While dedup was enabled per dataset, the DDT lived in the pool and the I/O path affected everyone sharing those vdevs.
The fix wasn’t heroic. They disabled dedup for new writes immediately, then staged a send/receive migration to a non-dedup pool over a weekend.
Performance stabilized before the migration finished, because the active working set moved first and DDT pressure dropped.
The postmortem takeaway was blunt: in shared storage, “per dataset” features still have pool-level consequences. If the feature’s hot path touches pool-wide metadata,
treat it as a pool-level change, not a dataset experiment.
2) Optimization that backfired: “Let’s turn dedup on for backups, it’s mostly duplicates”
Another shop had a backup target on ZFS and wanted to reduce capacity growth. Someone looked at backup data, saw repetition, and made a reasonable-sounding bet:
dedup should crush redundant fulls. The environment had plenty of CPU and “enough RAM,” and the disks were a mix of HDD and a small SSD for cache.
For a month, it looked like a win. Dedupratio climbed. Storage purchase got deferred. Everyone congratulated themselves and moved on to the next fire.
Then they expanded retention. The DDT grew. ARC didn’t. Rehydrations slowed. Restore tests became unpleasant.
The day it became an incident, a large restore ran while backups were still ingesting. Latency spiked across the pool. Monitoring showed the disks were busy,
but throughput was low. Classic random I/O saturation. Metadata reads surged because the active DDT working set no longer fit in memory.
The team thought they were optimizing capacity; they had actually turned every write into a read-amplified operation at the worst possible time.
The fix was counterintuitive: they stopped dedup, leaned harder on compression, and changed the backup pipeline to avoid writing identical data in the first place
(better incremental chains, different chunking, and more intelligent copy policies).
The lesson: dedup can look great early, then cliff-dive when the table crosses a cache threshold. The backfire isn’t gradual. It’s a step function.
3) Boring but correct practice that saved the day: “Measure DDT before enabling anything”
A financial services team had a request: store multiple versions of VM images with minimal growth. The business asked for dedup. The storage team said “maybe,”
which is the polite version of “we’ll measure before we regret.”
They carved out a small test pool mirroring the production vdev type and ran a representative ingest: a few hundred gigabytes of the actual VM image churn.
They collected zdb -DD histograms, watched ARC behavior, and measured latency under concurrent load. They repeated the test with compression-only and
with different recordsize/volblocksize settings. Nothing fancy—just disciplined.
The results were boring in the best way. Dedup saved less than expected because the data was already compressed in parts and because changes were dispersed.
Compression alone produced most of the savings with stable performance. DDT working set estimates suggested they’d need more RAM than the platform could offer
without a hardware refresh.
They declined dedup in production and documented the rationale. Six months later, a new executive asked the same question. The team reopened the report,
reran a smaller validation on current data, and again avoided a self-inflicted outage.
The takeaway: the most valuable optimization is the one you decide not to deploy after measuring it.
Joke #2: The DDT is a spreadsheet that never stops growing, except it lives on your critical path and doesn’t accept “hide column” as a performance fix.
Common mistakes: symptoms → root cause → fix
1) Writes are slow, but disks show lots of reads
Symptoms: write-heavy jobs trigger read IOPS, high r_await, and bursty latency.
Root cause: DDT lookups missing ARC; DDT entries fetched from disk randomly during writes.
Fix: increase RAM/ARC headroom, move metadata to faster devices (special vdev), or migrate off dedup. If you can’t keep DDT working set hot, dedup is a bad fit.
2) “We turned dedup off but it’s still slow”
Symptoms: dedup property is off, but pool still acts metadata-heavy and unpredictable.
Root cause: existing blocks remain deduped; reads still depend on DDT and its caching behavior.
Fix: rewrite data to a non-dedup destination (send/receive), then retire the deduped copy.
3) Pool is fine until it isn’t; then everything falls off a cliff
Symptoms: months of okay performance, then sudden spikes, timeouts, long txg sync, and erratic app behavior.
Root cause: DDT crosses ARC working-set threshold; miss rate spikes and random reads explode.
Fix: treat it as a capacity planning failure: DDT growth must be budgeted like any other capacity. Add RAM/fast metadata or exit dedup.
4) Reboot takes forever, import feels fragile
Symptoms: pool imports slowly, system appears hung after reboot under load.
Root cause: huge metadata sets (including DDT) interacting with limited cache; the system has to “warm up” a working set while servicing I/O.
Fix: reduce dedup footprint by migrating; ensure adequate RAM; avoid mixing heavy production load with cold-cache imports if you can control sequencing.
5) Special vdev added, but performance didn’t improve
Symptoms: you installed fast SSDs as special vdev, but dedup pain remains.
Root cause: DDT working set is still too large; or special vdev is saturated/too small; or you expected it to fix CPU-bound checksum/compress overhead.
Fix: confirm metadata allocation and device latency. Size special vdev properly, mirror it, and still budget RAM. Special vdev is an accelerator, not a miracle.
6) Dedup enabled on a dataset that already has compressed/encrypted blocks
Symptoms: negligible dedupratio improvement, but noticeable latency increase.
Root cause: dedup opportunities are limited when data is already high-entropy (compressed, encrypted, or unique by design).
Fix: keep dedup off. Use compression, smart backup strategies, or application-level dedup/chunking if needed.
7) Dedup enabled on general file shares “because it might help”
Symptoms: intermittent slowness across mixed workloads; unclear correlation.
Root cause: dedup tax applied to broad, unpredictable data patterns; DDT churn with poor locality.
Fix: restrict dedup to narrow datasets where duplication is proven and stable. Otherwise disable and migrate.
Checklists / step-by-step plan
Decision checklist: should you enable dedup at all?
- Measure duplication on real data: don’t guess. If you can’t test, default to “no.”
- Estimate DDT size and working set: use
zdb -DDon a test pool or a sample; assume growth with retention. - Budget RAM with headroom: you need ARC for normal reads too. If dedup would crowd everything else out, it’s a non-starter.
- Validate latency under concurrency: test with realistic parallel writes and reads, not a single-threaded benchmark.
- Decide your exit plan up front: if it goes wrong, how do you rewrite data and where does the extra physical usage fit?
Operational checklist: you already have dedup and it hurts
- Stop the bleeding: set
dedup=offfor new writes on affected datasets. - Quantify the blast radius: collect
zpool get dedupratio,zdb -DD, ARC stats, and iostat latency. - Identify the active hot datasets: where are the writes and reads coming from during incidents?
- Pick your remedy: add RAM, accelerate metadata (special vdev), or migrate off dedup. Do not “tune around” a fundamentally oversized DDT.
- Plan migration capacity: compare
logicalusedtoused. Ensure the destination can hold the expanded physical data. - Execute with snapshots and send/receive: keep it boring, verifiable, and reversible.
- Cut over deliberately: receive unmounted, validate, then switch mountpoints or consumers.
- Post-cut cleanup: destroy old deduped datasets only when you’re confident; then verify DDT shrinks over time as blocks are freed.
Preventative checklist: keep dedup from spreading quietly
- Audit dedup property inheritance monthly on production pools.
- Require a change record for enabling dedup anywhere.
- Alert on dedupratio changes and unusual read IOPS during write windows.
- Keep “dedup-free” landing zones for general workloads; make dedup datasets explicit and rare.
FAQ
Is ZFS dedup “bad”?
No. It’s just expensive. It’s excellent when duplication is high, stable, and you can keep the DDT working set hot (RAM and/or fast metadata devices).
It’s a trap when you enable it based on hope.
Why does dedup slow down writes so much?
Because it adds a lookup on the write path. If the DDT entry isn’t in ARC, ZFS does extra random reads to find it, before it can decide whether to write.
Can I disable dedup and get my performance back immediately?
You can stop new writes from being deduped by setting dedup=off. But existing deduped blocks remain and still rely on the DDT for reads.
Real recovery typically requires rewriting/migrating the data.
How do I know if my DDT fits in RAM?
You estimate via zdb -DD (histograms and “in core” sizing) and then validate behavior: if write periods cause random reads and latency spikes,
you’re likely missing. The “fit” is about working set, not total.
Does compression reduce dedup effectiveness?
Often, yes, because compression changes the on-disk representation and high-entropy compressed data has fewer identical blocks.
Practically, if compression already yields good savings, dedup may not be worth it.
Is dedup good for VM images?
Sometimes. Golden images and many clones can dedup well, but VM workloads are also latency-sensitive and generate random writes.
If you can’t guarantee DDT locality and fast metadata, you may turn “space saved” into “performance debt.”
Can I put the DDT on SSD?
With a special vdev, you can bias metadata allocations to fast devices, which can include structures like DDT. This can reduce the pain of cache misses.
But it increases reliance on those devices; they must be redundant and sized properly.
Will adding more disks fix dedup performance?
It can help if the bottleneck is random read IOPS and you’re on HDDs, but it’s usually an inefficient fix compared to adding RAM or moving metadata to fast storage.
If your DDT doesn’t fit, more spindles may just make the thrash slightly less awful.
What’s the safest way to “undedup” a dataset?
Snapshot and zfs send | zfs receive to a destination dataset with dedup=off. Validate, then cut over.
That rewrites blocks and removes the dependency on the DDT for that data.
Is there a rule of thumb for RAM per TB with dedup?
Rules of thumb exist because people needed something to say in meetings. The real answer depends on block size distribution, duplication patterns,
and your active working set. If you can’t measure it, assume it will be bigger than you want.
Conclusion: next steps that don’t ruin weekends
ZFS dedup is not evil. It’s just honest. It makes you pay up front—in RAM, in metadata I/O, and in operational complexity—so you can save space later.
If you’re not prepared to pay continuously, the DDT will collect interest in the form of latency.
Practical next steps:
- Audit: run a recursive
zfs get dedupand find where dedup is enabled and inherited. - Quantify: capture
zdb -DDoutput andzpool get dedupratiofor the pool. - Correlate: use
iostat -xduring slow windows; look for read IOPS and high read await during writes. - Contain: set
dedup=offfor new writes where you don’t absolutely need it. - Choose: either fund dedup properly (RAM + fast metadata + validation), or migrate off it with send/receive and reclaim predictability.
If you remember only one thing: dedup success is not “how much space did we save?” It’s “can we still hit latency targets when the DDT is having a bad day?”