Dedup is the kind of feature that looks like free money until your pool starts moving like it’s underwater.
The most common failure mode isn’t “dedup doesn’t work.” It’s “dedup works, then the DDT doesn’t fit in RAM,
and every random write turns into a tiny scavenger hunt on disk.”
If you’re here because someone asked, “Can we just turn on dedup?” you’re already ahead. The correct answer is,
“Maybe, after we predict the DDT size and prove the box can carry it.” Let’s do that with numbers you can defend
in a change review.
DDT, what it is (and why it hurts)
ZFS dedup is block-level: when you write a block, ZFS hashes it and checks whether that content already exists.
If it does, ZFS stores a reference instead of another copy. The index that makes that possible is the
Deduplication Table (DDT). Every unique block has an entry. Every shared block adds references.
That table is not a cute sidecar. It’s a core dependency of your write path. When dedup is enabled, a write
becomes “hash + lookup + maybe allocate + update refcounts.” If the DDT lookup is fast (ARC hit), dedup can be
tolerable. If the lookup is slow (DDT miss in ARC → disk read), performance falls off a cliff.
Here’s the operational truth: dedup is mostly a memory sizing problem disguised as a storage feature.
It can also be an IOPS sizing problem, a latency problem, and an “explain to management why backups are late”
problem. But it starts with RAM.
One more thing people miss: enabling dedup is per-dataset, but the DDT is effectively per-pool. One overeager
dataset can poison the whole pool’s cache behavior by ballooning the DDT working set.
Interesting facts and context you can use in meetings
- Dedup predates ZFS in spirit: content-addressed storage and hash-based dedup concepts were used in backup appliances long before they were mainstream in filesystems.
- Early ZFS dedup earned a reputation: many first-wave deployments enabled it on general-purpose pools and discovered that “works” and “fast” are not synonyms.
- The DDT tracks unique blocks, not logical data size: two 1 TB datasets can produce wildly different DDT sizes based on recordsize, compression, and churn.
- Compression changes dedup math: ZFS hashes the stored block (after compression), so identical uncompressed data that compresses differently won’t dedup the way you expect.
- Block size is destiny: smaller recordsizes increase block count, which increases DDT entries, which increases RAM needs. This is why VM storage is a common dedup trap.
- DDT is metadata, but not all metadata is equal: if you add a special vdev for metadata, the DDT can land there and reduce random I/O on slow disks.
- Dedup is not free on reads: it can amplify fragmentation and increase indirection, especially after lots of snapshots and deletes.
- Dedup and encryption aren’t friends by default: if data is encrypted before ZFS sees it (application-level), identical plaintext will look different and dedup will be near-zero.
The rule-of-thumb that gets people fired (and the one that doesn’t)
The bad rule-of-thumb: “Dedup needs about 1–2 GB of RAM per TB.” You’ll hear it repeated with the confidence
of a person who hasn’t been paged by it.
Why it’s wrong: it assumes a certain recordsize, a certain duplication ratio, a certain churn pattern, and a
certain DDT entry size in memory. Change any of those and your “per TB” number becomes fiction.
The good rule-of-thumb: size RAM for DDT entries, not raw terabytes. Estimate block count,
estimate unique block count, multiply by a realistic per-entry memory cost, then apply a working-set factor.
If you can’t estimate block count, you’re not ready to enable dedup.
Joke #1: Dedup without DDT sizing is like buying a truck by cargo volume and ignoring the bridge weight limit. The bridge always wins.
A defensible sizing method: estimate DDT entries and RAM
Step 1: Estimate how many blocks you will dedup
DDT entries roughly scale with the number of unique blocks written into dedup-enabled datasets.
You can approximate block count as:
Block count ≈ referenced logical bytes / average block size
“Average block size” is not necessarily your recordsize or volblocksize, because real blocks are
smaller at file tails, metadata isn’t counted the same way, and compression changes stored size. But as a
planning number, start with your configured block size for the dataset type:
- VM zvols:
volblocksize(often 8K–32K in performance setups) - File datasets:
recordsize(often 128K by default)
Step 2: Estimate the unique fraction (dedup ratio)
Dedup ratio tells you how much data is duplicated. But for DDT sizing you want the opposite:
the fraction that’s unique.
Example: if you expect 2.0x dedup, then about 50% of blocks are unique. If you expect 1.1x, then about 91%
are unique, which means almost every block needs its own DDT entry and you saved almost nothing.
Step 3: Estimate DDT memory per entry
The DDT entry format and in-memory overhead vary by OpenZFS version, feature flags, and whether you’re
counting just the on-disk entry or the in-ARC structures (which include hash tables, pointers, and eviction
behavior). In practice, planners use a conservative range.
Operationally safe planning numbers:
- Minimum optimistic: ~200 bytes per unique block (rarely safe for production planning)
- Common planning: ~320 bytes per unique block
- Paranoid planning: 400–500 bytes per unique block (use for high-churn VM pools)
If you’ve been burned before, use paranoid planning. RAM is cheaper than your time.
Step 4: Apply a working set factor
The DDT does not need to be 100% resident to function, but performance depends on how often the write path
needs a DDT entry that isn’t in ARC. If your workload is mostly sequential and append-only, you can sometimes
get away with a partial DDT working set. If your workload is random writes across a wide address space (hello,
virtualization), you want a large fraction of DDT hot in ARC.
Practical guidance:
- VM images, databases, CI runners: target 80–100% of DDT in ARC working set
- Backup targets, mostly sequential ingest: 30–60% might be survivable if DDT is on fast media
- Mixed workloads: assume the worst unless you can isolate dedup to a separate pool
Step 5: Add headroom for ARC, metadata, and the rest of the OS
Even if the DDT fits, you still need ARC for other metadata (dnode cache, dbufs), and you need RAM for the kernel,
services, and spikes. Overcommit memory and ZFS will politely stop caching before it starts swapping. Then your
pool starts acting like it hates you personally.
A concrete example
Say you plan to dedup 100 TB of referenced data on a dataset with 128K recordsize. Approximate blocks:
100 TB / 128K ≈ 800 million blocks. If dedup ratio is 2.0x, unique blocks ≈ 400 million.
DDT RAM at 320 bytes/entry: 400,000,000 × 320 ≈ 128 GB. That’s just the DDT. If you need 80% resident working set,
you still want ~100 GB of ARC available to it, plus everything else.
If your recordsize is 16K instead, the block count is 8x higher. Same data, same dedup ratio, now the DDT planning
number looks like a down payment on a small house. This is why “per TB” sizing lies.
Workload shapes: when dedup behaves and when it bites
Good candidates (sometimes)
- VDI clones with identical base images where most blocks truly match and stay stable.
- Backup repositories where you write large, repetitive blocks and rarely rewrite old data.
- Artifact stores where the same binaries land repeatedly and are mostly immutable.
Bad candidates (often)
- General VM storage with small blocks and constant churn: OS updates, swap, logs, databases.
- Databases that rewrite pages frequently. You’ll dedup less than you hope and pay lookup costs forever.
- Encrypted-at-source data (application encryption) that destroys redundancy.
Churn is the quiet killer
Dedup loves stable duplicates. Churn changes hashes. Even if your dedup ratio is decent at a point in time, high
churn forces frequent DDT updates, higher write amplification, and more random I/O. Snapshots can preserve old
blocks, making the DDT larger and the pool more fragmented. Your “savings” become “metadata debt.”
Special vdevs, metadata, and where the DDT actually lives
The DDT is stored on disk as part of pool metadata. When it isn’t in ARC, ZFS must fetch it from disk. If that disk
is a bunch of HDDs, your dedup lookups will line up to take turns at 10 ms a pop. That’s how you get multi-second
latencies from otherwise decent hardware.
A special vdev can hold metadata (and optionally small blocks), which can include DDT blocks. Putting the
DDT on fast SSD/NVMe can dramatically reduce the penalty when it falls out of ARC. It does not remove the need for
RAM, but it can turn “catastrophic” into “annoying and measurable.”
Opinionated guidance:
- If you are considering dedup on rust (HDD), budget for a special vdev or don’t do dedup.
- If you have a special vdev, mirror it. Losing it can mean losing the pool, depending on configuration and what moved there.
- Don’t treat special vdev as a magic “dedup accelerator.” It’s a latency reducer for misses, not a substitute for ARC hits.
Practical tasks: commands, outputs, and decisions (12+)
These are the things I actually run before anyone enables dedup. Each task includes: a command, what the output means,
and the decision it drives. Replace pool/dataset names to match your environment.
Task 1: Confirm dedup is currently off (or where it’s on)
cr0x@server:~$ zfs get -r -o name,property,value,source dedup tank
NAME PROPERTY VALUE SOURCE
tank dedup off default
tank/vm dedup off default
tank/backups dedup off default
Meaning: Dedup is disabled everywhere shown. If some datasets show on, you already have a DDT.
Decision: If any dataset is on, you must measure the existing DDT and its cache behavior before expanding usage.
Task 2: Check pool feature flags relevant to dedup
cr0x@server:~$ zpool get -H -o property,value feature@extensible_dataset tank
feature@extensible_dataset active
Meaning: Feature flags affect on-disk formats and sometimes behavior. This confirms the pool is using modern features.
Decision: If you’re on an ancient pool version or compatibility mode, reconsider dedup or plan a migration first.
Task 3: Capture recordsize / volblocksize to estimate block count
cr0x@server:~$ zfs get -o name,property,value recordsize tank/backups
NAME PROPERTY VALUE
tank/backups recordsize 1M
Meaning: 1M recordsize means fewer, larger blocks: DDT grows slower than on 128K/16K workloads.
Decision: Large recordsize workloads are more dedup-friendly from a DDT sizing perspective.
Task 4: For zvols, check volblocksize (this is where trouble starts)
cr0x@server:~$ zfs get -o name,property,value volblocksize tank/vm/zvol0
NAME PROPERTY VALUE
tank/vm/zvol0 volblocksize 8K
Meaning: 8K blocks mean huge block counts for large logical data. That inflates DDT entries fast.
Decision: If you plan to dedup 8K zvols, assume “paranoid” per-entry RAM and near-100% working set requirement.
Task 5: Measure referenced bytes (what dedup must index)
cr0x@server:~$ zfs list -o name,used,refer,logicalused,logicalrefer -p tank/vm
NAME USED REFER LOGICALUSED LOGICALREFER
tank/vm 54975581388800 54975581388800 87960930222080 87960930222080
Meaning: Logical referenced data is ~80 TB. DDT size correlates more with logical blocks than physical bytes.
Decision: Use logicalrefer and block size to estimate block count. If snapshots exist, also consider referenced data growth.
Task 6: Check compression (it changes stored blocks and dedup effectiveness)
cr0x@server:~$ zfs get -o name,property,value compression tank/vm
NAME PROPERTY VALUE
tank/vm compression lz4
Meaning: LZ4 is good. Compression can reduce physical IO, but dedup hashes compressed blocks.
Decision: If compression is off, consider turning it on before dedup. If compression is on and savings are already good, dedup’s marginal benefit may be small.
Task 7: Inspect ARC size and pressure (can it even hold a big DDT?)
cr0x@server:~$ arcstat.py 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:40:01 482 41 8 3 7% 8 20% 30 73% 96G 110G
12:40:02 501 45 9 4 9% 7 16% 34 76% 96G 110G
12:40:03 490 42 9 3 7% 6 14% 33 79% 96G 110G
Meaning: ARC is ~96G with target ~110G. Miss rate ~8–9% during sampling. This is decent, but it says nothing yet about DDT residency.
Decision: If ARC is already constrained, enabling dedup will fight for cache with everything else. Plan RAM upgrades or isolate dedup to a separate pool.
Task 8: Check whether swap is in use (a red flag for dedup plans)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 256Gi 142Gi 12Gi 3.0Gi 102Gi 104Gi
Swap: 16Gi 0.0Gi 16Gi
Meaning: No swap use. Good. If swap is used under normal load, you’re already memory-starved.
Decision: If swap is active, do not enable dedup until you fix memory pressure. Dedup + swapping is how systems become folklore.
Task 9: Estimate duplication potential without enabling dedup (sample with hashes)
cr0x@server:~$ find /tank/backups -type f -size +128M -print0 | xargs -0 -n1 shasum | awk '{print $1}' | sort | uniq -c | sort -nr | head
12 0c4a3a3f7c4d5f0b3e1d6f1c3b9b8a1a9c0b5c2d1e3f4a5b6c7d8e9f0a1b2c
8 3f2e1d0c9b8a7f6e5d4c3b2a19080706050403020100ffeeddccbbaa998877
5 aabbccddeeff00112233445566778899aabbccddeeff001122334455667788
Meaning: This crude sample shows duplicate large files (same hash appearing multiple times). It’s not block-level dedup, but it’s a smell test.
Decision: If you see almost no duplicates even at file level, block-level dedup probably won’t justify its cost.
Task 10: If dedup already exists, measure DDT size and hits
cr0x@server:~$ zpool status -D tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
dedup: DDT entries 182345678, size 54.3G on disk, 76.9G in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 141M 17.6T 12.4T 12.5T 141M 17.6T 12.4T 12.5T
2 31.2M 3.9T 2.7T 2.7T 62.4M 7.8T 5.4T 5.5T
4 7.8M 1.0T 0.7T 0.7T 31.2M 4.0T 2.8T 2.8T
Meaning: “in core” is memory usage for DDT structures currently cached. If this is close to available ARC headroom, you’re living dangerously.
The histogram shows where dedup wins: higher refcount buckets mean shared blocks.
Decision: If “in core” competes with ARC and your latency is already tight, either add RAM or move DDT to faster storage (special vdev).
Task 11: Watch latency and queueing during workload (dedup will amplify this)
cr0x@server:~$ iostat -x 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
6.12 0.00 3.44 8.70 0.00 81.74
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util aqu-sz await
sda 12.0 95.0 512 8200 0.0 2.0 84.0 6.2 55.1
sdb 10.0 90.0 488 7900 0.0 1.0 79.0 5.7 50.3
Meaning: High await and growing aqu-sz indicates the disks are queueing. Dedup misses add more random reads, increasing queueing.
Decision: If you’re already pushing disks hard, dedup will likely degrade latency unless you move DDT reads to fast media and keep it hot in ARC.
Task 12: Confirm special vdev presence and ashift (metadata speed plan)
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
Meaning: A mirrored special vdev exists. Good. That’s your “metadata fast lane.”
Decision: If you plan dedup, validate that DDT blocks are eligible for special vdev and that it has endurance and capacity headroom.
Task 13: Check special_small_blocks setting (don’t accidentally move data you didn’t intend)
cr0x@server:~$ zfs get -o name,property,value special_small_blocks tank
NAME PROPERTY VALUE
tank special_small_blocks 0
Meaning: Only metadata goes to special vdev; small data blocks do not. That’s safer for capacity planning.
Decision: Consider leaving it at 0 unless you have a clear goal and capacity model. Moving small blocks to special can be great or catastrophic depending on fill rates.
Task 14: Model DDT entry count from block size and logical data
cr0x@server:~$ python3 - <<'PY'
logical_tb = 80
block_kb = 8
dedup_ratio = 1.3
logical_bytes = logical_tb * (1024**4)
block_bytes = block_kb * 1024
blocks = logical_bytes / block_bytes
unique_blocks = blocks / dedup_ratio
for bpe in (320, 450):
ddt_gb = unique_blocks * bpe / (1024**3)
print(f"blocks={blocks:,.0f} unique={unique_blocks:,.0f} bytes_per_entry={bpe} => DDT~{ddt_gb:,.1f} GiB")
PY
blocks=10,737,418,240 unique=8,259,552,492 bytes_per_entry=320 => DDT~2,462.0 GiB
blocks=10,737,418,240 unique=8,259,552,492 bytes_per_entry=450 => DDT~3,461.9 GiB
Meaning: This is the “don’t do it” output. 80 TB at 8K blocks with only 1.3x dedup implies multi-terabyte DDT memory. That’s not a tuning problem.
Decision: Do not enable dedup on this workload. Change the design: larger blocks, different storage tech, or accept the capacity cost.
Task 15: Check snapshot count and churn potential
cr0x@server:~$ zfs list -t snapshot -o name,used,refer -S creation | head -n 5
NAME USED REFER
tank/vm@daily-2025-12-26 0 80T
tank/vm@daily-2025-12-25 0 79T
tank/vm@daily-2025-12-24 0 79T
tank/vm@daily-2025-12-23 0 78T
Meaning: Lots of snapshots can pin old blocks, keeping DDT entries alive and preventing space reclamation. Even “USED 0” snapshots can represent huge referenced trees.
Decision: If retention is long and churn is high, assume the DDT grows and stays large. Consider limiting dedup to datasets with controlled snapshot policies.
Fast diagnosis playbook: find the bottleneck in minutes
When dedup-enabled pools get slow, people guess. Don’t. Here’s the fastest path to the truth, in order.
First: Is the DDT fitting in ARC or thrashing?
- Run
zpool status -Dand look at “in core” size. If it’s huge relative to available ARC, you’re at risk. - Check ARC stats during workload (
arcstat.py): if miss% spikes when writes ramp, that’s your sign. - Look for sustained high latency even at low throughput: classic symptom of metadata lookups blocking writes.
Second: Are misses hitting slow media?
- Run
iostat -xand watchawait/aqu-szon vdevs. - If you have a special vdev, check whether it’s the one pegged. If HDDs are pegged, DDT is probably coming from rust.
- Low CPU + high iowait + high disk await is the dedup pain signature.
Third: Is the system memory-starved or swapping?
free -handvmstat 1: any swap activity during normal load is a red flare.- If ARC is being capped or shrinking aggressively, dedup will get worse as the DDT evicts.
Fourth: Is the workload simply a bad dedup candidate?
- Check the DDT histogram: if most blocks are refcnt=1, dedup is mostly overhead.
- Check the dedup ratio from
zfs get dedup(if already enabled) and compare to your expectations. - If dedup is ~1.0x–1.2x, you’re paying a lot to save very little.
If you’re stuck: capture a short performance snapshot under load and decide whether you’re memory-bound (DDT misses),
IOPS-bound (disk queues), or design-bound (low duplication). Don’t tune blind.
Common mistakes: symptom → root cause → fix
1) Writes become spiky and slow after enabling dedup
Symptom: Latency jumps, throughput collapses, CPU looks fine, disks show random reads.
Root cause: DDT working set doesn’t fit in ARC; dedup lookups miss and fetch DDT blocks from disk.
Fix: Add RAM (real fix), or move metadata/DDT to a mirrored special vdev (mitigation), or disable dedup and rewrite data without dedup (painful but honest).
2) Dedup ratio is disappointing (near 1.0x) but performance is still worse
Symptom: Pool slowed down; space savings barely moved.
Root cause: Workload has low true duplication (encrypted data, compressed differently, high churn), but dedup still forces lookups and refcount updates.
Fix: Stop using dedup for that dataset. Use compression, better recordsize choices, or application-aware dedup (backup software) instead.
3) Pool runs fine for weeks, then gets worse “for no reason”
Symptom: Dedup pool degrades over time; reboots temporarily help.
Root cause: DDT grows with new unique blocks, snapshots pin old blocks, ARC pressure increases until misses dominate.
Fix: Revisit retention and churn. Add RAM, shorten snapshot retention on dedup datasets, and ensure DDT is on fast media. Consider splitting workloads into separate pools.
4) Special vdev fills up and the pool panics operationally
Symptom: Special vdev hits high utilization; metadata allocations fail; performance tanks.
Root cause: Mis-sized special vdev, or special_small_blocks moved more data than expected, or dedup metadata growth exceeded plan.
Fix: Plan special vdev capacity with headroom. Mirror it. If it’s already too small, migrate data to a new pool; expanding special vdev is not always trivial depending on layout.
5) “We’ll just add RAM later” turns into downtime math
Symptom: Change approval assumes RAM can be added without impact, but maintenance windows are scarce.
Root cause: Dedup enabled without a hard capacity/performance budget; now you can’t roll back easily because data is already deduped.
Fix: Treat dedup as an architectural decision. If you must test, do it on a clone pool or a limited dataset with a rollback plan.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company consolidated “misc storage” onto a single ZFS pool. It hosted VM images, CI artifacts, and a few
backup directories. The storage team saw duplicate ISO files and repeated VM templates and decided dedup would be a
tidy win. The change request said, essentially, “dedup reduces space usage; performance impact expected to be minimal.”
The wrong assumption was subtle: they sized dedup by raw TB and assumed a modest RAM overhead. They did not model
block count. The VM side ran zvols with small blocks, and the write pattern was random. Within hours, latency alarms
started tripping across unrelated systems. Builds slowed. VM consoles froze intermittently. No one could correlate it
because nothing “big” changed—no new hardware, no new application release.
The on-call SRE pulled basic telemetry: CPU was bored, network was calm, disks were screaming. Random reads spiked on
the HDD vdevs, and average await climbed into the tens of milliseconds. The box wasn’t out of space. It was out of
fast metadata. DDT lookups were missing ARC and hitting rust.
The fix wasn’t clever. They disabled dedup for future writes and began migrating the worst offenders to a non-dedup
pool. That took time, because existing deduped blocks don’t magically “undedupe” without rewriting. The postmortem
action item was blunt: dedup changes require a DDT sizing estimate based on block count and churn, reviewed like a
capacity plan, not a feature toggle.
Mini-story 2: The optimization that backfired
Another organization tried to get fancy. They knew dedup lookups were expensive, so they added a special vdev on fast
SSDs and set special_small_blocks to move small blocks as well, reasoning that “small random IO should go to SSD.”
On paper, it sounded like performance engineering.
In reality, their workloads had a lot of small blocks. Not just metadata, but actual data: logs, package caches, small
artifacts, and VM filesystem noise. The special vdev began filling faster than anyone predicted. As it approached
uncomfortable utilization, allocation behavior got tense. Latency became erratic, and the operational risk increased:
the special vdev was now critical to far more of the dataset than intended.
The backfire wasn’t that SSDs are bad. It was that they treated the special vdev as a performance junk drawer. They
mixed goals (speed up metadata misses, speed up small block reads, accelerate dedup) without a single capacity model.
When you do that, you don’t have a design. You have a surprise subscription.
The recovery plan was boring but painful: migrate to a new pool with a correctly sized special vdev, keep
special_small_blocks conservative, and treat dedup as a scoped tool only for datasets with proven duplication.
The performance improved, but the real win was that the risk envelope became legible again.
Mini-story 3: The boring but correct practice that saved the day
A large internal platform team wanted dedup for a VDI-like environment where base images were identical and patch
cycles were predictable. They were tempted to enable dedup broadly, but one engineer insisted on a staged approach:
create a separate pool dedicated to dedup candidates, collect a week of workload traces, and size RAM using a
conservative bytes-per-entry number with headroom.
They started with a pilot: a subset of desktops and only the template-derived volumes. They watched zpool status -D
daily, tracked ARC hit ratios during peak logins, and measured latency at the hypervisor layer. They also set a hard
rollback criterion: if 95th percentile write latency exceeded a threshold for more than a few minutes, the pilot would
be halted and data would be redirected.
Nothing dramatic happened. That’s the point. The DDT grew as predicted, ARC stayed healthy, and the special vdev
handled the occasional miss without dragging HDDs into the fight. Dedup savings were real because the workload was
actually duplicate and fairly stable.
When leadership asked why the rollout took longer than flipping a property, the answer was simple: “We’re paying for
certainty.” The boring practice—pilot + measurement + conservative sizing—saved them from turning a capacity project
into an incident response project.
Checklists / step-by-step plan
Pre-flight checklist (before any change request)
- List candidate datasets and confirm dedup scope. Don’t enable pool-wide by accident.
- Record block sizes:
recordsizefor files,volblocksizefor zvols. - Measure
logicalreferfor each dataset and snapshot retention patterns. - Estimate dedup ratio realistically (sample, or use known workload behavior like VDI templates).
- Compute estimated unique blocks and DDT RAM (common and paranoid bytes-per-entry).
- Check current ARC headroom and system memory pressure.
- Decide where DDT misses will land: HDD vs special vdev vs all-flash.
- Write down rollback criteria: latency thresholds, miss rate thresholds, business impact triggers.
Pilot plan (the safe way to learn)
- Create a dedicated dataset (or a dedicated pool if you can) for dedup candidates.
- Enable dedup only on that dataset and migrate a representative subset of data.
- Monitor DDT growth daily via
zpool status -Dand track “in core” vs ARC size. - Measure application latency at the layer users feel (VM IO latency, DB commit time), not just ZFS stats.
- Stress a peak period intentionally (login storm, batch job window) and observe worst-case behavior.
- Expand scope only after you can predict DDT growth over time, not just day one.
Production rollout plan (what I’d sign off on)
- Ensure RAM capacity meets the conservative DDT working-set target plus ARC headroom.
- Ensure special vdev (if used) is mirrored and sized with large headroom for metadata growth.
- Enable dedup on a limited dataset set; do not “inherit” it into everything unless you enjoy surprises.
- Set snapshot retention policies explicitly for dedup datasets to control long-term DDT growth.
- Schedule post-change validation checkpoints (1 day, 1 week, 1 month) with the same metrics each time.
Joke #2: Dedup is the only feature where saving space can cost you time, money, and several new coworkers’ opinions.
FAQ
1) Can I predict DDT size precisely before enabling dedup?
Precisely, no. You can predict it well enough to make a safe decision by estimating block count, unique fraction,
and using conservative bytes-per-entry planning. The pilot approach gets you closer to “precise” because it measures
real behavior over time.
2) Is “DDT must fit in RAM” always true?
For good performance on random-write workloads, functionally yes: the hot working set must fit. If the workload is
mostly sequential ingest and DDT blocks are on fast media, partial residency can be survivable. But “survivable” is
not the same as “pleasant.”
3) What’s a reasonable bytes-per-DDT-entry number?
Use 320 bytes as a common planning baseline and 400–500 bytes for high-churn small-block workloads. If the answer
changes your decision, pick the higher number and sleep better.
4) Does compression help or hurt dedup?
It can help capacity and reduce physical IO, but it can reduce dedup matching if “similar” data compresses into
different byte sequences. Also remember: ZFS dedup hashes the stored block, which is typically compressed.
5) If I enable dedup and regret it, can I turn it off safely?
You can disable dedup for future writes by setting dedup=off on the dataset. Existing deduped blocks remain
deduped until rewritten. Rolling back to a truly non-deduped state usually means copying data to a non-dedup dataset
(or new pool) and cutting over.
6) Will a special vdev solve my dedup problems?
It helps by making DDT/metadata misses less awful. It does not replace RAM. If your DDT miss rate is high, you still
pay extra IO and extra latency; you just pay it on NVMe instead of HDD.
7) Should I enable dedup for VM storage?
Usually no, unless you have a very specific, stable duplication pattern (like VDI clones) and you’ve modeled block
size and churn. For general VM farms, compression + good templates + thin provisioning strategies are safer.
8) What metrics prove dedup is worth it?
Two categories: capacity and latency. Capacity: meaningful dedup ratio (not barely above 1.0x) and stable over time.
Latency: p95/p99 IO latency stays within SLO under peak write periods. If either fails, dedup is a bad trade.
9) Does dedup interact with snapshots?
Yes. Snapshots pin old blocks, which keeps DDT entries alive and can increase fragmentation. Long retention plus high
churn is a classic recipe for DDT growth and worsening cache pressure over time.
10) What’s the safest way to try dedup in production?
Don’t “try” it broadly. Pilot on an isolated dataset (or separate pool), with explicit rollback criteria, and measure
DDT growth and latency during real peak workloads.
Conclusion: next steps that won’t ruin your weekend
Dedup is not a checkbox. It’s a commitment to carrying a large, performance-critical index in memory and keeping it
hot enough that your write path doesn’t turn into random IO therapy.
Practical next steps:
- Inventory candidate datasets and record their block sizes (
recordsize/volblocksize). - Measure
logicalreferand snapshot retention to understand how many blocks you’ll actually index. - Model DDT RAM with conservative bytes-per-entry and a workload-appropriate working-set factor.
- If the math looks ugly, believe it. Change the design: avoid dedup, isolate it, or move to a workload where duplication is real.
- If the math looks reasonable, run a pilot with strict latency rollback criteria and watch DDT “in core” growth over time.
One paraphrased idea from a notable reliability voice fits here: “Hope is not a strategy,” often attributed in spirit to
operations thinkers like Gene Kranz. Size the DDT like you size risk: with evidence, headroom, and a rollback plan.