ZFS dedup: The Checkbox That Eats RAM and Ruins Weekends

Was this helpful?

ZFS deduplication is the storage equivalent of a “Free Money” button. It looks harmless, it demos well, and it has absolutely ended more than one on-call weekend. The promise is seductive: store identical blocks once, reference them many times, reclaim piles of space, and look like a wizard during budget season.

The reality is more like this: you trade disk for memory, and if you underpay that memory bill, ZFS collects interest in the form of latency, thrash, and scary pool behavior. Dedup can be a valid tool—just not the default. This piece is about running it in production without lying to yourself.

Table of contents

What dedup actually does (and what it doesn’t)

ZFS deduplication is block-level dedup. When you enable dedup=on for a dataset, ZFS hashes each block as it’s written. If a block with that hash already exists in the pool, ZFS avoids writing a second copy and instead creates another reference to the existing block.

This is not file-level dedup, not “find identical ISO files,” and not a background job you can pause when the database gets cranky. Dedup is inline: it happens on the write path. That means every write now includes extra work: compute the checksum, look it up, and possibly update dedup metadata. You don’t get to “dedup later” unless you rewrite the data later—more on that unpleasantness shortly.

Also: dedup is not compression. Compression reduces blocks by encoding them more efficiently. Dedup avoids storing duplicate blocks at all. They can complement each other, but compression is usually the first lever you should pull because it’s cheap, predictable, and often improves I/O by reducing bytes on disk.

One sentence that belongs on a sticker: ZFS dedup is a metadata problem disguised as a capacity feature.

Joke #1: Dedup is like buying a paper shredder to save space in your filing cabinet—technically correct, until you realize it runs on electricity and you only have one outlet.

Why dedup eats RAM: the DDT and the tax you can’t avoid

The beating heart of dedup is the DDT: the Deduplication Table. Conceptually, it’s a mapping from block checksum → physical block pointer(s) and reference counts. On every write to a dedup-enabled dataset, ZFS needs to determine whether an identical block already exists. That requires a lookup in the DDT.

If the DDT is hot in memory, dedup can be merely “some overhead.” If the DDT is not in memory, dedup becomes “random reads of metadata while you’re trying to write data,” which is how you turn a respectable storage system into an interpretive performance art piece.

ARC, L2ARC, and why “just add SSD” isn’t a plan

ZFS uses ARC (Adaptive Replacement Cache) in RAM for caching. Dedup wants the DDT in ARC because it’s accessed constantly. When ARC is big enough, life is fine. When ARC isn’t, ZFS must fetch DDT entries from disk. Those are small, scattered reads that are worst-case for spinning disks and still irritating on SSDs under load.

Some environments try to “fix” this with L2ARC (secondary cache on SSD). L2ARC can help, but it’s not magic. L2ARC is slower than RAM and has its own metadata overhead. If you’re already short on memory, adding L2ARC can make things worse by consuming more RAM for L2ARC headers while still not delivering RAM-like latency.

Special vdevs: better, but not free

On newer OpenZFS, a special vdev can store metadata (and optionally small blocks) on fast devices. If your DDT and other metadata live on a mirrored NVMe special vdev, dedup lookups are far less painful. This is the first approach that feels like an actual architecture rather than a coping mechanism. But it has a sharp edge: if you lose the special vdev, you can lose the pool. Treat it like you’d treat your most sacred vdev, because it is.

Why the “RAM per TB” folklore exists

You’ll hear rules of thumb like “1–5 GB of RAM per TB of deduped data.” The real answer depends on recordsize, workload, how duplicate the data is, and how large the DDT gets. But the folklore exists because the failure mode is consistent: once the DDT doesn’t fit in memory, latency spikes and throughput falls off a cliff.

And there’s a second trap: you don’t need ARC to fit the pool, you need ARC to fit the DDT working set. If your working set of DDT entries is large (high churn, many unique blocks), you can have lots of RAM and still have a bad time.

Facts & historical context (the stuff people forget)

Dedup didn’t become “controversial” because it’s broken. It became controversial because it’s easy to enable and hard to operate safely. Some context points worth knowing:

  1. ZFS dedup has been around for a long time, and early implementations earned a reputation for punishing memory usage because systems of the day were smaller and SSD metadata tiers were rare.
  2. Dedup is per-dataset, not per-pool, but the DDT lives at the pool level. A single dedup-enabled dataset can impose overhead on the whole pool’s behavior.
  3. You can’t “turn off” dedup retroactively. Setting dedup=off only affects new writes. Existing blocks remain deduped until rewritten.
  4. Dedup interacts with snapshots: deduped blocks can be referenced across snapshots and clones, increasing reference counts and making space accounting more subtle.
  5. Dedup makes certain failure modes slower: resilver, scrub, and some repair operations can take longer because more metadata must be walked and referenced.
  6. Compression often delivers most of the benefit with fewer risks, especially on log files, JSON, VM images, and database pages that are not identical but are compressible.
  7. VM golden images are the poster child for dedup wins, but modern VM stacks often use cloning/linked clones or image layering that already avoids duplication.
  8. DDT growth can surprise you when recordsize is small or workloads generate many unique blocks (databases, encrypted data, already-compressed blobs).
  9. Encrypted datasets reduce dedup opportunity: encryption tends to randomize blocks, eliminating identical-block patterns unless dedup happens before encryption in the data path (often it doesn’t).

When dedup works (rare, real, and specific)

Dedup can be the right tool when all of the following are true:

  • Your data is genuinely block-identical across many files or images. Not “similar,” not “compressible,” but identical blocks at the recordsize you’re using.
  • Your working set is stable, or at least predictable. Dedup hates high churn where the DDT working set constantly changes.
  • You can pay the metadata bill: enough RAM and/or a fast metadata tier (special vdev) to keep DDT lookups from becoming disk seeks.
  • You’ve tested with production-like data. Synthetic tests lie; dedup is especially sensitive to how data is laid out in real life.

The environments where I’ve seen dedup succeed without drama tend to look like:

1) Backup targets with repeated fulls (with caveats)

If you have many full backups that contain identical blocks (for example, a farm of similar machines) and you’re writing large sequential streams, dedup can deliver big space savings. But if your backup software already deduplicates at the application layer, ZFS dedup is redundant and sometimes harmful. Also, backup writes are often bursty—meaning the DDT needs to survive peak ingest, not average.

2) VDI / VM templates (only if you’re not already using clones)

Hundreds of desktops or VMs created from the same template can share a lot of identical blocks. But many platforms already use copy-on-write cloning that avoids duplication without ZFS dedup. If you’re on ZFS already, ZFS clones and snapshots may get you most of the win with less risk.

3) Read-mostly datasets with identical artifacts

Think of large fleets distributing identical container layers, package repositories, or build artifacts—assuming those artifacts are truly identical at the block level and not re-signed or re-timestamped in ways that change blocks. This is rarer than people assume.

When dedup fails (common, expensive, predictable)

Dedup fails in the ways SREs hate: slowly at first, then all at once, and usually at 2 a.m.

1) Databases and anything with high churn

Databases rewrite pages. Even if the logical content is similar, the blocks differ, and the write pattern churns metadata. If your DDT misses increase and the DDT can’t stay hot in ARC, you’ll see spiky latency and falling throughput. You also risk turning routine maintenance (scrubs, resilvers) into multi-day affairs.

2) Already-compressed or encrypted data

Compressed video, archives, many binary formats, and encrypted blobs are typically low-dedup. You pay the overhead but get little space back. This is the worst deal in storage: overhead without benefit.

3) Mixed workloads on a shared pool

One team enables dedup on one dataset. The pool-level DDT grows. Another team running latency-sensitive workloads on a different dataset starts filing tickets about “random pauses.” Nobody connects the dots until you graph ARC misses and watch the DDT fight for memory.

4) “We’ll just add RAM later”

This is not a reversible toggle. If you enable dedup and ingest weeks of data, then discover the DDT is too big, disabling dedup doesn’t remove the DDT entries. You’re now in the business of rewriting data or migrating, which is the storage equivalent of discovering your “temporary” duct tape is now structural.

Joke #2: ZFS dedup is the only feature I know that can turn “we saved 20% disk” into “we lost 80% performance” with the same enthusiasm.

Three corporate-world mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized enterprise ran a private virtualization cluster on ZFS. A new storage lead inherited an expensive expansion request and decided to “be smart” before buying more disks. They enabled dedup on the dataset holding VM disks, expecting template-heavy duplication.

The wrong assumption: “VMs are all similar, so dedup will be huge.” In reality, the environment had drifted. Templates were old, many VMs were patched for years, and the platform already used cloning for initial deployment. There was duplication, but nowhere near what the spreadsheet imagined.

The first symptoms were subtle: occasional latency spikes during morning login storms. Then the helpdesk started reporting “slow VMs” and “random freezes.” IOPS looked fine in averages, but p99 write latency was ugly. Scrubs started taking longer, too—more metadata walking, more cache pressure.

The on-call team treated it like a hypervisor issue until someone correlated the spikes with storage-side DDT cache misses. The DDT was far larger than expected, ARC was constantly evicting useful data to make room for dedup metadata, and the pool was doing small random reads just to decide how to write.

They disabled dedup on the dataset, expecting an instant fix. Performance improved slightly (fewer new dedup lookups), but the DDT remained massive because existing data was still deduped. The eventual fix was a controlled migration: replicate datasets to a new pool without dedup and cut over. The “disk savings” ended up costing weeks of planning and a lot of meetings where nobody was allowed to say “I told you so.”

Mini-story 2: The optimization that backfired

A SaaS company used ZFS as a backend for an internal artifact store. The workload was mostly writes during CI peaks and reads during deployments. Space growth was a concern, and someone noticed many artifacts had identical base layers.

They ran a quick test on a small subset, saw great dedup ratios, and rolled it out broadly. The optimization backfired because the test set was biased: it contained mostly identical build outputs from a single project that reused layers heavily. The broader environment contained many projects with slightly different dependencies and frequent rebuilds that changed blocks just enough to defeat dedup.

Within a month, the artifact system started missing SLAs during peak hours. Not because disks were full, but because metadata lookups were now in the critical path for every write. ARC was starved; the system spent its time fetching DDT entries. The team tried adding L2ARC to “cache the table,” which increased RAM pressure and did little for peak latency.

The eventual fix was boring: revert dedup, enable compression, and redesign retention policies. Compression gave predictable savings with less metadata pain. The retention redesign cut long-tail storage growth. Dedup was not the villain, exactly—it was the wrong tool for a workload with high variability and insufficient metadata tiering.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop ran a ZFS-based backup platform. They were tempted by dedup because backups are “full of repeats.” The storage engineer—quiet, methodical, and allergic to surprises—insisted on a preflight checklist before enabling anything.

They took a representative sample: a week of real backups, not synthetic, and pushed it into a staging pool. They measured compression ratio, dedup ratio, DDT size, ARC behavior, and peak ingest performance. They also simulated a scrub and a resilver window while the system was under moderate load, because the worst failures happen when maintenance overlaps with traffic.

The data showed dedup would help, but only if metadata lived on fast devices. So they built the pool with a mirrored special vdev sized for metadata growth and treated it like critical infrastructure (monitoring, spares, strict change control). They also capped ingest concurrency to avoid bursty DDT thrash.

When a later hardware incident forced a resilver during a busy week, the system stayed stable. It wasn’t thrilling—nobody got a trophy—but it avoided the classic dedup story: “we saved space and then lost the weekend.” The correct practice wasn’t clever; it was disciplined testing and building for the metadata path, not just the data path.

Practical tasks: commands, outputs, and how to interpret them

Below are concrete tasks you can run on a typical OpenZFS system (Linux with zfsutils). Some commands differ on FreeBSD/Illumos, but the mental model holds. The goal is not to memorize flags; it’s to make dedup behavior measurable.

Task 1: Identify which datasets have dedup enabled

cr0x@server:~$ sudo zfs get -r -o name,property,value,source dedup tank
NAME                   PROPERTY  VALUE  SOURCE
tank                   dedup     off    default
tank/vmstore           dedup     on     local
tank/backups           dedup     off    inherited from tank

Interpretation: Dedup is per-dataset. If you see even one critical dataset with dedup=on, assume DDT overhead exists at the pool level. Track who enabled it and why.

Task 2: Check the actual dedup ratio at pool level

cr0x@server:~$ sudo zpool list tank
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  54.5T  38.2T  16.3T        -         -    21%    70%  1.08x  ONLINE  -

Interpretation: DEDUP 1.08x means you’re saving ~8% logical space via dedup. That may not justify the overhead. If it’s under ~1.2x, you should be suspicious unless you have a very specific reason.

Task 3: Check DDT size and how “hot” it is

cr0x@server:~$ sudo zpool status -D tank
pool: tank
state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

dedup: DDT entries 48239124, size 34.1G on disk, 19.6G in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
     1    36.2M   9.44T   8.81T   8.81T    36.2M   9.44T   8.81T   8.81T
     2     8.7M   2.31T   2.16T   2.16T    18.1M   4.62T   4.32T   4.32T
     4     1.2M   353G    331G    331G     5.0M   1.38T   1.30T   1.30T

Interpretation: The key line is “size X on disk, Y in core.” If “in core” is large relative to your RAM, or if it can’t stay in ARC, you’re headed for trouble. Note also how many blocks are only referenced once: if most are refcnt=1, you’re paying dedup overhead for blocks that didn’t dedup.

Task 4: Estimate dedup “waste” by looking at reference counts

cr0x@server:~$ sudo zdb -DD tank | head -n 25
DDT-sha256-zap-duplicate: 48239124 entries, size 34972804096 on disk, 21043748864 in core
DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
     1    36.2M   9.44T   8.81T   8.81T    36.2M   9.44T   8.81T   8.81T
     2     8.7M   2.31T   2.16T   2.16T    18.1M   4.62T   4.32T   4.32T

Interpretation: If the majority of blocks are refcnt=1, dedup is not pulling its weight. If you have a strong cluster in higher refcounts (8, 16, 32), that’s when dedup starts to look like a bargain.

Task 5: Check ARC size and memory pressure signals

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c |c_min|c_max|misses|hits|demand_data_misses|demand_metadata_misses) '
size                            4    34325131264
c                               4    36507222016
c_min                           4    4294967296
c_max                           4    68719476736
hits                            4    1183749821
misses                          4    284993112
demand_data_misses              4    84129911
demand_metadata_misses          4    162104772

Interpretation: High demand_metadata_misses under load is a dedup smell. Dedup is metadata-heavy; if metadata misses climb, the pool will do more small reads, and latency follows.

Task 6: Observe real-time latency and IOPS per vdev

cr0x@server:~$ sudo zpool iostat -v tank 2
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        38.2T  16.3T    110    980  11.2M  84.5M
  raidz2-0                  38.2T  16.3T    110    980  11.2M  84.5M
    sda                         -      -     18    165  1.9M  14.2M
    sdb                         -      -     19    164  1.8M  14.0M
    sdc                         -      -     17    162  1.7M  14.1M
    sdd                         -      -     18    163  1.9M  14.1M
    sde                         -      -     19    163  1.9M  14.0M
    sdf                         -      -     19    163  2.0M  14.1M

Interpretation: This shows load distribution. If dedup is forcing random metadata reads, you may see read IOPS climb even when the application is mostly writing. Pair this with latency tools (below) to confirm.

Task 7: On Linux, catch ZFS-induced stalls via pressure metrics

cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.32 avg60=0.18 avg300=0.07 total=12873631
full avg10=0.09 avg60=0.05 avg300=0.02 total=2219481

Interpretation: Memory pressure correlates strongly with dedup pain. If ARC is fighting the kernel and applications for RAM, you’ll see “full” pressure rise during peaks.

Task 8: Check dataset recordsize and why it matters for DDT growth

cr0x@server:~$ sudo zfs get -o name,property,value recordsize tank/vmstore
NAME         PROPERTY    VALUE
tank/vmstore recordsize  128K

Interpretation: Smaller recordsize means more blocks for the same data, which means more DDT entries. Dedup overhead scales with number of blocks, not just total bytes.

Task 9: Measure compression ratio before reaching for dedup

cr0x@server:~$ sudo zfs get -o name,property,value,source compression,compressratio tank/vmstore
NAME         PROPERTY       VALUE     SOURCE
tank/vmstore compression    lz4       local
tank/vmstore compressratio  1.52x     -

Interpretation: If you’re already getting 1.5x compression, dedup’s incremental gain may be small—especially for the cost. If compression is off, fix that first unless you have a strong reason not to.

Task 10: Find whether a dataset is paying dedup overhead without benefit

cr0x@server:~$ sudo zfs get -o name,property,value written,logicalused,usedrefreserv usedbydataset tank/vmstore
NAME         PROPERTY       VALUE
tank/vmstore written        19.3T
tank/vmstore logicalused    27.8T
tank/vmstore usedrefreserv  0B
tank/vmstore usedbydataset  31.1T

Interpretation: logicalused vs actual used can help you reason about space efficiency, but dedup complicates accounting due to shared blocks. Use pool-level dedup ratio and DDT histograms to decide whether dedup is “working,” not just dataset used values.

Task 11: Disable dedup for new writes (and set expectations correctly)

cr0x@server:~$ sudo zfs set dedup=off tank/vmstore
cr0x@server:~$ sudo zfs get -o name,property,value dedup tank/vmstore
NAME         PROPERTY  VALUE
tank/vmstore dedup     off

Interpretation: This stops dedup on new writes only. Existing deduped blocks remain. Performance may improve if your workload is write-heavy and the DDT lookup rate drops, but the pool still carries the DDT and its associated complexity.

Task 12: Remove deduped data the only way that truly works: rewrite or migrate

cr0x@server:~$ sudo zfs snapshot tank/vmstore@migration-start
cr0x@server:~$ sudo zfs send -R tank/vmstore@migration-start | sudo zfs receive -u tank2/vmstore
cr0x@server:~$ sudo zfs set dedup=off tank2/vmstore
cr0x@server:~$ sudo zfs mount tank2/vmstore

Interpretation: Sending a snapshot replicates blocks. Whether dedup is preserved depends on how the receiving side is configured and the nature of the send/receive stream. The operationally safe approach is: migrate to a clean pool/dataset with dedup off and validate. If your goal is “get rid of DDT,” you typically need to move data to a pool that never had dedup enabled or rewrite it in place (which is rarely fun).

Task 13: Confirm whether you have a special vdev for metadata

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            sda                           ONLINE       0     0     0
            sdb                           ONLINE       0     0     0
            sdc                           ONLINE       0     0     0
            sdd                           ONLINE       0     0     0
            sde                           ONLINE       0     0     0
            sdf                           ONLINE       0     0     0
        special
          mirror-1                        ONLINE       0     0     0
            nvme0n1p2                     ONLINE       0     0     0
            nvme1n1p2                     ONLINE       0     0     0

Interpretation: A mirrored special vdev is a major mitigator for dedup metadata latency. It does not eliminate RAM needs, but it can keep the system upright when ARC is pressured.

Task 14: Watch ZFS transaction group behavior (a proxy for “is storage stalling?”)

cr0x@server:~$ grep -E "txg_sync|txg_quiesce" /proc/spl/kstat/zfs/* 2>/dev/null | head
/proc/spl/kstat/zfs/tank/txgs:txg_sync 4 9123
/proc/spl/kstat/zfs/tank/txgs:txg_quiesce 4 9123

Interpretation: If txg sync times balloon during dedup peaks, the pool is struggling to commit changes. Dedup adds metadata updates that can exacerbate sync pressure.

Fast diagnosis playbook: find the bottleneck before it finds you

This is the order I use when someone pings “ZFS is slow” and dedup might be involved. The point is to converge fast, not to be philosophically complete.

Step 1: Is dedup even enabled, and is it meaningful?

  • Check datasets: zfs get -r dedup POOL
  • Check pool ratio: zpool list
  • Check DDT size: zpool status -D POOL

Decision: If dedup ratio is low (< ~1.2x) and DDT is large, you likely have “cost without benefit.” Plan to exit dedup rather than tune around it.

Step 2: Is the DDT fitting in memory (or at least staying hot)?

  • ARC stats: cat /proc/spl/kstat/zfs/arcstats
  • Look at metadata misses and overall misses rising with load.
  • Check OS memory pressure: /proc/pressure/memory

Decision: If memory pressure is high and metadata misses rise, expect dedup lookups to hit disk. That’s your latency.

Step 3: Is metadata I/O landing on slow devices?

  • Confirm special vdev: zpool status
  • Observe read IOPS during write-heavy workload: zpool iostat -v 2

Decision: If you don’t have a special vdev and you’re on HDDs, dedup-at-scale is rarely a happy ending.

Step 4: Are you actually CPU-bound on hashing?

  • Check system CPU: top / mpstat (platform dependent)
  • Correlate high CPU with write throughput and stable latency.

Decision: Hashing overhead exists, but most real-world dedup disasters are memory/metadata I/O bound, not CPU bound. Don’t chase the wrong dragon.

Step 5: Rule out “not dedup” issues quickly

  • Pool health: zpool status -x
  • Scrub/resilver activity: zpool status
  • Nearly-full pool fragmentation: zpool list (CAP), and watch FRAG.

Decision: A pool at high capacity with high fragmentation will perform poorly regardless of dedup. Dedup can amplify the pain, but it’s not always the root cause.

Common mistakes (symptoms and fixes)

Mistake 1: Enabling dedup because “we have a lot of similar data”

Symptom: Dedup ratio stays near 1.00x–1.15x while latency worsens.

Fix: Turn on compression (if not already), measure real dedup opportunity on a representative sample, and consider application-layer dedup or cloning strategies instead. If dedup is already enabled and hurting, plan a migration/rewrite strategy; don’t expect dedup=off to undo history.

Mistake 2: Underestimating DDT growth due to small recordsize

Symptom: DDT entries explode; ARC metadata misses climb; random read IOPS appear during heavy writes.

Fix: Re-evaluate recordsize for the dataset. For VM images and general-purpose storage, 128K is common; smaller sizes may be necessary for certain workloads but come with metadata costs. If you must use small blocks, dedup becomes more expensive.

Mistake 3: Relying on L2ARC to “solve” dedup

Symptom: You add SSD cache and still see stalls; RAM usage increases; the system feels worse under peak load.

Fix: Treat L2ARC as a supplement, not a substitute for RAM. If dedup is required, prioritize enough RAM and consider a special vdev for metadata. Validate with workload-based testing.

Mistake 4: Enabling dedup on a mixed-use pool without isolation

Symptom: One workload changes, and unrelated services see latency spikes. People start blaming the network.

Fix: If dedup is necessary, isolate it: separate pool, or at least separate hardware resources and a metadata tier sized for it. A shared pool is a shared blast radius.

Mistake 5: Forgetting that dedup increases operational complexity

Symptom: Scrubs/resilvers take longer than planned; maintenance windows slip; recovery drills feel slow.

Fix: Measure scrub/resilver behavior in staging with dedup enabled. Monitor and budget time. If RTO/RPO matters, weigh this cost seriously.

Mistake 6: Expecting an easy rollback

Symptom: You set dedup=off and nothing gets better enough; DDT stays large.

Fix: Accept the physics: deduped blocks remain deduped until rewritten. The clean exit is migration to a non-dedup pool/dataset, or a controlled rewrite of data (often via send/receive, rsync-style copy, or application-level rehydration).

Checklists / step-by-step plan

Checklist A: “Should we enable dedup?” (preflight)

  1. Define the goal: Is this about delaying a disk purchase, reducing backup footprint, or enabling a specific product feature?
  2. Measure compression first: Enable compression=lz4 in a test dataset and measure compressratio.
  3. Sample real data: Use a representative subset (not a cherry-picked directory) and test dedup in staging.
  4. Measure dedup ratio and DDT size: Collect zpool status -D and zdb -DD.
  5. Model worst-case ingest: Run the peak write workload, not the average day.
  6. Test a scrub under load: Confirm you can still meet latency and throughput constraints.
  7. Decide on metadata tiering: If dedup is truly needed, plan for a mirrored special vdev and sufficient RAM.
  8. Write the rollback plan: Include how you will migrate off dedup if it disappoints, and how long it will take.

Checklist B: If dedup is already enabled and you’re suffering

  1. Stop the bleeding: Set dedup=off on the affected datasets to prevent further DDT growth from new writes (unless dedup is essential for that data).
  2. Quantify DDT impact: Use zpool status -D and ARC stats to confirm metadata misses and memory pressure.
  3. Reduce concurrency temporarily: If the workload allows it (backup ingest, batch jobs), limit parallel writers to reduce lookup pressure.
  4. Add RAM if feasible: It’s the fastest “stabilize now” lever, though not always possible immediately.
  5. Consider adding a special vdev: If supported and you can do it safely, move metadata to fast mirrored devices. This can transform random metadata reads from “death” to “annoying.”
  6. Plan the real fix: Migrate datasets to a new pool with dedup off, or rewrite the data in a controlled way.
  7. Validate with load testing: Don’t declare victory after the first quiet hour.

Checklist C: “Dedup, but safely” (operational guardrails)

  1. Monitor DDT size and “in core” footprint regularly.
  2. Alert on rising metadata misses and sustained memory pressure.
  3. Keep pool capacity sane; avoid running near full.
  4. Schedule scrubs with awareness of peak workload windows.
  5. Document which datasets use dedup and why, with an owner.

FAQ

1) Is ZFS dedup “bad”?

No. It’s specialized. It’s bad as a default because the downside is steep and the rollback is hard. In the right workload with the right hardware (RAM and fast metadata), it can be excellent.

2) How much RAM do I need for dedup?

Enough to keep the DDT working set hot. Rules of thumb exist, but the only safe answer is: measure your DDT “in core” size and observe metadata miss behavior under peak load. If the system is thrashing, you don’t have enough.

3) Can I disable dedup and get my RAM back?

Disabling dedup stops new dedup operations. It does not remove the existing DDT for already-deduped blocks. You reduce future growth, but you don’t erase the past without rewriting/migrating data.

4) Does dedup help with VM storage?

Sometimes. If you have many near-identical VM images and you’re not already using cloning/snapshots effectively, dedup can save space. But VM workloads also generate churn, and the “similarity” often fades over time. Test with real VM disks and a realistic time window.

5) Is compression a safer alternative?

Usually, yes. compression=lz4 is widely used because it often improves effective throughput (fewer bytes written/read) with minimal CPU cost on modern systems. It’s also easier to reason about and doesn’t require massive metadata tables.

6) Does encryption ruin dedup?

Often. If the data is encrypted before ZFS sees it, blocks look random and won’t dedup. If ZFS encrypts the dataset itself, dedup behavior depends on implementation details and when dedup happens relative to encryption. In many practical setups, encryption significantly reduces dedup gains.

7) Can a special vdev “fix” dedup?

It can drastically reduce the pain by keeping metadata on fast devices, which helps when DDT lookups miss ARC. It doesn’t eliminate RAM needs or operational complexity. It also becomes critical infrastructure: lose it and you may lose the pool.

8) Why is my pool slow even though the dedup ratio is high?

A high ratio means blocks are shared, but it doesn’t guarantee the DDT fits in memory or that the metadata path is fast. You can have strong dedup savings and still be bottlenecked on DDT lookups, txg sync pressure, or slow metadata devices. Measure ARC metadata misses and DDT “in core” versus RAM.

9) What’s the safest way to get rid of dedup once enabled?

Migration to a new pool (or at least a new dataset on a pool that never used dedup) with dedup disabled, followed by validation and cutover. In-place “cleanup” is usually a rewrite exercise, which is operationally harder to control.

10) Should I enable dedup on a shared storage pool used by many teams?

Only if you want dedup’s blast radius to include everyone. If dedup is required, isolate it: dedicated pool, dedicated metadata tier, clear ownership, and a rollback plan.

Conclusion

ZFS dedup is not a “nice to have.” It’s an architectural choice with a permanent signature on your pool’s metadata and your operational life. If you treat it like a checkbox, it will treat your RAM like a consumable and your latency budget like a suggestion.

When dedup works, it works because the data is truly duplicate and the system is built to serve metadata fast—RAM first, then a serious metadata tier. When dedup fails, it fails because someone assumed similarity equals duplication, or assumed they could roll it back later, or tried to solve a RAM problem with an SSD bandage. Your best move is disciplined: test with real data, size the metadata path, and have an exit plan before you need one.

← Previous
Route-based vs policy-based VPN: which is better for offices and why
Next →
ZFS Encryption + Compression: The Order That Makes Performance

Leave a comment