You wake up to a full root filesystem. Docker is “only using 40GB” according to docker system df,
yet df -h is screaming. CI builds are slow. Pruning doesn’t help. Someone suggests “just add disk”,
and you can feel your future on-call self filing a complaint.
On ZFS, the difference between a host that quietly runs containers for years and a host that explodes into
a thousand tiny layers, snapshots, and clones is mostly one thing: dataset layout. Get it right and disk
usage becomes legible. Get it wrong and you’ll be doing archaeology with zdb at 2 a.m.
What “layer explosion” looks like on ZFS
Docker images are stacks of layers. With the ZFS storage driver, each layer can map to a ZFS dataset
(or a clone of a snapshot) depending on Docker’s implementation and version. That’s not automatically bad.
ZFS clones are cheap at creation time. The bill arrives later, with interest:
- Clone storms: every container start produces a writable layer clone, and your pool starts
looking like a genealogical chart. - Snapshot sprawl: layers come with snapshots; snapshots keep blocks alive; “deleted” data
still counts because it’s referenced. - Metadata churn: lots of small datasets means lots of dataset metadata, mount operations,
and property inheritance surprises. - Space accounting lies (to humans):
dfsees a mount, Docker sees logical layers,
ZFS sees referenced bytes, and your brain sees none of it clearly.
Layer explosion is not just “too many images”. It’s too many filesystem objects whose lifetime does not match
your operational lifecycle. The fix is not “prune more”; the fix is aligning ZFS dataset boundaries with what
you actually manage: engine state, image cache, writable container layers, and persistent application data.
Interesting facts and historical context
- ZFS clones were built for instant provisioning (think developer environments and VM templates). Docker’s layer model happens to match that shape—sometimes too well.
- Docker originally pushed AUFS hard because union filesystems were the simplest mental model for layers. ZFS came later as a driver with different semantics and sharper edges.
- OverlayFS won the default slot on most Linux distros largely because it’s “good enough” and lives in-kernel without ZFS’s separate module story.
- ZFS tracks “referenced” vs “logical” space. This is why “I deleted it” is not the same as “the pool got space back” when snapshots and clones are involved.
- Recordsize defaults (128K) date back to throughput-oriented workloads. Databases, small-file workloads, and container layers sometimes want different values, and “one size fits all” is a myth.
- LZ4 compression became the no-brainer default in many ZFS deployments because it’s fast enough to be boring and often reduces writes significantly—especially with image layers full of text and binaries.
- Deduplication has been a recurring cautionary tale in the ZFS world: attractive on slides, unforgiving in RAM and operational complexity. Container images tempt people into trying it.
- “ZFS on Linux” matured into OpenZFS as a cross-platform project. That maturity is why many people now run ZFS for production container hosts with confidence.
Design goals: what a sane layout buys you
If you’re running Docker on ZFS in production, you want the host to behave like an appliance:
predictable upgrades, boring rollbacks, easy capacity planning, and failures that are loud early instead of
subtle late.
1) Separate lifecycles
Docker engine state, image caches, writable layers, and persistent volumes do not share a lifecycle.
Treating them as one directory tree under a single dataset is the fastest route to “we can’t clean this up
without risking production.”
2) Make space accounting legible
ZFS can tell you exactly where the bytes are—if you give it boundaries that map to your mental model.
Datasets give you used, usedbydataset, usedbysnapshots, usedbychildren, quotas, and reservations.
A monolithic dataset gives you one big number and a headache.
3) Stop “garbage survives”
Docker churn creates garbage that lingers due to snapshots and clones. The layout should make it possible to
destroy entire subtrees safely (and quickly) when you decide the cache or layers are disposable.
4) Keep performance tunable
Container writable layers behave like random small writes. Image pulls look like sequential writes. Databases
in volumes can have their own needs. You need dataset-level properties so you can tune without turning the whole pool into a science fair.
Paraphrased idea from Werner Vogels: “Everything fails, all the time—design so failures are contained and recoverable.”
This is exactly what dataset boundaries do for storage failures: they contain blast radius.
The recommended dataset layout (do this)
The pattern is simple: one pool, a few top-level datasets with clear ownership, and exactly one place where Docker is allowed to do its weird clone/snapshot thing.
Pool and top-level datasets
tank/ROOT/<os>— your OS root dataset(s), managed by your OS toolingtank/var— general/var, not Dockertank/var/lib/docker— Docker’s internal world (images, layers, metadata)tank/var/lib/docker/volumes— optionally separate, but usually I prefer volumes outside Docker’s tree (see next)tank/containers— persistent app data (bind mounts, compose volumes via host paths)tank/containers/<app>— per-application datasets with quotas/refquotastank/backup— replication targets (receive-only), not live workloads
Opinionated rule: keep persistent data out of Docker’s driver dataset
Docker’s ZFS driver is optimized for image and layer behavior, not “your database that must never be deleted.”
Put persistent data in dedicated datasets, mounted somewhere stable like /containers, and bind-mount that into containers.
This gives you clean replication and clean retention policies.
Properties that usually win
These are defaults, not religion. But they’re the boring kind of defaults that survive contact with CI,
log storms, and sleepy admins.
- Docker dataset:
compression=lz4,atime=off,xattr=sa,acltype=posixacl(if your distro expects it),recordsize=16Kor32K(often better for layer churn than 128K). - Databases in persistent datasets: tune per engine; common starting point is
recordsize=16Kfor PostgreSQL, sometimes8K, and considerlogbias=latencyfor sync-heavy workloads. - Logs dataset: often
compression=lz4andrecordsize=128Kis fine; the bigger fight is retention, not block size. - Backups:
readonly=onon receive targets; prevent accidental edits.
Joke #1: ZFS snapshots are like office email—deleting a thing doesn’t mean it’s gone, it means it’s “archived forever by someone else.”
Where layer explosion actually stops
Layer explosion becomes a manageable phenomenon when:
- The Docker driver dataset is disposable and bounded with quotas (so a runaway build can’t eat the host).
- Persistent data lives elsewhere, so “nuke Docker state” is a legitimate recovery option.
- Snapshots on Docker’s dataset are either avoided or tightly controlled (because Docker itself already uses snapshots/clones internally).
Why this works: ZFS mechanics that matter
Clones keep blocks alive
The ZFS driver leans on snapshots and clones: image layers become snapshots, writable layers become clones.
That’s efficient—until you try to reclaim space. A deleted layer might still reference blocks via a clone chain.
The pool sees “referenced bytes,” and referenced bytes don’t disappear just because Docker forgot about them.
Dataset boundaries are operational boundaries
If your Docker state is one dataset, you can set properties for that whole set of behaviors. You can also
destroy it as a unit. If your persistent volumes live inside that dataset, you’ve welded your crown jewels
to your trash heap.
Quota and refquota are not the same tool
quota limits a dataset plus its children. refquota limits only the dataset itself.
For “each app gets 200G but can create children datasets inside,” quota is useful.
For “this dataset must not grow, regardless of snapshots elsewhere,” refquota gives you more direct control.
Mount behavior matters for Docker reliability
Docker expects /var/lib/docker to be present early and stay stable. ZFS datasets mount via zfs mount
(often handled by systemd services). If you bury Docker inside an auto-mounted hierarchy with edge-case dependencies,
you’ll eventually produce a boot race and a very confused Docker daemon.
ARC pressure is real on container hosts
ZFS loves RAM. Container hosts also love RAM. If you don’t cap ARC on a busy node, you can starve container workloads
in subtle ways: elevated reclaim, latency spikes, and a lot of “it’s slow but nothing is pegged.”
Joke #2: Dedup looks like free storage until your RAM learns what “mandatory overtime” means.
Practical tasks (commands, outputs, and decisions)
These are real operational tasks you can run on a Docker host using ZFS. Each one includes: the command, example output,
what the output means, and the decision you make from it. Run them in this order when you’re building confidence, and
in a tighter loop when you’re firefighting.
Task 1: Confirm Docker is actually using the ZFS storage driver
cr0x@server:~$ docker info --format '{{.Driver}}'
zfs
Meaning: Docker’s image/layer store is ZFS-aware. If it says overlay2, this article is still useful for volumes, but not for layer mechanics.
Decision: If not zfs, stop and decide whether you’re migrating drivers or just organizing volumes.
Task 2: Identify the dataset backing /var/lib/docker
cr0x@server:~$ findmnt -no SOURCE,TARGET /var/lib/docker
tank/var/lib/docker /var/lib/docker
Meaning: Your Docker root is a dataset, not just a directory. Good—now you can set properties and quotas cleanly.
Decision: If it’s not a dataset (e.g., it shows /dev/sda2), plan a migration before you touch tuning.
Task 3: List the Docker dataset and immediate children
cr0x@server:~$ zfs list -r -o name,used,refer,avail,mountpoint tank/var/lib/docker | head
NAME USED REFER AVAIL MOUNTPOINT
tank/var/lib/docker 78.4G 1.20G 420G /var/lib/docker
tank/var/lib/docker/zfs 77.1G 77.1G 420G /var/lib/docker/zfs
Meaning: Docker’s driver often creates a child dataset (commonly named zfs) holding layer datasets.
Decision: If you see thousands of children under this tree, layer explosion is already happening; you’ll manage it with quotas and cleanup cadence.
Task 4: Count how many datasets Docker has spawned
cr0x@server:~$ zfs list -r tank/var/lib/docker/zfs | wc -l
3427
Meaning: That’s dataset count, not image count. Thousands is not automatically fatal, but it correlates with slow mounts, slow destroys, and slow boot sequences.
Decision: If this grows without bound, you need stricter image retention, CI cleanup, or a separate build node that you can reset.
Task 5: See where the space is: dataset vs snapshots vs children
cr0x@server:~$ zfs list -o name,used,usedbydataset,usedbysnapshots,usedbychildren -r tank/var/lib/docker | head
NAME USED USEDDS USEDSNAP USEDCHILD
tank/var/lib/docker 78.4G 1.20G 9.30G 67.9G
tank/var/lib/docker/zfs 77.1G 2.80G 8.90G 65.4G
Meaning: If usedbysnapshots is large, “deleted” data is being kept alive by snapshots. If usedbychildren dominates, the layer datasets are the space hogs.
Decision: High snapshot usage: reduce snapshotting on Docker datasets and clean old snapshots. High children usage: prune images/containers and consider resetting the Docker dataset if it’s safe.
Task 6: Find the oldest Docker-related snapshots (if any)
cr0x@server:~$ zfs list -t snapshot -o name,creation,used -s creation | grep '^tank/var/lib/docker' | head
tank/var/lib/docker@weekly-2024-11-01 Fri Nov 1 02:00 1.12G
tank/var/lib/docker@weekly-2024-11-08 Fri Nov 8 02:00 1.08G
Meaning: Host-level snapshot policies sometimes accidentally include Docker datasets. That’s usually counterproductive with the ZFS driver.
Decision: Exclude Docker datasets from generic snapshot schedules; snapshot persistent app datasets instead.
Task 7: Check critical ZFS properties on Docker dataset
cr0x@server:~$ zfs get -o name,property,value -s local,inherited compression,atime,xattr,recordsize,acltype tank/var/lib/docker
NAME PROPERTY VALUE
tank/var/lib/docker compression lz4
tank/var/lib/docker atime off
tank/var/lib/docker xattr sa
tank/var/lib/docker recordsize 16K
tank/var/lib/docker acltype posixacl
Meaning: These properties heavily influence small-file performance and metadata overhead.
Decision: If atime=on, turn it off for Docker datasets. If compression is off, enable lz4 unless you have a very specific reason not to.
Task 8: Apply a quota to bound Docker’s blast radius
cr0x@server:~$ sudo zfs set quota=250G tank/var/lib/docker
cr0x@server:~$ zfs get -o name,property,value quota tank/var/lib/docker
NAME PROPERTY VALUE
tank/var/lib/docker quota 250G
Meaning: Docker can no longer consume the entire pool and take the host down with it.
Decision: Pick a quota that supports your expected image churn plus headroom. If you regularly hit the quota, fix retention; don’t immediately raise it.
Task 9: Confirm pool health and see if you’re capacity-constrained
cr0x@server:~$ zpool status -x
all pools are healthy
cr0x@server:~$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 928G 721G 207G - - 41% 77% 1.00x ONLINE -
Meaning: Healthy pool, 77% capacity, moderate fragmentation. As you approach 85–90% usage, ZFS performance and allocation behavior degrade.
Decision: If CAP is above ~85%, prioritize freeing space or adding vdevs before you chase micro-optimizations.
Task 10: Identify write amplification and latency at a glance
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.1 0.0 6.2 9.8 0.0 71.9
Device r/s w/s rKB/s wKB/s avgrq-sz avgqu-sz await svctm %util
nvme0n1 210.0 980.0 9800.0 42000.0 72.1 8.90 7.40 0.52 62.0
Meaning: Elevated %iowait and await suggest storage latency is affecting the system. %util below 100% means you may be limited by queueing or sync behavior, not raw throughput.
Decision: If await is high during container build storms, consider separating build nodes, tuning sync-heavy datasets, and checking SLOG effectiveness (if any).
Task 11: Check ARC size and memory pressure signals
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c|c_min|c_max) '
size 4 8589934592
c 4 10737418240
c_min 4 1073741824
c_max 4 17179869184
Meaning: ARC is currently ~8G, can grow to ~16G. On a container host, ARC growing without a cap can starve workloads.
Decision: If your node is OOM’ing containers while ARC grows, cap ARC via module parameters and leave memory for applications.
Task 12: Find which datasets have the most snapshots (a proxy for churn)
cr0x@server:~$ zfs list -H -t snapshot -o name | awk -F@ '{print $1}' | sort | uniq -c | sort -nr | head
914 tank/var/lib/docker/zfs/graph/3f0c2b3d2a0e
842 tank/var/lib/docker/zfs/graph/9a1d11c7e6f4
Meaning: If Docker-related datasets are accumulating snapshots outside Docker’s own management, something is taking snapshots too aggressively.
Decision: Audit your snapshot tooling; exclude Docker layer trees.
Task 13: Detect space held by deleted-but-referenced blocks (snapshots/clones)
cr0x@server:~$ zfs get -o name,property,value used,referenced,logicalused,logicalreferenced tank/var/lib/docker
NAME PROPERTY VALUE
tank/var/lib/docker used 78.4G
tank/var/lib/docker referenced 1.20G
tank/var/lib/docker logicalused 144G
tank/var/lib/docker logicalreferenced 3.10G
Meaning: Logical space is higher than physical used: compression is working, and/or shared blocks exist. The key: used includes children and snapshots; referenced is what this dataset alone would free if destroyed.
Decision: If used is huge but referenced is small, destroying the dataset could free a lot (because it takes children and snapshots with it). That’s a valid reset strategy for Docker state—if persistent data is elsewhere.
Task 14: Create a persistent application dataset with a hard boundary
cr0x@server:~$ sudo zfs create -o mountpoint=/containers tank/containers
cr0x@server:~$ sudo zfs create -o mountpoint=/containers/payments -o compression=lz4 -o atime=off tank/containers/payments
cr0x@server:~$ sudo zfs set refquota=200G tank/containers/payments
cr0x@server:~$ zfs get -o name,property,value mountpoint,refquota tank/containers/payments
NAME PROPERTY VALUE
tank/containers/payments mountpoint /containers/payments
tank/containers/payments refquota 200G
Meaning: Persistent data has its own mountpoint and a strict size limit.
Decision: Bind-mount /containers/payments into containers. If the app hits the refquota, it fails in a contained way instead of consuming the host.
Task 15: Replicate persistent datasets safely (send/receive)
cr0x@server:~$ sudo zfs snapshot tank/containers/payments@replica-001
cr0x@server:~$ sudo zfs send -c tank/containers/payments@replica-001 | sudo zfs receive -u backup/containers/payments
cr0x@server:~$ zfs get -o name,property,value readonly backup/containers/payments
NAME PROPERTY VALUE
backup/containers/payments readonly off
Meaning: You’ve transferred a consistent snapshot. The receive dataset is not automatically read-only unless you set it.
Decision: Set readonly=on on backup targets to prevent accidental writes.
cr0x@server:~$ sudo zfs set readonly=on backup/containers/payments
cr0x@server:~$ zfs get -o name,property,value readonly backup/containers/payments
NAME PROPERTY VALUE
backup/containers/payments readonly on
Task 16: Verify Docker disk usage vs ZFS usage (spot the mismatch)
cr0x@server:~$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 44 12 38.7GB 22.3GB (57%)
Containers 61 9 4.1GB 3.5GB (85%)
Local Volumes 16 10 9.8GB 1.2GB (12%)
Build Cache 93 0 21.4GB 21.4GB
cr0x@server:~$ zfs list -o name,used,avail tank/var/lib/docker
NAME USED AVAIL
tank/var/lib/docker 78.4G 420G
Meaning: Docker reports logical sizes it thinks it controls. ZFS reports actual used including snapshots and clone relationships. If ZFS used is much larger than Docker’s view, you have referenced blocks outside Docker’s accounting (often snapshots).
Decision: Hunt snapshots and clones holding space; consider excluding Docker datasets from snapshot tooling and resetting Docker’s dataset if needed.
Fast diagnosis playbook
When the host is slow or full, don’t start with random pruning. Start with a tight loop that tells you
which subsystem is guilty: pool capacity, ZFS snapshot retention, Docker cache churn, or pure I/O latency.
First: capacity and “space held hostage”
- Pool capacity:
zpool list. If CAP > ~85%, expect pain. - Where space is:
zfs list -o used,usedbysnapshots,usedbychildren -r tank/var/lib/docker. - Snapshots:
zfs list -t snapshot | grep docker. If snapshots exist on Docker datasets, that’s suspicious.
Interpretation: If snapshots dominate, prune snapshots. If children dominate, prune images/containers or consider resetting Docker dataset.
Second: bottleneck type (latency vs CPU vs memory pressure)
- Disk latency:
iostat -x 1 3and watchawait,%util,%iowait. - ARC growth: check
/proc/spl/kstat/zfs/arcstatsand system memory. - CPU steal / scheduler contention: if virtualized, check
%stealiniostatoutput.
Interpretation: ARC pressure and I/O latency often masquerade as “Docker is slow.” They’re not the same fix.
Third: Docker churn source
- Build cache explosion:
docker system dfanddocker builder prunestrategy. - Image retention: list old images and tags; enforce TTL in CI and registries.
- Dataset count trend: dataset count in
tank/var/lib/docker/zfsweek over week.
Interpretation: If dataset count grows relentlessly, your build/pull workload is effectively a layer factory. Contain it with quotas and isolation.
Three corporate mini-stories (how teams get this wrong)
Incident: the wrong assumption (“Docker prune frees space”)
A mid-sized company ran their CI on a beefy ZFS-backed Docker host. They had a nightly job: prune images,
prune containers, prune build cache. It ran green, with logs that looked responsible. Meanwhile the pool crept
from 60% to 90% over a month and then fell off a cliff during a busy release week.
The on-call did the usual: ran the prune jobs manually, restarted Docker, even rebooted the host. Nothing moved.
docker system df claimed there was plenty reclaimable. ZFS disagreed. zpool list said 94% full,
and I/O latency spiked because ZFS was allocating from the worst remaining segments.
The wrong assumption was subtle: they assumed Docker’s notion of “unused” maps to ZFS’s ability to free blocks.
But the host also ran a generic snapshot policy on tank/var, which included /var/lib/docker.
Every night, they took snapshots of a dataset full of clones and churn. That meant “deleted layers” were still
referenced by snapshots, so space was stuck.
The fix was not heroics. They excluded the Docker dataset from the snapshot policy, destroyed the old snapshots,
and moved persistent data out of Docker’s dataset so they had the option to wipe Docker state if needed.
After that, pruning started working again because ZFS was finally allowed to actually free blocks.
Optimization that backfired (“Let’s turn on dedup for images”)
Another team had a good instinct: container images share lots of identical files. Why not enable ZFS dedup on
the Docker dataset and save a ton of space? They piloted it on one node and celebrated the initial numbers.
Used space dropped. High fives were exchanged in a meeting room with a whiteboard that still had last quarter’s KPIs.
Then the node started stuttering under load. Builds became erratic. Latency spikes appeared during peak traffic,
not just during CI. The team added CPU. They added faster disks. They were doing the ritual dance of performance
debugging while the real issue sat there quietly.
Dedup increases metadata lookups dramatically and wants a lot of RAM for the DDT (dedup table). The node was
now doing extra work on every write and read path, especially with the churn of layer creation and deletion.
Worse, the performance failures were intermittent, because they depended on cache hit rates and DDT working set.
The rollback was painful because turning off dedup doesn’t retroactively “undedupe” existing blocks; it just stops
deduping new writes. They eventually migrated Docker state to a fresh dataset with dedup off, and kept compression on.
They got most of the space savings they needed from lz4 and sensible retention, without the operational tax.
Boring but correct practice that saved the day (separate datasets + quotas)
A payments platform team ran Docker on ZFS with a layout that looked almost too tidy: Docker lived in a dataset
with a firm quota. Each stateful service had its own dataset under /containers with refquota and a simple
snapshot schedule. The backup target was receive-only. Nothing fancy. No clever scripts. No “storage optimization initiative.”
One afternoon a CI misconfiguration caused a loop: a build pipeline pulled base images repeatedly and created
new tags every run. On most systems, this would just chew the disk until the host fell over. Here, the Docker dataset
hit its quota and Docker started failing pulls. Loudly. The node stayed alive. The databases kept running.
The on-call got an alert about failed builds, not a dead production host. They fixed the CI config, then cleaned up
Docker’s dataset. No data restore. No emergency capacity purchase. The quota didn’t prevent the mistake; it prevented the mistake from becoming an outage.
This is the kind of practice that never gets a celebratory post. It should. Boring storage boundaries are what turn
“oops” into “ticket,” instead of “oops” into “incident.”
Common mistakes: symptom → root cause → fix
1) “Docker prune ran, but ZFS space didn’t come back”
Symptom: Docker reports reclaimable space; ZFS used stays high.
Root cause: Snapshots on Docker datasets holding referenced blocks; or clone chains keeping blocks alive.
Fix: Stop snapshotting Docker layer datasets; destroy snapshots; consider resetting tank/var/lib/docker after moving persistent data out.
2) “The host has thousands of mounts and boot is slow”
Symptom: Long boot times, systemd mount units take ages, Docker starts late or fails.
Root cause: Docker ZFS driver produced huge dataset counts; mount handling becomes expensive.
Fix: Bound the dataset with quota; reduce image churn; rebuild the Docker dataset periodically on CI nodes; separate build nodes from long-lived prod nodes.
3) “Containers randomly slow down; CPU isn’t pegged”
Symptom: Latency spikes, timeouts, inconsistent build performance.
Root cause: Pool nearly full, fragmentation and allocation slowdowns; or ARC starving applications.
Fix: Keep pool under ~80–85%; cap ARC; add vdevs (not bigger disks in-place, if you want real performance improvement).
4) “ZFS usedbysnapshots is huge under /var/lib/docker”
Symptom: Snapshot space dominates usage numbers.
Root cause: Generic snapshot policy applied to Docker dataset; Docker already has its own internal snapshot/clone model.
Fix: Exclude Docker dataset from host snapshot schedules; snapshot /containers/<app> instead.
5) “We tuned recordsize for Docker and the database got worse”
Symptom: DB latency increased after “container storage tuning.”
Root cause: Database data stored inside Docker’s dataset or inside the driver-managed layer/volume path; recordsize chosen for layer churn, not DB patterns.
Fix: Put DB on its own dataset; tune recordsize and logbias there; keep Docker’s dataset tuned for Docker.
6) “Replication is messy and restores are scary”
Symptom: Backups include Docker layers, caches, and state, making send/receive huge and slow.
Root cause: Persistent data mixed with Docker state under one dataset tree.
Fix: Split persistent datasets under /containers; replicate those. Treat Docker’s dataset as cache/state, not backup material.
7) “We enabled dedup and now everything is unpredictable”
Symptom: Performance variance, memory pressure, weird latency spikes.
Root cause: Dedup table working set too large; extra metadata overhead for churny layers.
Fix: Don’t use dedup for Docker layer stores. Use compression and retention policies; if already enabled, migrate to a new dataset.
Checklists / step-by-step plan
Plan A: New host (clean build)
- Create pool with sane ashift and vdev design for your hardware (mirror/RAIDZ as required by your failure model).
- Create datasets:
tank/var/lib/dockermounted at/var/lib/dockertank/containersmounted at/containers- Optional per-app datasets:
tank/containers/<app>
- Set Docker dataset properties:
compression=lz4,atime=off,xattr=sa, and considerrecordsize=16Kor32K. - Set a quota on
tank/var/lib/dockersized for your expected churn. - For each stateful service, create a dataset under
/containersand setrefquota. - Configure Docker to use the ZFS driver and the correct zpool/dataset (via daemon config), then start Docker.
- Exclude Docker datasets from any generic snapshot automation; snapshot only persistent datasets.
- Define replication from
tank/containerssubtree to a receive-only backup pool.
Plan B: Existing host (migrate without drama)
- Inventory what is persistent:
- List compose stacks and their volumes.
- Identify which volumes are actually databases or stateful services.
- Create
/containersdatasets per app and move data there (rsync or application-level migration). - Update compose/k8s manifests to bind-mount host paths from
/containers/<app>. - Only after persistent data is out: apply a quota to Docker dataset.
- Audit snapshots: if you have snapshots of Docker datasets, remove them carefully after verifying they are not part of a required rollback procedure.
- Set properties on Docker dataset and restart Docker in a controlled window.
- If the dataset tree is already pathological (tens of thousands of datasets), consider rebuilding Docker state:
- Stop Docker
- Destroy and recreate
tank/var/lib/docker - Start Docker and re-pull images
Plan C: CI nodes (treat them as cattle)
- Put CI Docker state on its own dataset with a strict quota.
- Do not snapshot CI Docker datasets.
- Schedule aggressive build cache cleanup.
- Rebuild CI nodes periodically instead of trying to “keep them clean forever.”
- Keep artifacts in an external store; keep Docker as a cache.
FAQ
1) Should I use Docker’s ZFS driver or overlay2 on ZFS?
If you already have ZFS and want ZFS-native snapshots/clones for layers, use the ZFS driver. If you want the mainstream path
and simpler day-2 operations, overlay2 on top of ZFS can be acceptable—but you lose some ZFS-native semantics and may hit weird interactions.
In either case, keep persistent data in its own datasets.
2) Can I snapshot /var/lib/docker for backups?
You can. You shouldn’t. Docker’s state is reconstructible; your databases aren’t. Snapshot and replicate /containers/<app>.
Treat Docker images and layers as cache and rebuild material.
3) Why does dataset count matter so much?
Each dataset has metadata and may involve mount handling. Thousands can be fine; tens of thousands become operational friction:
slow listing, slow destroy, slow mount/unmount, and a bigger blast radius for mistakes.
4) What properties are most important for Docker datasets?
compression=lz4, atime=off, xattr=sa are the usual wins. recordsize is workload-dependent; 16K–32K often behaves better for churny layers than 128K.
5) Should I put Docker volumes under Docker’s dataset?
For ephemeral volumes, it’s fine. For stateful workloads, don’t. Use host-path bind mounts backed by datasets under /containers.
That’s how you make backups and quotas not terrifying.
6) Is adding a SLOG useful for Docker?
Only if you have sync-heavy workloads on datasets with sync=standard and your applications actually issue sync writes.
Many container workloads are not sync-bound. Test with metrics; don’t buy a SLOG as a superstition.
7) Why do I see large usedbysnapshots even when I don’t take snapshots manually?
Host snapshot tooling often targets whole trees (like tank/var). Or a backup product is snapshotting recursively.
Docker itself also uses ZFS snapshots internally, but those are typically managed under the Docker driver’s dataset tree.
The fix is to scope your snapshot automation precisely.
8) Can I “defragment” a ZFS pool to fix performance?
Not in the classic filesystem sense. The practical fix is capacity discipline (don’t run hot), good vdev design,
and sometimes rewriting data by migrating datasets (send/receive) to a fresh pool.
9) What’s the safest way to reset Docker state on a ZFS host?
Stop Docker, ensure no persistent data is inside /var/lib/docker, then destroy and recreate the Docker dataset.
This is why we separate /containers—so this move is safe when you need it.
10) How do I prevent one runaway app from filling the pool?
Put each stateful app in its own dataset and set refquota. Put Docker state under a quota.
Then alert on quota utilization before it hits the wall.
Conclusion: next steps you can do today
If you remember one thing: Docker state is not precious, and your ZFS layout should reflect that. Give Docker its own dataset,
cap it with a quota, and stop snapshotting it like it’s a family photo album. Put persistent data in its own datasets under
/containers, with refquotas and a replication plan you can explain to a tired coworker at 3 a.m.
Practical next steps:
- Run
findmntand confirm/var/lib/dockeris a dedicated dataset. - Run
zfs list -o usedbydataset,usedbysnapshots,usedbychildrenand learn what’s actually holding space. - Exclude Docker datasets from snapshot automation.
- Create
/containersdatasets for stateful services and move data there. - Set quotas/refquotas so mistakes fail small and loud.
Layer explosion doesn’t stop because you asked nicely. It stops because you drew a boundary and enforced it.