ZFS for PostgreSQL: The Dataset and Sync Strategy That Works

Was this helpful?

PostgreSQL is honest: when it’s slow, it usually tells you why. ZFS is also honest—just not always in the same dialect.
Put them together and you can get a database that’s boringly reliable… or a system that looks fine on dashboards right up
until it doesn’t.

The pain point is always the same: sync writes. WAL durability. Latency spikes that show up only under commit-heavy load.
Someone suggests “just set sync=disabled” and suddenly you’re having an unplanned meeting about “data integrity posture.”

The strategy: what to do, what not to do

If you want PostgreSQL on ZFS to work in production, you need to stop thinking in generic filesystem tips and start thinking
in durability boundaries. PostgreSQL’s durability boundary is WAL + fsync behavior (plus checkpoints and replication).
ZFS’s durability boundary is the ZIL (and SLOG if present), transaction groups, and where synchronous writes land.

The dataset and sync strategy that works is not exotic:

  • Keep PostgreSQL data and WAL on separate datasets (and ideally separate vdevs if you can).
  • Leave PostgreSQL safety features on: fsync=on, full_page_writes=on, and sane synchronous_commit choices.
  • Use ZFS sync=standard for most things. It respects POSIX sync semantics, and Postgres uses them for a reason.
  • If commit latency is the bottleneck, fix the sync path correctly: provide a proper SLOG device (power-loss protected) or change the application durability requirement intentionally (e.g., synchronous_commit=off for specific workloads).
  • Use compression. It usually helps latency and throughput on database pages, and it’s one of the rare “free lunches” that actually delivers in the field.
  • Do not set sync=disabled on the database dataset unless you’ve written down the blast radius and got explicit sign-off from people who will be paged at 3 a.m.

The easiest way to get fired is to “optimize durability” without telling anyone. The second easiest way is to buy fast devices
and then make ZFS wait on slow sync I/O anyway.

Joke #1: If you set sync=disabled on a database and call it “eventually durable,” you’ve invented a new storage tier: hope.

Facts and context that change decisions

These aren’t trivia for trivia’s sake. Each one should nudge a real design choice.

  1. ZFS is copy-on-write (from the original Sun Microsystems work). PostgreSQL already expects to overwrite 8 KB pages; ZFS turns that into new block allocations. That affects write amplification and fragmentation.
  2. The ZIL is always there, even without a SLOG. A SLOG is not “turning on the ZIL,” it’s moving the log device to something faster and safer for sync writes.
  3. PostgreSQL’s default page size is 8 KB. ZFS’s default recordsize is often 128 KB. That mismatch is not inherently wrong, but it determines read-modify-write behavior on updates and how much data gets dirtied per write.
  4. Historically, many “ZFS is slow for databases” claims came from misconfigured pools: wrong ashift on 4K-sector disks, no SLOG for sync-heavy workloads, or RAM-starved ARC behavior.
  5. PostgreSQL WAL is sequential-ish but not purely sequential. It’s append-heavy, but fsync behavior can create bursts, and archiving/replication can create read pressure at awkward times.
  6. ZFS checksums every block. That means you can detect silent corruption that other stacks would happily feed your database until it “mysteriously” fails later. For databases, this is not optional luxury; it’s integrity.
  7. Compression became “safe by default” in many ZFS shops only after LZ4 matured (and CPUs got faster). Today, leaving compression off is often leaving performance on the table.
  8. PostgreSQL durability is not a binary switch. You can choose durability at the transaction level (synchronous_commit) or replication level (sync replication), rather than filesystem sabotage.

Dataset layout that survives real workloads

The most common mistake is putting everything under one dataset, tuning it once, and assuming you’re done.
PostgreSQL has at least three I/O personalities: data files, WAL, and “stuff that is important but not worth optimizing at the cost of durability”
(logs, dumps, base backups staging, etc.).

A pragmatic dataset tree

Here’s a layout that maps to how Postgres behaves and how ZFS can be tuned without self-sabotage:

  • tank/pg: parent dataset for all PostgreSQL storage.
  • tank/pg/data: main data directory (PGDATA) excluding WAL if you relocate it.
  • tank/pg/wal: WAL directory (symlink pg_wal here or use postgresql.conf where supported).
  • tank/pg/tmp: temp tables / sorts if you place temp_tablespaces here (optional; often better on local NVMe and acceptable to lose).
  • tank/pg/backups: backups and basebackup staging (snapshot targets, WAL archive spool, etc.).

Why separate datasets? Because recordsize, logbias, compression, and sometimes primarycache decisions differ.
WAL wants low latency and predictable sync behavior. Data files want read efficiency, caching, and sane fragmentation control.

Where to physically place WAL

If you can afford it, WAL deserves fast storage and a clean sync path. Options:

  • Same pool, separate dataset: simplest, still lets you tune properties; performance depends on vdev layout.
  • Separate mirror vdevs for WAL: good when the main pool is RAIDZ and WAL latency is suffering.
  • Separate pool for WAL: sometimes done, but operationally heavier (more things to monitor, more failure domains).

Don’t move WAL to “fast but sketchy” storage without power-loss protection if you’re relying on synchronous durability. WAL is the diary of what happened; you don’t write your diary in disappearing ink.

Sync semantics: PostgreSQL vs ZFS (and where SLOG fits)

PostgreSQL writes WAL records and uses fsync() (or fdatasync()) to ensure durability when a transaction commits, depending on settings.
When Postgres says “sync,” it means: “If you lose power right now, the committed transaction should still be committed after crash recovery.”

ZFS with sync=standard honors synchronous requests by using the ZIL. The ZIL is not a write cache for everything; it is a log of synchronous writes so they can be replayed after a crash. Without a dedicated SLOG device, the ZIL lives on the main pool, meaning sync writes land on your data disks. With a SLOG, they land on the SLOG (still written safely), then later get committed to the main pool with normal transaction group behavior.

What sync really means on a dataset

  • sync=standard: honor sync requests. This is the default you want for databases.
  • sync=always: treat all writes as synchronous, even if the app didn’t ask. Usually worse for Postgres; it turns non-critical writes into latency tax.
  • sync=disabled: lie to the application about durability. Fast. Also a career-limiting move when the power blips.

SLOG: what it is, what it isn’t

A SLOG is a dedicated device for the ZIL. It improves latency for synchronous writes by providing a fast place to commit them.
But it must be the right kind of device: low latency, high write endurance, and—non-negotiable—power-loss protection.

If the SLOG lies about durability (consumer SSD with volatile cache and no PLP), you’ve effectively moved the sync=disabled gamble into hardware.
It may “work” until it doesn’t, which is the least useful reliability pattern in operations.

One quote you can hang an incident review on

paraphrased idea — John Allspaw: reliability comes from making failure visible and survivable, not from believing it won’t happen.

Recommended ZFS properties for Postgres (with rationale)

These are opinionated defaults that work across a lot of real installations. You can deviate, but do it deliberately and test with your workload.

Pool-level prerequisites (before datasets)

  • Correct ashift for your drives (typically 12 for 4K sectors). Get this wrong and you buy permanent write amplification.
  • Mirrors for latency-sensitive workloads (including WAL-heavy Postgres). RAIDZ can be fine for capacity and throughput, but small sync I/O latency is usually better on mirrors.
  • Enough RAM for ARC, but don’t starve Postgres shared buffers. ARC will happily expand; your OOM killer won’t be impressed.

Dataset: tank/pg/data

  • recordsize=16K or recordsize=8K: Start with 16K for many OLTP workloads; 8K can reduce read-modify-write on updates but may hurt large sequential reads. Measure.
  • compression=lz4: Usually improves effective IOPS and reduces write amplification. It’s one of the few knobs that’s both safer and faster.
  • atime=off: Postgres doesn’t need atime. Stop paying for it.
  • xattr=sa (if supported): keeps xattrs in inodes; reduces metadata I/O.
  • logbias=latency (default): favor low latency for synchronous operations. For the data dataset, leaving default is fine; we’ll tune WAL dataset separately.

Dataset: tank/pg/wal

  • recordsize=16K: WAL writes are append-ish; you don’t need huge recordsize. Avoid 128K here.
  • compression=lz4: WAL compresses well enough sometimes, but the win is often modest. Still, it usually doesn’t hurt and can reduce device writes.
  • logbias=throughput:

This one needs explanation. logbias=throughput tells ZFS to bias toward writing sync data to the main pool rather than the SLOG in some patterns.
On a system with a good SLOG, you often want logbias=latency. On a system without a SLOG, WAL might already be living on the main pool ZIL; the bias matters less.
In practice: if you have a proper SLOG and commit latency is a bottleneck, keep logbias=latency for WAL. If you’re saturating the SLOG or it’s not the bottleneck, consider throughput.
Don’t cargo-cult it. Measure commit latency and ZIL behavior.

  • sync=standard: Yes, still. Your “fast mode” should be at Postgres transaction settings, not filesystem lies.
  • primarycache=metadata (sometimes): If WAL reads are pushing out useful data cache, limit cache to metadata for WAL. This depends on your RAM and workload.

Dataset: tank/pg/tmp (optional)

  • sync=disabled can be acceptable here if it is truly disposable temp space and you’re comfortable losing it on crash. Treat it like a scratch disk.
  • compression=lz4, atime=off.

Joke #2: The only thing more “temporary” than tank/pg/tmp is the confidence of someone who didn’t test a failover.

Practical tasks: commands, outputs, and what you decide

These are the field exercises: what you run, what the output means, and what decision you make. The goal is to stop arguing from vibes.

Task 1: Confirm pool health (because everything else is pointless if it’s broken)

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:11:03 with 0 errors on Sun Dec 22 03:00:12 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            nvme0n1p2               ONLINE       0     0     0
            nvme1n1p2               ONLINE       0     0     0
        logs
          nvme2n1p1                 ONLINE       0     0     0

errors: No known data errors

What it means: pool is ONLINE, scrub is clean, SLOG device is present and online.
Decision: proceed with performance tuning. If you see DEGRADED, checksum errors, or a dead log device, stop and fix reliability first.

Task 2: Verify ashift (silent performance killer)

cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|vdev_tree" -n | head
56:        vdev_tree:
74:            ashift: 12

What it means: ashift: 12 indicates 4K sectors. Good for modern SSDs/HDDs.
Decision: if ashift is 9 on 4K drives, you’ll pay with write amplification forever. Migration/rebuild is the fix, not a knob.

Task 3: Inspect current dataset properties

cr0x@server:~$ sudo zfs get -o name,property,value,source -r recordsize,compression,atime,sync,logbias,primarycache,xattr tank/pg
NAME          PROPERTY      VALUE     SOURCE
tank/pg       atime         off       local
tank/pg       compression   lz4       local
tank/pg       logbias       latency   default
tank/pg       primarycache  all       default
tank/pg       recordsize    128K      default
tank/pg       sync          standard  default
tank/pg       xattr         sa        local
tank/pg/data  recordsize    16K       local
tank/pg/wal   recordsize    16K       local
tank/pg/wal   logbias       latency   local

What it means: you can see what’s inherited and what’s explicitly set.
Decision: set properties on the dataset that matches the behavior (data vs WAL), not the parent “because it’s convenient.”

Task 4: Create datasets with sane defaults

cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql tank/pg
cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql/16/main tank/pg/data
cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql/16/wal tank/pg/wal
cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql/tmp tank/pg/tmp

What it means: datasets exist and mount where Postgres expects (adjust for your distro/version).
Decision: keep your filesystem boundaries aligned with operational needs: separate snapshots, separate properties, separate monitoring.

Task 5: Apply properties (data)

cr0x@server:~$ sudo zfs set atime=off compression=lz4 xattr=sa recordsize=16K sync=standard tank/pg/data
cr0x@server:~$ sudo zfs get -o property,value -H atime,compression,xattr,recordsize,sync tank/pg/data
atime	off
compression	lz4
xattr	sa
recordsize	16K
sync	standard

What it means: the data dataset is tuned for Postgres page-ish behavior and reduced metadata churn.
Decision: start at 16K. If updates are heavy and you see write amplification, test 8K. Don’t guess.

Task 6: Apply properties (WAL)

cr0x@server:~$ sudo zfs set atime=off compression=lz4 recordsize=16K sync=standard logbias=latency tank/pg/wal
cr0x@server:~$ sudo zfs get -o property,value -H atime,compression,recordsize,sync,logbias tank/pg/wal
atime	off
compression	lz4
recordsize	16K
sync	standard
logbias	latency

What it means: WAL is prepared for low-latency sync behavior.
Decision: if commit latency is still high, this points toward SLOG/device latency, not “more tuning.”

Task 7: Confirm where Postgres is writing WAL

cr0x@server:~$ sudo -u postgres psql -XAtc "show data_directory; show hba_file; show config_file;"
/var/lib/postgresql/16/main
/var/lib/postgresql/16/main/pg_hba.conf
/etc/postgresql/16/main/postgresql.conf
cr0x@server:~$ sudo -u postgres psql -XAtc "select pg_walfile_name(pg_current_wal_lsn());"
00000001000000020000003A
cr0x@server:~$ sudo ls -ld /var/lib/postgresql/16/main/pg_wal
lrwxrwxrwx 1 postgres postgres 26 Dec 25 10:44 /var/lib/postgresql/16/main/pg_wal -> /var/lib/postgresql/16/wal

What it means: WAL directory is redirected correctly.
Decision: if WAL is still inside the data dataset, you lose tuning isolation and snapshots become more annoying than they should be.

Task 8: Check if you’re bottlenecked on sync commits

cr0x@server:~$ sudo -u postgres psql -XAtc "select name, setting from pg_settings where name in ('fsync','synchronous_commit','wal_sync_method');"
fsync|on
synchronous_commit|on
wal_sync_method|fdatasync

What it means: Postgres is behaving safely: fsync on, synchronous commits.
Decision: if latency is unacceptable, solve it with a SLOG or with targeted Postgres-level durability tradeoffs—not by disabling sync at ZFS.

Task 9: Measure ZIL/SLOG activity and spot sync pressure

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                         980G  2.65T    120   1800  9.2M  145M
  mirror-0                   980G  2.65T    120   1700  9.2M  132M
    nvme0n1p2                   -      -     60    850  4.6M   66M
    nvme1n1p2                   -      -     60    850  4.6M   66M
logs                             -      -      0    420    0  4.1M
  nvme2n1p1                      -      -      0    420    0  4.1M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: the logs vdev is doing writes. That’s sync traffic hitting the SLOG.
Decision: if SLOG write ops are high and commit latency is high, your SLOG device may be too slow, saturated, or misbehaving.

Task 10: Watch latency directly with iostat (device-level truth)

cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server)  12/25/2025  _x86_64_  (32 CPU)

Device            r/s   w/s  rkB/s  wkB/s  await  svctm  %util
nvme0n1          80.0  600.0  5120  65536   2.10   0.20  14.0
nvme1n1          78.0  590.0  5000  64500   2.05   0.19  13.5
nvme2n1           0.0  420.0     0   4200   0.35   0.05   2.5

What it means: low await on SLOG device suggests it’s not the bottleneck.
Decision: if await on the log device spikes into milliseconds under commit load, fix the SLOG (device, PCIe topology, firmware, PLP) or don’t use one.

Task 11: Verify compression is actually helping (and not burning CPU)

cr0x@server:~$ sudo zfs get -o name,property,value -H compressratio,compression tank/pg/data tank/pg/wal
tank/pg/data	compressratio	1.62x
tank/pg/data	compression	lz4
tank/pg/wal	compressratio	1.10x
tank/pg/wal	compression	lz4

What it means: data compresses well, WAL not so much. That’s normal.
Decision: keep compression on; the data win usually outweighs everything. If CPU is maxed, measure before changing.

Task 12: Check ARC pressure and whether ZFS is fighting Postgres for RAM

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:52:01   520    40      7    30   75    10   25     0    0   64G   96G
10:52:02   610    55      9    45   82    10   18     0    0   64G   96G
10:52:03   590    50      8    41   82     9   18     0    0   64G   96G

What it means: ARC is large but not at cap, miss% is reasonable. If miss% is high and Postgres is also caching heavily, you may be double-caching.
Decision: tune ARC max if the OS is swapping or Postgres is starved; or reduce Postgres shared_buffers if ARC is doing the job better for your access pattern.

Task 13: Check transaction group timing (a source of periodic latency)

cr0x@server:~$ sudo sysctl vfs.zfs.txg.timeout
vfs.zfs.txg.timeout: 5

What it means: TXG commit interval is 5 seconds (common default). Bursty workloads can see periodic flush behavior.
Decision: don’t change it just because you can. If you see regular latency spikes aligned with TXG commits, investigate write pressure and device latency before touching this.

Task 14: Confirm autotrim status (SSD performance over time)

cr0x@server:~$ sudo zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  on        local

What it means: TRIM is enabled; SSDs stay healthier under sustained writes.
Decision: enable it for SSD pools unless you have a specific reason not to.

Task 15: Snapshot policy sanity check (because backups are an operational feature)

cr0x@server:~$ sudo zfs list -t snapshot -o name,creation -S creation | head
NAME                               CREATION
tank/pg/data@hourly-2025-12-25-10   Thu Dec 25 10:00 2025
tank/pg/wal@hourly-2025-12-25-10    Thu Dec 25 10:00 2025
tank/pg/data@hourly-2025-12-25-09   Thu Dec 25 09:00 2025
tank/pg/wal@hourly-2025-12-25-09    Thu Dec 25 09:00 2025

What it means: consistent snapshot cadence exists for both datasets.
Decision: snapshots are not backups by themselves, but they enable fast rollback and replication. Ensure the WAL retention/archiving strategy matches snapshot cadence.

Task 16: Test sync write latency with pgbench and correlate

cr0x@server:~$ sudo -u postgres pgbench -i -s 50 benchdb
dropping old tables...
creating tables...
generating data...
vacuuming...
creating primary keys...
done.
cr0x@server:~$ sudo -u postgres pgbench -c 16 -j 16 -T 60 -N benchdb
transaction type: 
scaling factor: 50
query mode: simple
number of clients: 16
number of threads: 16
duration: 60 s
number of transactions actually processed: 920000
latency average = 1.043 ms
tps = 15333.201 (without initial connection time)

What it means: this is read-heavy (-N) and should not stress sync commits much.
Decision: run a write-heavy test too; if only write/commit tests are slow, focus on WAL + sync path (SLOG, device latency, queue depth).

Fast diagnosis playbook

When production is slow, you don’t have time for philosophy. You need a short path to the bottleneck.
Here’s the order that finds real problems quickly.

First: prove it’s sync latency (or not)

  • Check Postgres: are commits slow or are queries slow for other reasons?
  • Look at pg_stat_statements and transaction latency distribution, not just average TPS.
  • If you see write latency spikes aligned with commits/checkpoints, suspect sync path.
cr0x@server:~$ sudo -u postgres psql -XAtc "select checkpoints_timed, checkpoints_req, buffers_checkpoint, buffers_clean from pg_stat_bgwriter;"
120|8|981234|44321

Decision: high checkpoints_req relative to timed checkpoints suggests pressure and checkpoint-related stalls; investigate WAL and checkpoint tuning, but also storage latency.

Second: check ZFS is healthy and not throttling itself

  • zpool status for errors and resilver/scrub activity.
  • zpool iostat -v 1 for where writes are going (main vdevs vs logs).

Third: find the device that’s actually slow

  • iostat -x 1 and look for high await and high %util.
  • If SLOG exists, check it explicitly.
  • Confirm PCIe placement and whether the “fast” device is on a shared bus or behind a weird controller.

Fourth: check caching and memory pressure

  • ARC size and misses.
  • Linux swapping or memory reclaim storms.
  • Postgres shared_buffers vs OS cache vs ARC interactions.

Fifth: check fragmentation and write amplification signals

  • zpool list for capacity; high fullness hurts performance.
  • Dataset recordsize vs actual I/O pattern.
  • Autotrim, compression ratio trends.

Common mistakes: symptom → root cause → fix

These are the ones that show up on call rotations because they’re subtle, survivable… until they aren’t.

1) “Commits are slow and jittery” → no SLOG (or a bad one) → add a proper SLOG

Symptom: TPS looks fine until concurrency rises; commit latency jumps. Users see random pauses.

Root cause: synchronous WAL writes are landing on a busy RAIDZ vdev or slow disks; or the SLOG is consumer-grade without PLP and exhibits latency spikes.

Fix: use mirrored, low-latency, power-loss-protected SLOG devices; verify with zpool iostat -v and iostat -x.

2) “Performance got worse after tuning recordsize” → recordsize mismatch and write amplification → revert and test

Symptom: after setting recordsize=8K everywhere, sequential scans and backups got slower; CPU overhead rose.

Root cause: tiny recordsize increases metadata overhead and reduces prefetch efficiency for large reads.

Fix: use 16K (sometimes 32K) for data, keep WAL smaller; benchmark with your real queries.

3) “We lost committed data after a power event” → sync=disabled (or unsafe SLOG) → restore durability and revalidate

Symptom: database restarts cleanly but missing recent writes; replicas disagree; audit logs don’t match.

Root cause: ZFS acknowledged sync writes without stable storage. Could be explicit sync=disabled or a device that lies.

Fix: set sync=standard; use a proper SLOG; run consistency checks and validate application assumptions about durability.

4) “Periodic 5–10 second stalls” → pool near full or checkpoint bursts → free space and smooth the write pattern

Symptom: stalls at roughly regular intervals; I/O looks fine otherwise.

Root cause: pool is too full (allocation gets expensive), or checkpoints are forcing huge flushes.

Fix: keep pools comfortably below high utilization; tune Postgres checkpoint/WAL settings; ensure WAL and data datasets are tuned separately.

5) “CPU is high, IO seems low, queries still slow” → compression or checksumming blamed incorrectly → check memory and lock contention

Symptom: storage charts look calm, but latency is high and CPUs are busy.

Root cause: you’re not I/O bound; maybe buffer contention, vacuum pressure, or a mis-sized shared_buffers/ARC battle.

Fix: use Postgres views (pg_stat_activity, pg_stat_bgwriter, pg_stat_io on newer versions) to find the real bottleneck. Don’t scapegoat ZFS.

Three mini-stories from corporate life

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company migrated their primary Postgres cluster from ext4 on hardware RAID to ZFS on a pair of shiny NVMe mirrors.
They did the migration carefully: rehearsed, measured, even did a weekend cutover. The graphs looked great. Everyone went home.

A week later, a power maintenance event didn’t go as planned. The racks came back, Postgres started, replication caught up, and everything looked “fine.”
Then customer support tickets started: missing updates from the last few minutes before the outage. Not corruption. Not a crashed cluster. Just… absent transactions.

The wrong assumption was subtle and very human: they believed “NVMe is fast, and ZFS is safe, so we can disable sync to match ext4 performance.”
They had set sync=disabled on the dataset, thinking Postgres would still be safe because fsync=on in Postgres. But ZFS was returning success
without forcing stable storage. Postgres had been doing the right thing; the filesystem had simply declined to participate.

The post-incident fix was boring: restore sync=standard, add a mirrored SLOG with PLP, and document that “fast mode” for a subset of workloads is
synchronous_commit=off on specific sessions—not a filesystem-level global lie.

The real lesson wasn’t “never optimize.” It was: don’t optimize by changing the meaning of “committed.” Your application already has a dial for that.

Mini-story 2: The optimization that backfired

Another org—large, regulated, and fond of committees—had a reporting Postgres system that ran large sequential scans and batch writes overnight.
They read a blog post about matching recordsize to database page size and decided to enforce recordsize=8K across the pool “for consistency.”
A change request was filed. Everyone nodded. Consistency is comforting.

Within a week, the overnight jobs started taking longer. Then the “morning catch-up” window started overlapping peak hours.
The batch system wasn’t more write-heavy than before; it was just doing more work per byte stored.

The reason: the workload had huge sequential reads (analytics-style queries), and the smaller recordsize increased metadata overhead and reduced
read efficiency. ZFS prefetch and block aggregation couldn’t do its job as well. Compression was still on, but now it was compressing lots of small records
and tracking them individually. The CPU overhead increased. The disks weren’t saturated; the system was just spending more time on bookkeeping.

They fixed it by splitting datasets by behavior: data warehouse tablespaces on a dataset with recordsize=128K (sometimes 256K for very large scans),
and OLTP-ish datasets on 16K. WAL stayed at 16K. Everyone got to keep their “consistency,” just not at the pool-wide level.

The quiet takeaway: one number cannot describe “a database workload.” If you can’t name the I/O pattern, you can’t tune it.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran Postgres on ZFS with a conservative setup: mirrors, compression on, sync standard, and a mirrored SLOG.
Nothing fancy. Their best engineer described it as “aggressively uninteresting.” This was meant as praise.

They also had a habit that other teams mocked: weekly scrubs, and a dashboard that showed checksum errors even when “everything works.”
On paper, scrubs were “extra I/O.” In meetings, scrubs were “why are we doing this to ourselves.” In reality, scrubs were the canary.

One week, scrub reported a small number of checksum errors on a single device. No application errors yet. No customer symptoms.
They replaced the device on schedule, resilvered cleanly, and moved on. Weeks later, a similar model SSD was found to have a firmware issue
under a specific power-state transition. Teams without proactive detection found out the exciting way.

Their outcome was unglamorous: no incident, no outage, no data loss. The best kind of story in operations is the one you almost don’t tell.
But when your job is to keep promises to the business, “boring” is a feature.

Checklists / step-by-step plan

Step-by-step: build a new Postgres-on-ZFS host the sane way

  1. Design the pool for latency first: mirrors for primary OLTP if you care about commit latency. RAIDZ is for capacity and throughput.
  2. Choose correct sector alignment: confirm ashift=12 when creating the pool.
  3. Add a SLOG only if you need it (sync-heavy workloads) and only if it’s PLP-protected. Mirror it.
  4. Create separate datasets: tank/pg/data, tank/pg/wal, optionally tank/pg/tmp and tank/pg/backups.
  5. Set properties per dataset: compression, recordsize, logbias, atime, xattr.
  6. Wire Postgres directories correctly and verify WAL location.
  7. Benchmark with pgbench and a production-like query set. Measure commit latency percentiles.
  8. Enable scrubs and monitor errors. Treat checksum errors like smoke alarms, not like suggestions.
  9. Snapshot intentionally: coordinate with WAL archiving/retention and your recovery objectives.
  10. Document the durability contract: what settings are allowed (synchronous_commit), and what is forbidden (sync=disabled on critical datasets).

Operational checklist: before you change any ZFS property in production

  • Can you roll back? (Property changes are easy; performance regressions are not.)
  • Do you have a workload reproduction (pgbench profile, replay, or at least a known test query set)?
  • Will it change durability semantics? If yes, do you have explicit approval?
  • Did you capture before/after: commit latency p95/p99, device await, zpool iostat?
  • Is the pool healthy, scrubbed, and not resilvering?

FAQ

1) Should I run PostgreSQL on ZFS at all?

Yes, if you value data integrity, snapshots, and operational tooling, and you’re willing to understand sync behavior.
If your team treats storage like a black box, you’ll still learn—just during an incident.

2) Do I need a SLOG for PostgreSQL?

Only if synchronous write latency is your bottleneck. Many read-heavy or async-heavy workloads won’t benefit much.
If you do need one, use PLP-protected devices and mirror them.

3) Is sync=disabled ever acceptable?

On critical Postgres datasets: no. On truly disposable temp datasets: maybe, if you’re explicit about losing it during a crash.
If you want less durability for a subset of operations, use Postgres knobs like synchronous_commit.

4) What recordsize should I use for Postgres data?

Start at 16K. Consider 8K for write-heavy OLTP if you measure reduced write amplification. Consider larger recordsize (64K–128K)
for analytics tablespaces with big sequential scans. Separate datasets make this easy.

5) Should WAL be on a separate dataset?

Yes. It gives you targeted tuning and cleaner operational boundaries. Performance isolation is a bonus; clarity is the main win.

6) Does ZFS compression help databases?

Usually, yes. LZ4 is the common default because it’s fast and low-risk. It reduces physical I/O, which is what most databases are actually waiting on.

7) How do snapshots interact with PostgreSQL consistency?

ZFS snapshots are crash-consistent at the filesystem level. For application-consistent backups, coordinate with Postgres: use base backups,
WAL archiving, or quiesce appropriately. Snapshots are great plumbing; they’re not magic consistency sauce.

8) Should I disable atime?

Yes for Postgres datasets. atime updates are write noise. Keep it off unless you have a compliance requirement that actually depends on atime semantics.

9) Mirrors or RAIDZ for Postgres?

Mirrors when latency matters (OLTP, commit-heavy). RAIDZ when capacity efficiency matters and your workload is more sequential or tolerant of latency.
You can also mix: keep a mirror vdev “fast lane” for WAL or hot data and a RAIDZ vdev for colder data, but be careful with complexity.

10) What’s the simplest safe configuration that performs well?

Mirror pool, compression=lz4, atime=off, data dataset at 16K recordsize, WAL dataset at 16K recordsize, sync=standard.
Add mirrored SLOG only if commit latency demands it.

Conclusion: next steps you can execute this week

If you’re already running Postgres on ZFS, you don’t need a heroic refactor. You need two things: dataset boundaries that match reality, and a sync path that
matches your durability promises.

  1. Split datasets into at least data and WAL. Apply properties intentionally.
  2. Measure commit latency under write load and correlate it with zpool iostat and iostat -x.
  3. If sync is the bottleneck, fix it correctly: mirrored PLP SLOG or targeted Postgres durability changes—no filesystem lies.
  4. Keep compression on and stop paying for atime.
  5. Make scrubs and error monitoring a habit. “No incidents” is not luck; it’s detection plus boring discipline.

The goal is a system that doesn’t require a storage expert to operate day-to-day. ZFS can do that for PostgreSQL.
But you have to treat sync semantics as a contract, not a suggestion.

← Previous
ZFS Corrupt Labels: Fixing Import Failures the Right Way
Next →
Email backups: The restore drill you must run (or your backups are fake)

Leave a comment