PostgreSQL is honest: when it’s slow, it usually tells you why. ZFS is also honest—just not always in the same dialect.
Put them together and you can get a database that’s boringly reliable… or a system that looks fine on dashboards right up
until it doesn’t.
The pain point is always the same: sync writes. WAL durability. Latency spikes that show up only under commit-heavy load.
Someone suggests “just set sync=disabled” and suddenly you’re having an unplanned meeting about “data integrity posture.”
The strategy: what to do, what not to do
If you want PostgreSQL on ZFS to work in production, you need to stop thinking in generic filesystem tips and start thinking
in durability boundaries. PostgreSQL’s durability boundary is WAL + fsync behavior (plus checkpoints and replication).
ZFS’s durability boundary is the ZIL (and SLOG if present), transaction groups, and where synchronous writes land.
The dataset and sync strategy that works is not exotic:
- Keep PostgreSQL data and WAL on separate datasets (and ideally separate vdevs if you can).
- Leave PostgreSQL safety features on:
fsync=on,full_page_writes=on, and sanesynchronous_commitchoices. - Use ZFS
sync=standardfor most things. It respects POSIX sync semantics, and Postgres uses them for a reason. - If commit latency is the bottleneck, fix the sync path correctly: provide a proper SLOG device (power-loss protected) or change the application durability requirement intentionally (e.g.,
synchronous_commit=offfor specific workloads). - Use compression. It usually helps latency and throughput on database pages, and it’s one of the rare “free lunches” that actually delivers in the field.
- Do not set
sync=disabledon the database dataset unless you’ve written down the blast radius and got explicit sign-off from people who will be paged at 3 a.m.
The easiest way to get fired is to “optimize durability” without telling anyone. The second easiest way is to buy fast devices
and then make ZFS wait on slow sync I/O anyway.
Joke #1: If you set sync=disabled on a database and call it “eventually durable,” you’ve invented a new storage tier: hope.
Facts and context that change decisions
These aren’t trivia for trivia’s sake. Each one should nudge a real design choice.
- ZFS is copy-on-write (from the original Sun Microsystems work). PostgreSQL already expects to overwrite 8 KB pages; ZFS turns that into new block allocations. That affects write amplification and fragmentation.
- The ZIL is always there, even without a SLOG. A SLOG is not “turning on the ZIL,” it’s moving the log device to something faster and safer for sync writes.
- PostgreSQL’s default page size is 8 KB. ZFS’s default
recordsizeis often 128 KB. That mismatch is not inherently wrong, but it determines read-modify-write behavior on updates and how much data gets dirtied per write. - Historically, many “ZFS is slow for databases” claims came from misconfigured pools: wrong
ashifton 4K-sector disks, no SLOG for sync-heavy workloads, or RAM-starved ARC behavior. - PostgreSQL WAL is sequential-ish but not purely sequential. It’s append-heavy, but fsync behavior can create bursts, and archiving/replication can create read pressure at awkward times.
- ZFS checksums every block. That means you can detect silent corruption that other stacks would happily feed your database until it “mysteriously” fails later. For databases, this is not optional luxury; it’s integrity.
- Compression became “safe by default” in many ZFS shops only after LZ4 matured (and CPUs got faster). Today, leaving compression off is often leaving performance on the table.
- PostgreSQL durability is not a binary switch. You can choose durability at the transaction level (
synchronous_commit) or replication level (sync replication), rather than filesystem sabotage.
Dataset layout that survives real workloads
The most common mistake is putting everything under one dataset, tuning it once, and assuming you’re done.
PostgreSQL has at least three I/O personalities: data files, WAL, and “stuff that is important but not worth optimizing at the cost of durability”
(logs, dumps, base backups staging, etc.).
A pragmatic dataset tree
Here’s a layout that maps to how Postgres behaves and how ZFS can be tuned without self-sabotage:
tank/pg: parent dataset for all PostgreSQL storage.tank/pg/data: main data directory (PGDATA) excluding WAL if you relocate it.tank/pg/wal: WAL directory (symlinkpg_walhere or usepostgresql.confwhere supported).tank/pg/tmp: temp tables / sorts if you placetemp_tablespaceshere (optional; often better on local NVMe and acceptable to lose).tank/pg/backups: backups and basebackup staging (snapshot targets, WAL archive spool, etc.).
Why separate datasets? Because recordsize, logbias, compression, and sometimes primarycache decisions differ.
WAL wants low latency and predictable sync behavior. Data files want read efficiency, caching, and sane fragmentation control.
Where to physically place WAL
If you can afford it, WAL deserves fast storage and a clean sync path. Options:
- Same pool, separate dataset: simplest, still lets you tune properties; performance depends on vdev layout.
- Separate mirror vdevs for WAL: good when the main pool is RAIDZ and WAL latency is suffering.
- Separate pool for WAL: sometimes done, but operationally heavier (more things to monitor, more failure domains).
Don’t move WAL to “fast but sketchy” storage without power-loss protection if you’re relying on synchronous durability. WAL is the diary of what happened; you don’t write your diary in disappearing ink.
Sync semantics: PostgreSQL vs ZFS (and where SLOG fits)
PostgreSQL writes WAL records and uses fsync() (or fdatasync()) to ensure durability when a transaction commits, depending on settings.
When Postgres says “sync,” it means: “If you lose power right now, the committed transaction should still be committed after crash recovery.”
ZFS with sync=standard honors synchronous requests by using the ZIL. The ZIL is not a write cache for everything; it is a log of synchronous writes so they can be replayed after a crash. Without a dedicated SLOG device, the ZIL lives on the main pool, meaning sync writes land on your data disks. With a SLOG, they land on the SLOG (still written safely), then later get committed to the main pool with normal transaction group behavior.
What sync really means on a dataset
sync=standard: honor sync requests. This is the default you want for databases.sync=always: treat all writes as synchronous, even if the app didn’t ask. Usually worse for Postgres; it turns non-critical writes into latency tax.sync=disabled: lie to the application about durability. Fast. Also a career-limiting move when the power blips.
SLOG: what it is, what it isn’t
A SLOG is a dedicated device for the ZIL. It improves latency for synchronous writes by providing a fast place to commit them.
But it must be the right kind of device: low latency, high write endurance, and—non-negotiable—power-loss protection.
If the SLOG lies about durability (consumer SSD with volatile cache and no PLP), you’ve effectively moved the sync=disabled gamble into hardware.
It may “work” until it doesn’t, which is the least useful reliability pattern in operations.
One quote you can hang an incident review on
paraphrased idea — John Allspaw: reliability comes from making failure visible and survivable, not from believing it won’t happen.
Recommended ZFS properties for Postgres (with rationale)
These are opinionated defaults that work across a lot of real installations. You can deviate, but do it deliberately and test with your workload.
Pool-level prerequisites (before datasets)
- Correct
ashiftfor your drives (typically 12 for 4K sectors). Get this wrong and you buy permanent write amplification. - Mirrors for latency-sensitive workloads (including WAL-heavy Postgres). RAIDZ can be fine for capacity and throughput, but small sync I/O latency is usually better on mirrors.
- Enough RAM for ARC, but don’t starve Postgres shared buffers. ARC will happily expand; your OOM killer won’t be impressed.
Dataset: tank/pg/data
recordsize=16Korrecordsize=8K: Start with 16K for many OLTP workloads; 8K can reduce read-modify-write on updates but may hurt large sequential reads. Measure.compression=lz4: Usually improves effective IOPS and reduces write amplification. It’s one of the few knobs that’s both safer and faster.atime=off: Postgres doesn’t need atime. Stop paying for it.xattr=sa(if supported): keeps xattrs in inodes; reduces metadata I/O.logbias=latency(default): favor low latency for synchronous operations. For the data dataset, leaving default is fine; we’ll tune WAL dataset separately.
Dataset: tank/pg/wal
recordsize=16K: WAL writes are append-ish; you don’t need huge recordsize. Avoid 128K here.compression=lz4: WAL compresses well enough sometimes, but the win is often modest. Still, it usually doesn’t hurt and can reduce device writes.logbias=throughput:
This one needs explanation. logbias=throughput tells ZFS to bias toward writing sync data to the main pool rather than the SLOG in some patterns.
On a system with a good SLOG, you often want logbias=latency. On a system without a SLOG, WAL might already be living on the main pool ZIL; the bias matters less.
In practice: if you have a proper SLOG and commit latency is a bottleneck, keep logbias=latency for WAL. If you’re saturating the SLOG or it’s not the bottleneck, consider throughput.
Don’t cargo-cult it. Measure commit latency and ZIL behavior.
sync=standard: Yes, still. Your “fast mode” should be at Postgres transaction settings, not filesystem lies.primarycache=metadata(sometimes): If WAL reads are pushing out useful data cache, limit cache to metadata for WAL. This depends on your RAM and workload.
Dataset: tank/pg/tmp (optional)
sync=disabledcan be acceptable here if it is truly disposable temp space and you’re comfortable losing it on crash. Treat it like a scratch disk.compression=lz4,atime=off.
Joke #2: The only thing more “temporary” than tank/pg/tmp is the confidence of someone who didn’t test a failover.
Practical tasks: commands, outputs, and what you decide
These are the field exercises: what you run, what the output means, and what decision you make. The goal is to stop arguing from vibes.
Task 1: Confirm pool health (because everything else is pointless if it’s broken)
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:11:03 with 0 errors on Sun Dec 22 03:00:12 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
logs
nvme2n1p1 ONLINE 0 0 0
errors: No known data errors
What it means: pool is ONLINE, scrub is clean, SLOG device is present and online.
Decision: proceed with performance tuning. If you see DEGRADED, checksum errors, or a dead log device, stop and fix reliability first.
Task 2: Verify ashift (silent performance killer)
cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|vdev_tree" -n | head
56: vdev_tree:
74: ashift: 12
What it means: ashift: 12 indicates 4K sectors. Good for modern SSDs/HDDs.
Decision: if ashift is 9 on 4K drives, you’ll pay with write amplification forever. Migration/rebuild is the fix, not a knob.
Task 3: Inspect current dataset properties
cr0x@server:~$ sudo zfs get -o name,property,value,source -r recordsize,compression,atime,sync,logbias,primarycache,xattr tank/pg
NAME PROPERTY VALUE SOURCE
tank/pg atime off local
tank/pg compression lz4 local
tank/pg logbias latency default
tank/pg primarycache all default
tank/pg recordsize 128K default
tank/pg sync standard default
tank/pg xattr sa local
tank/pg/data recordsize 16K local
tank/pg/wal recordsize 16K local
tank/pg/wal logbias latency local
What it means: you can see what’s inherited and what’s explicitly set.
Decision: set properties on the dataset that matches the behavior (data vs WAL), not the parent “because it’s convenient.”
Task 4: Create datasets with sane defaults
cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql tank/pg
cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql/16/main tank/pg/data
cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql/16/wal tank/pg/wal
cr0x@server:~$ sudo zfs create -o mountpoint=/var/lib/postgresql/tmp tank/pg/tmp
What it means: datasets exist and mount where Postgres expects (adjust for your distro/version).
Decision: keep your filesystem boundaries aligned with operational needs: separate snapshots, separate properties, separate monitoring.
Task 5: Apply properties (data)
cr0x@server:~$ sudo zfs set atime=off compression=lz4 xattr=sa recordsize=16K sync=standard tank/pg/data
cr0x@server:~$ sudo zfs get -o property,value -H atime,compression,xattr,recordsize,sync tank/pg/data
atime off
compression lz4
xattr sa
recordsize 16K
sync standard
What it means: the data dataset is tuned for Postgres page-ish behavior and reduced metadata churn.
Decision: start at 16K. If updates are heavy and you see write amplification, test 8K. Don’t guess.
Task 6: Apply properties (WAL)
cr0x@server:~$ sudo zfs set atime=off compression=lz4 recordsize=16K sync=standard logbias=latency tank/pg/wal
cr0x@server:~$ sudo zfs get -o property,value -H atime,compression,recordsize,sync,logbias tank/pg/wal
atime off
compression lz4
recordsize 16K
sync standard
logbias latency
What it means: WAL is prepared for low-latency sync behavior.
Decision: if commit latency is still high, this points toward SLOG/device latency, not “more tuning.”
Task 7: Confirm where Postgres is writing WAL
cr0x@server:~$ sudo -u postgres psql -XAtc "show data_directory; show hba_file; show config_file;"
/var/lib/postgresql/16/main
/var/lib/postgresql/16/main/pg_hba.conf
/etc/postgresql/16/main/postgresql.conf
cr0x@server:~$ sudo -u postgres psql -XAtc "select pg_walfile_name(pg_current_wal_lsn());"
00000001000000020000003A
cr0x@server:~$ sudo ls -ld /var/lib/postgresql/16/main/pg_wal
lrwxrwxrwx 1 postgres postgres 26 Dec 25 10:44 /var/lib/postgresql/16/main/pg_wal -> /var/lib/postgresql/16/wal
What it means: WAL directory is redirected correctly.
Decision: if WAL is still inside the data dataset, you lose tuning isolation and snapshots become more annoying than they should be.
Task 8: Check if you’re bottlenecked on sync commits
cr0x@server:~$ sudo -u postgres psql -XAtc "select name, setting from pg_settings where name in ('fsync','synchronous_commit','wal_sync_method');"
fsync|on
synchronous_commit|on
wal_sync_method|fdatasync
What it means: Postgres is behaving safely: fsync on, synchronous commits.
Decision: if latency is unacceptable, solve it with a SLOG or with targeted Postgres-level durability tradeoffs—not by disabling sync at ZFS.
Task 9: Measure ZIL/SLOG activity and spot sync pressure
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 980G 2.65T 120 1800 9.2M 145M
mirror-0 980G 2.65T 120 1700 9.2M 132M
nvme0n1p2 - - 60 850 4.6M 66M
nvme1n1p2 - - 60 850 4.6M 66M
logs - - 0 420 0 4.1M
nvme2n1p1 - - 0 420 0 4.1M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: the logs vdev is doing writes. That’s sync traffic hitting the SLOG.
Decision: if SLOG write ops are high and commit latency is high, your SLOG device may be too slow, saturated, or misbehaving.
Task 10: Watch latency directly with iostat (device-level truth)
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (32 CPU)
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 80.0 600.0 5120 65536 2.10 0.20 14.0
nvme1n1 78.0 590.0 5000 64500 2.05 0.19 13.5
nvme2n1 0.0 420.0 0 4200 0.35 0.05 2.5
What it means: low await on SLOG device suggests it’s not the bottleneck.
Decision: if await on the log device spikes into milliseconds under commit load, fix the SLOG (device, PCIe topology, firmware, PLP) or don’t use one.
Task 11: Verify compression is actually helping (and not burning CPU)
cr0x@server:~$ sudo zfs get -o name,property,value -H compressratio,compression tank/pg/data tank/pg/wal
tank/pg/data compressratio 1.62x
tank/pg/data compression lz4
tank/pg/wal compressratio 1.10x
tank/pg/wal compression lz4
What it means: data compresses well, WAL not so much. That’s normal.
Decision: keep compression on; the data win usually outweighs everything. If CPU is maxed, measure before changing.
Task 12: Check ARC pressure and whether ZFS is fighting Postgres for RAM
cr0x@server:~$ sudo arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
10:52:01 520 40 7 30 75 10 25 0 0 64G 96G
10:52:02 610 55 9 45 82 10 18 0 0 64G 96G
10:52:03 590 50 8 41 82 9 18 0 0 64G 96G
What it means: ARC is large but not at cap, miss% is reasonable. If miss% is high and Postgres is also caching heavily, you may be double-caching.
Decision: tune ARC max if the OS is swapping or Postgres is starved; or reduce Postgres shared_buffers if ARC is doing the job better for your access pattern.
Task 13: Check transaction group timing (a source of periodic latency)
cr0x@server:~$ sudo sysctl vfs.zfs.txg.timeout
vfs.zfs.txg.timeout: 5
What it means: TXG commit interval is 5 seconds (common default). Bursty workloads can see periodic flush behavior.
Decision: don’t change it just because you can. If you see regular latency spikes aligned with TXG commits, investigate write pressure and device latency before touching this.
Task 14: Confirm autotrim status (SSD performance over time)
cr0x@server:~$ sudo zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim on local
What it means: TRIM is enabled; SSDs stay healthier under sustained writes.
Decision: enable it for SSD pools unless you have a specific reason not to.
Task 15: Snapshot policy sanity check (because backups are an operational feature)
cr0x@server:~$ sudo zfs list -t snapshot -o name,creation -S creation | head
NAME CREATION
tank/pg/data@hourly-2025-12-25-10 Thu Dec 25 10:00 2025
tank/pg/wal@hourly-2025-12-25-10 Thu Dec 25 10:00 2025
tank/pg/data@hourly-2025-12-25-09 Thu Dec 25 09:00 2025
tank/pg/wal@hourly-2025-12-25-09 Thu Dec 25 09:00 2025
What it means: consistent snapshot cadence exists for both datasets.
Decision: snapshots are not backups by themselves, but they enable fast rollback and replication. Ensure the WAL retention/archiving strategy matches snapshot cadence.
Task 16: Test sync write latency with pgbench and correlate
cr0x@server:~$ sudo -u postgres pgbench -i -s 50 benchdb
dropping old tables...
creating tables...
generating data...
vacuuming...
creating primary keys...
done.
cr0x@server:~$ sudo -u postgres pgbench -c 16 -j 16 -T 60 -N benchdb
transaction type:
scaling factor: 50
query mode: simple
number of clients: 16
number of threads: 16
duration: 60 s
number of transactions actually processed: 920000
latency average = 1.043 ms
tps = 15333.201 (without initial connection time)
What it means: this is read-heavy (-N) and should not stress sync commits much.
Decision: run a write-heavy test too; if only write/commit tests are slow, focus on WAL + sync path (SLOG, device latency, queue depth).
Fast diagnosis playbook
When production is slow, you don’t have time for philosophy. You need a short path to the bottleneck.
Here’s the order that finds real problems quickly.
First: prove it’s sync latency (or not)
- Check Postgres: are commits slow or are queries slow for other reasons?
- Look at
pg_stat_statementsand transaction latency distribution, not just average TPS. - If you see write latency spikes aligned with commits/checkpoints, suspect sync path.
cr0x@server:~$ sudo -u postgres psql -XAtc "select checkpoints_timed, checkpoints_req, buffers_checkpoint, buffers_clean from pg_stat_bgwriter;"
120|8|981234|44321
Decision: high checkpoints_req relative to timed checkpoints suggests pressure and checkpoint-related stalls; investigate WAL and checkpoint tuning, but also storage latency.
Second: check ZFS is healthy and not throttling itself
zpool statusfor errors and resilver/scrub activity.zpool iostat -v 1for where writes are going (main vdevs vs logs).
Third: find the device that’s actually slow
iostat -x 1and look for highawaitand high%util.- If SLOG exists, check it explicitly.
- Confirm PCIe placement and whether the “fast” device is on a shared bus or behind a weird controller.
Fourth: check caching and memory pressure
- ARC size and misses.
- Linux swapping or memory reclaim storms.
- Postgres shared_buffers vs OS cache vs ARC interactions.
Fifth: check fragmentation and write amplification signals
zpool listfor capacity; high fullness hurts performance.- Dataset recordsize vs actual I/O pattern.
- Autotrim, compression ratio trends.
Common mistakes: symptom → root cause → fix
These are the ones that show up on call rotations because they’re subtle, survivable… until they aren’t.
1) “Commits are slow and jittery” → no SLOG (or a bad one) → add a proper SLOG
Symptom: TPS looks fine until concurrency rises; commit latency jumps. Users see random pauses.
Root cause: synchronous WAL writes are landing on a busy RAIDZ vdev or slow disks; or the SLOG is consumer-grade without PLP and exhibits latency spikes.
Fix: use mirrored, low-latency, power-loss-protected SLOG devices; verify with zpool iostat -v and iostat -x.
2) “Performance got worse after tuning recordsize” → recordsize mismatch and write amplification → revert and test
Symptom: after setting recordsize=8K everywhere, sequential scans and backups got slower; CPU overhead rose.
Root cause: tiny recordsize increases metadata overhead and reduces prefetch efficiency for large reads.
Fix: use 16K (sometimes 32K) for data, keep WAL smaller; benchmark with your real queries.
3) “We lost committed data after a power event” → sync=disabled (or unsafe SLOG) → restore durability and revalidate
Symptom: database restarts cleanly but missing recent writes; replicas disagree; audit logs don’t match.
Root cause: ZFS acknowledged sync writes without stable storage. Could be explicit sync=disabled or a device that lies.
Fix: set sync=standard; use a proper SLOG; run consistency checks and validate application assumptions about durability.
4) “Periodic 5–10 second stalls” → pool near full or checkpoint bursts → free space and smooth the write pattern
Symptom: stalls at roughly regular intervals; I/O looks fine otherwise.
Root cause: pool is too full (allocation gets expensive), or checkpoints are forcing huge flushes.
Fix: keep pools comfortably below high utilization; tune Postgres checkpoint/WAL settings; ensure WAL and data datasets are tuned separately.
5) “CPU is high, IO seems low, queries still slow” → compression or checksumming blamed incorrectly → check memory and lock contention
Symptom: storage charts look calm, but latency is high and CPUs are busy.
Root cause: you’re not I/O bound; maybe buffer contention, vacuum pressure, or a mis-sized shared_buffers/ARC battle.
Fix: use Postgres views (pg_stat_activity, pg_stat_bgwriter, pg_stat_io on newer versions) to find the real bottleneck. Don’t scapegoat ZFS.
Three mini-stories from corporate life
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company migrated their primary Postgres cluster from ext4 on hardware RAID to ZFS on a pair of shiny NVMe mirrors.
They did the migration carefully: rehearsed, measured, even did a weekend cutover. The graphs looked great. Everyone went home.
A week later, a power maintenance event didn’t go as planned. The racks came back, Postgres started, replication caught up, and everything looked “fine.”
Then customer support tickets started: missing updates from the last few minutes before the outage. Not corruption. Not a crashed cluster. Just… absent transactions.
The wrong assumption was subtle and very human: they believed “NVMe is fast, and ZFS is safe, so we can disable sync to match ext4 performance.”
They had set sync=disabled on the dataset, thinking Postgres would still be safe because fsync=on in Postgres. But ZFS was returning success
without forcing stable storage. Postgres had been doing the right thing; the filesystem had simply declined to participate.
The post-incident fix was boring: restore sync=standard, add a mirrored SLOG with PLP, and document that “fast mode” for a subset of workloads is
synchronous_commit=off on specific sessions—not a filesystem-level global lie.
The real lesson wasn’t “never optimize.” It was: don’t optimize by changing the meaning of “committed.” Your application already has a dial for that.
Mini-story 2: The optimization that backfired
Another org—large, regulated, and fond of committees—had a reporting Postgres system that ran large sequential scans and batch writes overnight.
They read a blog post about matching recordsize to database page size and decided to enforce recordsize=8K across the pool “for consistency.”
A change request was filed. Everyone nodded. Consistency is comforting.
Within a week, the overnight jobs started taking longer. Then the “morning catch-up” window started overlapping peak hours.
The batch system wasn’t more write-heavy than before; it was just doing more work per byte stored.
The reason: the workload had huge sequential reads (analytics-style queries), and the smaller recordsize increased metadata overhead and reduced
read efficiency. ZFS prefetch and block aggregation couldn’t do its job as well. Compression was still on, but now it was compressing lots of small records
and tracking them individually. The CPU overhead increased. The disks weren’t saturated; the system was just spending more time on bookkeeping.
They fixed it by splitting datasets by behavior: data warehouse tablespaces on a dataset with recordsize=128K (sometimes 256K for very large scans),
and OLTP-ish datasets on 16K. WAL stayed at 16K. Everyone got to keep their “consistency,” just not at the pool-wide level.
The quiet takeaway: one number cannot describe “a database workload.” If you can’t name the I/O pattern, you can’t tune it.
Mini-story 3: The boring but correct practice that saved the day
A financial services team ran Postgres on ZFS with a conservative setup: mirrors, compression on, sync standard, and a mirrored SLOG.
Nothing fancy. Their best engineer described it as “aggressively uninteresting.” This was meant as praise.
They also had a habit that other teams mocked: weekly scrubs, and a dashboard that showed checksum errors even when “everything works.”
On paper, scrubs were “extra I/O.” In meetings, scrubs were “why are we doing this to ourselves.” In reality, scrubs were the canary.
One week, scrub reported a small number of checksum errors on a single device. No application errors yet. No customer symptoms.
They replaced the device on schedule, resilvered cleanly, and moved on. Weeks later, a similar model SSD was found to have a firmware issue
under a specific power-state transition. Teams without proactive detection found out the exciting way.
Their outcome was unglamorous: no incident, no outage, no data loss. The best kind of story in operations is the one you almost don’t tell.
But when your job is to keep promises to the business, “boring” is a feature.
Checklists / step-by-step plan
Step-by-step: build a new Postgres-on-ZFS host the sane way
- Design the pool for latency first: mirrors for primary OLTP if you care about commit latency. RAIDZ is for capacity and throughput.
- Choose correct sector alignment: confirm
ashift=12when creating the pool. - Add a SLOG only if you need it (sync-heavy workloads) and only if it’s PLP-protected. Mirror it.
- Create separate datasets:
tank/pg/data,tank/pg/wal, optionallytank/pg/tmpandtank/pg/backups. - Set properties per dataset: compression, recordsize, logbias, atime, xattr.
- Wire Postgres directories correctly and verify WAL location.
- Benchmark with
pgbenchand a production-like query set. Measure commit latency percentiles. - Enable scrubs and monitor errors. Treat checksum errors like smoke alarms, not like suggestions.
- Snapshot intentionally: coordinate with WAL archiving/retention and your recovery objectives.
- Document the durability contract: what settings are allowed (
synchronous_commit), and what is forbidden (sync=disabledon critical datasets).
Operational checklist: before you change any ZFS property in production
- Can you roll back? (Property changes are easy; performance regressions are not.)
- Do you have a workload reproduction (pgbench profile, replay, or at least a known test query set)?
- Will it change durability semantics? If yes, do you have explicit approval?
- Did you capture before/after: commit latency p95/p99, device await,
zpool iostat? - Is the pool healthy, scrubbed, and not resilvering?
FAQ
1) Should I run PostgreSQL on ZFS at all?
Yes, if you value data integrity, snapshots, and operational tooling, and you’re willing to understand sync behavior.
If your team treats storage like a black box, you’ll still learn—just during an incident.
2) Do I need a SLOG for PostgreSQL?
Only if synchronous write latency is your bottleneck. Many read-heavy or async-heavy workloads won’t benefit much.
If you do need one, use PLP-protected devices and mirror them.
3) Is sync=disabled ever acceptable?
On critical Postgres datasets: no. On truly disposable temp datasets: maybe, if you’re explicit about losing it during a crash.
If you want less durability for a subset of operations, use Postgres knobs like synchronous_commit.
4) What recordsize should I use for Postgres data?
Start at 16K. Consider 8K for write-heavy OLTP if you measure reduced write amplification. Consider larger recordsize (64K–128K)
for analytics tablespaces with big sequential scans. Separate datasets make this easy.
5) Should WAL be on a separate dataset?
Yes. It gives you targeted tuning and cleaner operational boundaries. Performance isolation is a bonus; clarity is the main win.
6) Does ZFS compression help databases?
Usually, yes. LZ4 is the common default because it’s fast and low-risk. It reduces physical I/O, which is what most databases are actually waiting on.
7) How do snapshots interact with PostgreSQL consistency?
ZFS snapshots are crash-consistent at the filesystem level. For application-consistent backups, coordinate with Postgres: use base backups,
WAL archiving, or quiesce appropriately. Snapshots are great plumbing; they’re not magic consistency sauce.
8) Should I disable atime?
Yes for Postgres datasets. atime updates are write noise. Keep it off unless you have a compliance requirement that actually depends on atime semantics.
9) Mirrors or RAIDZ for Postgres?
Mirrors when latency matters (OLTP, commit-heavy). RAIDZ when capacity efficiency matters and your workload is more sequential or tolerant of latency.
You can also mix: keep a mirror vdev “fast lane” for WAL or hot data and a RAIDZ vdev for colder data, but be careful with complexity.
10) What’s the simplest safe configuration that performs well?
Mirror pool, compression=lz4, atime=off, data dataset at 16K recordsize, WAL dataset at 16K recordsize, sync=standard.
Add mirrored SLOG only if commit latency demands it.
Conclusion: next steps you can execute this week
If you’re already running Postgres on ZFS, you don’t need a heroic refactor. You need two things: dataset boundaries that match reality, and a sync path that
matches your durability promises.
- Split datasets into at least data and WAL. Apply properties intentionally.
- Measure commit latency under write load and correlate it with
zpool iostatandiostat -x. - If sync is the bottleneck, fix it correctly: mirrored PLP SLOG or targeted Postgres durability changes—no filesystem lies.
- Keep compression on and stop paying for atime.
- Make scrubs and error monitoring a habit. “No incidents” is not luck; it’s detection plus boring discipline.
The goal is a system that doesn’t require a storage expert to operate day-to-day. ZFS can do that for PostgreSQL.
But you have to treat sync semantics as a contract, not a suggestion.