ZFS for Media Files: Big Records, Big Compression, Big Wins

November 22, 2025 • February 3, 2026 • Read: 24 min • Views: 26

Was this helpful?

Media storage is where good filesystems go to die quietly. You build a nice pool, copy in a few terabytes of video, add a second client, then suddenly “it was fine yesterday” becomes the most repeated sentence in your house or your office.

ZFS can make media workloads boring—in the best possible way—if you stop treating it like a generic filesystem and start tuning it for what media actually is: big sequential reads, big sequential writes, and metadata that explodes as libraries grow.

What media workloads really need (and what they don’t)

Let’s get something out of the way: most “media servers” aren’t IOPS-bound. They’re usually throughput-bound or latency-spiky from metadata and small random reads. A typical movie file is large and read mostly sequentially. Even multiple concurrent streams are often sequential reads from different offsets.

What breaks things is the stuff around the media:

Library scans: lots of tiny reads and metadata lookups; a Plex/Jellyfin/Emby scan can look like a small-file workload disguised as a media workload.
Thumbnails, previews, subtitles: small files that cause random I/O and metadata churn.
Downloads/transcodes: sequential writes plus CPU-heavy tasks, sometimes in the same datasets as finished media (don’t do that).
Backups and replication: sustained sequential reads that compete with streaming.

So the game is: optimize the large-file path without sabotaging the metadata and small-file path. ZFS gives you knobs to do exactly that, per dataset. Use them.

Opinionated baseline: if you’re storing large, mostly-read media files, you should be thinking about recordsize, compression, sync, atime, and special vdev / metadata locality before you waste time arguing about RAIDZ vs mirrors on Reddit.

First short joke (1/2): Storage tuning is like dieting: the fastest gains come from stopping the obviously bad habits, not from buying a fancier scale.

Big records: how recordsize actually behaves

Recordsize is not “block size,” and it’s not a promise

ZFS stores file data in variable-sized blocks up to a maximum. For filesystems (datasets of type filesystem), that maximum is recordsize. Default is often 128K. For media, that’s usually conservative.

Key behavior: ZFS will use smaller blocks when it has to. If you set recordsize=1M, you won’t force every I/O to be 1M. Small files and tail blocks remain smaller. Random rewrites can also cause smaller blocks due to copy-on-write behavior.

Why big records help media

Media playback loves long sequential reads. Bigger records can mean:

Fewer I/O operations for the same throughput (lower per-I/O overhead).
Better compression opportunities (more data per compression decision).
Fewer indirect blocks for big files (less metadata to chase).

But bigger records are not free:

Worse read amplification for small random reads (reading a 1M record to fetch 4K can be silly).
Potentially more painful rewrites if your workload edits in-place (rare for finalized media files, common for VM images—different problem).
More RAM pressure in some caching patterns (ARC holds more large buffers; not always a problem, but don’t ignore it).

What to set for media

For finalized media files (movies, episodes, music archives), set recordsize=1M on the dataset that stores them. If your OpenZFS supports it and your workload is extremely sequential, recordsize=2M can work, but 1M is the sweet spot in practice: big gain, minimal weirdness.

Do not set a single global recordsize for everything. Your downloads directory, thumbnails, and application metadata should not be dragged into the “big record” world. Keep them in separate datasets with appropriate recordsize (often default 128K or even smaller for extremely metadata-heavy workloads).

And what about volblocksize?

If you’re exporting iSCSI LUNs or zvol-backed storage, this article is only partially for you. volblocksize is fixed at zvol creation time and behaves differently. Media storage is usually file-based (SMB/NFS), so focus on recordsize.

Big compression: when it wins, when it lies

Compression is not just about saving space

On modern CPUs, compression often increases throughput because it trades CPU cycles for fewer disk reads/writes. For media, that sounds pointless because “video is already compressed.” True, but incomplete.

Even in media libraries you’ll find:

Text-heavy sidecars (subtitles, NFO metadata): compressible.
Artwork: sometimes already compressed, sometimes not (PNG can still compress a little; raw images compress a lot).
Download artifacts (logs, temporary files): highly compressible.
Audio formats: FLAC is compressed, WAV is not.

So compression “wins” by default more often than people expect. And when it doesn’t save much space, it can still reduce I/O for metadata and sidecars.

Pick the algorithm like an adult

My usual picks for media datasets:

compression=lz4 for always-on, low-risk, low-latency compression.
compression=zstd for datasets with lots of metadata, documents, or mixed content. Start around zstd-3 to zstd-6. Higher levels can be fine, but test your CPU headroom.

For purely finalized video files, compression won’t dramatically shrink the movie. But it also usually won’t hurt. The only real “don’t do this” is picking a heavy compression level on a box that is already CPU-bound transcoding.

Second short joke (2/2): If you enable max-level compression on a server that’s already transcoding 4K, you’ve basically invented a space heater with feelings.

What compression can break

Compression itself rarely breaks correctness; ZFS is solid here. What breaks is your latency budget if you turn a CPU-sipping box into a CPU-melting box. Symptoms look like “disk is slow” because everything queues behind CPU.

Also: compression ratios can mislead. A dataset can show a “meh” overall ratio while still saving a lot of I/O for the non-video parts that cause latency spikes.

Dataset layout for media: keep it boring and fast

Split datasets by I/O behavior

Don’t put everything under one dataset called tank/media and call it a day. You want different properties for different behavior. A practical layout:

tank/media — finalized media files (big sequential reads). recordsize=1M, compression=lz4 (or zstd modest), atime=off.
tank/downloads — active writes, partial files, unpacking. Leave recordsize default or 128K; compression helps a lot here.
tank/app — application state (Plex/Jellyfin metadata). Often benefits from default recordsize; consider special vdev if you have one.
tank/backups — replication targets; tune independently to match send/receive patterns.

atime: just turn it off for media

atime updates file access times. On media, that’s just extra metadata writes for no benefit. Set atime=off on media datasets and anything read-heavy. The only time you want atime=on is if an application uses it for logic (rare these days).

sync, log devices, and the media reality check

Most media ingestion is asynchronous by nature: downloads, file copies, rips. If you’re serving media over SMB/NFS, client behavior and protocol settings dictate how much synchronous write pressure you get.

Guidance:

Don’t set sync=disabled casually. It makes benchmarks look great and makes incident reports look worse.
If you have an SSD SLOG and your workload actually issues sync writes (databases, VM stores, or strict NFS), sync=standard plus a good SLOG can help. For pure media read-mostly datasets, it’s often irrelevant.

Special vdev: the underrated media upgrade

Large media files are sequential, but the metadata for a big library is not. A special vdev (fast SSDs dedicated to metadata and small blocks) can make library browsing and scans snappy without moving terabytes to flash.

Rules of engagement:

Treat special vdev as critical. If it dies and you didn’t mirror it, your pool can be toast.
Set special_small_blocks thoughtfully if you want small file data on special, not just metadata.

Ashifts, sector sizes, and why your pool is forever

ashift is set at vdev creation and is painful to change. If you’re using 4K sector drives (most are), use ashift=12 (or higher if you truly have 8K). Mis-setting it won’t always show as “slow.” It often shows as “weirdly slower than it should be” plus extra write amplification.

This is one of those “pay now or pay forever” settings.

One reliability quote (paraphrased idea)

Paraphrased idea (Werner Vogels): Everything fails, all the time—so design systems that expect failure and keep working.

Interesting facts and historical context (because it matters)

ZFS was born at Sun Microsystems as a “storage pool” filesystem, merging volume management and filesystem semantics into one administrative model.
Copy-on-write wasn’t a gimmick; it enabled consistent snapshots without freezing the filesystem, which is why replication workflows are so clean.
End-to-end checksums are foundational: ZFS verifies data integrity from disk up to the application read path, not just “RAID says it’s fine.”
The ARC is not “just cache”; it’s a memory-resident adaptive cache that can store both data and metadata and heavily influences perceived performance.
LZ4 became the default in many distributions because it’s fast enough to be essentially “free” on modern CPUs for general workloads.
Recordsize defaults were conservative because general-purpose filesystems must behave reasonably for mixed workloads, not because 128K is magically correct for video.
RAIDZ exists partly to reduce parity write penalty compared to classic RAID5 in certain patterns, but it still has tradeoffs for random I/O.
The “special vdev” feature is relatively new compared to classic ZFS; it changed the economics of accelerating metadata-heavy workloads.

Practical tasks: commands, outputs, and decisions (12+)

Every task below includes: a command, realistic output, what it means, and what decision you make from it. Assumptions: Linux with OpenZFS tools installed, pool named tank, and datasets as described earlier.

Task 1: Confirm pool health and obvious red flags

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool.
action: Upgrade the pool to enable all features.
  scan: scrub repaired 0B in 06:12:31 with 0 errors on Sun Dec 15 03:10:12 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            ata-WDC_WD140EDFZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDFZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDFZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDFZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDFZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDFZ-...   ONLINE       0     0     0

errors: No known data errors

Meaning: The pool is online; scrub is clean; no read/write/checksum errors. “Some supported features not enabled” is informational.

Decision: If state isn’t ONLINE or you see CKSUM errors, stop tuning and start fixing hardware/cabling first. Performance tuning on a sick pool is performance theater.

Task 2: Check dataset properties that matter for media

cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,compression,atime,sync tank/media
NAME       PROPERTY     VALUE
tank/media recordsize   128K
tank/media compression  off
tank/media atime        on
tank/media sync         standard

Meaning: This dataset is at generic defaults: 128K records, no compression, atime enabled.

Decision: For finalized media, change to recordsize=1M, compression=lz4, atime=off. Leave sync alone unless you understand your client sync-write behavior.

Task 3: Apply media-friendly settings (safely, per dataset)

cr0x@server:~$ sudo zfs set recordsize=1M compression=lz4 atime=off tank/media

Meaning: New files will use up to 1M records; existing blocks don’t magically rewrite.

Decision: If you need existing files to benefit, plan a rewrite workflow (Task 8).

Task 4: Verify settings took effect

cr0x@server:~$ zfs get -o name,property,value recordsize,compression,atime tank/media
NAME       PROPERTY     VALUE
tank/media recordsize   1M
tank/media compression  lz4
tank/media atime        off

Meaning: Dataset is now tuned for large sequential reads with low-overhead compression.

Decision: Move on to validation: are we actually seeing bigger blocks and better throughput?

Task 5: Observe real-time I/O patterns during playback or copy

cr0x@server:~$ iostat -xm 2 3
Linux 6.8.0 (server)  12/25/2025  _x86_64_  (16 CPU)

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await wareq-sz  aqu-sz  %util
sda              18.0   2400.0     0.0    0.0    8.1   133.3        0.5     64.0     2.0   128.0      0.2    14.0
sdb              17.5   2320.0     0.0    0.0    7.9   132.6        0.6     80.0     2.1   133.3      0.2    13.5
md0               0.0      0.0     0.0    0.0    0.0     0.0        0.0      0.0     0.0     0.0      0.0     0.0

Meaning: Reads are happening with average request sizes around ~128K at the block device level. That’s not necessarily “bad,” but it suggests ZFS may still be issuing smaller I/Os (or the workload is not purely sequential).

Decision: Check ZFS-side stats (ARC, prefetch, record sizes) and whether files were written before changing recordsize.

Task 6: Check ARC and memory pressure

cr0x@server:~$ arcstat 2 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:01:10   520    40      7     8   20    30   75     2    5   48.2G  52.0G
12:01:12   610    55      9    10   18    42   76     3    5   48.3G  52.0G
12:01:14   590    60     10    12   20    45   75     3    5   48.3G  52.0G

Meaning: ARC size is ~48G with a target c ~52G. Miss rate is low; prefetch misses exist (media streaming will trigger prefetch).

Decision: If ARC is tiny or constantly shrinking, investigate memory limits and other services. Media servers that also run containers and transcoding can starve ARC and then blame disks.

Task 7: Confirm actual on-disk block sizes of a media file

cr0x@server:~$ zdb -bbbb tank/media "movie.mkv" | head -n 12
Indirect blocks:
               0 L0  512K  1/1/200  16384L/32768P F=1 B=1000/1000
               1 L0  512K  1/1/200  16384L/32768P F=1 B=1000/1000
               2 L0  512K  1/1/200  16384L/32768P F=1 B=1000/1000
               3 L0  512K  1/1/200  16384L/32768P F=1 B=1000/1000

Meaning: This file uses 512K blocks, not 1M. That’s still larger than 128K and might reflect how it was written (or earlier settings).

Decision: If you need 1M blocks, rewrite or re-copy the file after setting recordsize=1M (Task 8). Don’t chase ghosts: recordsize is a maximum, not a guarantee.

Task 8: Rewrite existing media to adopt new recordsize/compression

cr0x@server:~$ rsync -aH --inplace --no-whole-file /tank/media/ /tank/media_rewrite/
sending incremental file list
movie.mkv
episode01.mkv
episode02.mkv

sent 119,845,232,110 bytes  received 1,842 bytes  112,394,820.31 bytes/sec
total size is 119,842,001,002  speedup is 1.00

Meaning: You’re copying files into a new dataset (or directory) so they get re-laid-out with new properties. (Use a separate dataset like tank/media_new for clean semantics.)

Decision: After verification, swap mountpoints or rename datasets. Avoid rewriting in place unless you’re confident and have backups; media libraries are large enough to make mistakes expensive.

Task 9: Check compression effectiveness on the dataset

cr0x@server:~$ zfs get -o name,property,value compressratio,compression tank/media
NAME       PROPERTY       VALUE
tank/media compression    lz4
tank/media compressratio  1.08x

Meaning: Overall savings is ~8%. For a “mostly video” dataset, that’s plausible: sidecars and metadata compress; video barely does.

Decision: Keep compression on. You’re getting savings with minimal CPU. If CPU is constrained, stay with lz4 instead of heavy zstd levels.

Task 10: Measure real throughput with a read test (and interpret it carefully)

cr0x@server:~$ dd if=/tank/media/movie.mkv of=/dev/null bs=8M status=progress
40097546240 bytes (40 GB, 37 GiB) copied, 38 s, 1.1 GB/s
48273149952 bytes (48 GB, 45 GiB) copied, 46 s, 1.0 GB/s
50331648000 bytes (50 GB, 47 GiB) copied, 49 s, 1.0 GB/s

Meaning: You’re reading around 1.0 GB/s. That could be ARC (cache) or disks, depending on repetition and system state.

Decision: Repeat after dropping caches is not straightforward with ZFS. Instead, test with multiple large files and watch zpool iostat to confirm physical reads (Task 11).

Task 11: Observe ZFS vdev throughput and queueing

cr0x@server:~$ zpool iostat -v tank 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        48.2T  26.1T    320     12  1.02G  45.3M
  raidz2-0  48.2T  26.1T    320     12  1.02G  45.3M
    sda         -      -     55      2   174M  7.5M
    sdb         -      -     54      2   171M  7.4M
    sdc         -      -     53      2   170M  7.6M
    sdd         -      -     54      2   171M  7.6M
    sde         -      -     52      2   168M  7.6M
    sdf         -      -     52      2   167M  7.6M
----------  -----  -----  -----  -----  -----  -----

Meaning: Reads are spread across drives, aggregate ~1.02 GB/s. That’s physical. Good.

Decision: If bandwidth is low but clients are buffering, look at network and CPU. If disks show high ops but low bandwidth, you’re doing small random I/O and should focus on metadata/special vdev/app dataset tuning.

Task 12: Find whether metadata is the real bottleneck (directory traversal pain)

cr0x@server:~$ sudo zpool iostat -r tank 1 5
                            read IOs
pool        raidz2-0   sda   sdb   sdc   sdd   sde   sdf
----------  -------  ----  ----  ----  ----  ----  ----
tank            900   150   148   151   149   151   151
tank           1200   200   198   201   199   201   201
tank           1100   184   182   185   183   184   182

Meaning: High IOPS with no mention of bandwidth suggests many small reads (metadata, thumbnails, small files).

Decision: Consider special vdev or moving app metadata to SSD-backed pool/dataset. Also check primarycache and ARC sizing.

Task 13: Check for pathological fragmentation and free space pressure

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,cap,health tank
NAME  SIZE  ALLOC   FREE  FRAG  CAP  HEALTH
tank  74.3T 48.2T  26.1T   38%  64%  ONLINE

Meaning: 38% fragmentation and 64% capacity used is not alarming. Fragmentation matters more at high usage and with random writes.

Decision: If you’re above ~80% capacity and frag climbs, expect performance cliffs during writes and resilvers. Plan expansion before the pool becomes a packed suitcase.

Task 14: Validate sector alignment (ashift) on vdevs

cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head
56:        vdev_tree:
72:            ashift: 12

Meaning: ashift=12 indicates 4K sectors. Good for modern disks.

Decision: If ashift is 9 on 4K drives, performance and endurance suffer. Fixing it requires rebuilding vdevs/pool; decide early, not after 50TB.

Task 15: Check whether sync writes are unexpectedly killing ingest

cr0x@server:~$ zpool iostat -w tank 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        48.2T  26.1T     10    220   4.2M   38.0M
----------  -----  -----  -----  -----  -----  -----

Meaning: Lots of writes but modest bandwidth can indicate small sync writes or metadata churn.

Decision: Check your workload (SMB durable handles, NFS sync, app behavior). If you truly have sync writes and care about ingest latency, consider a mirrored SLOG on power-loss-safe SSDs. If you don’t know, don’t touch sync=disabled.

Task 16: Confirm snapshots aren’t silently ballooning space

cr0x@server:~$ zfs list -o name,used,usedbysnapshots,refer,avail -r tank/media | head
NAME        USED  USEDBYSNAPSHOTS  REFER  AVAIL
tank/media  22.1T 3.4T             18.7T  26.1T

Meaning: 3.4T is held by snapshots. That’s not “bad,” but it explains why deletes don’t free space.

Decision: If space pressure is rising, implement snapshot retention (keep a few, prune the rest) especially for datasets with churn like downloads and app metadata.

Fast diagnosis playbook: find the bottleneck in minutes

This is the checklist I actually use when someone says “streaming is buffering” or “copies are slow.” The goal is to avoid hours of tweaking when the problem is a cable, a full pool, or a single bad disk doing interpretive dance.

First: is the pool healthy and not rebuilding?

Run zpool status -v. If you see DEGRADED, resilvering, or checksum errors, performance is secondary. Fix health first.
Check if a scrub/resilver is running and saturating I/O.

Second: is it disks, CPU, or network?

Disks: zpool iostat -v 2 and iostat -xm 2. Look for one device with high await/%util relative to others.
CPU: top or mpstat -P ALL 2. Look for a single-threaded bottleneck, or total CPU pegged (transcoding + compression + checksums can stack).
Network: ip -s link and switch counters. Look for errors/drops or a link negotiated at 1Gb when you thought it was 10Gb.

Third: are you accidentally doing small random I/O?

High IOPS, low bandwidth in zpool iostat is the tell.
App metadata and thumbnail generation: move to SSD/special vdev, or isolate datasets.
Check atime and disable it for read-heavy datasets.

Fourth: is the pool too full or too fragmented?

zpool list for capacity and fragmentation.
If you’re above ~80–85% usage, stop expecting miracles. Plan expansion, delete with snapshot awareness, or add vdevs.

Fifth: confirm your tuning matches the dataset

zfs get recordsize,compression,atime for the dataset in question.
If you changed recordsize recently, verify new files, not old ones. Old blocks persist.

Three corporate mini-stories from the trenches

1) Incident caused by a wrong assumption: “It’s just sequential reads, RAIDZ is fine”

A media production team built a shared NAS for editors. The workload “looked like media,” so they sized it for throughput: big HDD pool, RAIDZ2, lots of spindles, a fat network link. Everyone was happy during the first copy-in and the first week of playback.

Then the library grew. The asset manager started generating proxies, thumbnails, waveform previews, and index metadata. Editors began scrubbing timelines and jumping around files. The I/O pattern turned into a nasty mix: sequential reads for playback plus random reads for thumbnails plus random writes for metadata updates.

The wrong assumption was that “media equals sequential.” Media plus tools equals a mixed workload. RAIDZ2 wasn’t “bad,” but the pool had no fast place for metadata, and latency spikes made the UI feel broken. It looked like a network problem because clients buffered and the SMB sessions stayed connected. Everyone blamed the switch. Naturally.

The fix was boring: split datasets, move app metadata to SSDs, add a special vdev mirrored, and stop hammering the same dataset with both finalized media and constantly-changing metadata. Recordsize tuning helped, but the real win was giving metadata a fast lane.

After that, performance got predictable. Not “max benchmark” predictable—real predictable. The kind that stops Slack threads.

2) Optimization that backfired: turning off sync to “speed up ingest”

A corporate comms group ingested hundreds of GB per day of camera footage. Someone ran a quick benchmark, saw writes were slower than expected, and found the infamous knob: sync=disabled. Flip it, ingestion doubles. A hero is born.

Weeks later, a power event hit the building. The UPS did its best, but the host lost power mid-ingest. After reboot, the pool imported fine. The filesystem mounted. But the most recent files were corrupted in subtle ways—some clips played with glitches, some had missing tail segments, some failed checksum verification in the editing app.

No single “ZFS is broken” error appeared. That’s the trap. Disabling sync doesn’t break the pool; it breaks your promises. Applications that believed fsync() meant “on stable storage” were lied to. They did what they were supposed to do; the storage did not.

The remediation took time: re-ingest from source, implement proper power-loss protection, and restore sync=standard. They added a small mirrored SLOG on power-loss-safe SSDs, which got most of the ingest performance back while keeping semantics intact.

Performance tuning that depends on lying to applications is not tuning. It’s debt with interest.

3) Boring but correct practice that saved the day: monthly scrubs plus alerting

An internal streaming archive ran on a large ZFS pool. Nothing fancy: decent controllers, mirrored boot, RAIDZ2 data, and a special vdev mirrored for metadata. The team had one religious practice: scrubs on a schedule and alerts that actually paged someone who cared.

One month, a scrub reported a handful of checksum errors on a single disk. The system healed the data using redundancy, but the alert fired. The disk wasn’t obviously dead; SMART looked “fine,” because SMART loves optimism.

They replaced the disk during business hours, calmly, before it escalated into a multi-disk failure during a resilver. A week later, a second disk in the same batch began throwing read errors. Without the first replacement, that could have been a bad day.

No heroic tuning. No secret sysctl. Just scrubs, alerts, and a willingness to replace a disk that was telling on itself early. Reliability is mostly refusing to be surprised.

Common mistakes: symptom → root cause → fix

1) “Streaming buffers when someone runs a library scan”

Symptom: Playback pauses or UI becomes sluggish during scans, thumbnail generation, or metadata refresh.

Root cause: Metadata/small file workload is competing with streaming reads on the same vdevs; ARC is thrashed by small random reads.

Fix: Split datasets; move app metadata to SSD or add mirrored special vdev; keep media files in a big-record dataset; consider limiting scan concurrency in the app.

2) “I set recordsize=1M but nothing got faster”

Symptom: No measurable change after tuning.

Root cause: Existing files keep their old block layout; recordsize applies primarily to new writes.

Fix: Rewrite/re-copy files into a dataset with the new settings (rsync or zfs send/recv). Verify with zdb on a sample file.

3) “Writes are slow and iostat shows small writes with high latency”

Symptom: Ingest feels capped at tens of MB/s; latency spikes.

Root cause: Sync writes from SMB/NFS/app behavior, no SLOG, or slow disks forced to commit frequently.

Fix: Keep sync=standard. If sync writes are real and important, add a mirrored, power-loss-safe SLOG; otherwise adjust client/app settings that force sync.

4) “Pool got slower over time even though disks are healthy”

Symptom: Gradual performance decay, especially on writes and during maintenance.

Root cause: High pool fullness and fragmentation; lots of churn under snapshots; RAIDZ write amplification worsens at high utilization.

Fix: Keep pools under ~80–85% used; prune snapshots; add vdevs before you’re desperate; isolate churny datasets like downloads.

5) “SMB browsing is slow but streaming is fine”

Symptom: Listing directories and opening folders takes seconds; playback once started is okay.

Root cause: Metadata latency. Directory traversal and stat calls are small random reads; maybe atime updates too.

Fix: atime=off; special vdev for metadata; ensure ARC has enough RAM; don’t put metadata-heavy app state on the same spinning rust dataset as bulk media if you can avoid it.

6) “Checksum errors appear during scrub”

Symptom: zpool status shows repaired bytes or growing CKSUM counts.

Root cause: Failing disk, flaky cable/backplane, controller issues, or (less commonly) memory instability.

Fix: Treat as hardware incident. Reseat/replace cables, move drives bays, run SMART long tests, replace suspect disks. Don’t ‘tune’ checksum errors.

Checklists / step-by-step plan

Plan A: New media pool, done right

Choose vdev topology based on failure domain and rebuild time: RAIDZ2 for large HDD pools is a reasonable default; mirrors if you need better random I/O and faster resilvers.
Set ashift correctly at creation (typically 12 for modern 4K HDDs).
Create datasets by workload: media, downloads, app metadata, backups.
Set dataset properties:
- Media: recordsize=1M, compression=lz4, atime=off.
- Downloads: compression on; recordsize default; consider sync=standard always.
- App: default recordsize; consider SSD/special vdev.
Implement snapshots with retention (especially for app and downloads). Keep the policy simple so it actually runs.
Scrubs on schedule + alerts. If you don’t alert, you’re not scrubbing; you’re journaling your regrets.
Test with real workflows: simultaneous streams + scan + ingest. Synthetic tests only tell you how good your synthetic life would be.

Plan B: Existing pool with generic settings

Confirm health: zpool status. Fix errors first.
Split datasets: create tank/media, tank/app, tank/downloads if you don’t already have them.
Apply media properties to tank/media only.
Rewrite or re-copy files to adopt recordsize/compression.
Move app metadata off spinning disks if browsing is slow (SSD pool or special vdev).
Re-evaluate capacity and fragmentation; make an expansion plan before you’re at 90% used.

Plan C: You’re already on fire (production buffering)

Stop background tasks: pause scans, pause backups, postpone scrubs/resilvers if safe and policy allows.
Run zpool iostat -v 2 and find any single slow drive; replace/relocate if it’s clearly failing.
Check network negotiation and errors.
Confirm the workload: are you transcoding? CPU might be your “disk issue.”
After stabilization, implement dataset separation and metadata acceleration.

FAQ

1) Should I always set `recordsize=1M` for media?

For finalized large media files, yes. For anything with small random reads/writes (app metadata, downloads in progress, thumbnails), no. Split datasets and tune per workload.

2) Will compression hurt video streaming performance?

Usually not with lz4. Video is already compressed, so savings are small, but CPU cost is low. The risk is using heavy zstd levels on a CPU that’s already busy transcoding.

3) If compression ratio is only 1.02x, is it pointless?

No. Even small overall gains can hide meaningful savings on sidecar files and metadata. Also, compressed blocks can mean fewer bytes read from disk for the compressible parts, reducing latency spikes.

4) What’s the best layout: RAIDZ2 or mirrors?

For bulk media on HDDs, RAIDZ2 is a sensible default for capacity and safety. Mirrors often feel faster for mixed random I/O and resilvers, but cost more in disks. If your bottleneck is metadata latency, a special vdev can be a bigger win than changing topology.

5) Do I need an L2ARC for a media server?

Rarely. Media streaming is often sequential and doesn’t benefit much from L2ARC unless you have repeated reads of the same content and insufficient RAM. If browsing is slow, fix metadata locality first.

6) Should I disable atime everywhere?

On media and read-heavy datasets, yes. On datasets where some workflow depends on atime semantics (uncommon), keep it on. If you don’t know, disable it for media and leave defaults for app datasets until proven otherwise.

7) Is `sync=disabled` ever acceptable?

Only when you explicitly accept data loss risk on power failure and you know the workload semantics. For business or anything you’d cry over, keep sync=standard and use proper hardware (UPS, SLOG if needed).

8) How full can I run a ZFS pool?

Try to stay under ~80–85% for predictable performance and resilience. Above that, fragmentation and allocation constraints can amplify latency, especially on RAIDZ during writes and maintenance.

9) Can I change ashift after creating the pool?

Not in-place in any sane way. The practical method is to rebuild: replace vdevs with correctly-sized ones or migrate data to a new pool. Choose ashift correctly at day zero.

10) How do snapshots affect media storage?

Snapshots are cheap until they aren’t. Media files don’t change often, so snapshots don’t grow much—unless you store churny downloads or app databases in the same dataset. Use separate datasets and snapshot retention that matches churn.

Conclusion: practical next steps

If you want the “big wins” without turning your storage into an experimental art project, do three things:

Split datasets by behavior: finalized media, downloads, and app metadata should not share one set of properties.
Make media boring: recordsize=1M, compression=lz4, atime=off on the media dataset.
Diagnose with evidence: zpool status, zpool iostat, iostat, ARC stats, and capacity/fragmentation before you tweak knobs.

Then, if browsing and scans still feel like wading through molasses, stop blaming your disks and start treating metadata as a first-class workload: SSD-backed app datasets or a mirrored special vdev can change the experience more than any recordsize setting ever will.