You reboot the box and it’s fast again. Two weeks later it’s back: writes stall, latency spikes, databases start “mysteriously” timing out, and your dashboards look like a seismograph. The drive is “healthy,” the CPU is bored, and the app team is convinced you changed something. You didn’t. The SSD did.
This failure mode is common on Ubuntu 24.04 servers: SSD/NVMe performance slowly degrades as the drive’s free space map gets messy, garbage collection (GC) does more work, and TRIM/discard either isn’t happening or isn’t reaching the physical device. The good news: you can prove it with hard data, and you can fix it in ways you can verify.
What the slowdown looks like in real systems
Not all “NVMe slow” reports are TRIM/GC. But the “falls over time” pattern is a tell. Here’s what it typically looks like when discard/TRIM isn’t effective:
- Write latency gradually increases under steady workload, especially random writes and mixed read/write.
- Periodic latency spikes (hundreds of milliseconds to seconds) even when the application is not saturating bandwidth.
- Reboot or long idle improves performance temporarily. (Some drives do background GC more aggressively when idle; a reboot may also change workload patterns and give idle time.)
- Free space on the filesystem looks fine, yet the device behaves like it’s full. That’s the point: the SSD’s notion of “free” is not the filesystem’s unless TRIM tells it.
- iowait is not necessarily high. You can have a few threads blocked on storage and still melt the SLO.
Most teams discover this the same way: a production system that was “fine” at launch turns into a slow-motion incident generator. Your first instinct is to blame the database, then the kernel, then the cloud, then the intern. Don’t. Start with whether the SSD is being told what blocks are no longer in use.
Interesting facts and historical context (the short, useful kind)
- Early consumer SSDs could lose performance dramatically after being filled once, because they had limited overprovisioning and primitive garbage collection.
- TRIM was introduced to align the filesystem’s view of free space with the SSD’s. Without it, the SSD must assume every block ever written still matters.
- “Garbage collection” isn’t optional on flash: flash pages can be read and written, but erasure happens in larger erase blocks, so rewriting requires data movement.
- Write amplification is the enemy you don’t see: a small host write can cause many internal writes when the SSD is forced to consolidate valid pages.
- NVMe made latency and parallelism much better, but it didn’t repeal physics. GC is still there; it’s just happening at higher IOPS and sometimes with sharper cliffs.
- Linux has long supported TRIM, but “support” doesn’t mean “enabled end-to-end”. LUKS, LVM, dm-crypt, device-mapper layers, and virtualized storage can drop discards if not configured.
- Mounting with continuous discard was historically controversial because it could add overhead and fragmentation; periodic fstrim became the boring, effective default in many distros.
- Some storage arrays and cloud block devices ignore discards or translate them in ways that don’t actually free physical space. The guest OS thinks it helped; the backend shrugs.
Fast diagnosis playbook (first/second/third)
This is the “someone’s paging you” version. The goal is not perfect analysis; it’s to find the bottleneck in minutes and decide whether you’re chasing TRIM/GC or something else.
First: confirm it’s storage latency, not CPU/memory
- Check
iostatfor device latency (await) and utilization. - Check per-process IO with
pidstator application metrics. - Look for a pattern: latency rising over days/weeks, not a sudden step change after deployment.
Second: confirm the TRIM path exists end-to-end
- Is
fstrim.timerenabled and succeeding? - Does the filesystem support discard? Does the block layer accept discards?
- Are you on LUKS/LVM/MD RAID where discards can be disabled or blocked?
Third: reproduce and measure with a controlled workload
- Use
fioto run a consistent random write test (carefully, on a non-production target or a spare partition). - Run
fstrim, then repeat the test and compare latency distribution, not just average MB/s.
If the post-TRIM run materially improves p95/p99 latency or sustains IOPS longer before cliffing, you have your culprit: the drive was starving for clean blocks.
Prove it’s TRIM/GC: measurements that survive arguments
TRIM and garbage collection debates tend to attract hand-waving. Don’t participate. Produce evidence that answers one question: does informing the device about freed blocks reduce write amplification symptoms and latency?
Two things matter:
- Latency distribution (p95/p99/p99.9), not just average throughput. GC pain often shows up as spikes.
- Before/after a discard event (manual fstrim or controlled workload with discard), using the same workload profile.
Also, keep in mind what TRIM is not:
- It’s not a magic “defrag” for SSDs.
- It does not instantly erase cells; it marks pages invalid so the SSD can erase blocks efficiently later.
- It does not fix a dying drive, a saturated controller, or a broken queueing setup.
One quote worth keeping on your wall, because it’s the operational posture you need here:
“Hope is not a strategy.” — General Gordon R. Sullivan
Now let’s do the work: concrete tasks with commands, outputs, and the decision you make next.
Practical tasks: commands, expected output, and what decision to make
These are ordered roughly the way I’d run them on an Ubuntu 24.04 server with suspected “SSD gets slower over time.” The commands are real and runnable. The outputs are representative; your device names will differ.
Task 1: Identify the actual block device(s)
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL,ROTA,DISC-MAX,DISC-GRAN
NAME TYPE SIZE FSTYPE MOUNTPOINTS MODEL ROTA DISC-MAX DISC-GRAN
nvme0n1 disk 1.8T /dev/nvme0n1 0 2G 4K
├─nvme0n1p1 part 512M vfat /boot/efi
└─nvme0n1p2 part 1.8T crypto
└─vg0-root lvm 300G ext4 /
What it means: You’re on NVMe (ROTA=0), and the stack includes LUKS crypto + LVM. That’s exactly where discards often die quietly.
Decision: You must verify discard support through dm-crypt and LVM, not just the filesystem.
Task 2: Check filesystem type and mount options (discard vs periodic)
cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /
/dev/mapper/vg0-root ext4 rw,relatime,errors=remount-ro
What it means: No discard mount option. That’s not automatically wrong; Ubuntu often relies on periodic fstrim.
Decision: Don’t knee-jerk add discard. First, check whether periodic TRIM is enabled and working.
Task 3: Is the weekly TRIM timer enabled?
cr0x@server:~$ systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
Loaded: loaded (/usr/lib/systemd/system/fstrim.timer; enabled; preset: enabled)
Active: active (waiting) since Mon 2025-12-22 00:12:01 UTC; 1 weeks 1 days ago
Trigger: Mon 2025-12-29 00:28:14 UTC; 1 days ago
Triggers: ● fstrim.service
What it means: Timer is enabled and scheduled. That’s necessary, not sufficient.
Decision: Check the last run’s result. A timer can be enabled and still failing every time.
Task 4: Did fstrim actually succeed last run?
cr0x@server:~$ journalctl -u fstrim.service -n 50 --no-pager
Dec 29 00:28:14 server systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
Dec 29 00:28:14 server fstrim[32107]: /: 0 B (0 bytes) trimmed on /dev/mapper/vg0-root
Dec 29 00:28:14 server systemd[1]: fstrim.service: Deactivated successfully.
Dec 29 00:28:14 server systemd[1]: Finished fstrim.service - Discard unused blocks on filesystems from /etc/fstab.
What it means: Trimmed 0 B. That can mean “nothing to trim” or “discard doesn’t pass through,” or “filesystem thinks nothing is free,” or “you’re on a virtual block device that ignores it.” On a busy system, 0 B week after week is suspicious.
Decision: Validate discard support at the block layer and through dm-crypt/LVM. Also run a manual trim and see whether it reports anything.
Task 5: Verify the device reports discard capability
cr0x@server:~$ cat /sys/block/nvme0n1/queue/discard_max_bytes
2147483648
What it means: The raw NVMe device accepts discards up to 2 GiB per command. Good.
Decision: If you’re not trimming, it’s not because the NVMe can’t. It’s because your software stack isn’t issuing discards or not passing them down.
Task 6: Check whether the mapped device supports discard (dm layer)
cr0x@server:~$ cat /sys/block/dm-0/queue/discard_max_bytes
0
What it means: Your dm device (likely LVM LV or dm-crypt mapping) does not advertise discard. That’s why fstrim reports 0 bytes: the discard is blocked above the NVMe.
Decision: Inspect dm-crypt and LVM configuration. You probably need to enable discards for LUKS (allow-discards) and ensure LVM isn’t preventing it.
Task 7: Confirm dm-crypt discard setting
cr0x@server:~$ sudo cryptsetup status nvme0n1p2
/dev/mapper/nvme0n1p2 is active and is in use.
type: LUKS2
cipher: aes-xts-plain64
keysize: 512 bits
key location: keyring
device: /dev/nvme0n1p2
sector size: 512
offset: 32768 sectors
size: 3906250000 sectors
mode: read/write
flags:
What it means: No flags shown. If allow-discards were enabled, you’d typically see it in flags.
Decision: Plan a controlled change to enable discards in crypttab (or re-open mapping with discards) and validate that dm advertises discard afterwards.
Task 8: Check /etc/crypttab configuration
cr0x@server:~$ sudo cat /etc/crypttab
nvme0n1p2 UUID=9f3d7b2a-2c42-4e47-b2c6-7d92a4e5a8a1 none luks,discard
What it means: This example shows discard already set. If yours lacks it, discards are blocked by dm-crypt by design.
Decision: If missing, add discard (or allow-discards depending on tooling) and schedule a reboot or maintenance to reopen the mapping safely.
Task 9: Validate LVM is not the discard bottleneck
cr0x@server:~$ sudo lvs -o lv_name,vg_name,lv_attr,segtype,devices
LV VG Attr Type Devices
root vg0 -wi-ao---- linear /dev/mapper/nvme0n1p2(0)
What it means: A linear LV is fine. Thin pools and snapshots complicate discard semantics.
Decision: If you see thin-pool or heavy snapshot use, you must verify discard/trim support in that layout and consider periodic fstrim inside the guest plus proper discard pass-through at the host.
Task 10: Run a manual fstrim and interpret it
cr0x@server:~$ sudo fstrim -av
/boot/efi: 256.4 MiB (268783616 bytes) trimmed on /dev/nvme0n1p1
/: 112.7 GiB (121011388416 bytes) trimmed on /dev/mapper/vg0-root
What it means: This is what “working trim” looks like: non-zero trimmed bytes, especially on the root filesystem. If your earlier journal showed 0 B and now you see large trims, you just found a broken schedule or a previously blocked path that you fixed.
Decision: If trim is now working, move on to proving performance improvement (fio + iostat). If fstrim still reports 0 B but you know you delete data, keep digging: the discard path is still broken, or your workload isn’t freeing blocks in a trim-visible way.
Task 11: Observe latency and utilization during a slow period
cr0x@server:~$ iostat -x 1 10
Linux 6.8.0-41-generic (server) 12/30/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.12 0.00 1.45 2.10 0.00 93.33
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 5.00 220.0 0.00 0.00 2.40 44.00 310.00 18432.0 120.00 27.91 38.50 59.46 12.40 98.00
What it means: %util near 98% with w_await ~38 ms indicates the device is saturated and writes are waiting. If this grows over time under the same workload, GC pressure is a strong suspect.
Decision: Capture this “bad” baseline, then compare immediately after a successful fstrim and a short idle window.
Task 12: Check NVMe SMART / health to rule out obvious drive issues
cr0x@server:~$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0x00
temperature : 43 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 3%
data_units_read : 123,456,789
data_units_written : 98,765,432
host_read_commands : 3,210,987,654
host_write_commands : 2,109,876,543
controller_busy_time : 9,812
power_cycles : 27
power_on_hours : 2,144
unsafe_shutdowns : 1
media_errors : 0
num_err_log_entries : 0
What it means: No critical warnings, low wear (percentage_used). This supports the idea that the drive isn’t failing; it’s suffering from internal housekeeping under load.
Decision: If you see media errors, critical warnings, or high percentage_used, treat it as potential hardware reliability issue first. TRIM won’t save a dying SSD.
Task 13: Measure performance with a controlled fio job (before TRIM)
Do this only on a safe target (a test LV, a spare partition, or a dedicated file on a non-critical filesystem). If you point fio at production data without thinking, you’re not an SRE; you’re a cautionary tale.
cr0x@server:~$ sudo fio --name=randwrite --filename=/var/tmp/fio.test --size=8G --direct=1 --ioengine=libaio --bs=4k --rw=randwrite --iodepth=32 --numjobs=1 --runtime=60 --time_based --group_reporting
randwrite: (groupid=0, jobs=1): err= 0: pid=4123: Tue Dec 30 11:12:01 2025
write: IOPS=18.2k, BW=71.2MiB/s (74.7MB/s)(4272MiB/60001msec); 0 zone resets
slat (nsec): min=1500, max=180512, avg=6210.3, stdev=3341.2
clat (usec): min=80, max=215000, avg=1732.4, stdev=8200.7
lat (usec): min=90, max=215020, avg=1738.8, stdev=8201.1
clat percentiles (usec):
| 1.00th=[ 120], 5.00th=[ 150], 10.00th=[ 170], 50.00th=[ 320],
| 90.00th=[ 1400], 95.00th=[ 3800], 99.00th=[20000], 99.90th=[120000]
cpu : usr=3.10%, sys=12.40%, ctx=1123456, majf=0, minf=12
IO depths : 1=0.1%, 2=0.2%, 4=0.5%, 8=1.2%, 16=7.0%, 32=91.0%, >=64=0.0%
What it means: Average latency looks “okay-ish,” but p99 and p99.9 are ugly. Those long tails are what hurt databases and API SLOs. This pattern is consistent with GC stalls.
Decision: Run fstrim (if possible), optionally idle for a few minutes, then rerun the same fio job and compare percentiles.
Task 14: Trim, then rerun fio (after TRIM)
cr0x@server:~$ sudo fstrim -v /
/: 112.7 GiB (121011388416 bytes) trimmed on /dev/mapper/vg0-root
cr0x@server:~$ sudo fio --name=randwrite --filename=/var/tmp/fio.test --size=8G --direct=1 --ioengine=libaio --bs=4k --rw=randwrite --iodepth=32 --numjobs=1 --runtime=60 --time_based --group_reporting
randwrite: (groupid=0, jobs=1): err= 0: pid=4188: Tue Dec 30 11:16:01 2025
write: IOPS=28.9k, BW=113MiB/s (118MB/s)(6780MiB/60001msec); 0 zone resets
slat (nsec): min=1500, max=110220, avg=5901.1, stdev=2987.4
clat (usec): min=70, max=42000, avg=980.2, stdev=1400.3
lat (usec): min=78, max=42012, avg=986.1, stdev=1400.5
clat percentiles (usec):
| 1.00th=[ 110], 5.00th=[ 140], 10.00th=[ 160], 50.00th=[ 280],
| 90.00th=[ 980], 95.00th=[ 1700], 99.00th=[ 5200], 99.90th=[16000]
What it means: Tail latency improved massively. The drive is spending less time relocating valid pages during writes because it has more pre-erased blocks to work with.
Decision: You’ve now “proven” the hypothesis: TRIM/GC state is affecting performance. Proceed to permanent fixes: ensure discards happen reliably, and consider operational practices that keep drives from living at 95% full.
Task 15: Confirm continuous discard is not already enabled (and decide if you want it)
cr0x@server:~$ mount | grep ' on / '
/dev/mapper/vg0-root on / type ext4 (rw,relatime,errors=remount-ro)
What it means: No discard mount option. That’s fine if periodic trim works.
Decision: Prefer periodic fstrim for general-purpose servers. Use continuous discard only for specific workloads where you’ve measured benefit and overhead is acceptable.
Task 16: Check free space and overfill risk
cr0x@server:~$ df -hT /
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/vg0-root ext4 295G 271G 10G 97% /
What it means: 97% full. This is a performance cliff zone for many SSDs, even with TRIM, because the drive has less room for wear leveling and GC.
Decision: Free space is a performance feature. Target 70–85% utilization for write-heavy systems, or add capacity. If your finance team hates this, call it “latency insurance.”
Joke #1: Running SSDs at 97% full is like scheduling your fire drill during a real fire—technically an exercise, practically a bad day.
Fixes that work (and why)
There are three categories of fixes, and you usually need at least two:
- Make TRIM actually happen (periodically or continuously), end-to-end through your stack.
- Give the SSD room to breathe (avoid living near 100% full; reduce churn on hot volumes).
- Stop making GC harder than it needs to be (workload and filesystem choices, queueing sanity, avoiding pathological layers).
Fix 1: Enable and verify periodic fstrim (the default you should want)
On Ubuntu 24.04, periodic TRIM is usually provided via fstrim.timer. You want it enabled, and you want it to report non-zero trims over time on systems that delete/overwrite data.
cr0x@server:~$ sudo systemctl enable --now fstrim.timer
Created symlink /etc/systemd/system/timers.target.wants/fstrim.timer → /usr/lib/systemd/system/fstrim.timer.
Verify it runs:
cr0x@server:~$ systemctl list-timers --all | grep fstrim
Mon 2026-01-05 00:28:14 UTC 5 days left Mon 2025-12-29 00:28:14 UTC 1 day ago fstrim.timer fstrim.service
Then verify output:
cr0x@server:~$ sudo journalctl -u fstrim.service --since "14 days ago" --no-pager | tail -n 20
Dec 29 00:28:14 server fstrim[32107]: /: 112.7 GiB (121011388416 bytes) trimmed on /dev/mapper/vg0-root
If it always trims 0 B, don’t congratulate yourself. You’re trimming air.
Fix 2: Ensure discards pass through dm-crypt (LUKS) if you use encryption
dm-crypt blocks discards by default for good reasons: discards can leak information about which blocks are in use. Many production environments accept that tradeoff because performance and predictable latency matter more than hiding allocation patterns.
What to do: Add discard in /etc/crypttab (or ensure it’s present), then reboot or reopen the mapping during a maintenance window. Afterward, confirm /sys/block/dm-*/queue/discard_max_bytes is non-zero.
Verification is non-negotiable: if dm still advertises 0, the SSD still isn’t hearing you.
Fix 3: Prefer periodic fstrim over mount option discard (most of the time)
Mount option discard issues discards continuously as blocks are freed. This can be helpful in some steady-state churn workloads, but it can also add overhead, especially on filesystems with lots of small deletes, and it can interact poorly with certain device implementations.
My opinionated default:
- Use periodic
fstrimfor ext4/xfs on servers. - Consider continuous
discardonly if:- you measured that periodic trim isn’t enough, and
- your workload has constant churn and strict latency SLOs, and
- you validated the overhead under load.
Fix 4: Stop running SSDs at “almost full”
This sounds like a budgeting problem, but it’s an engineering problem wearing a budgeting hat.
- On write-heavy volumes, target 15–30% free space as an operational buffer.
- Avoid giant monolithic root volumes that collect everything. Split hot write paths to dedicated volumes where you can manage space and trim behavior.
- If you’re using thin provisioning, monitor pool utilization like it’s a pager duty rotation—because it is.
Fix 5: Let the drive idle occasionally (yes, really)
Many SSDs perform background GC more effectively when they have idle time. This is not a substitute for TRIM, but it can reduce the severity of latency spikes.
If your server is pegged 24/7 with constant writes, you are forcing GC to happen in the foreground. You can mitigate by smoothing write bursts (app-level batching, queueing), or by scaling out so each drive gets breathing room.
Fix 6: Align filesystems and stacks with discard semantics
Some configurations make discard complicated:
- LVM thin pools: discards may need explicit support. Thin provisioning plus heavy churn is a classic “looks fine until it doesn’t” setup.
- Snapshots everywhere: snapshots retain old blocks, meaning your deletes don’t actually free space from the lower layer’s point of view. Trim may not reflect real reclaimable space.
- Virtualized disks: discards might be ignored or delayed. Your guest can do everything right and still not influence physical media.
Joke #2: Storage stacks are like lasagna—every extra layer makes it tastier until you realize you’re debugging cheese.
Three corporate mini-stories (anonymized, plausible, and technically accurate)
Mini-story 1: The incident caused by a wrong assumption
They migrated a fleet of Ubuntu servers from SATA SSDs to shiny NVMe. The migration went smoothly, and the first few weeks were glorious. Latency dropped, dashboards looked calm, and everyone congratulated themselves in the chat channel that exists exclusively for congratulating themselves.
Six weeks in, the alerts started: database write latency p99 going vertical during peak hours. The on-call did the normal dance—checked CPU, checked memory, checked network, blamed the ORM, then stared at iostat like it would confess. The NVMe devices showed high utilization and long write awaits, but nothing was “broken.” SMART looked clean. So they scaled up instances. It got better for a while, then it came back.
The wrong assumption was subtle: “NVMe is fast, so storage housekeeping won’t matter.” They didn’t realize the new build used LUKS encryption with default settings, and discards were blocked. fstrim ran weekly and reported success, but trimmed 0 bytes every time. No one noticed because “Finished successfully” is the sort of line humans are trained to misread.
Once they enabled discards through dm-crypt and verified /sys/block/dm-*/queue/discard_max_bytes was non-zero, the weekly trim started trimming real space. The next peak period still had load, but the latency cliff was gone. The postmortem action item was not “buy faster drives.” It was “test discard end-to-end in the golden image pipeline.”
Mini-story 2: The optimization that backfired
A different org had a log ingestion pipeline that wrote continuously: small files, constant deletes, lots of metadata churn. They read somewhere that mounting with discard keeps SSDs happy. So they pushed a change to mount the log volume with continuous discard. They did not benchmark it because, and I quote the vibe, “it’s just discard.”
Within days, ingest throughput dipped and CPU system time climbed. The disk didn’t look saturated by bandwidth, but latency got noisier. They had effectively moved discard work into the hot path: constant tiny discards, constant extra device commands, and more overhead per delete. The SSD was being politely informed about free space thousands of times per second, like a coworker who insists on narrating every keystroke.
They rolled back to periodic fstrim and scheduled trim during off-peak. Throughput returned, CPU calmed down, and the SSD still got the information it needed—just in batches where the overhead was amortized.
The lesson: continuous discard is not evil, but it is a workload-dependent knob. If you turn it on everywhere by policy, you’re not tuning; you’re hoping.
Mini-story 3: The boring but correct practice that saved the day
A payments team had a habit that wasn’t glamorous: every storage-related change required a “prove it” checklist. Not a document to satisfy auditors—an actual engineering ritual. They measured p95/p99 write latency with a synthetic fio profile that matched their database pattern. They recorded it before changes, after changes, and again two weeks later.
When they moved to Ubuntu 24.04, the numbers looked fine on day one. Two weeks later, the scheduled benchmark showed tail latency creeping up. Not catastrophic yet, but trending in the wrong direction. Because this was caught by routine measurement, they had time to investigate without an active incident.
They found that a new thin-provisioned LVM layout was used for convenience. Discards from the filesystem weren’t reclaiming space in the thin pool as expected, and the thin pool was running hot. They adjusted the layout (simplified for the database volume), ensured periodic trim worked, and enforced a free-space budget. No heroics, no midnight pages.
Nothing about that story is exciting. That’s the point. The boring practice—measure now, measure later—saved them from the exciting kind of storage problem.
Common mistakes: symptom → root cause → fix
This section is the accumulated scar tissue. Match your symptom to a likely cause, then verify with the tasks above.
1) fstrim runs “successfully” but always trims 0 bytes
- Symptom: journal shows
/: 0 B trimmedweek after week. - Root cause: Discards blocked by dm-crypt, LVM, thin pool settings, or the backend device ignores discards.
- Fix: Check
/sys/block/*/queue/discard_max_bytesfor the actual mount source device. Enable discards in crypttab, confirm dm advertises discard, rerun manualfstrim -av.
2) Performance improves after reboot, then decays
- Symptom: “Reboot fixes it” folklore, repeating every few weeks.
- Root cause: Drive got idle time or queue conditions changed; underlying issue is lack of TRIM or chronic near-full state causing GC under load.
- Fix: Prove with fio before/after manual trim; enable periodic trim; maintain free space.
3) NVMe looks healthy, but write p99 is awful
- Symptom: SMART fine, no media errors, but long-tail latency spikes.
- Root cause: Foreground GC due to insufficient clean blocks; write amplification; device at high utilization with mixed workload.
- Fix: Ensure effective TRIM; reduce fill level; consider separating workloads (logs vs DB); verify queueing and scheduler defaults.
4) Continuous discard enabled and performance got worse
- Symptom: CPU sys time increases, throughput drops, latency noisier under delete-heavy workload.
- Root cause: Discard overhead in hot path; too many small discards.
- Fix: Remove mount option
discard; use periodicfstrim.timer; schedule trims off-peak.
5) TRIM works on bare metal, not in VMs
- Symptom: Guest runs fstrim, reports trimming, but host device performance still degrades; or guest trims 0 bytes.
- Root cause: Hypervisor/backing storage doesn’t propagate discards; virtual disk type doesn’t support it; cloud provider behavior.
- Fix: Verify discard support at the virtualization layer. If discards don’t propagate, use host-side reclaim mechanisms or accept that you need spare capacity and periodic re-provisioning.
6) Thin provisioning hits a wall
- Symptom: Thin pool or backing device fills, performance tanks, trims don’t seem to free space.
- Root cause: Discards not passed into thin pool, snapshots pin blocks, or pool is overcommitted and under-monitored.
- Fix: Validate thin discard settings; reduce snapshot retention; monitor pool data% like a first-class SLO; keep free space.
Checklists / step-by-step plan
This is the operational plan you can hand to an on-call rotation without also handing them your therapist’s number.
Checklist A: Prove the problem is TRIM/GC (30–90 minutes)
- Capture baseline device latency during the “bad” period (
iostat -xand application p99 write latency). - Confirm discard capability on raw device (
/sys/block/nvme*/queue/discard_max_bytes). - Confirm discard capability on the actual mount source device (often
dm-*). - Run
sudo fstrim -avand record trimmed bytes per mount. - Run a controlled fio test and record p95/p99 percentiles.
- Repeat fio after fstrim (and a short idle window) using identical parameters.
- If tail latency improves materially: treat TRIM path and free-space budget as the fix target.
Checklist B: Make TRIM reliable (maintenance window)
- Enable
fstrim.timerand verify it’s scheduled. - Ensure
/etc/crypttabincludes discard options if using LUKS and your threat model allows it. - Reboot or safely reopen dm-crypt mappings as required.
- Confirm
/sys/block/dm-*/queue/discard_max_bytesis non-zero. - Run manual
fstrim -avonce and confirm non-zero trims. - Track weekly fstrim logs and alert if it trims 0 bytes repeatedly on churn-heavy volumes.
Checklist C: Keep it fixed (ongoing hygiene)
- Set capacity targets: keep 15–30% free space on hot write volumes.
- Split write-heavy workloads across volumes/devices when feasible.
- Record a standard fio profile for your environment and rerun it monthly (or after kernel/storage changes).
- Monitor NVMe SMART: temperature, critical warnings, media errors, percentage_used.
- Monitor latency percentiles, not just throughput and average await.
FAQ
1) Why does SSD/NVMe performance degrade over time at all?
Because flash requires erase-before-write at the block level. When the SSD can’t find clean blocks, it must move valid data out of the way (GC) before writing. Without TRIM, it assumes more data is valid, making GC heavier and increasing write amplification and latency spikes.
2) Isn’t NVMe supposed to be “always fast”?
NVMe is a protocol and interface optimized for parallelism and low overhead. It doesn’t change how NAND flash works internally. A fast interface can deliver a faster fall off a cliff.
3) Should I mount ext4 with discard on Ubuntu 24.04?
Usually no. Prefer periodic fstrim.timer. Use continuous discard only after benchmarking, and only on volumes where the delete pattern and latency requirements justify it.
4) How often should I run fstrim?
Weekly is a sane default. For high-churn workloads, you might run it daily during off-peak. Measure: if performance decays within days, trim more often or fix fill-level and workload issues.
5) Does TRIM work through LUKS encryption?
It can, but only if enabled. dm-crypt may block discards unless configured (commonly via discard in /etc/crypttab). Enabling it can leak allocation patterns, so decide based on your security model.
6) I ran fstrim and it trimmed a lot. Is that bad for SSD lifespan?
TRIM itself doesn’t write data; it informs the SSD that blocks are no longer needed. It can actually reduce write amplification by helping the SSD clean more efficiently. The bigger risk to lifespan is sustained write amplification from running near full with no effective trim.
7) Why does fstrim show 0 bytes on a busy server where we delete data daily?
Common reasons: discards blocked at dm-crypt/LVM; thin provisioning or snapshots keep blocks “in use”; the backend ignores discards (some virtual disks); or your workload overwrites in-place without freeing extents in a way that generates discardable space.
8) Can I “reset” the SSD to restore performance?
Some drives support secure erase or format operations that return the device to a fresh state, but that’s destructive. Operationally, the non-destructive approach is: ensure TRIM works, maintain free space, and avoid pathological stacking that blocks discards.
9) My drive is only 3% used (SMART percentage_used), so why is it slow?
SMART percentage_used is wear, not fullness. You can have a nearly new drive with terrible performance if it’s kept at high logical fill, under constant writes, with no effective TRIM and lots of internal GC.
10) Do I need to change the I/O scheduler for NVMe on Ubuntu 24.04?
Often no; modern kernels default to reasonable choices (commonly none for NVMe). Scheduler tweaks won’t compensate for a drive that’s forced into heavy foreground GC. Fix discard and free space first, then tune if needed.
Conclusion: practical next steps
If your Ubuntu 24.04 SSD/NVMe gets slower over time, treat it like an engineering problem, not folklore. Prove the hypothesis with before/after measurements: same fio job, same device, with and without a successful trim. If tail latency improves after trim, stop debating and start fixing the discard path end-to-end.
Next steps that are worth doing in this order:
- Check the discard chain: raw NVMe discard support, then dm device discard support, then fstrim output.
- Make fstrim reliable: enable the timer, confirm non-zero trims over time, alert if it silently does nothing.
- Fix the structural problem: stop running hot write volumes at 95–99% full; it’s a self-inflicted latency tax.
- Keep receipts: store a baseline fio profile and rerun it after upgrades, image changes, and storage stack modifications.
The drive will keep doing garbage collection. Your job is to keep it in the background where it belongs.