Ubuntu 24.04: Performance tank after update — the first 6 checks that reveal the culprit

Was this helpful?

It’s the classic: you patch a perfectly fine Ubuntu box, reboot, and suddenly your “snappy” system feels like it’s trying to run through wet cement. Dashboards lag. Deploys crawl. IO graphs look like a heart monitor. Your first instinct is to blame “the update” as one giant blob. That’s emotionally satisfying. It’s also operationally useless.

Performance regressions are usually one or two concrete bottlenecks wearing a trench coat. The job is to catch them quickly, identify whether they’re real regressions or just different defaults, and make a calm decision: roll back, tune, or wait for an upstream fix.

Fast diagnosis playbook (what to check first)

If you only have 15 minutes before someone pings you again, do this in order. The point is not to collect data for a museum. The point is to identify the bottleneck class: CPU, memory pressure, disk latency, network, or “a new service is eating the host.”

1) Confirm what changed (kernel, microcode, drivers, services)

  • Check current kernel and last boot time.
  • List recent package upgrades and new units enabled.
  • Look for new snap refreshes if you run snap-heavy systems.

2) Classify the slowdown: CPU bound vs I/O bound vs memory bound

  • If load is high and iowait is high, it’s usually storage or filesystem behavior.
  • If CPU is pegged in user/system with low iowait, it’s a runaway process, syscall storm, or a “helpful” new daemon.
  • If you see high swap activity, it’s memory pressure or a regression in working set.

3) Identify the top offender process and confirm it correlates with the slowdown

  • Don’t stop at “top says X uses CPU.” Check if it’s blocked on disk or waiting on locks.

4) Measure disk latency (not throughput) and queueing

  • Throughput can look fine while latency destroys tail performance.
  • Confirm whether it’s one device, one filesystem, or one mount option.

5) Check CPU frequency scaling and power profiles

  • After updates, power-related defaults and drivers can shift. If your CPU is stuck in powersave, everything feels “mysteriously slow.”

6) Only then: dig into kernel logs, driver regressions, AppArmor denials, and journald/log growth

  • These are common on Ubuntu 24.04 because the platform is modern and busy: new kernels, new drivers, and new policies.

One paraphrased idea from Gene Kim (DevOps author) that operations teams should tattoo on their runbooks: paraphrased idea: optimize for fast detection and fast recovery, not perfect prevention. That mindset is how you stay sane during “it got slow” season.

The first 6 checks that usually reveal the culprit

Check 1: Are you actually running the kernel you think you’re running?

After updates, “we upgraded” and “we are running the upgraded kernel” are not the same statement. If you use Livepatch, multiple kernels, or deferred reboots, you can be in a half-updated state where userland expects one behavior and the kernel delivers another. Also: a new kernel may have changed a driver path (NVMe, network, GPU), and now you’re riding a regression.

Check 2: Is the system I/O-bound (iowait, disk latency, queue depth)?

I/O regressions are the top source of “the whole machine is slow” complaints because they hit everything: boot, package installs, database commits, service restarts, log writes. After an update, common triggers include:

  • Filesystem mount options changed or got “improved.”
  • Journald log volume increased and is now syncing harder.
  • Snap refreshes kicking off at inconvenient times.
  • Storage driver changes (NVMe/APST, scheduler, multipath).

Check 3: Is memory pressure forcing reclaim or swap?

Ubuntu 24.04 is happy on modern hardware, but your workload might not be. A package update can change default cache sizes, introduce a new service, or upgrade a runtime that increases memory. Once you get into sustained reclaim, you’ll see “random” slowness: everything waits behind memory management.

Check 4: Did CPU frequency scaling/power mode change?

On laptops it’s obvious. On servers it’s sneakier. A change in power profiles, BIOS settings, microcode, or driver behavior can keep cores at low frequency under load or oscillating. It looks like “CPU is only at 40% but requests are slow.”

Check 5: Did a service get enabled, reconfigured, or start thrashing?

Systemd makes it easy to enable stuff. Updates can also change unit defaults. A small daemon doing “just a little scanning” becomes a 24/7 disk crawler on large filesystems. The usual suspects: indexers, security agents, backup hooks, log shippers, and anything that “discovers” container images.

Check 6: Are kernel logs showing errors, resets, or policy denials?

If performance tanked, it’s often because the system is retrying something. Link flaps. NVMe resets. Filesystem warnings. AppArmor denies a hot path, causing retries and weird fallbacks. Kernel logs are where the body is buried.

Joke #1: The only thing more consistent than “it got slow after the update” is “it was already slow, the update just made people look.”

Practical tasks: commands, interpretation, decisions (12+)

These are the tasks I run on real production boxes because they answer questions quickly. Each task includes: a command, what the output means, and the decision you make. Run them as a user with sudo where needed.

Task 1: Confirm kernel, boot time, and basic platform

cr0x@server:~$ uname -a
Linux server 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
cr0x@server:~$ uptime -s
2025-12-29 02:14:03

Meaning: You see exactly which kernel is active and when it booted. If the update happened but uptime is older, you didn’t reboot and you’re not actually testing the updated kernel.

Decision: If regression started only after reboot, suspect kernel/driver/services. If it started before reboot, suspect userspace updates or background tasks (snap refreshes, indexers, log growth).

Task 2: List recent upgrades to correlate the “slowdown start”

cr0x@server:~$ grep -E " upgrade | install " /var/log/dpkg.log | tail -n 15
2025-12-28 21:05:12 upgrade linux-image-6.8.0-41-generic:amd64 6.8.0-40.40 6.8.0-41.41
2025-12-28 21:05:20 upgrade linux-modules-6.8.0-41-generic:amd64 6.8.0-40.40 6.8.0-41.41
2025-12-28 21:06:01 upgrade systemd:amd64 255.4-1ubuntu8 255.4-1ubuntu9
2025-12-28 21:06:29 upgrade openssh-server:amd64 1:9.6p1-3ubuntu13 1:9.6p1-3ubuntu14

Meaning: Shows what changed recently. Kernel + systemd updates are high-leverage suspects because they touch boot, device management, cgroups, and logging behavior.

Decision: Pick 1–3 “most suspicious” updates and check their changelog impact in your environment (especially kernel and storage/network stack).

Task 3: Check systemd for failed units and slow boot services

cr0x@server:~$ systemctl --failed
  UNIT                 LOAD   ACTIVE SUB    DESCRIPTION
● fwupd.service         loaded failed failed Firmware update daemon
cr0x@server:~$ systemd-analyze blame | head
35.114s snapd.seeded.service
12.892s dev-nvme0n1p2.device
 8.310s systemd-journald.service
 6.942s NetworkManager-wait-online.service

Meaning: Failed units can cause retry loops. The blame output shows what’s eating startup time; it’s also a hint about runtime slowness (e.g., journald taking long, devices slow to appear).

Decision: If a unit is failing or restarting, fix that first. If snapd seeding dominates and coincides with the slowdown, you likely have background I/O churn.

Task 4: Identify the current top consumers (CPU, memory) without guessing

cr0x@server:~$ ps -eo pid,comm,%cpu,%mem,etimes,state --sort=-%cpu | head -n 12
  PID COMMAND         %CPU %MEM ELAPSED S
 2143 node            280.5  6.1   15421 R
  982 systemd-journal   32.1  0.3   86322 R
 1777 snapd            24.8  0.6   74210 S
 1460 postgres         18.2 12.4   99111 S

Meaning: You get a ranked list, with elapsed time and state. “R” means running; “D” (uninterruptible sleep) often means stuck on I/O.

Decision: If a process is pegging CPU, dig into it. If many important processes are in “D”, stop blaming CPU and go to storage.

Task 5: Check load, iowait, and context switching

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0  10240 812344  92160 6123456   0    0   112   205 3210 5400 18  6 70  6  0
 6  3  10304  92344  90240 5943120  40   88  9800  6100 5400 8800 22  9 40 29  0
 4  2  10496  74320  90112 5901220  52  140 10400  7200 5900 9400 18 10 41 31  0
 7  4  10560  70212  90224 5899900  60  160 12000  8400 6100 9900 16 11 38 35  0
 3  1  10560  69300  90240 5898000   0    0  2100  1900 4100 7200 14  8 70  8  0

Meaning: Watch wa (I/O wait) and swap in/out (si/so). Sustained swap activity means memory pressure. Sustained high wa means storage latency/queueing.

Decision: High swap: fix memory (reduce working set, add RAM, tune caches). High iowait: measure disk latency and find the device or mount causing it.

Task 6: Measure disk latency and utilization by device (the truth serum)

cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0-41-generic (server)  12/29/2025  _x86_64_  (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19.20    0.00    7.10   28.40    0.00   45.30

Device            r/s     w/s   rKB/s   wKB/s  await  aqu-sz  %util
nvme0n1         120.0   310.0  6400.0 41000.0  18.40    6.20  98.00
sda               0.0     2.0     0.0    48.0   1.10    0.01   0.20

Meaning: await is average I/O latency. aqu-sz and %util show queueing and saturation. A device at ~98% util with rising await is a bottleneck.

Decision: If the primary disk is saturated, find what’s writing/reading and whether it’s expected (DB load) or new (log storm, snap refresh, indexing, backup agent).

Task 7: Attribute disk I/O to processes (who is hammering the disk)

cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 28.40 M/s | Total DISK WRITE: 41.10 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>    COMMAND
  982 be/4  syslog      0.00 B/s   9.50 M/s  0.00 % 95.00% systemd-journald
 1777 be/4  root        0.00 B/s   7.20 M/s  0.00 % 80.00% snapd
 2143 be/4  app         6.10 M/s   2.40 M/s  0.00 % 35.00% node /srv/app/server.js

Meaning: You see which processes are doing real I/O right now. High IO> suggests the process spends time waiting on I/O rather than executing.

Decision: If journald is writing a lot, inspect log rate and persistence settings. If snapd is writing heavily during refresh/seeding, schedule or control refresh windows.

Task 8: Check journald size and rate (log storms are performance storms)

cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 6.2G in the file system.
cr0x@server:~$ sudo journalctl -p warning -S "1 hour ago" | tail -n 8
Dec 29 09:12:41 server kernel: nvme nvme0: I/O 123 QID 7 timeout, aborting
Dec 29 09:12:43 server kernel: nvme nvme0: Controller reset, clearing queue
Dec 29 09:12:44 server systemd-journald[982]: Missed 312 log messages due to rate-limiting

Meaning: Big journals aren’t automatically bad, but heavy churn is. Also, warnings about device timeouts correlate strongly with latency spikes.

Decision: If you see storage timeouts/resets, you’re in driver/firmware territory. If journald is huge and growing rapidly, rate-limit noisy services and set sane retention.

Task 9: Inspect kernel ring buffer for storage and network errors

cr0x@server:~$ dmesg -T | egrep -i "nvme|reset|timeout|error|ext4|xfs|zfs|link is down|iommu" | tail -n 25
[Mon Dec 29 09:12:41 2025] nvme nvme0: I/O 123 QID 7 timeout, aborting
[Mon Dec 29 09:12:43 2025] nvme nvme0: Controller reset, clearing queue
[Mon Dec 29 09:12:44 2025] EXT4-fs (nvme0n1p2): warning: mounting fs with errors, running e2fsck is recommended
[Mon Dec 29 09:13:02 2025] igb 0000:03:00.0 eno1: Link is Down

Meaning: Timeouts and resets are not “noise.” They are performance killers because they trigger retries, queue flushes, and application stalls. Filesystem warnings can trigger extra journaling work.

Decision: Storage errors: stop tuning and start stabilizing (firmware, cabling, BIOS, kernel regression). Network link flaps: check NIC driver/firmware and switchport.

Task 10: Validate CPU frequency and governor (the silent slowdown)

cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave
cr0x@server:~$ grep -E "model name|cpu MHz" /proc/cpuinfo | head -n 6
model name	: Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
cpu MHz		: 800.028
model name	: Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
cpu MHz		: 800.031

Meaning: If you’re pinned at ~800 MHz under load, you’re not “having a bad day,” you’re power-limited. Updates can change power-profiles-daemon behavior on some setups or reveal BIOS constraints.

Decision: On servers, you usually want predictable performance: set governor appropriately (often performance) or use a tuned profile that matches your SLOs.

Task 11: See if memory pressure is real (reclaim, swap, and OOM risks)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        49Gi       1.2Gi       1.1Gi        12Gi       6.8Gi
Swap:          8.0Gi       2.4Gi       5.6Gi
cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.25 avg60=0.18 avg300=0.12 total=92841712
full avg10=0.03 avg60=0.01 avg300=0.01 total=3812233

Meaning: Low “available” plus swap in use may be fine for some workloads, but PSI (/proc/pressure) tells you if tasks are waiting on memory. full pressure > 0 suggests serious contention.

Decision: If memory PSI climbs during incidents, tune memory usage, cap runaway services, or add RAM. If swap is thrashing, reduce swappiness or fix the workload.

Task 12: Look for cgroup limits or container changes after updates

cr0x@server:~$ systemctl status docker --no-pager | sed -n '1,12p'
● docker.service - Docker Application Container Engine
     Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2025-12-29 02:16:12 UTC; 7h ago
cr0x@server:~$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

Meaning: Ubuntu 24.04 uses cgroup v2 by default in common setups. Upgrades sometimes change container runtime assumptions. An application can become CPU-throttled or I/O-limited by cgroup policy, which looks like “the host is fine but the app is slow.”

Decision: If you run containers, verify resource constraints and runtime versions. Fix the limit instead of tuning the host blindly.

Task 13: Check mount options and filesystem health (especially after kernel changes)

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /
/dev/nvme0n1p2 / ext4 rw,relatime,errors=remount-ro
cr0x@server:~$ sudo tune2fs -l /dev/nvme0n1p2 | egrep -i "Filesystem state|Errors behavior|Default mount options"
Filesystem state:         clean
Errors behavior:          Remount read-only
Default mount options:    user_xattr acl

Meaning: You confirm what filesystem you’re on and how it’s mounted. If errors were detected, ext4 may do extra work or remount read-only.

Decision: If the filesystem reports errors or the kernel logged warnings, schedule a maintenance window for fsck and investigate storage stability. Don’t “optimize” a broken disk.

Task 14: Check network path quickly (because latency can masquerade as CPU issues)

cr0x@server:~$ ip -s link show eno1 | sed -n '1,12p'
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
    918273645  821933     12      98       0    1203
    TX:  bytes packets errors dropped carrier collsns
    827364554  801102      0       0       7       0
cr0x@server:~$ ss -s
Total: 1048 (kernel 0)
TCP:   812 (estab 621, closed 134, orphaned 0, synrecv 0, timewait 134/0), ports 0
Transport Total     IP        IPv6
RAW       0         0         0
UDP       18        12        6

Meaning: RX/TX errors and carrier issues can drag performance down or create retry storms. ss -s gives a quick health overview of socket state.

Decision: If errors jumped after an update, suspect driver/firmware or offload settings changes. Validate with switch counters and consider pinning/rolling back the NIC driver if needed.

Interesting facts & context (why this keeps happening)

  • Kernel updates change more than “the kernel.” They ship new device drivers, scheduler tweaks, block layer behavior, and sometimes new defaults. Storage and NIC paths are especially sensitive.
  • Linux I/O schedulers have evolved dramatically. Modern kernels often default to mq-deadline or none for NVMe; older tuning guides recommending CFQ are historical artifacts from a different era.
  • NVMe power management has a long history of surprising people. Features like APST can be great for power, but firmware quirks can cause latency spikes or resets on specific hardware.
  • systemd has gradually absorbed responsibilities once handled by separate daemons. That’s good for coherence, but it means systemd/journald changes can show up as performance behavior changes after updates.
  • “iowait” is not a measure of disk speed. It’s CPU time spent waiting for I/O completion. It’s a symptom that something is blocking, not a root cause by itself.
  • cgroup v2 is now the default in many mainstream distributions. It improves resource control, but upgrades can expose previously hidden throttling or mis-set limits, especially in containerized environments.
  • Snap’s transactional model trades some I/O for reliability. It’s a deliberate design choice: more metadata and more writing during updates, often noticeable on busy or small disks.
  • Journaling filesystems prioritize consistency. ext4 and XFS are designed to survive crashes, but certain workloads plus heavy logging can make journal commit behavior visible as latency.
  • Microcode updates can shift performance characteristics. They fix real security and stability issues, but can also change boost behavior or mitigate CPU errata in ways that affect tail latency.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company was mid-migration to Ubuntu 24.04, and the first wave looked fine. Then a cluster of “identical” nodes started timing out under moderate load right after the scheduled patch window. The team assumed it was the application release—because application releases are always guilty until proven innocent. Rollback was initiated. Nothing improved.

Graphs showed the hosts were not CPU-saturated. Memory was comfortable. But tail latency was wrecked. The logs were full of harmless-looking warnings that no one had time to read. So they did what teams do under stress: they added more replicas. That made it worse, because the system became more I/O parallel and pushed the storage device into deeper queueing.

The wrong assumption was subtle: “All NVMe drives behave the same.” These nodes had a different NVMe model due to supply-chain substitutions. The new kernel activated a power-management path that the older kernel didn’t tickle. Under certain access patterns, the drive firmware would reset the controller. Each reset caused seconds of blocked I/O, and every blocked I/O turned into application timeouts.

The fix wasn’t heroic. They pinned a known-good kernel for that hardware cohort, disabled the problematic NVMe power-saving setting, and scheduled a firmware validation. The lesson was painfully simple: hardware variance plus kernel change equals “read the dmesg before you touch the app.”

Mini-story 2: The optimization that backfired

A different team saw disk writes spike after upgrading a fleet. They noticed journald had grown large and decided to “optimize” by putting the journal on the fastest device and increasing log retention “so we can debug better.” It sounded reasonable and even got a thumbs-up in chat.

Within a week, the fastest device was now the busiest device. Their database shared that same NVMe. During peak hours, latency climbed, and p99 queries got ugly. The database wasn’t running out of throughput; it was losing the latency lottery because logging had become a constant background writer with sync points.

The backfire wasn’t journald itself—it was the combination of a higher log volume from a newly verbose service plus longer retention plus co-location with latency-sensitive storage. They had optimized for “I/O locality” and got “I/O contention.”

The eventual fix was boring: cap journal size, reduce log verbosity for the chatty service, and push high-volume logs off-host asynchronously. They also separated “fast” from “critical.” Fast storage is not an infinite sink; it’s where you put the workload that cannot tolerate queueing.

Mini-story 3: The boring but correct practice that saved the day

One org had a rule: every patch window produces a “change record” with three artifacts—kernel version, the top 20 packages changed, and a 10-minute snapshot of baseline performance counters (CPU, memory PSI, iostat latency, and network errors). It wasn’t glamorous. Nobody got promoted for it. But it made incidents short.

After an update, a set of API servers slowed down. The on-call pulled the previous baseline from the last patch window and immediately saw a new pattern: CPU frequency was lower under load and context switching was higher. Disk latency was unchanged, network was clean. The bottleneck class was CPU behavior, not storage.

The root cause was a power profile change after an unrelated package update. They didn’t spend hours blaming the kernel or chasing phantom I/O. They restored the performance profile, verified frequency scaling under load, and closed the incident before the business escalated it.

This is the kind of practice that sounds like paperwork until it saves you: capture a baseline while the system is healthy. Otherwise you’re stuck arguing with your own memory, which is not a reliable monitoring tool.

Common mistakes: symptom → root cause → fix

These are the patterns I see repeatedly after Ubuntu updates, including 24.04. The symptoms are what people report. The root cause is what’s actually happening. The fix is specific and testable.

1) “Load is high but CPU usage is low”

  • Symptom: Load average spikes; top shows CPUs mostly idle; apps stall.
  • Root cause: Threads blocked in uninterruptible I/O sleep (storage latency or driver resets).
  • Fix: Run iostat -xz and dmesg -T. If you see NVMe timeouts/resets, stabilize firmware/kernel settings; if latency is real, find the writer with iotop.

2) “Everything is slower after reboot, but then it gradually improves”

  • Symptom: First hour after reboot is terrible; later it’s tolerable.
  • Root cause: Post-update tasks: snap seeding/refresh, cache rebuilds, updatedb, language pack generation, container image scanning.
  • Fix: Use systemd-analyze blame, journalctl, and iotop to identify churn. Reschedule tasks or limit them. Don’t benchmark during “first boot after update.”

3) “Disk throughput looks fine, but the app is timing out”

  • Symptom: MB/s is high; dashboards show healthy throughput; p99 latency explodes.
  • Root cause: Latency and queueing, not throughput. Mixed reads/writes plus sync-heavy logging can cause long tail latencies.
  • Fix: Focus on await, aqu-sz, and per-process I/O. Reduce sync frequency, split noisy writers, and validate filesystem mount options.

4) “CPU is only 30% but requests are slow”

  • Symptom: Plenty of headroom; still slow.
  • Root cause: CPU stuck at low frequency (governor/power profile) or throttling via cgroup limits.
  • Fix: Check scaling_governor and CPU MHz; verify container limits; correct power profile or cgroup policy.

5) “After the update, the disk is constantly busy doing ‘nothing’”

  • Symptom: High disk utilization at idle; fans spin; interactive shell lags.
  • Root cause: Journald growth, log storms, snap refresh, or a newly enabled scanner/indexer.
  • Fix: Identify the writer with iotop. Rate-limit the noisy source, cap journal size, and manage snap refresh windows.

6) “Network became flaky and now performance is terrible”

  • Symptom: Retries, slow API calls, intermittent timeouts.
  • Root cause: Driver regression, offload setting change, link flaps, or MTU mismatch after update.
  • Fix: Inspect ip -s link errors and dmesg link messages; coordinate with network counters; adjust offloads only with evidence.

Joke #2: Kernel regressions are like office coffee machines: the moment you rely on them, they develop “character.”

Checklists / step-by-step plan

Checklist A: 10-minute triage on a single host

  1. Record kernel and uptime (uname -a, uptime -s).
  2. Check recent upgrades (/var/log/dpkg.log tail).
  3. Check failed services and boot blame (systemctl --failed, systemd-analyze blame).
  4. Classify with vmstat 1 (watch wa, si, so).
  5. Measure disk latency with iostat -xz.
  6. Attribute I/O with iotop -oPa if disk is hot.
  7. Check kernel logs for resets/timeouts (dmesg -T filtered).
  8. Check CPU governor and frequency (scaling_governor, /proc/cpuinfo).
  9. Check memory PSI (/proc/pressure/memory) and swap usage (free -h).
  10. Make a call: mitigate (stop the offender), roll back kernel, or open a hardware/driver investigation.

Checklist B: When it’s a fleet problem (not one host)

  1. Pick three hosts: one fast, one slow, one average.
  2. Compare kernel versions and microcode packages.
  3. Compare hardware IDs (NVMe model, NIC model) so you don’t chase “software” that’s really hardware variance.
  4. Compare iostat latency and dmesg errors across cohorts.
  5. If it correlates with a specific kernel build, consider pinning or rolling back that kernel on affected hardware.
  6. Confirm whether new services are enabled on slow hosts (systemctl list-unit-files --state=enabled diff).

Checklist C: Controlled mitigation actions (do these, not random tuning)

  • If disk is saturated by logs: reduce verbosity, cap journal size, move high-volume logs off-host asynchronously.
  • If snap refresh is disrupting SLOs: schedule refresh windows and avoid running seeding during peak.
  • If NVMe resets appear: prioritize firmware validation and kernel parameter mitigations over “filesystem tuning.”
  • If CPU frequency is stuck: correct governor/profile and confirm under load with repeatable tests.
  • If memory PSI is high: reduce working set, fix leaks, set limits, or add RAM—then re-measure.

FAQ

1) How do I know if the slowdown is CPU, disk, or memory?

Use vmstat 1 and iostat -xz. High wa + high await points to disk latency. Swap activity and memory PSI point to memory pressure. High user/system CPU with low iowait points to CPU or syscall overhead.

2) The update finished, but I didn’t reboot. Can performance still change?

Yes. Userspace services can restart, configuration can change, and background tasks (snap refresh, cache rebuild) can run immediately. Kernel/driver behavior generally changes after reboot, but not all regressions require it.

3) Why does iowait matter so much if CPU looks idle?

Because your application threads are waiting for storage. CPUs being idle doesn’t help if the request path is blocked on fsync, metadata writes, or device retries. Measure disk latency, not just utilization.

4) Should I roll back the kernel immediately?

If you have clear evidence of driver resets/timeouts or a cohort-based regression tied to a specific kernel, rolling back is a reasonable containment move. If the bottleneck is a noisy service or logging, rolling back won’t help and wastes time.

5) Snapd is busy after the update—do I remove snaps?

Not as a knee-jerk reaction. First confirm snapd is the offender with iotop and systemd-analyze blame. Then control refresh windows and reduce conflict with latency-sensitive workloads. Removing snaps can create more operational surface area than it saves.

6) What’s the fastest way to catch NVMe issues?

dmesg -T filtered for nvme, timeout, and reset, plus iostat -xz for latency spikes. If you see controller resets, stop tuning the filesystem and start validating firmware and kernel behavior.

7) How do AppArmor denials show up as “performance problems”?

Denials can force fallbacks, repeated failures, or retries in hot paths. Check journalctl for AppArmor messages and correlate timestamps with latency. Fix is typically policy adjustment or correcting a service’s expected file paths—not disabling AppArmor broadly.

8) Is it safe to set the CPU governor to performance on Ubuntu 24.04 servers?

Often yes for latency-sensitive production, assuming power and thermals are sized correctly. But treat it as a change: apply to a subset, measure, and confirm you aren’t triggering thermal throttling or violating power budgets.

9) Why does journald sometimes become a bottleneck?

High-volume logging is sustained small writes plus metadata updates. If a service becomes noisy after an update, journald can create constant write pressure and sync points. Cap size, reduce verbosity, and don’t colocate log churn with your most latency-sensitive storage if you can avoid it.

Practical next steps

Do this today, while the incident is still fresh and the evidence hasn’t rotated out of the logs:

  1. Run the fast diagnosis playbook on one affected host and one unaffected host. Write down: kernel, iostat latency, top I/O process, and any dmesg errors.
  2. If you see device timeouts/resets: contain the blast radius (pin/rollback kernel on affected hardware cohort) and open a firmware/driver investigation.
  3. If you see log or snap-driven I/O churn: cap it, schedule it, or move it. Don’t let background maintenance compete with foreground SLOs.
  4. If CPU frequency is wrong: fix the governor/profile and verify under load with repeatable measurements, not vibes.
  5. Capture a baseline artifact after you stabilize. Next patch window, you’ll thank you.
← Previous
286 explained: protected mode that saved PCs—and tortured developers
Next →
PostgreSQL vs Percona Server performance myths: why “it’s faster” depends on workload

Leave a comment