Ubuntu 24.04: PHP-FPM keeps crashing — the log line you must find (and the fixes)

Was this helpful?

You upgraded to Ubuntu 24.04, everything looked fine, then your site started dropping into 502 Bad Gateway like it’s practicing for a disaster recovery drill. You restart php-fpm, it behaves for a few minutes, then faceplants again. You stare at Nginx logs like they owe you money.

Here’s the reality: PHP-FPM crashes are rarely “mysterious.” They’re usually one decisive log line away from being boring. Find that line, and you’ll stop guessing and start fixing.

The log line you must find (and why it matters)

If PHP-FPM is “crashing,” you want the line that says who killed it or what signal ended it. Not the Nginx 502. Not the WordPress warning. Not your app’s stack trace (yet). The line you must find is one of these:

  • Kernel OOM killer line: Out of memory: Killed process ... (php-fpm...) or oom-kill: entries.
  • systemd result line: Main process exited, code=killed, status=9/KILL or status=11/SEGV.
  • PHP-FPM child crash: child ... exited on signal 11 (SIGSEGV) or child ... exited with code 255.
  • Socket/listen failures: unable to bind listening socket, Address already in use, Permission denied.

Those lines decide your branch of reality:

  • OOM killer? Stop tuning PHP and start sizing memory, pm.max_children, and per-request memory behavior.
  • SIGSEGV? Treat it like a native crash: extension bug, Opcache JIT edge case, corrupted shared memory, or a bad module build.
  • Status 9/KILL without OOM? Likely systemd watchdog/timeouts, admin scripts, cgroup limits, or container memory limits.
  • Socket bind/permission problems? That’s a deployment/config mistake, not a performance issue.

One quote worth keeping on a sticky note, because it forces discipline:

Paraphrased idea — Werner Vogels: “Everything fails; build systems that expect failure and recover fast.”

We’re going to do the expecting part: grab the right line first, then fix the right problem. Your uptime budget deserves better than “restart and hope.”

Fast diagnosis playbook (first/second/third)

This is the fast path when production is bleeding and you don’t have time for interpretive log dance.

First: confirm what “crash” means in systemd’s eyes

  • Check service state and last exit status.
  • Decide: OOM vs signal vs configuration failure vs dependency issue.

Second: look for the killer (kernel OOM, systemd kill, or segfault)

  • Search journalctl for oom, killed process, SEGV, SIG.
  • If you see OOM: stop and fix memory and concurrency. Don’t “optimize PHP” yet.

Third: correlate with Nginx and pool logs to learn the blast radius

  • Do errors line up with traffic spikes, cron jobs, deploys, or backups?
  • Is it one pool or all pools?
  • Are requests slow/hanging before death? If yes, enable slowlog and timeouts.

That’s it. If you can’t find the killer line in 10 minutes, you’re probably looking in the wrong place or your logs are misrouted.

Interesting facts & context (the stuff that explains today’s weirdness)

  1. PHP-FPM wasn’t always “default PHP.” FastCGI Process Manager began as a third-party patch set before being merged into PHP core years ago, which explains some “legacy” knobs that still exist.
  2. systemd changed the troubleshooting workflow. The journal often has the truth even when app logs don’t, because systemd records exit codes, signals, and restart loops.
  3. Exit code 139 is usually a segfault. In Linux conventions, 128 + signal number; signal 11 (SIGSEGV) becomes 139. It’s not magic, it’s arithmetic with consequences.
  4. OOM killer is a policy decision, not a bug. The kernel kills something to keep the system alive. If it chose PHP-FPM, it’s telling you which process was “most killable” under pressure.
  5. Opcache is a performance feature that lives in shared memory. When it misbehaves, it can take down multiple workers in ways that look random, because the shared state is the common denominator.
  6. Unix sockets are faster, but stricter. TCP listens are forgiving about permissions; Unix sockets are not. One wrong mode/owner and Nginx will scream while PHP-FPM insists it’s “running.”
  7. pm.max_children is not “how many CPU cores you have.” It’s how many concurrent PHP processes you allow, and memory is usually the real limiter long before CPU.
  8. Ubuntu upgrades can quietly swap defaults. Newer PHP builds, different systemd unit hardening, and updated OpenSSL/ICU libraries can change extension behavior and stability.

Joke #1: PHP-FPM is like a coffee shop—if you open the doors to unlimited customers with one barista, your “latency” becomes a lifestyle.

Practical tasks: commands, outputs, and decisions

You wanted real tasks, not vibes. Each task below includes: the command, a realistic sample output, what it means, and what decision you make.

Task 1: Check service state and exit status

cr0x@server:~$ systemctl status php8.3-fpm --no-pager
● php8.3-fpm.service - The PHP 8.3 FastCGI Process Manager
     Loaded: loaded (/usr/lib/systemd/system/php8.3-fpm.service; enabled; preset: enabled)
     Active: failed (Result: signal) since Mon 2025-12-29 09:14:11 UTC; 2min 3s ago
    Process: 18244 ExecStart=/usr/sbin/php-fpm8.3 --nodaemonize --fpm-config /etc/php/8.3/fpm/php-fpm.conf (code=killed, signal=SEGV)
   Main PID: 18244 (code=killed, signal=SEGV)
        CPU: 2.114s

Meaning: systemd saw PHP-FPM die by SIGSEGV. This is not “it got slow.” It crashed.

Decision: Investigate extension/Opcache/JIT/coredump rather than tuning timeouts first.

Task 2: Pull the decisive lines from the journal (last boot)

cr0x@server:~$ journalctl -u php8.3-fpm -b -n 200 --no-pager
Dec 29 09:14:11 server systemd[1]: php8.3-fpm.service: Main process exited, code=killed, status=11/SEGV
Dec 29 09:14:11 server systemd[1]: php8.3-fpm.service: Failed with result 'signal'.
Dec 29 09:14:11 server systemd[1]: php8.3-fpm.service: Scheduled restart job, restart counter is at 5.
Dec 29 09:14:11 server systemd[1]: Stopped php8.3-fpm.service - The PHP 8.3 FastCGI Process Manager.

Meaning: Repeating segfault; systemd is restart-looping.

Decision: Stabilize by disabling suspect features (JIT), consider coredumps, and identify the crashing request/extension.

Task 3: Check for kernel OOM killer evidence

cr0x@server:~$ journalctl -k -b | grep -Ei 'oom|out of memory|killed process' | tail -n 20
Dec 29 08:57:02 server kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/php8.3-fpm.service,task=php-fpm8.3,pid=17602,uid=33
Dec 29 08:57:02 server kernel: Out of memory: Killed process 17602 (php-fpm8.3) total-vm:1884320kB, anon-rss:812344kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:2308kB oom_score_adj:0
Dec 29 08:57:03 server kernel: systemd[1]: php8.3-fpm.service: Main process exited, code=killed, status=9/KILL

Meaning: PHP-FPM got killed by OOM, not “randomly crashed.” The SEGV in Task 2 might be a different incident; don’t mix timelines.

Decision: Treat as capacity/concurrency problem: reduce pm.max_children, fix memory leaks, or add memory/swap. Also check if PHP-FPM is in a cgroup with a memory cap.

Task 4: Identify which pool(s) are involved

cr0x@server:~$ ls -1 /etc/php/8.3/fpm/pool.d/
www.conf
api.conf
worker.conf

Meaning: Multiple pools. Failures may be isolated to one pool with different settings.

Decision: Narrow scope: find which pool log shows child exits; don’t assume “PHP is down” if one pool is broken.

Task 5: Check PHP-FPM pool logs for child deaths and slow requests

cr0x@server:~$ grep -R --line-number -E 'SIGSEGV|exited on signal|server reached pm\.max_children|slowlog|terminated' /var/log/php8.3-fpm.log | tail -n 30
14592:WARNING: [pool api] child 19044 exited on signal 11 (SIGSEGV) after 3.221146 seconds from start
14593:NOTICE: [pool api] child 19061 started
14610:WARNING: [pool www] server reached pm.max_children setting (20), consider raising it
14625:WARNING: [pool www] child 19104, script '/var/www/site/index.php' (request: "GET /") executing too slow (5.112 sec), logging

Meaning: Two separate issues: API pool is segfaulting; WWW pool is capacity-limited and slow.

Decision: Split work: stabilize API pool (extensions/JIT), then tune WWW pool concurrency/timeouts.

Task 6: Confirm listen socket and Nginx upstream alignment

cr0x@server:~$ ss -xlpn | grep -E 'php|fpm'
u_str LISTEN 0 4096 /run/php/php8.3-fpm.sock 32564  * 0 users:(("php-fpm8.3",pid=19212,fd=8))

Meaning: Socket exists and is listened by PHP-FPM.

Decision: If Nginx still throws 502, check permissions on the socket and Nginx user/group; it’s not a missing socket.

Task 7: Verify socket ownership and mode

cr0x@server:~$ stat -c '%n %U:%G %a' /run/php/php8.3-fpm.sock
/run/php/php8.3-fpm.sock www-data:www-data 660

Meaning: Socket is owned by www-data and group-writeable; typical for Nginx running as www-data.

Decision: If Nginx runs as nginx user, add it to the socket group or adjust listen.owner/listen.group/listen.mode in the pool.

Task 8: Inspect Nginx error log for the precise upstream error

cr0x@server:~$ tail -n 30 /var/log/nginx/error.log
2025/12/29 09:14:12 [error] 20220#20220: *884 connect() to unix:/run/php/php8.3-fpm.sock failed (111: Connection refused) while connecting to upstream, client: 198.51.100.24, server: example, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/run/php/php8.3-fpm.sock:", host: "example"
2025/12/29 09:14:20 [error] 20220#20220: *901 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 198.51.100.29, server: example, request: "POST /api HTTP/1.1", upstream: "fastcgi://unix:/run/php/php8.3-fpm.sock:", host: "example"

Meaning: First error: PHP-FPM wasn’t accepting (down or restarting). Second error: PHP-FPM accepted but didn’t respond fast enough.

Decision: Fix the crash/restart loop first. Timeouts are often the symptom of not enough workers or blocked I/O.

Task 9: Check whether PHP-FPM is memory-capped by systemd/cgroups

cr0x@server:~$ systemctl show php8.3-fpm -p MemoryMax -p MemoryHigh -p TasksMax -p OOMPolicy
MemoryMax=536870912
MemoryHigh=0
TasksMax=512
OOMPolicy=stop

Meaning: There is a 512 MiB cap for the service. Under load, you’ll hit it even if the host has free RAM.

Decision: Raise or remove the cap (carefully), then revisit pool sizing. If you didn’t set it, audit who did.

Task 10: Measure per-worker memory to size pm.max_children sanely

cr0x@server:~$ ps -o pid,rss,cmd -C php-fpm8.3 --sort=-rss | head -n 8
  PID   RSS CMD
19244 148320 php-fpm: pool www
19241 142908 php-fpm: pool www
19262 139440 php-fpm: pool api
19251 136112 php-fpm: pool www
19212  31240 php-fpm: master process (/etc/php/8.3/fpm/php-fpm.conf)

Meaning: Workers are ~135–150 MiB RSS each under current workload. 20 children can mean ~3 GB just for workers, plus Opcache and everything else.

Decision: If you have 2–4 GB RAM, pm.max_children=20 is a fantasy. Reduce children or reduce memory per request (app/limits), or add RAM.

Task 11: Enable and read PHP-FPM slowlog to catch “it hung then died”

cr0x@server:~$ sudo grep -nE 'slowlog|request_slowlog_timeout' /etc/php/8.3/fpm/pool.d/www.conf
55:request_slowlog_timeout = 5s
56:slowlog = /var/log/php8.3-fpm-www-slow.log

Meaning: Slowlog triggers at 5s. That’s aggressive enough to catch pathological requests without drowning you.

Decision: If you don’t have slowlog enabled during incidents, you’re choosing ignorance. Turn it on in the affected pool(s), then reproduce under load.

Task 12: Confirm PHP version and loaded modules for crash suspects

cr0x@server:~$ php-fpm8.3 -v
PHP 8.3.6 (fpm-fcgi) (built: Nov 21 2025 10:14:22)
Copyright (c) The PHP Group
Zend Engine v4.3.6, Copyright (c) Zend Technologies
    with Zend OPcache v8.3.6, Copyright (c), by Zend Technologies
cr0x@server:~$ php -m | egrep -i 'opcache|redis|imagick|swoole|xdebug'
imagick
opcache
redis

Meaning: Opcache is present (expected). Imagick is a frequent “native crash” source because it binds to ImageMagick libraries; redis can also have version-sensitive behavior.

Decision: If you see SIGSEGV: temporarily disable high-risk extensions in the crashing pool to isolate. The goal is to make the crash stop, then reintroduce.

Task 13: Check core dump handling (so crashes become actionable)

cr0x@server:~$ coredumpctl list php-fpm8.3 | head
TIME                            PID   UID   GID SIG COREFILE  EXE
Mon 2025-12-29 09:14:11 UTC   18244    0     0  11 present   /usr/sbin/php-fpm8.3

Meaning: A core exists. That’s gold for debugging native crashes.

Decision: If you can, extract a backtrace in staging with debug symbols; in production, at least preserve the core and correlate with deploy changes and enabled modules.

Task 14: Verify file descriptor limits (sneaky crash/instability vector)

cr0x@server:~$ systemctl show php8.3-fpm -p LimitNOFILE
LimitNOFILE=1024

Meaning: 1024 open files might be too low for busy sites (sockets, files, logs, upstreams). When you hit it, you get weird failures: failed accepts, failed opens, sometimes cascading timeouts.

Decision: Raise LimitNOFILE in a systemd override if you’re at scale; then verify with runtime checks.

Task 15: Check AppArmor denials (Ubuntu loves AppArmor)

cr0x@server:~$ journalctl -k -b | grep -i apparmor | tail -n 10
Dec 29 09:02:41 server kernel: audit: type=1400 apparmor="DENIED" operation="open" profile="/usr/sbin/php-fpm8.3" name="/srv/secrets/api-key.txt" pid=18011 comm="php-fpm8.3" requested_mask="r" denied_mask="r" fsuid=33 ouid=0

Meaning: PHP-FPM process was denied access. That can cause fatal errors, endless retries, or cascading failures depending on app behavior.

Decision: Fix paths and permissions or adjust AppArmor profile. Don’t “chmod 777” your way into compliance.

Task 16: Validate configuration before restarting (stop shipping broken syntax)

cr0x@server:~$ php-fpm8.3 -t
[29-Dec-2025 09:20:55] NOTICE: configuration file /etc/php/8.3/fpm/php-fpm.conf test is successful

Meaning: Syntax is OK. Not a guarantee of correct semantics, but it avoids humiliating restarts.

Decision: Make php-fpm -t part of deploy pipelines and pre-restart hooks.

Crash patterns on Ubuntu 24.04: what they look like

Pattern A: OOM kills (the classic “it restarts, then dies again”)

Symptoms:

  • Intermittent 502s under load, worse during traffic spikes or batch jobs.
  • status=9/KILL in systemd, Out of memory: Killed process in kernel logs.
  • Workers show large RSS; memory use climbs until sudden death.

Mechanics:

  • Each worker is a process with its own memory footprint. With pm=dynamic, concurrency increases with demand, and memory usage increases with it.
  • Requests can allocate large memory for JSON, ORM hydration, image processing, PDF generation, or just a plain leak.
  • If you run everything on one VM (web + db + caches) you’re creating a cage match for RAM.

Pattern B: Segfaults (exit code 139, SIGSEGV)

Symptoms:

  • systemd shows status=11/SEGV or log shows child exited on SIGSEGV.
  • Often tied to a specific route, upload type, image operation, or extension call.
  • May begin after an upgrade: PHP minor version, library update, or extension rebuild.

Mechanics:

  • PHP itself is C. Extensions are C. A segfault means someone touched memory they shouldn’t. PHP userland can trigger it indirectly.
  • Opcache and JIT can amplify rare bugs because they alter execution paths and memory layouts.
  • Mixed packages or stale .so modules cause ABI mismatches; it can run for a while then crash.

Pattern C: Socket binding and permissions (not a crash, but looks like one)

Symptoms:

  • PHP-FPM fails to start: “unable to bind listening socket.”
  • Nginx error log: connect() ... failed (13: Permission denied).

Mechanics:

  • Stale socket file or wrong listen address.
  • Pool running under different user than expected; Nginx can’t access socket.
  • Two pools trying to bind the same socket path.

Pattern D: “Not crashed” — just hung or saturated

Symptoms:

  • Nginx says upstream timed out; PHP-FPM stays “active (running).”
  • Log line: server reached pm.max_children.
  • Slowlog shows the same bottleneck function: DB calls, network calls, filesystem I/O.

Mechanics:

  • You’re out of workers. The queue backs up. Nginx waits. Clients leave.
  • Or you have workers, but they’re blocked on something external and won’t return to the pool.

Fixes that actually stick (OOM, segfaults, sockets, limits)

Fix bucket 1: Stop OOM kills by sizing concurrency to memory

The unglamorous truth: most PHP-FPM “crashes” under load are self-inflicted by allowing more concurrency than RAM can support.

1) Reduce pm.max_children based on observed RSS

Use Task 10 to measure realistic RSS during typical and peak requests. Then do the math. If workers average 140 MiB RSS and you can spare 1.5 GiB for PHP workers, you’re in the ~10 child range, not 30.

In the pool file:

cr0x@server:~$ sudo grep -nE 'pm\.max_children|pm\.start_servers|pm\.min_spare_servers|pm\.max_spare_servers' /etc/php/8.3/fpm/pool.d/www.conf
90:pm.max_children = 20
91:pm.start_servers = 4
92:pm.min_spare_servers = 2
93:pm.max_spare_servers = 6

Decision: Drop pm.max_children to a safe number, then observe queueing and latency. It’s better to queue than to die.

2) Put a ceiling on per-request memory (strategically)

Don’t set memory_limit to something “huge” because one report page needs it. That’s how you hand each worker a grenade. Instead, handle heavy tasks asynchronously, or isolate them in a dedicated pool with strict controls.

3) Fix the “one pool does everything” anti-pattern

Create separate pools for:

  • Public web requests (latency sensitive)
  • API requests (often bursty)
  • Admin/cron/background (memory hungry, slower tolerance)

This limits blast radius. A runaway admin export shouldn’t evict your homepage from memory.

4) Review systemd memory caps if present

If MemoryMax is set (Task 9), adjust it with a drop-in override. This is a surgical change, not a ritual.

cr0x@server:~$ sudo systemctl edit php8.3-fpm
# (editor opens; add the following)
# [Service]
# MemoryMax=0
# LimitNOFILE=65535
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart php8.3-fpm
cr0x@server:~$ systemctl status php8.3-fpm --no-pager
● php8.3-fpm.service - The PHP 8.3 FastCGI Process Manager
     Active: active (running) since Mon 2025-12-29 09:28:12 UTC; 3s ago

Decision: If you remove caps, compensate with sane pm.max_children and monitoring, or you’ll just move the failure to “whole VM OOM.”

Fix bucket 2: Stop segfaults by isolating the culprit

Segfaults are deterministic eventually. Your job is to shrink the search space.

1) Temporarily disable JIT if enabled

JIT can be fast; it can also be a crash amplifier for corner cases. If you don’t know whether JIT is on, assume someone turned it on “for performance” and then forgot.

cr0x@server:~$ php -i | grep -i opcache.jit
opcache.jit => tracing => tracing
opcache.jit_buffer_size => 128M => 128M

Turn it off (for now) in /etc/php/8.3/fpm/conf.d/10-opcache.ini or a dedicated override:

cr0x@server:~$ sudo sed -n '1,120p' /etc/php/8.3/fpm/conf.d/10-opcache.ini
opcache.enable=1
opcache.memory_consumption=256
opcache.jit=tracing
opcache.jit_buffer_size=128M

Decision: Set opcache.jit=0 and opcache.jit_buffer_size=0, restart, and see if crashes stop. If they do, you found a big lever.

2) Disable high-risk extensions per pool

You can set php_admin_value[extension]= style controls indirectly by using per-pool php_admin_value for INI settings, but module loading is usually global. Practically, you isolate by running a separate FPM instance or separate pool with different php.ini where possible, or by uninstalling the extension briefly in a maintenance window.

If Imagick is involved and crashes correlate with image operations, test by removing it in staging or temporarily disabling the feature flag that triggers it.

3) Capture a core dump and backtrace (the adult approach)

When the crash is expensive, stop guessing. Use coredumpctl (Task 13) and extract information. Even without full symbols you can often see the crashing module name.

4) Purge mixed packages / stale modules after upgrade

Ubuntu upgrades can leave old modules behind. Ensure you aren’t loading an extension built for a different PHP minor.

cr0x@server:~$ php -i | grep -E '^extension_dir'
extension_dir => /usr/lib/php/20230831 => /usr/lib/php/20230831

Decision: Verify extension_dir matches the installed PHP API version and that modules in that directory are from the same repo/build set.

Fix bucket 3: Make sockets and permissions boring

Socket issues waste time because they look like “PHP down” while PHP-FPM thinks it’s fine.

1) Ensure unique listen sockets per pool

If two pools listen on the same socket path, one wins, one fails, and your outage becomes a coin flip.

cr0x@server:~$ grep -R --line-number '^listen\s*=' /etc/php/8.3/fpm/pool.d/
/etc/php/8.3/fpm/pool.d/www.conf:34:listen = /run/php/php8.3-fpm.sock
/etc/php/8.3/fpm/pool.d/api.conf:34:listen = /run/php/php8.3-fpm.sock

Decision: Fix immediately: give each pool its own socket (or TCP port), and update Nginx upstreams accordingly.

2) Align Nginx user with socket permissions

Check Nginx user:

cr0x@server:~$ grep -nE '^\s*user\s+' /etc/nginx/nginx.conf
1:user www-data;

Decision: If Nginx runs as nginx, then set listen.group=nginx (or add nginx to www-data group), and keep 660 permissions.

Fix bucket 4: Stop restart loops from becoming outages

systemd restart loops are useful until they aren’t. If PHP-FPM crashes instantly, systemd will keep trying, eating CPU, spamming logs, and making Nginx’s experience… exciting.

1) Slow down the restart loop while you debug

Add a systemd override with RestartSec. This buys you time and reduces log noise.

cr0x@server:~$ sudo systemctl edit php8.3-fpm
# [Service]
# RestartSec=5s

Decision: Use this as a temporary seatbelt, not a long-term fix. The goal is not “restart slower,” it’s “stop crashing.”

2) Use reload when safe, restart when necessary

Reload is gentler; restart is a hammer. For config changes that PHP-FPM can reload, do:

cr0x@server:~$ sudo systemctl reload php8.3-fpm
cr0x@server:~$ journalctl -u php8.3-fpm -n 5 --no-pager
Dec 29 09:33:10 server systemd[1]: Reloaded php8.3-fpm.service - The PHP 8.3 FastCGI Process Manager.

Decision: If you suspect memory corruption or extension issues, restart is safer. Reload won’t save you from a poisoned process state.

Joke #2: Restarting PHP-FPM every five minutes is not “self-healing.” It’s “self-soothing,” and it doesn’t impress your on-call rotation.

Three corporate-world mini-stories (what really goes wrong)

Mini-story 1: The incident caused by a wrong assumption

They migrated a mid-sized e-commerce stack to Ubuntu 24.04 over a weekend. The change plan was solid: new AMIs, blue/green, canary traffic. Monday morning, the canary looked fine. By lunchtime, support tickets lit up: sporadic checkouts failing with 502s.

The on-call engineer assumed “network flakiness” because the errors were intermittent and clustered around a payment API call. They chased upstream TLS, DNS caching, even firewall conntrack. They were smart, they were busy, and they were wrong.

The real clue was in the kernel journal: OOM kills against php-fpm. The new base image had a systemd drop-in setting MemoryMax for “hardening.” It was copied from a smaller service template where it made sense. Here, it was a trap. Under real traffic, a handful of large cart requests pushed PHP workers into the cgroup ceiling.

Once they stopped assuming “network,” the fix took an hour: raise the memory cap, reduce pm.max_children based on measured RSS, and split out the admin pool used for imports. They also added an alert on OOM kill events. Not because it’s fancy. Because it’s the one log line that ends arguments.

The interesting part wasn’t the fix; it was the failure mode. Intermittent 502s can be pure capacity. Your intuition will try to blame the network. The kernel will quietly tell you it’s memory.

Mini-story 2: The optimization that backfired

A different team had a latency problem. Pages were slow after an application upgrade. Someone proposed enabling Opcache JIT because they’d read it could improve CPU-heavy workloads. They enabled it globally, bumped the buffer, and declared victory after a quick benchmark.

Two days later, random PHP-FPM worker segfaults began. Not constant, not immediately reproducible. Just enough to trigger restarts during peak hours, which turned into customer-visible errors. The logs showed child exited on signal 11. That’s the kind of line that makes you stare at the ceiling for a minute.

They did the right thing next: they reverted JIT and the crashes stopped. That didn’t prove JIT was “bad.” It proved the change interacted with their particular extension set and request mix in a way the benchmark never exercised. The benchmark hit the happy path. Production always hits the weird path.

The postmortem was blunt: performance features are changes, not free lunch. They introduced a policy: enable JIT only per pool, behind a controlled rollout, and with crash-rate monitoring. The funny thing is the policy was cheaper than the incident. That’s usually how it goes.

Mini-story 3: The boring but correct practice that saved the day

A SaaS company ran multiple PHP pools: www for interactive traffic, api for partners, and jobs for background tasks. Each pool had its own socket, its own slowlog, and strict timeouts. It wasn’t glamorous. It was also the reason their incident was only mildly annoying.

One afternoon, a background job started generating huge PDFs due to a bad template change. Memory usage per request doubled. The jobs pool began hitting its own pm.max_children limit and queueing. The workers churned. But the public site stayed up.

Because the blast radius was contained, the on-call could debug without a customer-facing fire. Slowlog captured stacks showing the PDF generator path. They rolled back the template, cleared the job queue, and then reduced the memory limit for that pool so it would fail fast next time.

The lesson wasn’t “be clever.” It was “be partitioned.” When PHP-FPM crashes, isolation is your shock absorber. Separate pools are operational circuit breakers, and they cost almost nothing compared to downtime.

Common mistakes: symptom → root cause → fix

1) Symptom: Nginx shows 502; PHP-FPM is “active (running)”

Root cause: Wrong socket path in Nginx, or permission mismatch on the socket, or Nginx pointing to a pool that isn’t listening.

Fix: Use ss -xlpn to confirm the exact socket; confirm Nginx upstream matches; verify stat owner/mode; align listen.owner/listen.group/listen.mode.

2) Symptom: PHP-FPM keeps restarting; systemd shows status=9/KILL

Root cause: Kernel OOM kill or cgroup MemoryMax enforcement.

Fix: Check journalctl -k for OOM lines; inspect systemctl show for memory caps; reduce pm.max_children; fix memory-heavy endpoints; increase RAM if needed.

3) Symptom: Exit code 139 / status=11/SEGV

Root cause: Native crash in PHP or an extension, often tied to library updates, Opcache/JIT, Imagick, or ABI mismatches.

Fix: Disable JIT; remove/disable suspect extensions; capture core dump with coredumpctl; ensure consistent package set; reproduce on staging with the same binaries.

4) Symptom: “server reached pm.max_children” and rising latency

Root cause: Not enough workers for request concurrency, or workers are blocked on I/O (DB, network, filesystem), so they don’t return to the pool.

Fix: Use slowlog to find blocking calls; fix DB indexes or external calls; tune pm.max_children only after confirming memory headroom; consider caching and async jobs.

5) Symptom: PHP-FPM won’t start after upgrade

Root cause: Pool configuration error, duplicate listen sockets, or a stale pid/socket file.

Fix: Run php-fpm -t; grep pool files for duplicate listen; remove stale socket if necessary; restart cleanly.

6) Symptom: Random “Permission denied” on files that “should be readable”

Root cause: AppArmor denial or incorrect user/group in the pool.

Fix: Check kernel audit logs for AppArmor denials; keep secrets in approved locations; adjust profile or path; do not broaden permissions indiscriminately.

7) Symptom: Works for hours, then starts failing until restart

Root cause: Memory leak, file descriptor leak, or Opcache fragmentation/pressure under specific traffic patterns.

Fix: Track RSS growth per worker; raise LimitNOFILE if needed; add periodic graceful reload only if you understand the leak; fix the root cause in app/extension.

Checklists / step-by-step plan

Checklist A: Triage in production (15 minutes)

  1. Run systemctl status php8.3-fpm. Record exit status/signal.
  2. Pull journalctl -u php8.3-fpm -b. Find the line with SEGV, KILL, or bind errors.
  3. Search kernel logs for OOM: journalctl -k -b | grep -Ei 'oom|killed process'.
  4. Check Nginx error log for upstream error type (refused vs timeout vs permission).
  5. Confirm socket existence and permissions: ss -xlpn, stat.
  6. Identify which pool is failing (pool name in PHP-FPM logs).

Checklist B: Stabilize (same day)

  1. If OOM: reduce pm.max_children immediately and isolate heavy jobs into a separate pool.
  2. If SEGV: disable JIT (if enabled) and remove high-risk extensions temporarily; preserve core dumps.
  3. If timeout saturation: enable slowlog, set request_slowlog_timeout, and confirm pm.max_children isn’t too low.
  4. Set realistic request_terminate_timeout per pool to avoid infinite hangs.
  5. Raise LimitNOFILE if you’re approaching FD limits.

Checklist C: Make it not happen again (this week)

  1. Add alerting on kernel OOM kill events targeting PHP-FPM.
  2. Track PHP-FPM pool metrics: active processes, listen queue, max children reached, request duration percentiles.
  3. Document pool intent and resource budgets: web vs api vs jobs.
  4. Test upgrades with representative traffic patterns, not just a home page curl.
  5. Make config validation (php-fpm -t) mandatory in deploy.

FAQ

Why is Ubuntu 24.04 making PHP-FPM crash “more” than before?

It usually isn’t Ubuntu “causing” it; upgrades change PHP versions, linked libraries, and systemd defaults. That can expose extension bugs, memory caps, or changed performance characteristics.

What log should I read first: Nginx, PHP-FPM, or systemd?

Start with systemd/journal (journalctl -u php8.3-fpm) and kernel logs for OOM. Nginx tells you the symptom; systemd/kernel tells you the cause.

What does “status=11/SEGV” mean?

The process died with a segmentation fault (SIGSEGV). Treat it as a native crash: suspect extensions, Opcache/JIT, ABI mismatches, or library issues.

What does “server reached pm.max_children” mean, and should I just increase it?

It means all workers were busy and new requests queued. Increasing it can help if you have memory headroom and the workers are not blocked. If you’re already near memory limits, increasing it will convert latency into an OOM kill.

How do I size pm.max_children properly?

Measure worker RSS under real workload (ps sorted by RSS), decide how much RAM you can dedicate to PHP workers, then divide. Keep buffer for OS, Nginx, caches, and spikes.

Is swap a valid fix for PHP-FPM OOM?

Swap can prevent immediate OOM kills, but it can also turn your server into a latency machine. Use swap as a safety net, not a capacity plan. If swap usage climbs under normal traffic, you’re undersized or misconfigured.

Can a single bad request crash all PHP-FPM workers?

Yes. A request that triggers a segfault in a shared extension path can repeatedly kill workers. Also, shared-state issues (Opcache shared memory) can create correlated failures across workers.

Should I use TCP instead of Unix sockets to stop 502s?

No. TCP vs socket is rarely the root cause. Use TCP if you need network separation or container boundaries; otherwise, fix the permission and listen configuration.

What’s the quickest way to catch slow or hanging PHP requests?

Enable request_slowlog_timeout and slowlog per pool, and set sane termination timeouts. Slowlog gives you stack traces for “why is it stuck?” without guessing.

Can systemd itself be killing PHP-FPM?

Yes—via resource limits (MemoryMax, TasksMax), watchdog/timeouts in unusual setups, or aggressive restart policies interacting with failing config. Inspect with systemctl show.

Conclusion: next steps you should take today

If PHP-FPM is crashing on Ubuntu 24.04, stop treating it like weather. Find the line that names the killer: kernel OOM, systemd signal, or PHP-FPM child exit. That single line tells you which toolbox to open.

Do these next steps in order:

  1. Grab systemctl status and journalctl -u; write down the exit status/signal.
  2. Check kernel logs for OOM kills; if present, fix concurrency vs memory immediately.
  3. If SEGV, disable JIT (if enabled), isolate extensions, and preserve core dumps.
  4. Split pools by workload and give each pool its own socket and resource budget.
  5. Turn on slowlog in the pool that hurts, then fix what it points at (DB, external calls, filesystem).

Once you’ve done that, PHP-FPM becomes what it should be: boring infrastructure. The best kind.

← Previous
DNS over HTTPS and DNS over TLS: privacy without breaking corporate networks
Next →
Debian/Ubuntu Disk Latency Spikes: Prove It’s Storage, Not the App (Tools + Fixes)

Leave a comment