Ubuntu 24.04 “Illegal instruction”: CPU flags vs binaries — fix deployments cleanly (case #56)

Was this helpful?

You upgrade to Ubuntu 24.04, your deploy pipeline stays green, and then a service falls over with a blunt message:
Illegal instruction. No stack trace you trust. No useful logs. Just a core dump and a pager that now owns your evening.

This failure is usually not “Ubuntu being weird.” It’s physics: the CPU executed an opcode it doesn’t support because your binary (or one of its libraries) assumed a different set of CPU flags than the machine actually has. The fix is not “reinstall the package” or “try a different kernel.” The fix is to make your build and your deployment agree on what instructions are legal.

Fast diagnosis playbook

When a process dies with Illegal instruction, you’re debugging an instruction set mismatch until proven otherwise. Don’t start by
reinstalling packages. Start by proving which opcode faulted and which CPU features are available.

First: confirm it’s really SIGILL and capture where

  • Check journald / kernel message for SIGILL and the faulting instruction pointer.
  • Confirm the binary that crashed (not just the wrapper script).
  • Get a backtrace (even a crude one) from a core file.

Second: compare CPU flags vs what the binary expects

  • Collect CPU flags from /proc/cpuinfo (and from inside the container/VM if applicable).
  • Identify if the binary was compiled with -march=native, -mavx2, or an x86-64 micro-architecture baseline.
  • Check if your libc is selecting an optimized variant via hwcaps directories.

Third: decide the clean fix path

  • If it’s your code: rebuild with the right baseline, publish multiple targets, or add runtime dispatch.
  • If it’s a third-party binary: get a compatible build, pin a compatible version, or change the CPU model exposed by your hypervisor.
  • If it’s a fleet heterogeneity problem: segment deployments by CPU capability and make your scheduler enforce it.

Paraphrased idea (attributed): Hope is not a strategy — commonly attributed to reliability engineering practitioners.
In this specific case: “We hoped all hosts had AVX2” is not a plan. It’s an incident timeline.

What “Illegal instruction” actually means on Linux

On Linux, “Illegal instruction” almost always corresponds to a SIGILL signal delivered to the process.
The CPU attempted to decode/execute an opcode that is invalid for the current ISA level, or it hit a privileged/forbidden instruction in user space.
In production practice, the common cause is ISA extension mismatch: your binary uses AVX/AVX2/AVX-512, SSE4.2, BMI1/2, FMA, etc.,
but the CPU (or the virtual CPU presented to the guest) doesn’t support it.

The failure signature is rude: no helpful error, often no application logs, sometimes not even a useful backtrace if you didn’t enable core dumps.
You fix it by aligning what you compiled for with what you deployed onto.

One small but recurring confusion: SIGILL is not the same as a segfault. A segfault is memory access.
SIGILL is “the CPU refuses.” It can also happen due to corrupt binaries or bad JIT output, but in fleets it’s mostly feature mismatch.

Joke #1: An illegal instruction is the CPU’s way of saying, “I don’t speak that dialect,” except it expresses it by flipping the table.

Facts and context you can use in a postmortem

These are the kind of short, concrete details that help a team move from “mystery crash” to “understood failure mode.”

  1. SIGILL is older than your build system. Unix signals like SIGILL go back decades; it’s a first-class OS way to report illegal opcodes.
  2. x86-64 is not one thing anymore. Modern distros increasingly distinguish x86-64 baseline levels (often referred to as x86-64-v1/v2/v3/v4).
  3. SSE2 became effectively mandatory on 64-bit x86. That’s why “x86-64 baseline” binaries tend to assume at least SSE2.
  4. AVX and AVX2 are not “free speed.” They can trigger frequency downclocking on some CPUs, so “compiled with AVX2” can be both faster and slower depending on workload.
  5. Virtualization can lie by omission. A VM may run on an AVX2-capable host but present a virtual CPU model without AVX2 to the guest, causing illegal instructions in guest binaries built for AVX2.
  6. Containers share the host kernel, not the host CPU feature set decisions. They see the same CPU, but your container image might have been built on a different machine with different assumptions.
  7. glibc can select optimized code paths at runtime. With “hardware capabilities” (hwcaps), libc can load optimized implementations depending on CPU features, changing behavior after upgrades.
  8. “Works on my machine” is often literally “works on my CPU.” Build boxes tend to be newer than production nodes; this mismatch is a classic silent foot-gun.
  9. Some language runtimes ship multiple code paths. Others don’t. If your runtime lacks runtime dispatch, your compiled extension modules might be the ones triggering SIGILL.

CPU flags vs binaries: where mismatches come from

The three ways you get SIGILL in real deployments

  1. You built “too new.” The binary includes instructions not supported by some fraction of the fleet.
    Common culprits: -march=native, building on a laptop/workstation CPU, or using an optimized vendor build.
  2. You deployed onto “too old.” Hardware refresh cycles are messy. The fleet ends up mixed:
    new nodes with AVX2, old nodes without; or newer Intel vs older AMD; or cloud instance families with different feature sets.
  3. Your platform masked CPU features. Hypervisors, live migration policies, or conservative CPU models can hide features.
    So you compile against one set of flags, but the runtime environment doesn’t match.

Why Ubuntu 24.04 shows up in these incidents

Ubuntu 24.04 is not “breaking CPUs.” What it does bring is a newer toolchain, newer libraries, and a more modern packaging environment.
That matters because:

  • A new compiler may enable different auto-vectorization patterns at the same optimization level.
  • Newer libraries may ship more aggressive optimized variants and select them based on hwcaps.
  • Your rebuilds triggered by the OS upgrade might have changed baseline flags (for example, CI runners upgraded and now compile for newer CPUs).

What to do with this knowledge

Treat ISA level like an API contract. If you don’t specify it, the toolchain will happily infer it. And it will infer it from the machine you built on,
which is almost never the worst CPU you deploy onto.

Practical tasks: commands, outputs, decisions (12+)

These are the tasks I actually run when a service starts face-planting with SIGILL. Each task includes the command, realistic output,
what the output means, and the decision you make next.

Task 1: Confirm SIGILL in journald

cr0x@server:~$ sudo journalctl -u myservice --since "10 min ago" -n 50
Dec 30 10:11:02 node-17 systemd[1]: Started myservice.
Dec 30 10:11:03 node-17 myservice[24891]: Illegal instruction (core dumped)
Dec 30 10:11:03 node-17 systemd[1]: myservice.service: Main process exited, code=killed, status=4/ILL
Dec 30 10:11:03 node-17 systemd[1]: myservice.service: Failed with result 'signal'.

Meaning: systemd reports status=4/ILL. That’s SIGILL, not a segfault.
Decision: stop chasing memory corruption theories; move to ISA mismatch workflow and capture a core.

Task 2: Check if the kernel logged the faulting RIP (instruction pointer)

cr0x@server:~$ sudo dmesg --ctime | tail -n 20
[Mon Dec 30 10:11:03 2025] myservice[24891]: trap invalid opcode ip:000055d2f2b1c3aa sp:00007ffeefb6f1d0 error:0 in myservice[55d2f2b00000+1f000]

Meaning: “invalid opcode” is the kernel’s way of saying “CPU rejected the instruction.”
Decision: use the IP address with addr2line or gdb later; also confirm which binary image is executing.

Task 3: Identify the exact binary and its architecture

cr0x@server:~$ file /usr/local/bin/myservice
/usr/local/bin/myservice: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=a1b2c3..., for GNU/Linux 3.2.0, not stripped

Meaning: It’s a 64-bit x86-64 ELF. Nothing exotic like a wrong-arch container.
Decision: proceed to CPU flags comparison and library selection checks.

Task 4: Collect CPU flags from the host

cr0x@server:~$ lscpu | sed -n '1,25p'
Architecture:             x86_64
CPU op-mode(s):           32-bit, 64-bit
Address sizes:            46 bits physical, 48 bits virtual
Byte Order:               Little Endian
CPU(s):                   32
Vendor ID:                GenuineIntel
Model name:               Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Flags:                    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 popcnt aes xsave avx

Meaning: This CPU has AVX but not AVX2 (no avx2 flag).
Decision: if your binary uses AVX2, it will SIGILL here. Next: verify whether the binary or a library expects AVX2.

Task 5: Confirm flags from /proc/cpuinfo (helps in containers too)

cr0x@server:~$ grep -m1 -oE 'flags\s*:.*' /proc/cpuinfo | cut -c1-180
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 popcnt aes xsave avx

Meaning: Matches lscpu. No AVX2.
Decision: treat this host as “x86-64 with AVX but without AVX2” for deployment targeting.

Task 6: Check whether the binary contains AVX2 instructions (quick scan)

cr0x@server:~$ objdump -d /usr/local/bin/myservice | grep -m1 -E '\bvpmaddubsw\b|\bvpbroadcastd\b|\bvpandd\b'
000000000000f7c0:	vpbroadcastd 0x10(%rdi),%ymm0

Meaning: That’s a YMM instruction commonly associated with AVX2. Not definitive on its own, but suspicious.
Decision: validate via gdb on the faulting address, or check build flags if you control the build.

Task 7: Use gdb with a core file to confirm the faulting instruction

cr0x@server:~$ coredumpctl gdb myservice
           PID: 24891 (myservice)
           UID: 1001 (svc-myservice)
        Signal: 4 (ILL)
     Timestamp: Mon 2025-12-30 10:11:03 UTC (3min ago)
  Command Line: /usr/local/bin/myservice --config /etc/myservice/config.yml
    Executable: /usr/local/bin/myservice
     Control Group: /system.slice/myservice.service
             Unit: myservice.service
          Message: Process 24891 (myservice) of user 1001 dumped core.

(gdb) info registers rip
rip            0x55d2f2b1c3aa
(gdb) x/6i $rip
=> 0x55d2f2b1c3aa: vpbroadcastd 0x10(%rdi),%ymm0
   0x55d2f2b1c3b0: vpmaddubsw %ymm1,%ymm0,%ymm0
   0x55d2f2b1c3b5: vpmaddwd %ymm2,%ymm0,%ymm0

Meaning: The crash is on vpbroadcastd, an AVX2 instruction. Your CPU doesn’t have AVX2. Case closed.
Decision: deploy a non-AVX2 build, add runtime dispatch, or constrain scheduling to AVX2-capable nodes.

Task 8: Find out what shared libraries the binary loads (and from where)

cr0x@server:~$ ldd /usr/local/bin/myservice | head -n 20
	linux-vdso.so.1 (0x00007ffeefbf9000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4b5a9d0000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4b5a8e9000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4b5a6d0000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4b5a6b0000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4b5a4a0000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4b5aa0f000)

Meaning: Normal dynamic linking; nothing obviously off-path.
Decision: if the binary itself isn’t AVX2 but a library is, you’ll need to inspect libraries too. Otherwise focus on rebuilding the app.

Task 9: Check glibc hwcaps selection (diagnose “it changed after upgrade”)

cr0x@server:~$ LD_DEBUG=libs /usr/local/bin/myservice 2>&1 | head -n 25
     24988:	find library=libc.so.6 [0]; searching
     24988:	 search path=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3:/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu
     24988:	  trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libc.so.6
     24988:	  trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libc.so.6
     24988:	  trying file=/lib/x86_64-linux-gnu/libc.so.6

Meaning: The loader searches hwcaps directories first. On some systems, a v3/v2 variant may get loaded.
Decision: if SIGILL started after upgrading glibc and you’re on old CPUs, ensure the right variant is selected (or remove an incompatible override).

Task 10: Verify which libc variant you actually loaded

cr0x@server:~$ LD_DEBUG=libs /bin/true 2>&1 | grep -E 'trying file=.*/glibc-hwcaps' | head -n 5
     25033:	  trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libc.so.6
     25033:	  trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libc.so.6

Meaning: This host at least has directories present; selection depends on CPU capabilities.
Decision: if you’re debugging a container or chroot, confirm the libc in that filesystem tree and not the host’s.

Task 11: Determine baseline ISA level supported (quick heuristic)

cr0x@server:~$ python3 - <<'PY'
import re
flags = open("/proc/cpuinfo").read()
m = re.search(r'^flags\s*:\s*(.*)$', flags, re.M)
f = set(m.group(1).split()) if m else set()
need_v2 = {"sse3","ssse3","sse4_1","sse4_2","popcnt","cx16"}
need_v3 = need_v2 | {"avx","avx2","bmi1","bmi2","fma"}
print("has_v2:", need_v2.issubset(f))
print("has_v3:", need_v3.issubset(f))
print("missing_for_v3:", sorted(list(need_v3 - f))[:20])
PY
has_v2: True
has_v3: False
missing_for_v3: ['avx2', 'bmi1', 'bmi2']

Meaning: This host is roughly “v2-ish” but not v3 (missing AVX2/BMI).
Decision: don’t deploy x86-64-v3 binaries here; build for v2 or provide a v2 fallback.

Task 12: Inspect build metadata from a Go binary (example of self-reporting)

cr0x@server:~$ go version -m /usr/local/bin/myservice | head -n 30
/usr/local/bin/myservice: go1.22.2
	path	example.com/myservice
	build	-ldflags="-s -w"
	build	CGO_ENABLED=1

Meaning: It’s a Go binary and CGO is enabled. That means native C/C++ libs or extensions could be injecting CPU-specific instructions.
Decision: if SIGILL occurs, inspect the CGO-linked libraries or rebuild with controlled CFLAGS (not -march=native).

Task 13: For C/C++ projects, prove whether “native” snuck into build flags

cr0x@server:~$ strings /usr/local/bin/myservice | grep -E -- '-march=|-mavx|-mavx2|-msse4\.2' | head
-march=native
-mavx2

Meaning: The binary likely contains embedded compiler flags (not always, but common in builds with metadata).
Decision: assume the build is not portable. Rebuild with an explicit baseline: for example, -march=x86-64-v2 or a conservative target.

Task 14: Validate container sees the same CPU flags (and catch “it fails only in one environment”)

cr0x@server:~$ docker run --rm ubuntu:24.04 bash -lc "lscpu | grep -E 'Model name|Flags' | head -n 2"
Model name:                           Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 popcnt aes xsave avx

Meaning: Containers see the host CPU flags (as expected).
Decision: if your container image crashes with SIGILL on this host, it’s the image’s binaries, not container CPU masking.

Task 15: Validate VM CPU model and exposed features (KVM/libvirt example)

cr0x@hypervisor:~$ sudo virsh dominfo appvm-03 | sed -n '1,12p'
Id:             7
Name:           appvm-03
UUID:           9c6d1c9e-3a9a-4f62-9b61-0a0f3c7a2c11
OS Type:        hvm
State:          running
CPU(s):         8
CPU time:       18344.1s
cr0x@hypervisor:~$ sudo virsh dumpxml appvm-03 | grep -nE '
58:    Haswell-noTSX
59:    
60:  

Meaning: The VM is explicitly configured to disable AVX2. A guest binary compiled for AVX2 will SIGILL even if the host CPU supports it.
Decision: fix the VM CPU model (if safe) or build/deploy a non-AVX2 binary to that VM.

Three corporate-world mini-stories (anonymized)

Story 1: The incident caused by a wrong assumption

A company ran a small on-prem Kubernetes cluster with a mixed bag of servers. Some were new-ish with AVX2; some were old but reliable, kept around
because “they still have plenty of RAM” and nobody wanted to touch the rack.

The team upgraded their CI runner fleet first. That quietly changed the compilation environment for a latency-sensitive service written in C++.
They also flipped a build option to “native” because they saw a microbenchmark improve on the CI runner machine.
It merged cleanly. Tests passed. The service deployed.

Then the rollout hit one of the older nodes. The pod restarted instantly and kept restarting. Logs showed one line: Illegal instruction.
The on-call did the normal things: delete pod, reschedule, drain node. The scheduler kept placing it back onto old nodes because there was no constraint.

The hidden assumption was simple: “All x86_64 is basically the same.” It isn’t. The fix was also simple, but required discipline:
rebuild with an explicit baseline target and publish it as the default artifact. They also added a node label for AVX2 and pinned the AVX2 build
only to labeled nodes. After that, the incident class disappeared.

Story 2: The optimization that backfired

Another organization used a vendor-provided binary for a data processing agent. The vendor offered two downloads:
a “standard” build and an “optimized” build. Someone chose “optimized” because the word sounds like free money, and the agent did run faster on their
staging environment.

Production was split across two cloud instance families. One family exposed AVX2; the other didn’t (or, more precisely, a particular generation didn’t).
The optimized binary assumed AVX2. Half the fleet crashed at startup with SIGILL, but only after a rolling upgrade reached the older family.

The backfire wasn’t only the crash. It was the operational blast radius. Restart loops overwhelmed logging. Autoscaling tried to compensate.
A downstream queue backed up because a portion of the agents were dead, and the “healthy” ones couldn’t keep up.
The optimization turned into a distributed failure amplifier.

The fix wasn’t heroic. They reverted to the standard build and then introduced an explicit “capability-aware” deployment:
separate node pools per instance family and separate artifact selection. The lesson was boring but expensive: if you don’t know the baseline CPU for your
fleet, you don’t get to run “optimized” binaries by default.

Story 3: The boring but correct practice that saved the day

A third team had a policy: every service artifact must declare a CPU baseline in its release metadata, and every cluster must publish a “minimum ISA”
fact that the scheduler enforces. It wasn’t glamorous. People grumbled. It felt like paperwork.

Then Ubuntu 24.04 entered the environment. A new toolchain version nudged performance-oriented builds to emit different vector instructions in hot paths.
A few services were rebuilt. One of them would have crashed on older nodes if deployed broadly.

It didn’t crash. The deployment system refused to schedule it on incompatible nodes because the artifact metadata said “requires x86-64-v3”.
The rollout went only to the nodes that met the requirement. Users never noticed. On-call never got paged. The team got to keep sleeping, which is the
correct outcome for almost all engineering work.

The postmortem was a non-event: a ticket to expand the v3 node pool, a note to keep building a v2 fallback for legacy pools, and a quiet appreciation
for guardrails that don’t negotiate.

Joke #2: The only thing faster than an AVX2 build is an AVX2 build crashing instantly on an AVX-only CPU.

Containers and VMs: the CPU you think you have vs the CPU you get

The CPU feature story is different depending on whether you’re on bare metal, containers, or virtual machines.
Production outages happen when teams apply the wrong mental model.

Containers: same CPU, different assumptions

Containers don’t emulate the CPU. If your container executes vpbroadcastd on a CPU without AVX2, it will die the same way a host process dies.
The mismatch usually comes from the build environment:
images built on newer runners, or multi-stage builds that compile with “native” optimizations because nobody pinned CFLAGS.

The clean approach: build portable artifacts, then optionally ship additional “accelerated” variants and choose at runtime or via node scheduling.
Do not ship one “optimized” image and pray the cluster is homogeneous. It never is for long.

VMs: CPU models, migration, and conservative defaults

Hypervisors often expose a virtual CPU model. This is not just branding; it determines which instruction set extensions are available in the guest.
Some environments intentionally hide features to allow live migration across a wider range of hosts.

If you compile inside a VM that exposes AVX2, then deploy to a VM that doesn’t, you get SIGILL. If you compile on bare metal with AVX2 and deploy to a VM
with a conservative CPU model, you get SIGILL. If you compile inside the same VM type but the hypervisor configuration differs, you get SIGILL.

The fix is governance: define CPU models per environment, document them as compatibility contracts, and make your build pipeline target those contracts.
“Whatever the cloud gives us” is not a CPU contract. It’s a surprise generator.

glibc hwcaps and why Ubuntu 24.04 can “suddenly” pick faster code

glibc has supported multiple optimized implementations of functions for a long time via mechanisms like IFUNC resolvers.
More recently, distributions have leaned into hwcaps directories:
filesystem paths that contain optimized library builds for specific x86-64 baseline levels.

The dynamic loader searches these directories first (as you saw in the LD_DEBUG=libs output).
On a CPU that meets the criteria, glibc can load a v2/v3 variant of a library, enabling faster implementations.
On a CPU that doesn’t, it should fall back.

When this goes wrong operationally, it’s often because of one of these patterns:

  • A chroot/container image contains hwcaps variants that don’t match the actual CPU it runs on (for example, copied from a different rootfs).
  • An environment variable or loader configuration causes unexpected search paths.
  • A third-party library bundles optimized code and does its own detection poorly.

The practical takeaway: if SIGILL appears after a base-image upgrade, don’t assume it’s only “your binary.”
It can be “the library variant you now load.” Prove it with core analysis and loader debug output.

Common mistakes: symptoms → root cause → fix

1) Crash immediately at startup after upgrade

Symptom: service starts then instantly dies with Illegal instruction (core dumped).

Root cause: a newly built binary assumes AVX2/SSE4.2/FMA due to build host flags or new toolchain decisions.

Fix: rebuild with an explicit baseline (-march=x86-64-v2 or conservative target), and enforce that baseline in CI.

2) Only some nodes crash; rescheduling “fixes” it

Symptom: the same container image runs fine on some nodes and crashes on others.

Root cause: heterogeneous fleet CPU features; image was built for the “better” half.

Fix: label nodes by capability; constrain scheduling; ship multiple image variants or a portable baseline with runtime dispatch.

3) Works on bare metal, fails in VM (or vice versa)

Symptom: binary runs on a workstation but SIGILLs on a VM guest.

Root cause: VM CPU model hides features (AVX2 disabled, conservative baseline for migration).

Fix: align VM CPU model with requirements, or compile for the guest’s exposed flags; don’t compile inside an environment with different flags than production.

4) Only one code path crashes under load

Symptom: service runs for a while, then SIGILL during specific operations.

Root cause: runtime dispatch/jitted code or a plugin selects an AVX2 path conditionally; or a rarely used function gets called.

Fix: capture core at crash; identify the library/function; disable the optimized path, or fix dispatch logic to check correct flags.

5) “But the host CPU supports AVX2” and it still crashes

Symptom: someone points at the spec sheet and insists the server supports the instruction.

Root cause: microcode/BIOS settings, VM masking, or container running on a different node than assumed.

Fix: trust lscpu and the core dump, not the spec sheet; validate flags in the actual runtime environment.

6) A third-party binary from the internet crashes on older hardware

Symptom: vendor tool works in staging, crashes in a “legacy” environment.

Root cause: vendor built for newer baseline (x86-64-v3) without providing a portable build.

Fix: insist on a baseline-compatible build; if unavailable, isolate to compatible nodes or replace the component.

Checklists / step-by-step plan

Checklist A: When you’re paged for SIGILL (operator workflow)

  1. Confirm SIGILL: check journalctl and systemd status. If it’s not status=4/ILL, stop and re-scope.
  2. Capture the crash location: look for invalid opcode ip: in dmesg.
  3. Ensure core dumps exist: if absent, enable them temporarily and reproduce in a controlled way.
  4. Open the core: use coredumpctl gdb, disassemble at RIP, identify the instruction.
  5. Compare to CPU flags: does the CPU have the extension required by that instruction?
  6. Identify whether it’s your binary or a shared library: check backtrace frames and loaded objects.
  7. Decide mitigation: roll back to a portable build, pin to compatible nodes, or change VM CPU model.
  8. Write down the baseline contract and enforce it in CI/CD so you don’t do this again next week.

Checklist B: Clean build-and-release policy (team workflow)

  1. Declare a fleet baseline ISA (per environment). Example: “prod-x86 must support x86-64-v2; some pools support v3.”
  2. Build artifacts with explicit targets; forbid -march=native in release builds.
  3. If you need speed, ship multiple builds: baseline + accelerated (v3/v4). Make selection explicit (scheduler labels or runtime dispatch).
  4. Add a startup self-check: log detected CPU flags and refuse to start if requirements aren’t met (fail loudly, not randomly).
  5. Maintain a compatibility test host (or VM CPU model) that mimics the oldest supported CPU. Run smoke tests there.
  6. Version and pin toolchains in CI to reduce “silent rebuild drift.”

Checklist C: Kubernetes capability-aware deployment (practical)

  1. Label nodes based on CPU flags (AVX2 present or not).
  2. Use node selectors/affinity for accelerated workloads.
  3. Keep baseline image as default; use separate deployment for accelerated image.
  4. Monitor crash loops and correlate with node labels to catch drift early.

FAQ

1) Is “Illegal instruction” always a CPU flag mismatch?

No, but in production it’s the dominant cause. Other causes exist: corrupted binaries on disk, bad RAM, broken JIT codegen, or executing data as code.
Your first job is to prove the faulting instruction via a core dump.

2) Why did it start after moving to Ubuntu 24.04?

Because upgrades change toolchains and library selection behavior. Even if your source code didn’t change, your build outputs can.
Also, glibc hwcaps and optimized code paths might be selected differently after a distro upgrade.

3) If I compile with -O2, can the compiler emit AVX2 anyway?

Not unless your target allows it. The default target depends on compiler configuration and flags.
The real trap is when build systems inject -march=native or when you build on a CPU with broader capabilities and then ship the binary elsewhere.

4) What’s the cleanest “one artifact for everywhere” option?

Build for a conservative baseline (often x86-64-v2 in modern fleets, sometimes even older depending on your hardware) and use runtime dispatch in hot code.
You lose some peak performance but gain predictability and operational simplicity.

5) How do I know if a binary requires AVX2 without crashing it?

You can disassemble it and look for instruction mnemonics associated with AVX2, but the most reliable method is: run it under controlled conditions,
collect a core on SIGILL, and inspect the faulting instruction at RIP.

6) Why does it only crash under certain requests?

Some libraries use runtime dispatch: they check CPU flags and choose a fast path. If that detection is wrong, or if the fast path is in a plugin
only used for specific workloads, you see “random” crashes tied to certain inputs.

7) Can virtualization cause this even if the host CPU supports the instruction?

Yes. The guest sees the virtual CPU model, not the host. Hypervisors can disable features (intentionally, for compatibility or migration).
Always check CPU flags inside the guest.

8) What about ARM servers and “Illegal instruction”?

Same signal, different extensions. On ARM, you’ll see mismatches like compiling for newer ARMv8.x features and running on older cores,
or using crypto extensions not present. The workflow is the same: prove the opcode, compare to CPU features, align build targets.

9) Should we just standardize the fleet on AVX2 and move on?

If you can, yes—homogeneity reduces failure modes. But standardization is a program, not a wish. Until it’s complete, assume heterogeneity and deploy accordingly.

10) What’s the best long-term mitigation?

Treat ISA requirements as deploy constraints. Declare them, test them, enforce them. The “best” fix is the one that prevents the class of incident,
not the one that wins the current firefight.

Next steps you can execute this week

  1. Pick a baseline ISA for each environment (prod/staging/dev) and write it down in your platform contract.
  2. Audit release builds for -march=native and other host-dependent flags. Remove them from production artifacts.
  3. Add a “lowest-common-denominator” CI job that runs tests on a VM CPU model matching your oldest supported nodes.
  4. For heterogeneous fleets, label nodes by capability and enforce scheduling. Stop letting the scheduler “discover” compatibility the hard way.
  5. Enable core dumps in a controlled manner for services where SIGILL would be catastrophic; keep the knob ready for incident response.
  6. If you must ship accelerated builds, do it deliberately: baseline + v3/v4, clear selection logic, and clear rollback paths.

The clean deployment story is not “we never use advanced CPU features.” It’s “we use them on purpose.” Ubuntu 24.04 didn’t betray you.
Your build pipeline did what you implicitly asked for. Now make the request explicit.

← Previous
WordPress Gutenberg Editor Won’t Load: A Practical Debug Checklist
Next →
Supply-chain attacks: when hackers target your vendor instead of you

Leave a comment