Ubuntu 24.04: Certbot renews but your app still fails — fix permissions and reload hooks

Was this helpful?

You run certbot renew. It says “Congratulations.” Your monitoring says “Absolutely not.” Users see TLS errors, or the app keeps serving a certificate that expired yesterday. This is the particular kind of annoyance where the console is cheerful and production is on fire.

On Ubuntu 24.04, the usual culprit isn’t Let’s Encrypt itself. It’s the boring plumbing around it: file permissions, symlinks in /etc/letsencrypt/live, systemd timers that renew but don’t restart anything, and applications that can’t (or won’t) reload certs without a firm nudge.

What’s really happening when renewal “succeeds”

Certbot renewing a certificate is only one leg of a three-legged stool:

  1. Issuance/renewal: Let’s Encrypt signs a new certificate and Certbot stores it under /etc/letsencrypt/archive/<name>/, then updates symlinks in /etc/letsencrypt/live/<name>/.
  2. Access: Your application (nginx, Apache, HAProxy, a Java service, a container) must be able to read fullchain.pem and privkey.pem. “Able” means Unix permissions and path traversal through every parent directory. Not just “the file exists.”
  3. Reload: The process must reload the files (or be restarted) after renewal. Some daemons reload on SIGHUP. Some need a config test first. Some need a full restart. Some never reload certs at runtime and will happily serve the old one until the next deploy window.

Most failures are in legs 2 or 3. Certbot doesn’t automatically know how your service consumes certs. Also, in Ubuntu 24.04, systemd and packaged defaults push you toward automation (timers, services), which is great right up until nobody wires in the “reload the actual thing” part.

One operational truth: a renewed certificate is useless until the process presenting it has re-opened the keypair. Your client only cares about the bytes it receives during the TLS handshake, not what certbot printed.

Exactly one quote, because it’s still true decades later: “Hope is not a strategy.” — General Gordon R. Sullivan

Joke #1: TLS renewal without a reload hook is like changing the batteries in a smoke detector you haven’t installed. Technically progress, practically smoke.

Fast diagnosis playbook (first/second/third)

If you’re on call, you don’t want a lecture. You want the shortest path to “Is the cert renewed, readable, and loaded?” Use this order. It minimizes thrash.

First: what certificate is the client actually seeing?

  • Check the served certificate’s expiry and serial number from a client perspective.
  • If it’s old: this is a reload/selection/routing issue. Stop blaming Let’s Encrypt.
  • If it’s new but clients still fail: you may have chain issues, SNI mismatch, or the wrong vhost binding.

Second: does the service process have access to the private key path?

  • Confirm the file path in the service config matches your intended /etc/letsencrypt/live symlink.
  • Test permissions as the service user (or using namei to check traversal).
  • Look for AppArmor/SELinux restrictions (Ubuntu often: AppArmor).

Third: what is (or isn’t) triggering reload?

  • Certbot can run on a timer and renew quietly. Your nginx won’t magically notice.
  • Check Certbot logs for “Deploying Certificate” and hook execution.
  • Add a deploy hook that reloads your service only when renewal actually happened.

Only after those three do you dig into DNS challenges, ACME rate limits, firewall rules, or random cloud load balancer behavior. Those happen, but they’re not the common case when renewal says success.

Interesting facts and context (why this keeps biting teams)

  • Let’s Encrypt’s debut (2015) turned TLS into a default expectation, but it also turned “certificate renewal” into a recurring operational task rather than a calendar reminder.
  • Certbot’s storage model uses an “archive” plus “live symlink” layout specifically to enable safe atomic-ish updates: new files land in archive/, symlinks move in live/.
  • The private key in /etc/letsencrypt/live is normally 0600 root:root. That’s correct security posture, and also why non-root apps frequently break after “helpful” refactors.
  • systemd timers replaced cron for many packaged renewals because timers integrate with journald logging and service health. They also make it easier to forget that “renewal” is not “deployment.”
  • Nginx can reload configuration without dropping connections, but only if the reload is triggered. Without it, nginx will keep using whatever it loaded at start.
  • Apache’s graceful reload behavior differs by MPM and module stack; it’s capable, but broken permissions or a failed config test can cause it to keep running with the old cert.
  • ACME challenges (http-01, dns-01, tls-alpn-01) solve issuance, not deployment. Teams conflate “challenge succeeded” with “site fixed” because both happen in the same command output.
  • Snap-packaged Certbot changed paths and confinement behavior for some installs, which can surprise people migrating between Ubuntu versions or following stale blog posts.

Practical tasks: commands, outputs, and decisions (12+)

These are real operational tasks. Each one includes: command, realistic output, what it means, and the decision you make.

Task 1: Confirm what cert the client sees (expiry, subject, issuer)

cr0x@server:~$ echo | openssl s_client -servername app.example.com -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -subject -issuer -dates
subject=CN = app.example.com
issuer=C = US, O = Let's Encrypt, CN = R11
notBefore=Dec 29 02:10:11 2025 GMT
notAfter=Mar 29 02:10:10 2026 GMT

What it means: The process on localhost:443 is serving a certificate valid until Mar 29. If your alert says “expired,” your alert may be checking a different endpoint, or a different proxy layer is serving the old cert.

Decision: If served cert is old/expired, jump to reload/permissions. If it’s fresh, check routing/SNI and intermediate chain.

Task 2: Check the certificate files Certbot thinks are current

cr0x@server:~$ sudo certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

Found the following certs:
  Certificate Name: app.example.com
    Serial Number: 4e6a0f9a4b3c17c2a3b9e1d0c4a1a9f2
    Key Type: ECDSA
    Domains: app.example.com www.app.example.com
    Expiry Date: 2026-03-29 02:10:10+00:00 (VALID: 89 days)
    Certificate Path: /etc/letsencrypt/live/app.example.com/fullchain.pem
    Private Key Path: /etc/letsencrypt/live/app.example.com/privkey.pem

What it means: Certbot’s view is fine: it has a valid cert and canonical paths.

Decision: If Certbot shows a new expiry but clients see old expiry, your service is not reading these paths, can’t read them, or hasn’t reloaded.

Task 3: Verify the live symlinks actually point to the newest archive version

cr0x@server:~$ sudo ls -l /etc/letsencrypt/live/app.example.com/
total 4
lrwxrwxrwx 1 root root  43 Dec 29 02:10 cert.pem -> ../../archive/app.example.com/cert4.pem
lrwxrwxrwx 1 root root  44 Dec 29 02:10 chain.pem -> ../../archive/app.example.com/chain4.pem
lrwxrwxrwx 1 root root  48 Dec 29 02:10 fullchain.pem -> ../../archive/app.example.com/fullchain4.pem
lrwxrwxrwx 1 root root  46 Dec 29 02:10 privkey.pem -> ../../archive/app.example.com/privkey4.pem
-rw-r--r-- 1 root root 692 Jun  1  2024 README

What it means: The live links now point to “4”. Renewal rotated the symlinks.

Decision: If the symlink number didn’t change after renewal, renewal might not actually have happened; check logs and rate limits.

Task 4: Confirm nginx/Apache config references the correct paths

cr0x@server:~$ sudo nginx -T 2>/dev/null | grep -R --line-number -E "ssl_certificate(_key)?\s" /etc/nginx/sites-enabled/* | head
/etc/nginx/sites-enabled/app.conf:12:    ssl_certificate /etc/letsencrypt/live/app.example.com/fullchain.pem;
/etc/nginx/sites-enabled/app.conf:13:    ssl_certificate_key /etc/letsencrypt/live/app.example.com/privkey.pem;

What it means: nginx is configured to use the expected symlink paths.

Decision: If you see /etc/letsencrypt/archive/... hardcoded, change it. Hardcoding archive files is how you guarantee pain next renewal.

Task 5: Check if the service is actually running and what user it runs as

cr0x@server:~$ systemctl status nginx --no-pager
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-29 02:11:03 UTC; 2h 14min ago
       Docs: man:nginx(8)
   Main PID: 1842 (nginx)
      Tasks: 3 (limit: 19092)
     Memory: 8.3M
        CPU: 1.742s
     CGroup: /system.slice/nginx.service
             ├─1842 "nginx: master process /usr/sbin/nginx -g daemon on; master_process on;"
             ├─1843 "nginx: worker process"
             └─1844 "nginx: worker process"

What it means: nginx is running. Workers are typically www-data.

Decision: If the service is failing or flapping, permissions errors will likely show in journald right here. If it’s running but serving old cert, it probably hasn’t reloaded.

Task 6: Look for permission errors in journald

cr0x@server:~$ sudo journalctl -u nginx -n 50 --no-pager
Dec 29 02:10:59 server nginx[1842]: nginx: [emerg] cannot load certificate "/etc/letsencrypt/live/app.example.com/fullchain.pem": BIO_new_file() failed (SSL: error:8000000D:system library::Permission denied:calling fopen(/etc/letsencrypt/live/app.example.com/fullchain.pem, r) error:10080002:BIO routines::system lib)
Dec 29 02:10:59 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 29 02:10:59 server systemd[1]: nginx.service: Failed with result 'exit-code'.

What it means: Classic. nginx can’t read the cert file. Often because the file is root-only and nginx is running unprivileged at the wrong moment, or because directory traversal is blocked.

Decision: Don’t chmod 777 your way out. Fix permissions with a deliberate model (see the permissions section).

Task 7: Check path traversal permissions with namei (this catches the “directory is 0700” trap)

cr0x@server:~$ sudo namei -l /etc/letsencrypt/live/app.example.com/privkey.pem
f: /etc/letsencrypt/live/app.example.com/privkey.pem
drwxr-xr-x root root /
drwxr-xr-x root root etc
drwxr-xr-x root root letsencrypt
drwx------ root root live
drwxr-xr-x root root app.example.com
lrwxrwxrwx root root privkey.pem -> ../../archive/app.example.com/privkey4.pem

What it means: /etc/letsencrypt/live is 0700, so non-root users can’t traverse into it, even if the file itself had more permissive bits.

Decision: If your app runs as non-root and reads directly from live, it will fail. Either make the app read from a controlled copy path, or use ACLs carefully.

Task 8: Validate the certificate files parse correctly (catch partial writes or wrong file referenced)

cr0x@server:~$ sudo openssl x509 -in /etc/letsencrypt/live/app.example.com/fullchain.pem -noout -text | grep -E "Not After|Subject:"
        Subject: CN = app.example.com
            Not After : Mar 29 02:10:10 2026 GMT

What it means: The on-disk file is a valid X.509 cert and has the expected expiry.

Decision: If parsing fails, you might have pointed at the wrong file or have corruption. Fix before touching reloads.

Task 9: Confirm Certbot’s timer/service is present and when it last ran

cr0x@server:~$ systemctl list-timers --all | grep -E "certbot|letsencrypt"
Mon 2025-12-29 02:07:41 UTC  10h left Mon 2025-12-29 00:08:12 UTC  11h ago certbot.timer                certbot.service

What it means: Renewal is handled by a timer. That’s fine, but you need to see what the service actually does.

Decision: Inspect certbot.service definition and hook behavior next.

Task 10: Inspect what the Certbot systemd service executes (where hooks may or may not be wired)

cr0x@server:~$ systemctl cat certbot.service
# /usr/lib/systemd/system/certbot.service
[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/certbot/readme.Debian.gz
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/bin/certbot -q renew
PrivateTmp=true

What it means: It runs certbot -q renew quietly. No --deploy-hook here. So unless you configured hooks elsewhere, nothing reloads.

Decision: Add a deploy hook in /etc/letsencrypt/renewal-hooks/deploy/ or override the systemd unit (more on that later).

Task 11: Check Certbot logs for hook execution and renewal outcome

cr0x@server:~$ sudo tail -n 60 /var/log/letsencrypt/letsencrypt.log
2025-12-29 02:10:11,214:INFO:certbot._internal.renewal:Cert is due for renewal, auto-renewing...
2025-12-29 02:10:12,992:INFO:certbot._internal.client:Successfully received certificate.
2025-12-29 02:10:13,103:INFO:certbot._internal.storage:Writing new private key to /etc/letsencrypt/archive/app.example.com/privkey4.pem.
2025-12-29 02:10:13,214:INFO:certbot._internal.storage:Deploying certificate to /etc/letsencrypt/live/app.example.com/fullchain.pem.
2025-12-29 02:10:13,215:INFO:certbot._internal.storage:Deploying key to /etc/letsencrypt/live/app.example.com/privkey.pem.

What it means: Renewal happened. But there’s no evidence a deploy hook ran (those lines would show hook execution if configured).

Decision: Implement deploy hooks. If hooks exist but didn’t run, check file permissions/executable bits on the hook scripts.

Task 12: Manual reload with config test (avoid restarting into a broken config)

cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

What it means: Reload is safe.

Decision: Proceed to reload. If the test fails, fix config first—no hook should reload a daemon into failure.

Task 13: Reload the service and re-check the served cert

cr0x@server:~$ sudo systemctl reload nginx
cr0x@server:~$ echo | openssl s_client -servername app.example.com -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -dates
notBefore=Dec 29 02:10:11 2025 GMT
notAfter=Mar 29 02:10:10 2026 GMT

What it means: After reload, the served cert matches the renewed cert.

Decision: Your fix is “ensure reload happens after successful renewal.” Now automate it with a deploy hook.

Task 14: Test permissions as the service user (the only test that matters)

cr0x@server:~$ sudo -u www-data bash -lc 'head -n 1 /etc/letsencrypt/live/app.example.com/fullchain.pem'
head: cannot open '/etc/letsencrypt/live/app.example.com/fullchain.pem' for reading: Permission denied

What it means: As expected, www-data cannot read the file. If nginx needs to read certs as www-data at reload time, you’ll fail.

Decision: Do not “fix” this by making private keys world-readable. Use a root-readable model with root-performed reload, or a controlled copy/ACL approach.

Fixing permissions without creating a security incident

The temptation is immediate: chmod -R 755 /etc/letsencrypt, reload, go home. That works right up until you realize you just made private keys readable to more principals than intended. In some environments, that’s an incident by itself.

Here’s the practical mental model:

  • Private key secrecy is the whole game. If an attacker reads privkey.pem, they can impersonate your service until the certificate is revoked/rotated and clients stop trusting the old chain.
  • Most daemons don’t need the key readable by the worker user if the master process starts/reloads as root and then drops privileges. nginx is a classic example: the master runs as root, reads keys, then workers run unprivileged.
  • Problems happen when you run the whole service as non-root (containers, hardened units, custom user) and still point it at /etc/letsencrypt/live.

Choose one of three sane patterns

Pattern A (preferred): Service reload runs as root; cert files stay root-only

If you can reload nginx/Apache/HAProxy as root via systemd, keep /etc/letsencrypt defaults. This is safest and easiest.

What you do: create a deploy hook that runs nginx -t then systemctl reload nginx. Cert files remain root-only. The reload is privileged, so the service can read the key.

Pattern B: Controlled copy to an app-readable directory (good for non-root apps)

Some apps (or containers) run fully non-root and must read key material directly. Don’t point them at /etc/letsencrypt/live. Instead:

  • Create a dedicated directory like /etc/ssl/app.example.com/ with tight ownership and permissions.
  • On deploy hook, copy fullchain.pem and privkey.pem there with install (sets mode/owner atomically enough for our purposes).
  • Reload the service after copying.

This reduces the blast radius: you don’t weaken /etc/letsencrypt, you expose only what the app needs, to the exact user it runs as.

Pattern C: ACLs on specific paths (use sparingly, document aggressively)

You can use POSIX ACLs to grant read/traverse rights to a service user for just the needed files/directories. This can work, but it’s easy to forget and hard to audit in a hurry.

If you choose ACLs, bake verification into your runbook. Otherwise your future self will “fix” it again with chmod at 3am.

What not to do (unless you enjoy incident retros)

  • Don’t make privkey.pem group-readable by a broad group like www-data if that group contains other services. That’s lateral movement as a feature.
  • Don’t point services at /etc/letsencrypt/archive. Renewal increments filenames; your config won’t follow.
  • Don’t build hooks that restart critical services on every timer run even when nothing renewed. That’s self-inflicted churn.

Reload hooks done right (deploy hooks, systemd, and gotchas)

Certbot has multiple hook types. The one you want for “reload after successful renewal” is typically the deploy hook. It runs only when a certificate is actually renewed (or newly issued), which keeps your services stable on days nothing changes.

The deploy hook directory (the easiest, least surprising option)

Drop an executable script into:

  • /etc/letsencrypt/renewal-hooks/deploy/

Certbot will run it after it has written the new cert and updated live symlinks.

Example: nginx reload hook with safety checks

cr0x@server:~$ sudo install -d -m 0755 /etc/letsencrypt/renewal-hooks/deploy
cr0x@server:~$ sudo tee /etc/letsencrypt/renewal-hooks/deploy/reload-nginx >/dev/null <<'EOF'
#!/bin/bash
set -euo pipefail

# Only reload if nginx is installed and running.
if ! command -v nginx >/dev/null 2>&1; then
  exit 0
fi

if ! systemctl is-active --quiet nginx; then
  exit 0
fi

# Validate config before reload; fail hook if config is broken.
nginx -t

# Reload picks up new cert without dropping connections.
systemctl reload nginx
EOF
cr0x@server:~$ sudo chmod 0755 /etc/letsencrypt/renewal-hooks/deploy/reload-nginx

Why this script is opinionated: it does nothing if nginx isn’t present or active (useful on multi-role servers), and it refuses to reload a broken config. Hooks should be safe in the presence of unrelated changes.

Test the hook without waiting for renewal day

Use Certbot’s dry run. It simulates renewal (staging environment) and runs hooks.

cr0x@server:~$ sudo certbot renew --dry-run
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/app.example.com.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Simulating renewal of an existing certificate for app.example.com and www.app.example.com

Congratulations, all simulated renewals succeeded:
  /etc/letsencrypt/live/app.example.com/fullchain.pem (success)

What it means: The dry-run succeeded. Now confirm the hook ran by checking nginx reload logs or journald timestamps.

Decision: If dry-run works but production renewal doesn’t reload, inspect file executability, SELinux/AppArmor, or Snap confinement differences.

When you should use --deploy-hook instead

If you want the hook tied to a specific invocation (say a special unit or a particular certificate), you can pass --deploy-hook on the command line. But on Ubuntu with systemd timers, the hook directory is usually simpler, because it applies consistently even when humans run certbot renew manually.

systemd overrides: for when you must control behavior centrally

If your organization insists that “everything must be in systemd units,” override the service:

cr0x@server:~$ sudo systemctl edit certbot.service
cr0x@server:~$ sudo systemctl cat certbot.service
# /usr/lib/systemd/system/certbot.service
[Unit]
Description=Certbot
Wants=network-online.target
After=network-online.target

# /etc/systemd/system/certbot.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/certbot -q renew --deploy-hook "systemctl reload nginx"

What it means: You replaced ExecStart with one that contains a deploy hook. It will reload nginx only on actual renewals.

Decision: Pick either directory hooks or unit overrides. Don’t do both unless you like mystery double reloads.

Gotcha: reload vs restart

Prefer reload if the service supports it properly. It’s less disruptive. Use restart when:

  • The daemon can’t reload certs cleanly (some app servers).
  • You’re inside a container where “reload” doesn’t exist and you have to bounce the process.
  • You’ve verified reload does not pick up new keys (rare but real, depending on integration).

Joke #2: If your renewal hook uses restart for everything, you’ve reinvented “planned downtime,” just with more surprise.

Containers and reverse proxies: where reloads go to die

Ubuntu 24.04 didn’t invent containers, but it did inherit their favorite property: they make filesystem assumptions someone else’s problem. Certbot runs on the host, updates files on the host, and your TLS terminator might be:

  • nginx on the host (simple)
  • nginx in a container (file sharing and signaling issues)
  • Traefik/HAProxy in a container (dynamic reload options vary)
  • a cloud load balancer (certbot renewal is irrelevant unless you upload certs there)

Container case 1: host certs bind-mounted read-only

Common pattern: mount /etc/letsencrypt/live/app.example.com into the container. The container can read the files, but it won’t know they changed unless:

  • the process polls or watches for changes, or
  • you trigger a reload signal into the container.

Certbot renew succeeded; the container still serves old cert; everything looks “fine.” That’s not a cert problem. It’s a lifecycle problem.

Container case 2: non-root container can’t traverse /etc/letsencrypt/live

Even with a bind mount, the directory permissions can block access. Remember the earlier namei output showing live as 0700. If you mount /etc/letsencrypt broadly, you may still be blocked at the mount root. The right move is usually Pattern B: copy certs to a container-readable directory with strict permissions, and mount that.

Reverse proxy layer mismatch

Another classic: you renew certs on the app server, but TLS is terminated on a front proxy (nginx/HAProxy) or a load balancer. Your app never presents a certificate to clients, so renewing it does nothing. Meanwhile, the certificate that matters is sitting elsewhere, expiring in peace.

Operational advice: map the handshake path. The certificate that matters is the one on the first TLS hop from the client. Everything behind it is internal traffic unless you’re doing mTLS end-to-end.

Three corporate-world mini-stories (pain included)

Mini-story 1: The incident caused by a wrong assumption

They had a clean setup: Certbot on Ubuntu, nginx terminating TLS, and a couple of upstream services. Someone rotated a bunch of servers to a “hardened baseline” and proudly removed root privileges from several systemd units. nginx was among them, because “least privilege.” The change request looked reasonable; it even came with a security sign-off.

Certbot renewal day arrived. The timer ran, wrote new files, updated symlinks, and printed success. The deploy hook triggered a reload. nginx attempted to reopen the certificate files. And promptly failed because the unit was now running as an unprivileged user with no access to /etc/letsencrypt/live (which is 0700 at the directory level).

The assumption was subtle: “If nginx workers can run as non-root, then nginx can run as non-root.” Not always. nginx’s master process traditionally starts as root specifically so it can bind to low ports and read key material, then drops privileges for workers. Running it non-root changes what it can read, and suddenly cert renewal becomes a reliability event.

They fixed it by reverting that unit hardening for nginx (keeping worker privilege separation) and writing down an explicit rule: services that terminate TLS must have a clear, reviewed method to access private keys. The postmortem didn’t blame Certbot. It blamed the lack of an end-to-end test that validated “served cert expiry after renewal.”

Mini-story 2: The optimization that backfired

A platform team wanted to reduce reload churn. They had dozens of certificates across many vhosts and decided they’d run certbot renew hourly “just in case,” but only reload nginx once a day in a separate job. Less reloads, less noise, fewer chances of disrupting long-lived connections. The idea sounded clean in a spreadsheet.

Then they added a new domain with a certificate that renewed earlier than the daily reload time. Certbot renewed it happily, but nginx kept serving the old cert for almost 24 hours. One client with strict TLS checks started failing. The support team saw “renewal success” in logs and assumed it was a client issue. It wasn’t.

The optimization broke the hidden contract: renewal must be coupled to deployment. You can optimize reload frequency only if your TLS terminator supports loading certs dynamically per handshake (many don’t) or you implement a smarter reload condition. Otherwise you’re optimizing the wrong variable: reload count instead of correctness.

The fix was to reload on actual renewals only (deploy hook), and to add a guardrail check: “served certificate expiry is at least 20 days out” from the proxy itself. That replaced assumptions with measurement.

Mini-story 3: The boring but correct practice that saved the day

Another org had an unglamorous rule: every TLS endpoint must have a local script that prints the served certificate expiry and compares it to what’s on disk. The script ran as a daily health check and during deployments. Nobody loved it. Nobody put it on a T-shirt. It was just there.

One morning, a renewal happened and the proxy didn’t reload. The hook existed, but a packaging update replaced a systemd override and removed a custom --deploy-hook flag. The timer still ran. Certbot still renewed. The service stayed up and kept serving the old cert, so no alarms fired from uptime monitors.

The boring script caught it before users did: “served expiry does not match live expiry.” The on-call had a single command to run, a single place to look, and a single fix: restore the deploy hook via /etc/letsencrypt/renewal-hooks/deploy (which survived packaging changes better).

That practice didn’t prevent the misconfiguration, but it turned it into a calm maintenance ticket instead of a public outage. Boring, correct, and weirdly heroic.

Common mistakes: symptom → root cause → fix

1) “Certbot says renewed, but browser shows expired cert”

Symptom: certbot renew reports success; clients still see an old expiry.

Root cause: Service didn’t reload, or the TLS endpoint is not the service you renewed (proxy/load balancer mismatch).

Fix: Add a deploy hook to reload the correct process; verify from the client side using openssl s_client against the real endpoint.

2) “nginx fails to reload after renewal with Permission denied”

Symptom: journald shows BIO_new_file() failed ... Permission denied.

Root cause: Key/cert path not readable due to directory traversal (/etc/letsencrypt/live is 0700) or service runs non-root.

Fix: Keep nginx master privileged for key reads, or copy cert/key to a controlled directory readable by the service user.

3) “Reload hook runs, but service still serves old certificate”

Symptom: Hook executed; reload command succeeded; served cert unchanged.

Root cause: Service isn’t using /etc/letsencrypt/live paths, or you’re hitting a different vhost/SNI, or there’s another TLS terminator in front.

Fix: Confirm config paths with nginx -T / Apache vhost configs; check SNI with -servername; trace the handshake path.

4) “Everything works manually, but automation fails”

Symptom: Running certbot renew by hand works; systemd timer renews but doesn’t reload.

Root cause: Your manual command includes flags/hook; timer’s unit doesn’t. Or hook scripts aren’t executable in the timer context.

Fix: Put hooks in /etc/letsencrypt/renewal-hooks/deploy and ensure chmod 0755. Validate with certbot renew --dry-run.

5) “After renewal, some clients fail with chain errors”

Symptom: Intermittent failures, “unable to get local issuer certificate,” while some clients work.

Root cause: You configured ssl_certificate to cert.pem instead of fullchain.pem, or a mixed chain across proxies.

Fix: Serve fullchain.pem for nginx/HAProxy typical setups; test with openssl s_client -showcerts.

6) “Renewal works, but the container still serves old cert”

Symptom: Host files updated; container traffic shows old cert.

Root cause: Containerized process doesn’t reload, or mount points/permissions prevent seeing updates.

Fix: Implement a hook that signals the container (or restarts it) after renewal, or move to a proxy that supports dynamic cert reload.

Checklists / step-by-step plan

Checklist A: Stop the bleeding (restore valid TLS now)

  1. Check served cert expiry from the endpoint the user hits (openssl s_client with SNI).
  2. If served cert is old: reload the TLS terminator (systemctl reload nginx after nginx -t).
  3. If reload fails: read journald for permission errors and fix access model (do not chmod private keys broadly).
  4. Re-check served cert after reload. Only then close the incident.

Checklist B: Make renewal actually deploy (so you don’t do this again)

  1. Decide your permissions pattern: A (root reload), B (controlled copy), or C (ACLs).
  2. Create a deploy hook script in /etc/letsencrypt/renewal-hooks/deploy/.
  3. In the hook: test configuration (nginx -t / apachectl configtest) before reload/restart.
  4. Run certbot renew --dry-run and verify the hook executed (journald timestamps).
  5. Add a health check that compares served expiry vs on-disk expiry (catch broken hooks after updates).

Checklist C: Permissions model for non-root services (Pattern B)

  1. Create a restricted directory owned by the service user (or a dedicated group) with 0750 or tighter.
  2. Use install in a deploy hook to copy cert and key with specific mode/owner.
  3. Point the app to the copied paths, not Let’s Encrypt’s live directory.
  4. Reload/restart the service after copying.
  5. Audit: confirm only intended principals can read the private key.

Example Pattern B hook: copy files and reload

cr0x@server:~$ sudo tee /etc/letsencrypt/renewal-hooks/deploy/publish-app-cert >/dev/null <<'EOF'
#!/bin/bash
set -euo pipefail

DOMAIN="app.example.com"
SRC_DIR="/etc/letsencrypt/live/${DOMAIN}"
DST_DIR="/etc/ssl/${DOMAIN}"

install -d -m 0750 -o root -g appsvc "${DST_DIR}"

# Copy with explicit permissions. Private key is readable only by root and group appsvc.
install -m 0644 -o root -g appsvc "${SRC_DIR}/fullchain.pem" "${DST_DIR}/fullchain.pem"
install -m 0640 -o root -g appsvc "${SRC_DIR}/privkey.pem"   "${DST_DIR}/privkey.pem"

# Validate service config if applicable, then reload.
if systemctl is-active --quiet appsvc; then
  systemctl reload appsvc || systemctl restart appsvc
fi
EOF
cr0x@server:~$ sudo chmod 0755 /etc/letsencrypt/renewal-hooks/deploy/publish-app-cert

Decision point: Make appsvc a tight group with only the service account. Don’t reuse a shared group just because it exists.

FAQ

1) Why does Certbot renew successfully but my site still shows the old certificate?

Because renewal updates files on disk, not the running process. Your TLS terminator must reload or restart to re-read the cert/key.

2) Should I point nginx at /etc/letsencrypt/archive to avoid symlinks?

No. Archive filenames increment on renewal (fullchain4.pem, fullchain5.pem). Use /etc/letsencrypt/live so updates follow symlinks.

3) Is it safe to make /etc/letsencrypt/live readable by www-data?

Usually no. It expands who can read the private key. Prefer root-performed reloads (Pattern A) or a controlled copy to a dedicated directory with a dedicated group (Pattern B).

4) What’s the difference between a deploy hook and a post hook?

A deploy hook runs only when a cert is actually renewed/issued. A post hook runs after every Certbot run, even if nothing changed. For reloads, deploy hooks are the sane default.

5) Why does certbot renew --dry-run matter?

It validates the ACME flow and runs your hooks without waiting for real expiry. It’s the quickest way to catch “hook not executable” and “reload command fails” issues.

6) My service runs in Docker. How do I reload it from Certbot?

Either (a) mount a published cert directory into the container and send a signal/restart via the container runtime from a deploy hook, or (b) terminate TLS outside the container on the host proxy.

7) I reloaded nginx but clients still fail TLS. What now?

Check chain configuration (serve fullchain.pem), SNI mismatch (use -servername in tests), and whether a front proxy/load balancer is serving a different certificate.

8) Does Ubuntu 24.04 change anything about Certbot specifically?

The bigger change is packaging and automation expectations: systemd timers are common, Snap vs apt installs can differ, and confinement can change filesystem assumptions. Your hooks must match your install method.

9) How can I prove the running process has loaded the new cert?

Compare the served cert’s serial/expiry (via openssl s_client) with the on-disk live cert. If they match, the process loaded it. If not, it didn’t.

Conclusion: next steps that stick

The reliable path on Ubuntu 24.04 is blunt and effective:

  1. Verify what clients see, not what Certbot claims.
  2. Keep private keys locked down. Fix access by design, not by chmod panic.
  3. Wire renewal to deployment with a deploy hook that tests config and reloads the right service.
  4. Add a health check that compares served expiry to on-disk expiry so packaging updates or refactors can’t quietly break you.

If you only do one thing after reading this: create a deploy hook in /etc/letsencrypt/renewal-hooks/deploy/ that safely reloads your TLS terminator, then prove it with certbot renew --dry-run. That’s how you turn “certificates are automated” from a slogan into a property of your production system.

← Previous
Ubuntu 24.04: CIFS mount says “Permission denied” — the exact options that fix it
Next →
Drivers That Drop Performance: The Post-Update Ritual

Leave a comment