Proxmox SSL certificate broke: fast ways to restore Web UI access safely

November 6, 2025 • February 3, 2026 • Read: 20 min • Views: 15

Was this helpful?

The Proxmox Web UI is down, your browser is screaming about TLS, and someone is already asking if you “can just turn off HTTPS.”
Meanwhile, you’ve got VMs to babysit and a cluster that absolutely will not appreciate improvisation.

This is one of those outages that feels bigger than it is—until you make it bigger with a hasty “quick fix.”
The goal is simple: regain safe, authenticated access to port 8006 fast, without turning your hypervisor into a trust exercise.

Fast diagnosis playbook (check these first)

When Proxmox “breaks SSL,” you’re usually dealing with one of four buckets:
(1) time is wrong, (2) the proxy can’t load keys/certs, (3) the service isn’t listening, or (4) something in front is intercepting.
The playbook below is tuned for speed: find the bottleneck in minutes, not in a long, emotional SSH session.

First: is the UI actually down, or just untrusted?

If port 8006 is listening and you can fetch a certificate, your issue is likely trust/expiry/chain.
If port 8006 isn’t listening or the TLS handshake fails immediately, your issue is likely pveproxy startup or cert/key parsing.

Second: check time, then services

Time wrong makes valid certificates “not yet valid” or “expired” instantly. Fix time before touching cert files.
pveproxy down means the UI is down regardless of certificate validity.

Third: check certificate integrity and chain

Key mismatch (private key doesn’t match cert) kills pveproxy.
Broken PEM (wrong format, missing headers, CRLF junk) kills pveproxy.
Incomplete chain breaks browsers and reverse proxies but may still let pveproxy run.

Fourth: cluster-specific: did you break it on one node or all?

In a cluster, cert handling can differ per node. Don’t assume one fix propagates unless you built it that way.
If you use a shared VIP/load balancer, one “bad” node can poison the experience unpredictably.

One rule: don’t disable TLS to “get in quickly.” That’s how “temporary” becomes “we should file an incident report.”

What actually broke when “the SSL certificate broke”

Proxmox VE serves the Web UI via pveproxy on TCP/8006. That proxy expects a certificate and private key it can read and parse.
If it can’t, it often won’t start. If it can start but the certificate is expired, not trusted, or missing intermediates, browsers and API clients may refuse to connect.

Common root causes hiding behind the same symptom

Expired certificate: you get a browser warning; automation (Terraform, scripts) may hard-fail.
Clock drift or wrong timezone: valid certificate appears expired or not valid yet.
Key/cert mismatch: pveproxy fails to start; port 8006 may be closed.
Permissions/ownership: private key unreadable by the service (or too open, depending on hardening).
Bad copy/paste: PEM format ruined; missing BEGIN/END lines; Windows line endings.
Wrong certificate installed: certificate for the load balancer hostname, but you access node IP (or vice versa).
Proxy in front: reverse proxy presents its own certificate; you fixed Proxmox but users still see the old cert.
SNI confusion: browser requests one name; proxy serves another certificate.

Joke #1: Certificates are like milk: the label looks fine until you actually open it and regret your life choices.

Interesting facts and historical context

A little context helps when you’re deciding whether to “just regenerate” versus “surgically fix”:

TLS used to be called SSL; everyone still says “SSL cert” because “TLS certificate broke” sounds like a kernel panic.
Browsers steadily tightened rules on certificate validity, name matching, and chain completeness; an old “works on my laptop” cert can suddenly stop working after a browser update.
Let’s Encrypt popularized short-lived certificates (typically 90 days), trading long expirations for automation. If automation breaks, expiry shows up fast.
Modern clients reject SHA-1 signatures and weak RSA keys; legacy internal PKI habits can become outages when crypto policy moves on.
Time is a dependency for trust: X.509 validation is effectively “cryptography + clocks.” NTP failures have caused real outages across the industry.
Intermediate CA certificates matter: many issuers rely on intermediates; if you install only the leaf cert, you get “works on some machines, fails on others.”
SNI (Server Name Indication) allows multiple certificates on one IP:port. Without it, you’d still be burning IP addresses like it’s 2005.
Self-signed defaults are common in infrastructure UIs because secure-by-default beats “ship without TLS.” The tradeoff is trust management.

Practical recovery tasks with commands (and how to decide)

Below are real tasks you can run over SSH on a Proxmox node. Each one includes: the command, the output you care about, and the decision you make next.
The theme: measure first, then change the minimum necessary.

Task 1: Confirm the node is reachable and you’re not chasing a network ghost

cr0x@server:~$ ping -c 2 proxmox01
PING proxmox01 (10.20.0.11) 56(84) bytes of data.
64 bytes from 10.20.0.11: icmp_seq=1 ttl=64 time=0.310 ms
64 bytes from 10.20.0.11: icmp_seq=2 ttl=64 time=0.295 ms

--- proxmox01 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1025ms

Decision: If packet loss or no route, fix networking/VLAN/routing before touching TLS. Certificates don’t fix ARP.

Task 2: Is port 8006 listening?

cr0x@server:~$ sudo ss -lntp | grep 8006
LISTEN 0      4096               0.0.0.0:8006       0.0.0.0:*    users:(("pveproxy",pid=1324,fd=6))

Decision: If you see pveproxy listening, the UI is up and your issue is likely certificate trust/expiry/chain.
If nothing is listening, go straight to service logs and certificate parsing.

Task 3: Check service health quickly

cr0x@server:~$ sudo systemctl status pveproxy --no-pager
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Wed 2025-12-24 09:41:12 UTC; 3min ago
    Process: 2041 ExecStart=/usr/bin/pveproxy (code=exited, status=1/FAILURE)

Decision: If it’s failed, don’t guess. Pull the journal next; it often spells out the exact file it can’t read or parse.

Task 4: Read the logs for the real reason (usually cert/key)

cr0x@server:~$ sudo journalctl -u pveproxy -n 60 --no-pager
Dec 24 09:41:12 proxmox01 pveproxy[2041]: cannot load certificate '/etc/pve/local/pve-ssl.pem': PEM routines:get_name:no start line
Dec 24 09:41:12 proxmox01 systemd[1]: pveproxy.service: Main process exited, code=exited, status=1/FAILURE
Dec 24 09:41:12 proxmox01 systemd[1]: pveproxy.service: Failed with result 'exit-code'.

Decision: “no start line” means PEM formatting is broken (bad copy/paste, wrong file, or binary data). Fix the file content, not the service.

Task 5: Check system time and NTP status before you touch certificates

cr0x@server:~$ timedatectl
               Local time: Wed 2025-12-24 09:44:31 UTC
           Universal time: Wed 2025-12-24 09:44:31 UTC
                 RTC time: Wed 2025-12-24 09:44:30
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Decision: If clock sync is no or time is wildly off, fix NTP first. Otherwise you may “fix” a cert that wasn’t broken.

Task 6: Validate the certificate dates (expiry and not-before)

cr0x@server:~$ sudo openssl x509 -in /etc/pve/local/pve-ssl.pem -noout -dates -subject
notBefore=Dec  1 00:00:00 2025 GMT
notAfter=Mar  1 23:59:59 2026 GMT
subject=CN = proxmox01.example.internal

Decision: If notAfter is in the past, renew/replace. If notBefore is in the future, your time is wrong or you installed the wrong cert.

Task 7: Confirm the certificate matches the private key (mismatch is a pveproxy killer)

cr0x@server:~$ sudo openssl x509 -noout -modulus -in /etc/pve/local/pve-ssl.pem | openssl md5
MD5(stdin)= 8b2f6a8d7efbd33a1c3c7f2e4e2a1b51
cr0x@server:~$ sudo openssl rsa -noout -modulus -in /etc/pve/local/pve-ssl.key | openssl md5
MD5(stdin)= 8b2f6a8d7efbd33a1c3c7f2e4e2a1b51

Decision: Hashes must match. If they don’t, you installed a cert that doesn’t belong to that key. Reinstall the correct pair or regenerate properly.

Task 8: Check file permissions and ownership (the “works until reboot” classic)

cr0x@server:~$ sudo ls -l /etc/pve/local/pve-ssl.*
-rw-r----- 1 root www-data  1826 Dec 24 09:39 /etc/pve/local/pve-ssl.pem
-rw------- 1 root www-data  1704 Dec 24 09:39 /etc/pve/local/pve-ssl.key

Decision: If the key is world-readable, you’ve got a security issue. If it’s unreadable by the service context, you’ve got an availability issue.
Proxmox typically expects sane root-owned permissions; don’t “chmod 777” your way into production.

Task 9: Check what certificate the node is actually presenting on 8006

cr0x@server:~$ openssl s_client -connect 10.20.0.11:8006 -servername proxmox01.example.internal -showcerts

CONNECTED(00000003)
depth=0 CN = proxmox01.example.internal
verify error:num=20:unable to get local issuer certificate
verify return:1
---
Certificate chain
 0 s:CN = proxmox01.example.internal
   i:CN = Example Intermediate CA 01
---
SSL handshake has read 1657 bytes and written 407 bytes

Decision: If you see “unable to get local issuer certificate,” you likely have an incomplete chain.
That can be fine for internal systems with enterprise trust stores, but browsers and automation may still reject it.

Task 10: Validate the certificate chain file content (leaf + intermediates)

cr0x@server:~$ sudo awk 'BEGIN{c=0} /BEGIN CERTIFICATE/{c++} END{print c}' /etc/pve/local/pve-ssl.pem
1

Decision: If you intended to install a full chain, but the file contains only 1 cert, you’re missing intermediates.
Fix by concatenating leaf + intermediate(s) in the correct order (leaf first).

Task 11: Quick check that you didn’t break DNS/name matching

cr0x@server:~$ sudo openssl x509 -in /etc/pve/local/pve-ssl.pem -noout -ext subjectAltName
X509v3 Subject Alternative Name:
    DNS:proxmox01.example.internal, DNS:pve.example.internal, IP Address:10.20.0.11

Decision: If you access via IP but the cert has only DNS names (or vice versa), users will get name mismatch.
Either change how people access it (preferred) or reissue the cert with correct SANs.

Task 12: Restart the right service and verify it stays up

cr0x@server:~$ sudo systemctl restart pveproxy
cr0x@server:~$ sudo systemctl is-active pveproxy
active

Decision: If it flaps back to failed, immediately re-check the journal; repeated restarts are not a strategy, they’re a denial.

Task 13: Confirm the API is responding locally (bypasses browser trust drama)

cr0x@server:~$ curl -k https://127.0.0.1:8006/api2/json/version
{"data":{"release":"8.2","repoid":"b2c3d4e5","version":"8.2.2"}}

Decision: If local curl works, your issue might be client trust, reverse proxying, or name mismatch—not Proxmox itself.

Task 14: If you use a reverse proxy, prove which certificate users are seeing

cr0x@server:~$ openssl s_client -connect pve.example.internal:443 -servername pve.example.internal 2>/dev/null | openssl x509 -noout -subject -issuer -dates
subject=CN = pve.example.internal
issuer=CN = Example Public CA R3
notBefore=Dec  1 00:00:00 2025 GMT
notAfter=Mar  1 23:59:59 2026 GMT

Decision: If the reverse proxy presents a different cert than the node, fix the proxy config/cert store. Don’t keep “fixing Proxmox” for a proxy problem.

Task 15: Verify cluster filesystem health (because certs may live in /etc/pve)

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             labcluster
Config Version:   17
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Dec 24 09:49:18 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2a
Quorate:          Yes

Decision: If the cluster is not quorate or /etc/pve is unhealthy, you can see weird “file missing” behavior.
Fix quorum/cluster comms before you do large config edits.

Task 16: Confirm you’re editing the right files (Proxmox has opinions)

cr0x@server:~$ sudo grep -R "pve-ssl" -n /etc/pve/local | head
/etc/pve/local/pveproxy-ssl.pem:1:-----BEGIN CERTIFICATE-----

Decision: File names vary by version and setup; don’t assume. If you’re unsure, follow pveproxy logs—they tell you which path is in use.

Safe restoration paths: choose your repair strategy

Once you’ve identified the category—time, service, certificate content, or fronting proxy—pick a strategy that matches the situation.
There’s no prize for the most creative fix. There’s a prize for the fix that survives the next reboot.

Path A: UI is up, cert is expired/untrusted → renew or replace, keep service running

If ss shows pveproxy listening and local curl -k returns version JSON, you’re not in a hard outage.
You’re in a trust outage. That’s better.

If you already have Let’s Encrypt automation: repair automation, then renew.
If you use internal PKI: reissue with correct SANs, include the chain, and deploy consistently.
If you used the default self-signed: regenerate safely and distribute trust where needed (admins’ browsers, automation nodes).

Path B: pveproxy won’t start → fix parsing/permissions first, then consider regeneration

The fastest safe recovery is usually: restore a known-good cert/key pair from backup or regenerate the Proxmox proxy certificate using the standard tooling.
What you want to avoid is hand-editing PEM files under stress without validation.

If you must “get in now” while you repair trust, your least-bad approach is SSH + local API checks.
Don’t weaken the system’s network exposure because your browser is impatient.

Path C: Reverse proxy in front → fix the proxy first or bypass it temporarily

If users hit pve.example.internal and that points to Nginx/HAProxy/Traefik, the certificate the browser sees may not be Proxmox’s certificate at all.
You can spend all day renewing on the node and still fail the handshake at the edge.

In incidents, I prefer a controlled bypass: access the node directly over the management network, verify it’s healthy, then repair the edge.
This prevents a proxy misconfiguration from masquerading as a Proxmox outage.

Path D: Time is wrong → fix time, then re-test before changing anything else

Certificates are time-bound. If time is off, everything looks broken:
certificate validation, token validity, sometimes even cluster interactions. Fixing time is almost insultingly effective.

Joke #2: NTP is like toothpaste—only noticed when missing, and the situation gets messy fast.

Three corporate mini-stories (how this goes wrong in real life)

Mini-story 1: The outage caused by a wrong assumption

A mid-size company ran a small Proxmox cluster for internal services. They used the default Proxmox certificate on each node and trained admins to click through the browser warning.
It wasn’t elegant, but it was stable—until a new security team rolled out a corporate browser policy update.

The next morning, nobody could reach the Web UI. The assumption was, “Proxmox is down.” It wasn’t. Port 8006 was up, VMs were running, storage was fine.
The browser policy simply refused self-signed certs without a specific exception mechanism.

The team’s first move was to restart services and “regenerate certificates” on all nodes. That created a second problem: each node now had a different self-signed cert than yesterday,
so even the admins who had previously accepted the warnings couldn’t get past the new mismatch prompts in their automation tools.

The fix ended up boring: they deployed an internal CA-signed certificate with proper SANs, then pushed the internal CA to managed endpoints.
After that, the UI “magically” worked again, and the incident was reclassified as a policy change impact, not a Proxmox failure.

The lesson: “Browser says no” is not the same as “service is down.” Treat it as a trust pipeline failure until proven otherwise.

Mini-story 2: The optimization that backfired

A different org wanted “one clean URL” for Proxmox: a single hostname behind a reverse proxy with a publicly trusted certificate.
They fronted the cluster with a proxy that terminated TLS, then forwarded to each node on 8006. It worked nicely in demos.

Then someone optimized: they enabled aggressive health checks and “smart routing” based on HTTP responses.
Proxmox, being a management UI with auth, didn’t love being poked incorrectly. The proxy started marking healthy nodes as unhealthy during brief login redirect changes.

During a later certificate renewal, they updated the proxy cert but forgot to update the upstream SNI/host header behavior.
Some clients saw the new cert; others saw an old one cached or served based on a different SNI route. The incident had the delightful property of being intermittent.

They eventually simplified: separate a stable TCP-level health check from HTTP-level assumptions, and route deterministically.
The “optimization” added fragility and obscured where TLS was actually terminated.

The lesson: reverse proxies can hide problems and create new ones. If you front Proxmox, be explicit about where certificates live and what your health checks mean.

Mini-story 3: The boring practice that saved the day

A regulated environment ran Proxmox with an internal PKI and short-lived certificates. Their ops team did two unsexy things:
(1) kept a configuration backup of cert/key material in an encrypted vault with change history, and (2) tested renewal in a staging node monthly.

One weekend, an intermediate CA was rotated. Monday morning, some automation that talked to the Proxmox API started failing with “unknown issuer.”
The Web UI was reachable for humans who had the updated trust store, but the automation nodes were behind on CA updates.

Because they had a clean record of what changed, they immediately confirmed the Proxmox nodes were serving the new chain correctly.
The fix wasn’t on Proxmox at all—it was pushing the new intermediate to the automation nodes’ trust stores.

No panic reissues. No random restarts. No guessing.
The incident was closed quickly because the team could prove where trust was broken and roll out a precise change.

The lesson: boring discipline beats heroics. When certs break, your ability to compare “before/after” is gold.

Common mistakes: symptom → root cause → fix

This is the section where future-you quietly thanks past-you for being honest.
These are common, specific failure modes I’ve seen in production.

1) Browser says “Your connection is not private” and refuses to proceed

Symptom: UI loads with a hard block; no “proceed anyway.”
Root cause: Managed browser policy disallows self-signed or unknown CAs; or cert is truly expired.
Fix: Use a CA-trusted certificate (internal PKI or public CA) with correct SANs; or renew the expired cert. Verify with openssl x509 -dates.

2) Port 8006 is closed; `pveproxy` keeps failing

Symptom: ss shows nothing on 8006; systemd shows failed.
Root cause: PEM file corrupted, wrong format, or key mismatch.
Fix: Read journalctl -u pveproxy. Validate PEM headers, match modulus hashes, restore known-good files, then restart.

3) “NET::ERR_CERT_COMMON_NAME_INVALID” or name mismatch warnings

Symptom: Cert is valid but for a different hostname; error mentions CN/SAN mismatch.
Root cause: Accessing by IP or a different DNS name than what’s in SANs; often after renaming a node or introducing a VIP.
Fix: Standardize access (preferred: one DNS name per node or one VIP name), reissue cert with SANs covering actual access patterns.

4) Works from some machines, fails from others

Symptom: One admin can log in; another gets issuer/chain errors.
Root cause: Missing intermediate certificate in the served chain; or different trust stores across clients.
Fix: Serve full chain (leaf + intermediates), and ensure clients trust the root/intermediate. Verify chain with openssl s_client -showcerts.

5) Cert “expired” right after renewal

Symptom: You renewed, restarted, still shows old expiry date.
Root cause: You updated the wrong certificate path; reverse proxy still serving old cert; browser cache; or multiple nodes behind a VIP and only one updated.
Fix: Confirm the presented cert with openssl s_client against the exact hostname users hit. Then update the actual termination point.

6) After “fixing permissions,” pveproxy starts but security got worse

Symptom: Someone ran permissive chmod; key is readable to too many users.
Root cause: Panic hardening/loosening without understanding service needs.
Fix: Restore minimal required permissions; keep private key readable only by root and the service account if needed. Document expected modes.

7) Cluster node UI works, but cluster join/auth breaks

Symptom: Web UI accessible; but inter-node operations fail or show auth errors.
Root cause: Confusing pveproxy UI cert with cluster communication/auth; or breaking /etc/pve sync/quorum while editing.
Fix: Verify cluster health with pvecm status, confirm /etc/pve is mounted and consistent, then apply changes node-by-node carefully.

Checklists / step-by-step plan

These plans assume you have SSH access (or console) to the host. If you don’t, fix that first.
Out-of-band access is not a luxury; it’s the “seatbelt” of hypervisor operations.

Checklist 1: Fast restore of Web UI when pveproxy is down (single node)

Confirm time: run timedatectl. If wrong, fix NTP/time and retest. Do not touch certs yet.
Check if 8006 listens: ss -lntp | grep 8006. If nothing, proceed.
Inspect pveproxy logs: journalctl -u pveproxy -n 60. Identify exact error (PEM parse, permission, missing file).
Validate certificate file format: open the referenced file and ensure it contains proper PEM blocks. Count cert blocks with awk.
Verify key/cert match: modulus md5 check. If mismatch, stop and reinstall correct pair.
Restore from known-good backup if available. This is usually the fastest safe move.
Restart pveproxy: systemctl restart pveproxy, then check systemctl is-active.
Verify local API: curl -k https://127.0.0.1:8006/api2/json/version.
Then verify remotely: test with openssl s_client from an admin machine to confirm the served certificate.

Checklist 2: UI works but browser/clients reject the certificate (trust outage)

Prove the service is up: ss -lntp and local curl -k.
Inspect certificate dates: openssl x509 -dates. If expired, renew/reissue.
Check SANs: openssl x509 -ext subjectAltName. If mismatch, reissue with correct names.
Check chain completeness: openssl s_client -showcerts. If missing intermediates, serve full chain.
Decide the trust model:
- Internal CA: distribute CA to clients, then deploy leaf+intermediates.
- Public CA: ensure validation method works (DNS/HTTP challenge) and renew automation is reliable.
- Self-signed: acceptable only if you control client trust stores; otherwise you’re just scheduling future pain.
Verify from at least two client types: managed browser + CLI tool (curl/openssl). Different trust behaviors catch different failures.

Checklist 3: Cluster approach (avoid fixing one node and breaking the rest)

Check quorum: run pvecm status on a healthy node first.
Pick a “gold” node to validate the certificate procedure before touching others.
Document the access path: direct node access vs VIP vs reverse proxy. Your certificate must match that reality.
Apply changes one node at a time and verify:
- pveproxy is active
- 8006 listens
- presented certificate matches expectation
If behind a load balancer: temporarily drain a node before changing its cert to avoid user roulette.
After all nodes: validate the VIP/proxy termination point separately.

FAQ

1) Can I just disable HTTPS on Proxmox to get the UI back?

You can, but you shouldn’t. The UI is an admin interface to a hypervisor—credentials and session tokens matter.
If you need emergency access, use SSH and local API checks while you restore TLS properly.

2) Why does Proxmox call it “SSL” when it’s really TLS?

Legacy naming. The industry kept saying “SSL certificate” long after TLS replaced SSL in practice. Your browser speaks TLS; your coworkers say SSL; everyone keeps shipping.

3) My certificate is valid but browsers still complain about “issuer” errors. What gives?

Usually an incomplete chain: you installed only the leaf certificate, but clients need the intermediate(s) to build trust to a root CA.
Confirm with openssl s_client -showcerts and fix by serving the full chain.

4) Why did this happen right after a reboot?

Two frequent causes: time drift (NTP didn’t come up cleanly and the clock is wrong) or permissions/paths changed and pveproxy can’t read the key at startup.
Check timedatectl and journalctl -u pveproxy.

5) Is it safe to use the default self-signed Proxmox certificate?

It’s safe in the sense that the traffic is encrypted, but not automatically trusted.
In managed environments, self-signed certs often become operational debt because policy and tooling increasingly reject them.

6) I updated the certificate files but the browser still shows the old one. Why?

You’re probably not looking at the node you edited (VIP/load balancer), or a reverse proxy is terminating TLS.
Prove what’s served with openssl s_client against the exact hostname and port users hit.

7) What’s the safest “fast fix” if pveproxy won’t start and I’m under pressure?

Restore a known-good cert/key pair from a secured backup, restart pveproxy, and confirm local API response.
Regenerating is fine too, but restoring reduces the chance you introduce a mismatch or missing chain under stress.

8) Does fixing the Web UI certificate affect VM traffic or storage?

Not directly. This certificate is for the management plane (Web UI/API) via pveproxy.
VM networking and storage access usually keep running even when the UI is inaccessible—unless you break other services while troubleshooting.

9) How do I know if this is a Proxmox issue or a reverse proxy issue?

Test both endpoints. If direct node access on 8006 presents the correct certificate and works, but the public hostname fails, your proxy/edge termination is the culprit.

10) What’s the single most common cause you see?

Expiry combined with “we thought auto-renew was working,” followed by someone discovering the renewal job depended on a DNS token that expired months ago.
Second place: incomplete chains.

Next steps that prevent a rerun

Here’s the reliability truth: TLS outages are rarely “hard.” They’re just predictable, and predictability is insulting when you’re on call.
The fix isn’t heroics; it’s putting certificates on the same lifecycle rails as everything else you care about.

Do these next, while the incident is still fresh

Write down the actual termination point: direct Proxmox, reverse proxy, or load balancer. One sentence. No ambiguity.
Standardize access names: choose DNS names people must use, and reissue certificates with SANs that match reality.
Automate renewal with alerts: if renewal fails, you should know long before users do. Track expiration dates and job status.
Keep a known-good rollback package: encrypted backup of cert/key/chain (and notes on where they live), plus a tested restore procedure.
Test from two client perspectives: a managed browser and a CLI tool. This catches chain and policy issues early.

Paraphrased idea from John Allspaw: reliability comes from systems and feedback loops, not from individual heroics. Build the loop; your future weekends will improve.