Proxmox firewall locked you out: restore SSH/Web UI from console without panic

Was this helpful?

You changed a firewall rule in Proxmox. Now the Web UI times out, SSH is dead, and your stomach is trying to exit via your throat. You’re not “down.” You’re just locally available—like a museum exhibit.

This is a console-first recovery guide for when Proxmox’s firewall policy blocks your own management access. It’s written for people who run production systems: you want the quickest safe path back to SSH and port 8006, and you want to understand what failed so it doesn’t happen again.

Fast diagnosis playbook

If you’re locked out, your job is not to “debug everything.” Your job is to restore one working management path, then improve safely. Here’s the fastest route that avoids making things worse.

First: determine whether this is firewall, service, or networking

  1. Can you reach the box via console? If yes, proceed. If not, your problem is upstream (hypervisor console, IPMI/iLO/DRAC, physical access).
  2. Is the host network up? Check link state, IP address, default route. If networking is broken, the firewall isn’t the primary problem.
  3. Is the Proxmox management stack running? Check pveproxy (Web UI) and sshd (SSH). If services are down, fix services first.

Second: confirm the firewall is actually active

  1. Check whether Proxmox firewall is enabled at the datacenter/node level.
  2. Check whether pve-firewall is running and whether it installed rules into nftables/iptables.

Third: implement the least-risk recovery

  1. Temporarily disable the Proxmox firewall service (systemctl stop pve-firewall) to restore access, or
  2. Inject a narrow allow rule for SSH and 8006 from your admin IP(s), then reload firewall.

Fourth: verify externally, then fix policy properly

  1. Verify SSH and Web UI from a known admin host.
  2. Review the rule set that caused lockout (datacenter vs node vs VM, input vs output, interface scope).
  3. Implement a permanent “management allowlist” strategy, not a pile of exceptions.

Dry truth: if you’re unsure which rule is wrong, stop guessing. Disable the firewall, restore access, then correct rules in daylight.

How Proxmox firewall actually works (the parts that matter during an outage)

Proxmox VE has a firewall feature integrated into the UI, layered across scopes:

  • Datacenter rules: global across nodes (cluster-wide intent).
  • Node rules: host-specific rules, often for management/control-plane access.
  • VM/CT rules: applied to guests (useful, but not your current emergency).

Under the hood, Proxmox programs the host firewall through its pve-firewall service. Depending on distro/kernel versions, rules land in nftables or iptables compatibility. During a lockout, you care about two things:

  1. Policy defaults: input policy may effectively become “drop unless allowed” once firewall is enabled with a default deny stance.
  2. Management ports: SSH (typically 22) and Proxmox Web UI (8006/TCP via pveproxy).

Proxmox’s firewall is not magic, but it can feel like it when it blocks you and you’re staring at a console. Your recovery hinges on understanding that:

  • Stopping pve-firewall usually removes the rules it manages (that’s the “big red button”).
  • Restarting pveproxy doesn’t help if the firewall is dropping traffic before it hits the service.
  • Cluster rules can propagate and ruin your day across nodes if you used datacenter scope recklessly.

One reliability paraphrased idea from John Allspaw: Incidents are often the result of normal work and local decisions that made sense at the time. Treat your lockout like an incident. You’ll fix it faster and learn more.

Interesting facts and historical context (so your brain stops guessing)

  • Port 8006 is Proxmox’s default Web UI port; it’s not “special,” just easy to block with a single bad rule.
  • Proxmox VE is Debian-based, which means your rescue toolkit is classic Linux: systemd, journalctl, iproute2, and nftables/iptables.
  • Linux firewalling evolved: iptables dominated for years; nftables is the modern replacement, but compatibility layers can confuse output during emergencies.
  • Stateful firewalls (conntrack) mean “allow established/related” rules can save existing sessions while blocking new ones—so your current SSH might live while everyone else dies.
  • Default drop policies are safer than default allow, but only if you pre-allow management. Security loves “deny by default.” Operations loves “don’t brick the host.” You can have both.
  • Cluster config propagation is a force multiplier: it makes changes consistent, and it makes mistakes consistent too.
  • Web UI availability depends on pveproxy and TLS; a firewall block looks identical to a dead proxy from the browser side.
  • Out-of-band consoles (IPMI/iLO/DRAC) exist because networks fail and humans misconfigure them; “console access” is not a luxury feature.
  • Firewall UI abstractions are helpful until they hide the underlying rule order and interface matching—then you learn the hard way that “simple” still has edge cases.

Get to a real console (no, not your half-working SSH tab)

When Proxmox locks you out, you need local access. Options, in order of sanity:

  1. IPMI/iLO/DRAC/KVM console: best for remote recovery. Use it.
  2. Physical console: crash cart, monitor, keyboard. Old-school, works.
  3. Provider console: if hosted, use their “VNC/Serial console.” Accept the jank.

Once you’re in, you want to avoid thrashing. Don’t start randomly editing every config file you can find. Take 90 seconds to confirm: network, services, firewall. Then act.

Joke #1: The firewall didn’t “break.” It just developed a strong sense of personal boundaries.

Console recovery tasks (commands, outputs, and decisions)

Below are practical tasks you can run from the console. Each includes: the command, what the output means, and the decision you make. Don’t run them blindly; follow the decision logic.

Task 1: Confirm you’re on the right node and not dreaming

cr0x@server:~$ hostnamectl
 Static hostname: pve-01
       Icon name: computer-server
         Chassis: server
      Machine ID: 2f5c0c0d3f3a4d44a1f8b3a2f0d0c111
         Boot ID: 7c2b3a7d2c3a4bdabf9dd0f66e1b2222
Operating System: Debian GNU/Linux 12 (bookworm)
          Kernel: Linux 6.2.16-20-pve
    Architecture: x86-64

What it means: You’re on the node you think you’re on, and it’s running a Proxmox kernel.

Decision: If this isn’t the correct node, stop and find the right one. If it is, proceed.

Task 2: Check link and IP addressing (basic network reality check)

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00
enp3s0           UP             3c:ec:ef:12:34:56
vmbr0            UP             3c:ec:ef:12:34:56
cr0x@server:~$ ip -br addr show vmbr0
vmbr0            UP             10.20.30.11/24 fe80::3eec:efff:fe12:3456/64

What it means: The bridge is up and has the management IP.

Decision: If the interface is DOWN or IP is missing, fix networking first (wrong VLAN, cable, bond, bridge config). Firewall changes won’t help if the host has no IP.

Task 3: Confirm default route and that you can reach your gateway

cr0x@server:~$ ip route
default via 10.20.30.1 dev vmbr0 proto kernel onlink
10.20.30.0/24 dev vmbr0 proto kernel scope link src 10.20.30.11
cr0x@server:~$ ping -c 2 10.20.30.1
PING 10.20.30.1 (10.20.30.1) 56(84) bytes of data.
64 bytes from 10.20.30.1: icmp_seq=1 ttl=64 time=0.401 ms
64 bytes from 10.20.30.1: icmp_seq=2 ttl=64 time=0.388 ms

--- 10.20.30.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1021ms
rtt min/avg/max/mdev = 0.388/0.394/0.401/0.006 ms

What it means: L2/L3 basic connectivity exists. You can likely recover management without re-cabling your brain.

Decision: If ping to gateway fails, look for VLAN trunk issues, bridge ports, bond status, or upstream switch config. Again: don’t blame the firewall yet.

Task 4: Check whether SSH and the Web UI services are running

cr0x@server:~$ systemctl status ssh --no-pager
● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-12-26 08:14:21 UTC; 1h 22min ago
       Docs: man:sshd(8)
             man:sshd_config(5)
   Main PID: 1123 (sshd)
      Tasks: 1 (limit: 154000)
     Memory: 6.2M
        CPU: 210ms
cr0x@server:~$ systemctl status pveproxy --no-pager
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-12-26 08:13:58 UTC; 1h 23min ago
   Main PID: 1055 (pveproxy)
      Tasks: 4 (limit: 154000)
     Memory: 154.8M
        CPU: 1.934s

What it means: Services are up. If you still can’t reach them remotely, the firewall is highly suspect.

Decision: If either service is inactive/failed, fix that first. A firewall change won’t resurrect a dead daemon.

Task 5: Verify services are listening on the expected ports locally

cr0x@server:~$ ss -lntp | egrep '(:22|:8006)\s'
LISTEN 0      128          0.0.0.0:22        0.0.0.0:*    users:(("sshd",pid=1123,fd=3))
LISTEN 0      4096         0.0.0.0:8006      0.0.0.0:*    users:(("pveproxy",pid=1055,fd=6))

What it means: The host is accepting connections on 22 and 8006. If remote clients can’t connect, packets are being filtered or misrouted.

Decision: If ports aren’t listening, check service configs, cert issues, or crash loops in logs.

Task 6: Prove the firewall is the blocker with a local curl

cr0x@server:~$ curl -kI https://127.0.0.1:8006/
HTTP/1.1 200 OK
server: pve-api-daemon/3.0
content-type: text/html; charset=UTF-8
cache-control: max-age=0

What it means: The Web UI works locally. The “down” symptom is network path or firewall, not pveproxy itself.

Decision: Move to firewall inspection and rollback.

Task 7: Check Proxmox firewall enablement at runtime (service + state)

cr0x@server:~$ systemctl status pve-firewall --no-pager
● pve-firewall.service - Proxmox VE firewall
     Loaded: loaded (/lib/systemd/system/pve-firewall.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-12-26 09:01:12 UTC; 36min ago
   Main PID: 2289 (pve-firewall)
      Tasks: 1 (limit: 154000)
     Memory: 19.7M
        CPU: 4.311s

What it means: Proxmox firewall is active and managing rules.

Decision: If you’re locked out and you need a quick restore, stopping this service is a legitimate emergency move. If you need partial restoration (tight allowlist), edit rules instead.

Task 8: Look at the firewall logs for drops (this saves guessing)

cr0x@server:~$ journalctl -u pve-firewall --since "30 min ago" --no-pager | tail -n 20
Dec 26 09:19:10 pve-01 pve-firewall[2289]: status update OK
Dec 26 09:19:10 pve-01 pve-firewall[2289]: compile new ruleset
Dec 26 09:19:11 pve-01 pve-firewall[2289]: firewall update successful
cr0x@server:~$ journalctl -k --since "30 min ago" --no-pager | egrep -i 'PVEFW|DROP|REJECT' | tail -n 10
Dec 26 09:28:03 pve-01 kernel: PVEFW-DROP-IN: IN=vmbr0 OUT= MAC=3c:ec:ef:12:34:56 SRC=10.20.30.50 DST=10.20.30.11 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=22222 DF PROTO=TCP SPT=51234 DPT=8006 WINDOW=64240 SYN
Dec 26 09:28:08 pve-01 kernel: PVEFW-DROP-IN: IN=vmbr0 OUT= MAC=3c:ec:ef:12:34:56 SRC=10.20.30.50 DST=10.20.30.11 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=22223 DF PROTO=TCP SPT=51235 DPT=22 WINDOW=64240 SYN

What it means: Kernel logs show your admin host (10.20.30.50) being dropped on ports 8006 and 22. That’s a smoking gun, not an interpretation.

Decision: Apply an emergency allow (preferred if you can be precise) or stop pve-firewall (fastest).

Task 9 (fastest): Stop Proxmox firewall to restore access immediately

cr0x@server:~$ systemctl stop pve-firewall
cr0x@server:~$ systemctl is-active pve-firewall
inactive

What it means: The Proxmox-managed rule set should be withdrawn. Remote access should come back if the firewall was the blocker.

Decision: Try SSH/Web UI from your admin machine now. If access returns, you’ve confirmed root cause. Next: fix rules properly, then re-enable firewall.

Task 10: If access still doesn’t return, inspect nftables/iptables directly

Sometimes other tooling, custom scripts, or leftover rule sets remain. Check what’s actually loaded.

cr0x@server:~$ nft list ruleset | head -n 40
table inet filter {
	chain input {
		type filter hook input priority filter; policy accept;
		ct state established,related accept
		iif "lo" accept
	}
}

What it means: A minimal nftables ruleset with input policy accept. If this is what you see, firewall filtering likely isn’t happening at nft level.

Decision: If you see policy drop and no allow rules for your management ports, you can add a temporary allow rule (see next task). If nft is empty but you suspect iptables, check iptables too.

cr0x@server:~$ iptables -S | head -n 30
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT

What it means: iptables is permissive. If you’re still locked out, suspect routing, rp_filter, upstream ACLs, or wrong interface/IP.

Decision: Continue with connectivity tests and upstream checks.

Task 11 (surgical): Temporarily allow SSH and 8006 from your admin IP using nftables

If you need to keep a default-drop posture while recovering, add a narrow allow for your workstation or bastion host. This assumes nftables is active on your node.

cr0x@server:~$ nft add table inet emergency
cr0x@server:~$ nft 'add chain inet emergency input { type filter hook input priority -50; policy accept; }'
cr0x@server:~$ nft add rule inet emergency input ip saddr 10.20.30.50 tcp dport {22,8006} accept
cr0x@server:~$ nft list table inet emergency
table inet emergency {
	chain input {
		type filter hook input priority -50; policy accept;
		ip saddr 10.20.30.50 tcp dport { 22, 8006 } accept
	}
}

What it means: You created a high-priority “emergency” chain that accepts management traffic from one IP before other chains can drop it.

Decision: Use this to get back in remotely, then fix Proxmox firewall rules properly. Remove the emergency table afterward; don’t leave crime-scene tape in production.

Task 12: Check the Proxmox firewall config files (find the rule that hurt you)

cr0x@server:~$ grep -R "enable" -n /etc/pve/firewall | head
/etc/pve/firewall/cluster.fw:2:enable: 1
/etc/pve/firewall/pve-01.fw:2:enable: 1
cr0x@server:~$ sed -n '1,120p' /etc/pve/firewall/cluster.fw
[OPTIONS]
enable: 1
policy_in: DROP
policy_out: ACCEPT

[RULES]
IN DROP -p tcp --dport 8006 -log nolog

What it means: Cluster-wide input policy is DROP and there’s an explicit DROP for 8006. That’s how you bricked the UI—politely and at scale.

Decision: Remove the bad DROP rule, add an allowlist rule for your admin subnet, and keep policy_in DROP if that’s your security posture. Just don’t deny your own control plane.

Task 13: Add a correct allow rule in Proxmox firewall (preferred permanent fix)

Example: allow management from an admin subnet to the node.

cr0x@server:~$ sed -n '1,120p' /etc/pve/firewall/pve-01.fw
[OPTIONS]
enable: 1
policy_in: DROP
policy_out: ACCEPT

[RULES]
IN ACCEPT -source 10.20.30.0/24 -p tcp -dport 22 -log nolog
IN ACCEPT -source 10.20.30.0/24 -p tcp -dport 8006 -log nolog
IN ACCEPT -p icmp -log nolog

What it means: The node has explicit allows for SSH and Web UI from your admin subnet, while still default-dropping everything else inbound.

Decision: If your admin subnet is not trustworthy, replace it with a bastion IP or VPN range. “0.0.0.0/0 can manage my hypervisor” is not a security strategy.

Task 14: Reload Proxmox firewall and verify it compiles

cr0x@server:~$ systemctl start pve-firewall
cr0x@server:~$ systemctl status pve-firewall --no-pager
● pve-firewall.service - Proxmox VE firewall
     Loaded: loaded (/lib/systemd/system/pve-firewall.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-12-26 10:02:13 UTC; 2s ago
   Main PID: 9912 (pve-firewall)
cr0x@server:~$ pve-firewall compile
status: ok

What it means: Service is active; rules compile without errors.

Decision: If compile fails, the firewall may revert or partially apply. Fix syntax before you trust it. A “half-applied” firewall is how you get haunted.

Task 15: Validate from the host: do packets still drop?

You can’t perfectly simulate remote access from the local console, but you can check whether the kernel is still logging drops for your admin source.

cr0x@server:~$ journalctl -k --since "5 min ago" --no-pager | egrep -i 'PVEFW-DROP-IN' | tail -n 5

What it means: If there’s no new drops while you attempt to connect remotely, you likely fixed it. If drops persist, your allow rule isn’t matching (wrong source range, interface, protocol, or scope).

Decision: If drops persist, check rule order and scope (datacenter vs node). Also check whether you’re hitting the host from a different source IP than you think (NAT, VPN, jump host).

Task 16: Confirm remote path with tcpdump (when you need proof, not vibes)

cr0x@server:~$ tcpdump -ni vmbr0 tcp port 8006 or tcp port 22 -c 10
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vmbr0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:06:41.112233 IP 10.20.30.50.51234 > 10.20.30.11.8006: Flags [S], seq 1234567890, win 64240, options [mss 1460,sackOK,TS val 111 ecr 0,nop,wscale 7], length 0
10:06:41.112455 IP 10.20.30.11.8006 > 10.20.30.50.51234: Flags [S.], seq 987654321, ack 1234567891, win 65160, options [mss 1460,sackOK,TS val 222 ecr 111,nop,wscale 7], length 0

What it means: You see SYN and SYN-ACK. That’s connectivity. If the SYN arrives but no SYN-ACK, the host is dropping or not listening.

Decision: SYN-only usually means firewall drop. SYN+SYN-ACK means you’re back; any remaining issue is likely TLS/cert/browser or a reverse proxy in front.

Task 17: Check if you accidentally blocked yourself by interface (vmbr0 vs a VLAN sub-interface)

cr0x@server:~$ ip -d link show vmbr0 | head -n 12
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3c:ec:ef:12:34:56 brd ff:ff:ff:ff:ff:ff
    bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 1 vlan_default_pvid 1 vlan_protocol 802.1Q

What it means: VLAN filtering is enabled on the bridge. If your management VLAN tagging or PVID assumptions are wrong, traffic might not even hit the host the way you think.

Decision: If you recently changed VLANs/bridge config alongside firewall, isolate: confirm L2/VLAN first, then firewall.

Task 18: Verify cluster filesystem health (because /etc/pve is special)

cr0x@server:~$ pvecm status 2>/dev/null | head -n 15
Cluster information
-------------------
Name:             corp-cluster
Config Version:   42
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec 26 10:08:12 2025
Quorum:           Yes

What it means: If you’re in a cluster, /etc/pve is a cluster filesystem. Quorum issues can make config edits tricky or non-propagating.

Decision: If there’s no quorum, avoid making “global” assumptions. Prefer node-local emergency access restoration and proceed carefully.

Joke #2: Nothing teaches “change control” like being physically introduced to the server room at 2 a.m.

Three corporate mini-stories (pain, regret, and one quiet hero)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran Proxmox for internal workloads: CI runners, a few databases, and a cluster of “temporary” VMs that had somehow lived for years. A new security lead asked for “default deny inbound” on the hypervisors. Reasonable. The team agreed—then did the dangerous part: they assumed their VPN subnet was the only source of management traffic.

In reality, the operations team used a bastion host on a different subnet for maintenance. The bastion had been set up during a network redesign and quietly became the de facto admin path. The firewall change allowed SSH and 8006 from the VPN range, and dropped everything else.

At first, nothing looked broken. People already logged in via existing SSH sessions kept working because stateful rules allowed established connections. That made the change seem safe. Then the bastion host was rebooted for unrelated kernel updates. When it came back, it couldn’t SSH into the nodes. The Web UI was dead too. Suddenly the “default deny” rollout looked like an outage.

The fix was simple: console into one node, stop pve-firewall, add an allow for the bastion subnet, start pve-firewall, then test from bastion and VPN. The lesson wasn’t “don’t do default deny.” The lesson was that assumptions about management paths are lies unless you inventory them.

After the incident, they wrote a one-page “management plane contract”: which subnets may access hypervisors, which ports, and how to test from each path before enabling DROP policies. It was dull. It worked.

Mini-story 2: The optimization that backfired

A different org had a habit of “optimizing” firewall rules to keep them tidy. One engineer decided to consolidate rules into datacenter scope so every node behaved the same. The argument: fewer per-node exceptions, fewer snowflakes, easier compliance. In principle, fine.

The backfire came from heterogeneity. Some nodes had management on vmbr0 with an untagged VLAN; others used a tagged management VLAN on vmbr0.20. The consolidated rules included interface matches that were correct for half the fleet and wrong for the other half. The affected nodes didn’t match the allow rules, so default DROP applied.

They lost access to multiple nodes at once, which is the kind of outage that makes executives ask questions like “Why can’t we just log in?” The team had to use out-of-band consoles for recovery, one host at a time, while the rest of the company watched a status page that said something non-committal.

What made it worse: because the rules were cluster-wide, “fixing it quickly” meant pushing another cluster-wide change, which the team hesitated to do while unsure of rule behavior. They ended up stopping pve-firewall on the locked nodes, restoring access, then redesigning rules around management groups: an allowlist based on source networks and ports without fragile per-interface assumptions.

The point: consolidation is not automatically simplification. If the environment is not uniform, a “global optimization” is sometimes just a global blast radius.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop ran a Proxmox cluster for test environments—still important, still tied to real deadlines. They had a strict rule: any firewall change must include a pre-staged “break glass” path, tested from a different network.

That path was intentionally boring: a dedicated admin jump host with a fixed IP, allowed to reach SSH and 8006 on all hypervisors, and monitored. They also had IPMI enabled and documented, with credentials stored in a controlled vault. Nobody liked maintaining it. It wasn’t glamorous work.

One afternoon, someone accidentally applied a datacenter rule that dropped all inbound TCP except a handful of application ports. It was an honest UI mistake: the engineer thought they were editing a VM policy, not the datacenter policy. Within seconds, most management access disappeared.

The team didn’t panic. They used the jump host, which still had access by design, to log into nodes and revert the rule. No console gymnastics. No guessing. The postmortem was short and unromantic: improve UI guardrails, add confirmation prompts for datacenter changes, and keep the “boring” jump host monitored.

Sometimes the best engineering move is to be uninteresting on purpose.

Common mistakes: symptom → root cause → fix

This section is where you match what you see to what’s actually happening. Quick pattern recognition beats creative suffering.

1) Browser times out on port 8006, but console curl works

  • Symptom: Web UI unreachable remotely, but curl -kI https://127.0.0.1:8006/ returns 200.
  • Root cause: Firewall dropping inbound TCP/8006 or upstream ACL block.
  • Fix: Stop pve-firewall to restore access, then add explicit allow rules for admin sources to TCP/8006 and reload.

2) SSH was working for one person, then stopped when they disconnected

  • Symptom: Existing SSH session survived; new SSH sessions fail.
  • Root cause: Stateful allow for established connections, but new inbound TCP/22 is dropped.
  • Fix: Add explicit allow for TCP/22 from admin sources; don’t rely on conntrack luck.

3) Only one node is locked out; others are fine

  • Symptom: Cluster mostly accessible; one node dead to SSH/UI.
  • Root cause: Node-level firewall enabled with stricter policy, interface mismatch, or node-specific rule typo.
  • Fix: Check /etc/pve/firewall/pve-XX.fw for enable and policies; compare to a working node; restart pve-firewall after correction.

4) Everything is locked out right after a datacenter firewall edit

  • Symptom: Multiple nodes become unreachable nearly simultaneously.
  • Root cause: Cluster-wide (datacenter) rule dropped management ports or default policy changed to DROP without allowlist.
  • Fix: Console into one node, stop pve-firewall, edit /etc/pve/firewall/cluster.fw to allow management, start firewall, validate, then repeat as needed.

5) Web UI works on LAN but not over VPN

  • Symptom: Local subnet can reach 8006; VPN users cannot.
  • Root cause: Allow rules scoped to one source subnet; VPN NAT changes perceived source IP; or MTU issues masquerading as firewall.
  • Fix: Confirm VPN source IP seen by Proxmox via firewall logs or tcpdump; adjust allow rule source accordingly; if SYN-ACK seen but UI still fails, check MTU/TLS rather than firewall.

6) Stopping pve-firewall doesn’t restore access

  • Symptom: systemctl stop pve-firewall but SSH/UI still unreachable.
  • Root cause: Rules enforced by nftables/iptables not owned by pve-firewall, upstream firewall/ACL, wrong route, or wrong IP.
  • Fix: Inspect nft list ruleset and iptables -S, verify IP/route, run tcpdump to see whether SYN reaches host, and check upstream network policy.

7) You allowed 8006 but still can’t log in

  • Symptom: TCP connects but UI login fails or loads partially.
  • Root cause: Not firewall: could be pveproxy issue, time skew, auth backend trouble, or certificate/hostname mismatch (especially behind proxies).
  • Fix: Check systemctl status pveproxy, inspect journalctl -u pveproxy, confirm time sync, and test local login first.

8) You edited /etc/pve/firewall files but nothing changes

  • Symptom: Changes don’t apply, rules seem unchanged.
  • Root cause: Cluster filesystem not healthy (quorum issues), or you edited the wrong scope file.
  • Fix: Check pvecm status for quorum; ensure you edited the correct file (cluster.fw vs node *.fw); restart pve-firewall and confirm compile status.

Checklists / step-by-step plan

Emergency step-by-step (restore access in under 10 minutes)

  1. Open console access (IPMI/KVM/provider console).
  2. Confirm IP and route:
    • ip -br addr shows correct management IP.
    • ip route shows default route.
  3. Confirm services:
    • systemctl status ssh and systemctl status pveproxy.
    • ss -lntp | egrep '(:22|:8006)'.
  4. Confirm firewall drops (optional but fast):
    • journalctl -k | egrep -i 'PVEFW|DROP|REJECT'.
  5. Pick a recovery mode:
    • Fastest: systemctl stop pve-firewall.
    • More controlled: add a narrow allow for your admin IP (temporary nft emergency chain) or edit /etc/pve/firewall/*.fw.
  6. Verify from outside: test SSH and https://node:8006 from a known-good admin host.
  7. Fix the actual rule: remove deny, add correct allowlist, reload firewall (systemctl start pve-firewall and pve-firewall compile).
  8. Remove temporary emergency nft rules if you created them:
    • nft delete table inet emergency

Stabilization checklist (after you’re back in)

  1. Identify scope: was it Datacenter, Node, or VM/CT firewall?
  2. Confirm policies: policy_in and policy_out at each scope.
  3. Define management sources: VPN subnet, jump host IP, on-call laptops, office egress NAT—write the list down.
  4. Implement a management allowlist:
    • Allow TCP/22 and TCP/8006 from that list.
    • Allow ICMP from that list (optional, but operationally useful).
  5. Test from each path: VPN, jump host, office network, etc. Don’t assume.
  6. Enable logging for drops only where needed; too much log noise hides the real drop.
  7. Document break-glass: where the console is, how to access it, and who can.

Prevention: make lockouts boring

Once you’ve recovered access, do the responsible thing: prevent recurrence. Not with hope. With design.

1) Separate “management plane” from “everything else”

If your hypervisor management shares the same LAN and policy as guest traffic, you’re inviting fun. Better patterns:

  • Dedicated management VLAN/subnet with controlled ingress.
  • Admin VPN that lands you in that subnet (or NATs to a known egress IP).
  • Bastion host with a stable IP and strong auth. Allow management only from it if you can.

2) Keep your “allowlist for SSH and 8006” explicit and local

You can set policy_in DROP globally, but don’t rely on one fragile datacenter rule to preserve your only access path. Use node rules for management allowlist, especially if nodes differ (interfaces, VLAN tags, subnets).

3) Use “two-person rule” for datacenter firewall changes

Datacenter scope is cluster blast radius. Treat it like production database schema changes: review, verify scope, and have a rollback.

4) Test with a new connection, not your existing lucky session

Stateful firewalls can trick you. Always validate with a fresh SSH connection from a second terminal, or from a different host, before declaring success.

5) Prefer narrow temporary changes during incidents

When you’re in an outage, “temporarily stop pve-firewall” is acceptable because it’s reversible and fast. But don’t leave the firewall off for days. That’s how temporary becomes permanent, and permanent becomes policy.

6) Build a “break-glass” workflow you can execute half-asleep

At minimum:

  • Out-of-band access verified quarterly.
  • A documented console login procedure.
  • A known-good admin IP/range to allow.
  • A pre-written snippet for node firewall rules allowing TCP/22 and TCP/8006.

FAQ

1) Should I just disable the Proxmox firewall permanently?

No. Disable it briefly to recover, then fix the policy. Hypervisors are privileged systems; running them wide open is an invitation to get audited the hard way.

2) What’s the quickest safe command to restore access?

From console: systemctl stop pve-firewall. If services are running and networking is fine, that usually restores SSH and the Web UI immediately.

3) Why does the Web UI use port 8006, and can I change it?

8006 is the default for pveproxy. You can change exposure with reverse proxies and network policy, but changing the port itself is rarely worth the operational friction during incidents.

4) I stopped pve-firewall but I’m still locked out. Now what?

Prove whether packets reach the host. Use tcpdump -ni vmbr0 tcp port 22 while attempting a connection. If no SYN arrives, it’s upstream (routing, VLAN, ACL). If SYN arrives but no SYN-ACK, it’s local filtering or service/listen issues.

5) Is it better to fix rules via the UI or by editing /etc/pve/firewall/*.fw?

During an incident, edit whichever is faster and less error-prone for you. The UI is nicer but unavailable during lockout. Console editing is fine; just validate with pve-firewall compile and restart the service.

6) What rule should always exist to prevent management lockout?

An explicit allow for TCP/22 and TCP/8006 from a controlled management source (bastion IP or admin subnet), at the node level if your environment is heterogeneous.

7) Can VM/CT firewall rules lock me out of the host?

Not directly. VM/CT firewall rules apply to guest interfaces. Host lockouts typically come from datacenter/node firewall settings or other host-level firewall tooling.

8) How do I know if Proxmox is using nftables or iptables?

Check what has rules loaded: nft list ruleset and iptables -S. On modern Debian-based systems, nftables is common, sometimes with iptables compatibility. During outages, trust what’s actually present, not what you remember from last year.

9) What about IPv6—can that cause “half-lockouts”?

Yes. If clients prefer IPv6 and your firewall rules only allow IPv4 (or vice versa), you’ll see inconsistent access. Check ip -br addr for IPv6 addresses and ensure your policy matches reality.

10) If I’m in a cluster, can I fix this from any node?

Maybe. If the firewall lockout affects all nodes, you’ll need console access to at least one. Also, if the cluster lacks quorum, /etc/pve behavior can be constrained. In emergencies, restore access per-node first, then reconcile cluster config once stable.

Conclusion: next steps you should do today

Getting locked out by Proxmox firewall rules is common because it’s a clean UI over a sharp tool. The recovery is straightforward: confirm networking, confirm services, confirm firewall drops, then either stop pve-firewall or add a narrow allow to regain access. After that, fix rules properly with explicit management allowlists and the right scope.

Next steps that pay rent:

  1. Create a node-level management allowlist for TCP/22 and TCP/8006 from a bastion or admin subnet.
  2. Verify out-of-band console access works before the next incident forces you to discover it doesn’t.
  3. Add a simple pre-change test: “Can a fresh SSH connection and a fresh 8006 connection succeed from each admin path?”
  4. Keep your rollback plan as a command you can type from console without thinking: stop firewall, restore access, then correct policy.
← Previous
ZFS Write Hole Explained: Who Has It and Why ZFS Avoids It
Next →
MariaDB vs PostgreSQL: CPU spikes—who burns cores faster under peak load

Leave a comment