The breach story in virtualization shops is rarely exotic. It’s usually “someone reused a password,” “the management UI was reachable,”
or “we patched later because it was inconvenient.” And then it becomes very inconvenient.
Proxmox VE is great because it’s honest: it’s Debian underneath, Linux networking underneath, and your operational decisions on top.
If you want it secure, you have to run it like production, not like a lab that accidentally started paying invoices.
Threat model in one page: what you’re defending
Proxmox is a management plane, not just a hypervisor. If an attacker gets into Proxmox, they don’t just get a single VM.
They get the ability to mount disks, snapshot and exfiltrate data, read console output (hello, secrets), attach ISOs, and
change networking. “Root on the host” is catastrophic; “admin in the UI” is often equivalent.
What usually goes wrong
- Exposed management interface (8006) to the Internet, sometimes with password auth and no 2FA.
- Over-permissioned operators (“just give them PVEAdmin so they can do their job”) that becomes permanent.
- Firewall assumptions: people think the datacenter firewall is “on” because the UI has a checkbox.
- Patch avoidance because “we can’t reboot.” Spoiler: you can, or you’ll reboot at 3 a.m. during an incident.
- Remote access shortcuts: SSH from everywhere, shared jump boxes, or worse, shared accounts.
Hardening priorities (opinionated)
- Make the management plane unreachable from untrusted networks. VPN or dedicated admin network.
- Enforce 2FA for all humans. Tokens for automation.
- Use RBAC properly: least privilege, explicit roles, and separation between “operate VMs” and “change cluster.”
- Turn on and verify firewalling at the right layer, with default-deny where it matters.
- Patch with a plan: predictable maintenance windows, staged updates, and reboot discipline.
Paraphrased idea from Werner Vogels (reliability/operations): “Everything fails eventually; design assuming failure, not assuming perfection.”
That’s the energy you want here.
Interesting facts and context (so you stop making 2008 mistakes)
- Fact 1: Proxmox VE’s default management port is
8006, and scanners love it because it’s consistent and chatty. - Fact 2: The Proxmox firewall is not “just iptables.” It compiles rules and manages them; you need to verify effective rules on the host.
- Fact 3: RBAC in Proxmox is path-based. Permissions inherit down the object tree (datacenter → node → VM), which is powerful and easy to overdo.
- Fact 4: API tokens exist for a reason: automation should not log in as a human. Tokens can be scoped and rotated with less drama.
- Fact 5: Management networks became a standard pattern in the 2000s because “admin VLAN” mistakes were cheaper than “entire fleet owned” incidents.
- Fact 6: “Password-only SSH” survived longer than it should have because it’s convenient. Convenience is not a control; it’s a liability.
- Fact 7: Debian’s packaging culture favors stability, but security fixes still arrive frequently. “Stable” is not “static.”
- Fact 8: Cluster traffic (Corosync) has very different needs than web UI traffic; treating it like normal user traffic is a great way to invent outages.
Checklists / step-by-step plan (do this, not vibes)
Day 0–1: Stop the bleeding
- Block public access to
8006and SSH at the edge. If you can’t, you don’t have a management plane—you have a public API. - Enable 2FA for all interactive users. No exceptions for “temporary” accounts.
- Audit users/tokens. Remove stale accounts, rotate secrets, kill shared logins.
- Confirm firewall state (datacenter + node) and verify effective rules on the host.
- Patch baseline: bring all nodes current on security updates; schedule reboots.
Week 1: Make it boring
- Define RBAC roles for common jobs (VM operator, backup operator, storage admin, cluster admin).
- Isolate management onto an admin network or VPN, and bind services to it where practical.
- Harden SSH (keys, no root login, restricted source networks).
- Logging and alerting: track auth failures, configuration changes, and firewall drops.
Month 1: Reduce blast radius
- Separate duties: cluster membership changes and storage changes should be limited to a tiny set of people.
- Test restore paths and lock down backup storage; backups are the first exfil target after the primary data.
- Document runbooks for incident response: lost 2FA device, compromised account, node rebuild.
2FA that actually reduces risk (and doesn’t brick admins)
2FA isn’t magic. It’s a seatbelt: it doesn’t prevent accidents; it prevents a bad day from becoming a dead day.
Use TOTP or WebAuthn depending on your culture and tooling. In practice, TOTP is easiest to roll out. WebAuthn is nicer when your org
already understands security keys and device lifecycle.
Rules I enforce in production
- 2FA required for all human accounts with UI or shell access.
- No shared accounts. Shared accounts make audits fictional.
- Break-glass exists, but it’s ugly: a sealed procedure, stored offline, and used with a paper trail.
- Automation uses API tokens, not password logins. Tokens get rotated; humans get fired for embedding passwords in scripts.
Joke 1: 2FA is like flossing—everyone agrees it’s a good idea, and then they “start next week” for three years.
Failure modes to plan for
- Lost device: if you don’t have a recovery workflow, your “security improvement” becomes downtime.
- Time drift: TOTP fails if host time is wrong. NTP isn’t optional.
- Clipboard and screen-recording malware: 2FA doesn’t fix compromised endpoints. It just raises the bar.
RBAC: permissions design for humans who will make mistakes
The trap with Proxmox RBAC is that it’s both powerful and deceptively simple. You assign a role to a user at a path, and it inherits.
Two months later, someone wonders why a contractor can modify storage on every node. Answer: because you gave them permissions at /.
RBAC design pattern that works
- Use groups for humans. Assign roles to groups, not individuals. Individuals come and go; groups are policy.
- Use narrow paths. Assign VM operators at
/vmsor specific pools, not/. - Separate “operate” from “change infrastructure”. VM start/stop is not the same as adding storage or joining nodes to a cluster.
- Limit audit/log access. Logs contain secrets more often than you think (cloud-init user-data, console output, tokens in environment variables).
Roles worth defining (examples)
- VM Operator: start/stop, console access, snapshot (maybe), but no storage, no networking changes.
- Backup Operator: run/monitor backups, restore to a quarantine pool, but not modify production networking.
- Storage Admin: storage config, replication, prune jobs. No permission to change user auth.
- Cluster Admin: rare. Can change cluster membership, corosync config, node maintenance.
If you can’t explain a role in one sentence, it’s not a role. It’s a bucket.
Firewalling Proxmox: host, datacenter, and “don’t expose the UI”
The Proxmox firewall can be excellent. It can also be a placebo if you don’t understand what’s actually applied.
Security posture should never depend on one checkbox in a GUI. You need defense in depth: edge firewall, management network isolation,
and host-level controls.
Non-negotiable inbound rules
- Management UI (8006): only from admin network / VPN.
- SSH (22): only from admin network / bastion.
- Cluster traffic: only between nodes on the cluster network (and don’t NAT it).
- Storage backends (NFS, iSCSI, Ceph, etc.): only between the right participants. No “any to any.”
Default deny where it counts
The safest model is default-deny inbound on management interfaces, with explicit allows. For VM bridges, the story depends on your tenant model.
If you’re running internal workloads and you trust your east-west traffic policies, you can be less strict. If you’re multi-tenant,
assume you’re hosting creative adversaries.
Don’t confuse “can ping” with “is safe”
ICMP reachability tells you almost nothing. Attackers don’t need ping. They need a reachable socket and one mistake.
Your job is to make the socket unreachable from places it shouldn’t exist.
Updates and maintenance: patch fast without gambling uptime
Patch discipline is a culture issue disguised as a technical issue. The technical part is easy: apt updates, reboot when needed,
migrate VMs, repeat. The cultural part is harder: getting leadership to accept that planned downtime is cheaper than surprise downtime.
Production patch strategy that scales
- Staggered updates: never update every node at once. One node first, then the rest.
- Maintenance window: even if it’s just one hour weekly. Predictable beats heroic.
- Reboot policy: kernel updates mean reboot. Yes, you can postpone; no, you shouldn’t indefinitely.
- Test after update: cluster quorum, storage mounts, VM start, backups, and firewall rules.
Joke 2: “We don’t reboot production” is a great policy—until your kernel reboots production for you at the worst possible time.
Security updates vs feature updates
Separate them mentally. Security updates get urgency. Feature updates get scrutiny. In practice on Proxmox, you’ll often do both through apt,
so your process must include a quick sanity check. If you’re allergic to change, you’re also allergic to security.
Safe remote access: VPN, bastions, and SSH hygiene
The only safe way to manage Proxmox remotely is to reduce the number of places management is reachable from.
“Allowed from the Internet but protected by a password manager” is not a strategy. It’s a confession.
Preferred patterns (pick one, don’t improvise)
- VPN to an admin network: simplest operational model. Admins authenticate to VPN, then access Proxmox privately.
- Bastion / jump host: SSH to bastion with strong auth, then hop to Proxmox nodes. UI access via port-forwarding or internal proxy.
- Dedicated management network reachable only via on-prem or VPN: the “boring enterprise” model, for good reason.
SSH hardening rules that don’t ruin your life
- Keys only, no password authentication.
- No root login via SSH. Use sudo with audited users.
- Limit source IPs to admin subnets.
- Separate admin accounts per person; use groups and sudo rules.
Web UI access without opening the world
If you must access the UI remotely, do it through VPN. If you can’t, put a reverse proxy in front with strong auth and IP allowlisting
and accept you’ve added complexity. Complexity is not a control, but it can be a compensating control if you operate it well.
Audit tasks with commands: what to run, what it means, what you decide
These are real operator tasks. Run them on each node (or centrally where noted). For each: command, sample output, what it means,
and the decision you make.
Task 1: Verify Proxmox version and cluster state
cr0x@server:~$ pveversion -v
pve-manager/8.2.2/9355359f (running kernel: 6.8.12-2-pve)
proxmox-ve/8.2.0 (running kernel: 6.8.12-2-pve)
pve-kernel-6.8/6.8.12-2
What it means: You’re on PVE 8.2.x with a specific kernel. If different nodes report different minor versions, you’re in “works until it doesn’t” land.
Decision: Align versions across the cluster before you debug weirdness or roll new features.
Task 2: Check cluster health and quorum
cr0x@server:~$ pvecm status
Cluster information
-------------------
Name: prod-cluster
Config Version: 18
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sun Dec 28 10:14:22 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2a
Quorate: Yes
What it means: Quorum is healthy. If Quorate: No, don’t change cluster config; you’re one mistake away from downtime.
Decision: If not quorate, fix networking between nodes first; postpone updates and reboots.
Task 3: Confirm NTP/time sync (2FA depends on it)
cr0x@server:~$ timedatectl
Local time: Sun 2025-12-28 10:14:42 UTC
Universal time: Sun 2025-12-28 10:14:42 UTC
RTC time: Sun 2025-12-28 10:14:41
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: Clock is synchronized. If not, TOTP failures and TLS weirdness show up like ghosts.
Decision: If unsynchronized, fix NTP before rolling out or enforcing 2FA.
Task 4: List users and identify local vs realm accounts
cr0x@server:~$ pveum user list
┌──────────────┬─────────┬───────────────────────┬────────────┬────────┬──────────────┐
│ userid │ enable │ expire │ firstname │ lastname│ email │
╞══════════════╪═════════╪═══════════════════════╪════════════╪════════╪══════════════╡
│ root@pam │ 1 │ │ │ │ │
│ alice@pam │ 1 │ │ Alice │ Ops │ │
│ bob@pve │ 1 │ 2026-01-31 00:00:00 │ Bob │ Eng │ │
└──────────────┴─────────┴───────────────────────┴────────────┴────────┴──────────────┘
What it means: You have users in different realms (@pam, @pve). Expiration exists; use it for contractors.
Decision: Disable or expire accounts that don’t map to a current human with a ticket.
Task 5: Check 2FA/TOTP configuration for a user
cr0x@server:~$ pveum user get alice@pam
enable: 1
expire: 0
firstname: Alice
lastname: Ops
groups: ops
keys:
totp:
enabled: 1
issuer: pve
realm: pam
What it means: TOTP is enabled for the user. If enabled: 0, they can still log in with just a password unless you enforce policy externally.
Decision: Require 2FA for interactive users; build a break-glass method that is auditable.
Task 6: Review groups and role assignments (RBAC audit)
cr0x@server:~$ pveum group list
┌────────────┬────────┬─────────────┐
│ groupid │ comment│ users │
╞════════════╪════════╪═════════════╡
│ ops │ │ alice@pam │
│ vm-ops │ │ bob@pve │
└────────────┴────────┴─────────────┘
cr0x@server:~$ pveum acl list
┌────────────┬──────────┬───────────────┬───────────┐
│ path │ type │ ugid │ roleid │
╞════════════╪══════════╪═══════════════╪═══════════╡
│ / │ group │ ops │ PVEAdmin │
│ /vms │ group │ vm-ops │ PVEVMAdmin│
└────────────┴──────────┴───────────────┴───────────┘
What it means: The ops group has PVEAdmin at /, which is effectively “god mode.” This is a common overreach.
Decision: Move broad roles off /. Give PVEAdmin only to a tiny cluster-admin group, and scope others to pools/paths.
Task 7: Inspect API tokens (automation hygiene)
cr0x@server:~$ pveum user token list alice@pam
┌───────────────┬────────┬──────────────┬──────────┐
│ tokenid │ enable │ expire │ privsep │
╞═══════════════╪════════╪══════════════╪══════════╡
│ backup-bot │ 1 │ 2026-03-01 │ 1 │
└───────────────┴────────┴──────────────┴──────────┘
What it means: Tokens exist and privsep is enabled (good). Expiration is set (better).
Decision: Rotate tokens on a schedule; disable tokens that aren’t tied to an owned workflow.
Task 8: Confirm Proxmox firewall is enabled (datacenter and node)
cr0x@server:~$ pvesh get /cluster/firewall/options
{ "enable": 1, "policy_in": "DROP", "policy_out": "ACCEPT", "log_level_in": "info" }
cr0x@server:~$ pvesh get /nodes/pve1/firewall/options
{ "enable": 1, "policy_in": "DROP", "policy_out": "ACCEPT", "log_level_in": "info" }
What it means: Firewall is enabled and inbound default policy is DROP. If disabled at either layer, rules may not apply where you think.
Decision: If inbound is ACCEPT on management interfaces, change it and explicitly allow required admin subnets.
Task 9: Verify effective listening services and bind addresses
cr0x@server:~$ ss -lntp | head
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 4096 0.0.0.0:8006 0.0.0.0:* users:(("pveproxy",pid=1520,fd=6))
LISTEN 0 4096 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=1042,fd=3))
LISTEN 0 4096 127.0.0.1:85 0.0.0.0:* users:(("pvedaemon",pid=1400,fd=12))
What it means: UI and SSH listen on all interfaces (0.0.0.0). That’s not automatically wrong, but it raises stakes: firewalling and network isolation must be correct.
Decision: If you can, bind management to an admin interface; otherwise enforce strict inbound allowlists.
Task 10: Check actual firewall rules on the host (truth serum)
cr0x@server:~$ nft list ruleset | sed -n '1,80p'
table inet filter {
chain input {
type filter hook input priority filter; policy drop;
iif "lo" accept
ct state established,related accept
ip saddr 10.50.0.0/24 tcp dport { 22, 8006 } accept
ip saddr 10.50.0.0/24 icmp type echo-request accept
counter log prefix "PVE-DROP " drop
}
}
What it means: Default drop, explicit allow from admin subnet to SSH/UI. This is what you want: readable, minimal, enforceable.
Decision: If policy is accept or rules are missing, fix firewall config and re-verify here—not only in the UI.
Task 11: Verify SSH config isn’t quietly permissive
cr0x@server:~$ sshd -T | egrep '^(passwordauthentication|permitrootlogin|pubkeyauthentication|allowusers|allowgroups)'
passwordauthentication no
permitrootlogin no
pubkeyauthentication yes
What it means: Password auth is off; root login is off; key auth is on. If passwordauthentication yes, you’re inviting brute force if exposed.
Decision: Disable password authentication and restrict sources; then test access with a second session open.
Task 12: Check for pending reboots after kernel/security updates
cr0x@server:~$ [ -f /var/run/reboot-required ] && echo "reboot required" || echo "no reboot required"
reboot required
What it means: You’re running with a kernel/userspace mismatch risk and un-applied security fixes that require reboot.
Decision: Schedule migration and reboot during the next window. Don’t let this sit for months.
Task 13: See what updates are pending (before you surprise yourself)
cr0x@server:~$ apt-get -s dist-upgrade | sed -n '1,40p'
Reading package lists... Done
Building dependency tree... Done
Calculating upgrade... Done
The following packages will be upgraded:
pve-manager pve-kernel-6.8 proxmox-widget-toolkit
3 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
What it means: Simulated upgrade shows what will change. If you see corosync or libc changes, plan for reboots and careful sequencing.
Decision: Approve the change in your maintenance plan; stage one node first.
Task 14: Validate management UI TLS certificate state (avoid self-inflicted distrust)
cr0x@server:~$ openssl s_client -connect 127.0.0.1:8006 -servername pve1 -showcerts /dev/null | openssl x509 -noout -subject -issuer -dates
subject=CN = pve1
issuer=CN = pve1
notBefore=Sep 1 00:00:00 2025 GMT
notAfter=Sep 1 00:00:00 2035 GMT
What it means: You’re using a self-signed cert (issuer equals subject). That’s fine internally if you trust/distribute it; it’s sloppy if you expose UI externally.
Decision: Keep UI internal. If you must expose, terminate TLS properly with a managed certificate and strong auth in front.
Task 15: Check auth failures in logs (attack surface reality check)
cr0x@server:~$ journalctl -u pveproxy --since "24 hours ago" | egrep -i 'authentication|login|failed' | tail -n 10
Dec 28 08:41:12 pve1 pveproxy[1520]: authentication failure; rhost=198.51.100.23 user=root@pam
Dec 28 08:41:14 pve1 pveproxy[1520]: authentication failure; rhost=198.51.100.23 user=admin@pve
What it means: Someone is knocking. If that IP isn’t your VPN/bastion, your management plane is reachable from untrusted networks.
Decision: Fix routing/firewalling now. Then consider banning at the edge; host bans are a bandage, not armor.
Task 16: Confirm only expected subnets can reach the UI from a remote vantage point
cr0x@server:~$ nc -vz -w2 10.10.10.11 8006
nc: connect to 10.10.10.11 port 8006 (tcp) failed: Connection refused
What it means: From this vantage point, UI is not reachable. “Connection refused” or “timed out” is what you want from non-admin networks.
Decision: If the port is open from general networks, correct ACLs and validate again from multiple networks.
Fast diagnosis playbook (first/second/third checks)
When “security stuff” breaks, it tends to break access. When access breaks, people panic and start disabling controls.
This playbook is designed to keep you from “fixing” security by turning it off.
Scenario A: Admin can’t log in to the UI
- Check time sync (TOTP/WebAuthn failure is often clock drift). Run
timedatectl. If unsynchronized, fix NTP and retry. - Check reachability from the right network. Are you on VPN/admin VLAN? Validate with
nc -vz pve1 8006. - Check pveproxy logs for auth errors vs TLS vs network blocks (
journalctl -u pveproxy).
Scenario B: Cluster operations are slow or failing after firewall changes
- Check quorum (
pvecm status). If quorum is unstable, stop making changes. - Check inter-node connectivity on the cluster network (ping is insufficient; verify relevant ports if you know them, and check drops).
- Inspect nftables counters/logs (
nft list ruleset, then look for drop counters). If you see drops between node IPs, fix allow rules.
Scenario C: You suspect the UI is exposed publicly
- Check logs for foreign IPs hitting auth endpoints (
journalctl -u pveproxy). - Confirm bind/listen (
ss -lntp). Listening on0.0.0.0:8006isn’t proof of exposure, but it increases risk. - Verify edge controls: can a non-admin vantage reach
8006? If yes, lock it down at the edge and at the host.
Common mistakes: symptom → root cause → fix
1) Symptom: 2FA codes “never work” for multiple users
Root cause: Time drift on the node (or clients), NTP disabled, or virtualization host clock issues.
Fix: Enable NTP, verify timedatectl shows synchronized, then re-enroll if needed. Don’t lower auth requirements to “fix” this.
2) Symptom: Operator can delete storage or change networking “by accident”
Root cause: RBAC role assigned at / or too-broad built-in roles used as convenience.
Fix: Create scoped groups and ACLs at specific paths/pools; reserve PVEAdmin for a small cluster-admin group.
3) Symptom: Firewall enabled, but port 8006 is still reachable from everywhere
Root cause: Firewall enabled at datacenter but disabled at node, or rules applied to wrong interface, or an upstream firewall/NAT bypasses expectations.
Fix: Verify both datacenter and node options via pvesh. Confirm effective rules via nft list ruleset. Fix edge ACLs.
4) Symptom: Cluster becomes unstable after “tightening firewall rules”
Root cause: Blocking corosync/knet traffic between nodes, or mixing mgmt and cluster networks with asymmetric rules.
Fix: Explicitly allow inter-node cluster traffic on the cluster interface. Validate quorum stability before and after changes.
5) Symptom: After updates, VMs migrate but performance tanks
Root cause: Kernel/driver mismatch, NIC offload settings changed, or storage path impacted; also possible that one node is now “different.”
Fix: Align versions across nodes. Check dmesg, NIC driver versions, and storage health. Roll changes gradually next time.
6) Symptom: SSH access lost after hardening
Root cause: Disabled password auth before distributing keys, or restricted IPs incorrectly, or locked out by firewall.
Fix: Always keep an active root console (iLO/IPMI/KVM) before SSH changes. Apply allowlists carefully and test with a second session open.
7) Symptom: Backups succeed but restores fail during an incident
Root cause: Restores weren’t tested; backup storage permissions/paths changed; encryption keys unavailable; or restore targets blocked by RBAC.
Fix: Test restores quarterly (at least). Store keys in a controlled system. Ensure backup operators can restore to a quarantine environment.
Three corporate mini-stories from the trenches
Story 1: The incident caused by a wrong assumption
A mid-sized company ran a three-node Proxmox cluster for internal services: CI runners, a few databases, and a “temporary” jump box
that became permanent because everything temporary is permanent. They had a next-gen firewall at the edge and believed the management UI
“was not exposed.” No one could point to a rule, but everyone could point to a belief.
An audit started after a vendor asked for an IP allowlist. The security team did a quick scan from outside and found port 8006 open.
Not “filtered.” Open. The edge firewall had a NAT rule for convenience during a remote migration months earlier, and nobody removed it.
The UI was reachable with only password auth for a couple of legacy accounts.
The event timeline wasn’t dramatic. A bot hit the UI, tried common usernames, found one account that used an old password pattern, and logged in.
Once inside, it didn’t need kernel exploits. It used the UI like an admin: downloaded VM backups, mounted disks, and created a new privileged user.
The “hypervisor compromise” was really just “admin plane compromise.”
Recovery was painful because the team’s mental model was wrong. They kept investigating VM vulnerabilities while the attacker was operating the control plane.
The fix was also boring: remove public exposure, enforce 2FA, rotate credentials, and add a standing change-control checklist for NAT rules.
They later admitted the actual root cause: assuming the firewall configuration matched the architecture diagram.
Story 2: The optimization that backfired
Another organization wanted “less friction” for operators. They centralized access by putting the Proxmox UI behind an internal reverse proxy,
then opened the proxy to a broader corporate network so on-call could reach it from anywhere on the company VPN. They also added an SSO-like flow
for convenience. It worked. Until it didn’t.
The proxy became a single point of failure and a single point of compromise. During a routine proxy upgrade, TLS settings changed and some clients
started failing. Operators, under pressure, began bypassing the proxy by directly exposing 8006 on a temporary basis “just for tonight.”
That temporary state lived long enough to show up in logs as a steady stream of auth attempts from networks that were never meant to touch management.
Meanwhile, the proxy obscured source IPs unless carefully configured, which made auditing suspicious logins harder. The team had effectively traded
a clear security boundary for a convenience layer that demanded its own operational excellence. They didn’t staff for that excellence.
The eventual fix was to simplify: UI access only via a dedicated admin VPN segment; no broad corporate reach. They kept the proxy internally for a few
workflows, but removed it as a “security feature.” The lesson was blunt: convenience layers are systems, and systems need owners.
Story 3: The boring but correct practice that saved the day
A smaller team running Proxmox for a SaaS product had one habit that looked almost old-fashioned: a weekly maintenance window with a written runbook.
Every Wednesday, they patched one node, migrated workloads, rebooted if needed, then moved to the next node. They logged what changed and what they observed.
It wasn’t glamorous. It was reliable.
One week, they applied updates and noticed pvecm status showed intermittent quorum warnings during migrations. Because they always checked cluster health
after each node, they caught it early. The cause was a switch port flapping on the cluster network—unrelated to the patch, but revealed by the routine.
They paused the rollout, fixed the physical issue, and resumed. No customer impact. The on-call engineer even slept.
The security win wasn’t just “they patched.” It was that their process made anomalies obvious before they became incidents.
Boring practices don’t get budget applause, but they do prevent budget meetings with unpleasant facial expressions.
FAQ
1) Should I ever expose Proxmox port 8006 to the Internet if I use 2FA?
No. 2FA reduces account takeover risk; it does not reduce service exposure risk. Keep 8006 on an admin network or behind VPN/bastion.
2) Is Proxmox RBAC enough, or do I still need Linux user controls?
You need both. RBAC governs Proxmox API/UI actions. Linux users/SSH govern host access. Treat host shell access as higher privilege than UI-only access.
3) What’s the minimum viable “secure remote access” setup?
A VPN into an admin subnet, with firewall rules allowing 22 and 8006 only from that subnet, plus 2FA for UI logins and SSH keys for shell.
4) How do I handle break-glass access when 2FA is mandatory?
Create a dedicated break-glass account with strong controls: stored credentials offline, restricted IP allowlist, monitored use, and a documented procedure.
Use it only to recover access, then rotate it.
5) Are API tokens safer than passwords?
Usually yes, because they can be scoped, rotated, and revoked without disrupting a human account. But treat tokens like secrets: protect them, log their use,
and expire them.
6) Do I need fail2ban on Proxmox?
If your management plane is correctly isolated, fail2ban is optional. If you suspect exposure, fail2ban can reduce noise, but it’s not a substitute for fixing exposure.
7) What firewall policy should I use: DROP or REJECT?
For management interfaces, prefer DROP to reduce signal. For internal troubleshooting convenience, REJECT can be helpful. Pick one deliberately and document it.
8) How often should I patch Proxmox nodes?
Weekly or biweekly is a healthy cadence for most environments, with the ability to accelerate for urgent security advisories. The key is consistency and staging.
9) If I isolate management on a VLAN, do I still need the Proxmox firewall?
Yes. VLANs reduce exposure; firewalls reduce blast radius when VLAN boundaries fail, are misconfigured, or when an internal system is compromised.
10) What’s the biggest “silent risk” in Proxmox security?
Over-broad permissions that stick around. Most organizations don’t get hacked by a genius; they get hurt by yesterday’s convenience.
Next steps you can execute this week
- Prove management isn’t exposed: from a non-admin network, confirm 8006 and 22 are unreachable. If reachable, fix edge rules immediately.
- Turn on 2FA for every human. Fix NTP first. Create a documented break-glass flow.
- RBAC cleanup sprint: list ACLs, remove
PVEAdminfrom/except for a tiny admin group, and scope everything else. - Verify firewall effectiveness: check configuration via
pveshand reality vianft list ruleset. - Patch with staging: simulate updates, update one node, verify quorum and core workflows, then roll forward.
- SSH harden carefully: keys-only, no root login, restricted sources; keep console access available while changing it.
- Write the two runbooks you’ll need: “lost 2FA device” and “compromised admin account.” Incidents don’t wait for documentation.
The goal isn’t to build an impenetrable fortress. It’s to make Proxmox management access intentionally scarce, permissions intentionally narrow,
and changes intentionally routine. The rest is just Linux.