Proxmox Security: API Tokens Done Right (So Root Password Stops Being a Weapon)

February 17, 2026 • February 17, 2026 • Read: 11 min • Views: 0

Was this helpful?

Somewhere in your environment, there’s a script, a CI runner, or a “temporary” admin workstation with a Proxmox root password baked into it. It works. It also means one leaked string can power-cycle production, delete backups, and reconfigure networks before your on-call coffee cools.

API tokens are how you stop treating the root password like a master key that everyone borrows. Done right, they’re scoped, revocable, auditable, and boring. Boring is the goal.

What you’re really securing (and why passwords keep failing)

Proxmox isn’t “just a hypervisor.” It’s your control plane: power actions, storage wiring, network changes, VM consoles, snapshots, replication, backups, firewall rules, and user identities. If someone can hit the Proxmox API with broad permissions, they don’t need to “hack” your VMs. They can become your platform team.

Root passwords fail in predictable ways:

They spread. One password becomes a shared secret across humans and machines.
They don’t scope. You can’t say “this automation can only start/stop these VMs and read these metrics” with a password.
They’re hard to rotate. Rotation means breaking scripts. People delay. Attackers don’t.
They blur accountability. “root did it” is not an audit trail, it’s a shrug.

API tokens don’t automatically fix culture, but they make good behavior cheaper than bad behavior. Tokens can be scoped and rotated without humans retyping secrets at 2 a.m. Tokens can be created per app, per pipeline, per integration. Tokens can die alone without dragging the whole org into a password reset spiral.

Facts and context that change decisions

Here are concrete facts—some historical, some architectural—that should affect how you design Proxmox access:

Proxmox VE centralizes control via a REST API. The web UI is essentially an API client. If you can do it in the UI, you can usually do it via API—and scripts will.
API tokens are per-user, not independent identities. In Proxmox, tokens attach to a user identity; your RBAC design still matters.
RBAC in Proxmox is built around roles and ACL paths. If you don’t understand path scoping (like /vms/100 vs /), you will either break automation or over-grant it.
“Least privilege” became mainstream because perimeter security failed repeatedly. Once networks stopped being “trusted inside,” scoped credentials became the only sane assumption.
Credential stuffing is old, reliable, and boring. Attackers reuse leaked passwords because it still works. Root passwords are especially reusable because humans reuse them in especially human ways.
Incident response runs on revocation. Password rotation is slow; token revocation is fast. Fast wins during containment.
Auditability is a feature, not compliance theater. When a VM gets destroyed, you want “which token” and “which service” not “which admin maybe.”
Automation increases blast radius if you don’t deliberately shrink it. CI systems often run with “cluster admin” because it’s easy. Easy is how outages become multi-site.

Joke #1: A shared root password is like a communal toothbrush—technically functional, morally questionable, and you don’t want to know where it’s been.

Threat model: the three attackers you actually have

Security advice gets weird when it’s built for movie villains. Let’s keep it operational. In real environments, Proxmox access typically fails because of one of these:

1) The well-meaning internal engineer with too much access

This is not a character flaw; it’s a systems flaw. If the easiest way to fix a problem is to use root, people will use root. The result is accidental damage that looks exactly like malice: deleted disks, wrong firewall rule pushed, live migration triggered during maintenance on the wrong node.

2) The compromised automation runner

Your CI runner, GitOps agent, or “backup controller” is a computer on the internet (or at least on a network). It will eventually be popped or misconfigured. When it is, the question is not “can it access Proxmox,” but “what can it do once inside?”

3) The external attacker who got a foothold elsewhere

Most attackers don’t start at Proxmox. They start at email, a VPN credential, a developer laptop, a web app, or a supply-chain dependency. Then they pivot. Proxmox is a pivot jackpot: control plane access is the shortest path to persistence and destruction.

Your job is to build a system where compromise of one token or one runner does not equal compromise of the cluster.

Proxmox API tokens: the model, limits, and sharp edges

Proxmox API tokens are credentials you create under a Proxmox user. They can be granted privileges via the same RBAC and ACL system as users. Tokens can also be marked as “privilege separation” (meaning they don’t automatically inherit everything the user has). That knob is where most security posture is won or lost.

Tokens are not magic; RBAC is the magic

If you create tokens for root@pam and give them broad cluster permissions, you have not improved anything. You have just changed the shape of the problem and made it easier to exfiltrate: tokens are designed to be used by machines, so machines will store them.

Scope is path-based, and path mistakes are common

Proxmox ACLs apply to paths like:

/ (cluster-wide)
/nodes/<node> (node-specific)
/vms/<vmid> (VM-specific)
/storage/<storage-id> (storage-specific)
/pool/<poolname> (resource pools)

Most “oops” incidents come from granting at / because someone wanted a token to do one task and couldn’t figure out the right path or role. So they went global. Congrats, you just built an outage button.

Token lifecycle matters more than token creation

Creating tokens is easy. Managing them is where programs go to die:

Where are tokens stored? (CI secrets, Vault, env vars, files on disk.)
How often are they rotated?
How do you detect usage anomalies?
How do you revoke in minutes?

Paraphrased idea (Gene Kim): “Improving systems is about shortening feedback loops and making changes safe to repeat.” That applies to credential rotation, too.

Design principles: least privilege that survives real life

1) Tokens are per-integration, not per-team

If “the DevOps token” exists, you’ve built shared fate. Create tokens per tool and per environment: terraform-prod, backup-controller, monitoring-readonly, ci-staging. Humans shouldn’t share tokens; machines shouldn’t share tokens either.

2) Use “privilege separation” unless you are deliberately doing an admin token

By default, a token may inherit the user’s privileges (depending on how you create it). You want the token to have only the ACLs you explicitly attach. The parent user can be an admin for emergency use, while the token stays scoped for automation.

3) Prefer resource pools to per-VM ACL sprawl

Granting ACLs on dozens of individual VMs is how you end up with “just give it /” later. Pools let you group resources and grant permissions at the pool path. Your future self will thank you with fewer 2 a.m. edits.

4) Separate duties: provisioning is not the same as operating

Provisioning tools often need rights to create VMs, attach storage, set tags, and configure network. Operators and backup systems need different rights. Don’t bundle everything because it’s “one pipeline.” Pipelines are not identities.

5) Build for revocation drills

Revocation isn’t an emergency-only maneuver. Practice it. You should be able to revoke a token and watch a controlled failure in the dependent system, with a clean error message and a rollback plan.

6) Design around the most common failure mode: someone grants too much

Make over-granting harder:

Restrict who can edit ACLs at /.
Keep a small number of predefined roles (and review them).
Require change control for cluster-wide permissions.

Joke #2: “Temporary admin access” has the same lifespan as a plastic bag in the ocean.

Practical tasks (commands, output, and decisions)

These are hands-on tasks you can run on a Proxmox node or via the CLI tools. Each includes what the output means and the decision you make from it. The point is operational clarity, not checkbox security.

Task 1: Confirm cluster state before changing auth

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-pve
Config Version:   17
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Feb  4 11:12:21 2026
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.24
Quorate:          Yes

What it means: You’re quorate and cluster config changes should replicate cleanly. If quorum is No, avoid RBAC churn; you may create drift between nodes.

Decision: Proceed only when quorate. If not, fix cluster health first.

Task 2: List users to find “automation hiding as humans”

cr0x@server:~$ pveum user list
┌──────────────┬───────────┬───────────┬────────────────────────────┐
│ userid       │ enable    │ expire    │ firstname                  │
╞══════════════╪═══════════╪═══════════╪════════════════════════════╡
│ root@pam     │ 1         │ 0         │                            │
│ alice@pve    │ 1         │ 0         │ Alice                      │
│ ci@pve       │ 1         │ 0         │ CI Runner                  │
│ monitor@pve  │ 1         │ 0         │ Monitoring                 │
└──────────────┴───────────┴───────────┴────────────────────────────┘

What it means: Local users exist. If you see “service accounts” as regular named humans or vice versa, you likely have token sprawl and unclear ownership.

Decision: Create dedicated service users (or realm-backed service identities) for each integration. Don’t reuse human users for automation.

Task 3: Inspect API tokens and spot inheritance mistakes

cr0x@server:~$ pveum user token list ci@pve
┌──────────────┬───────────────┬────────┬──────────────┐
│ tokenid      │ expire        │ enable │ privsep      │
╞══════════════╪═══════════════╪════════╪══════════════╡
│ terraform    │ 0             │ 1      │ 1            │
│ ansible      │ 0             │ 1      │ 0            │
└──────────────┴───────────────┴────────┴──────────────┘

What it means: privsep=0 suggests the token may inherit user privileges. That’s often how “least privilege” quietly dies.

Decision: Flip tokens to privilege separation unless you have a documented reason not to. Then explicitly ACL the token.

Task 4: Create a dedicated service user (no shell, no extras)

cr0x@server:~$ sudo pveum user add backup@pve --comment "Backup controller service user"

What it means: You created an identity that can hold tokens and ACLs. It’s not a human; it shouldn’t be used interactively.

Decision: Use one service user per integration domain (backup, monitoring, provisioning). Avoid “one service user to rule them all.”

Task 5: Create a token with privilege separation and capture it once

cr0x@server:~$ sudo pveum user token add backup@pve restic --privsep 1
┌──────────────┬──────────────────────────────────────────────┐
│ key          │ value                                        │
╞══════════════╪══════════════════════════════════════════════╡
│ full-tokenid │ backup@pve!restic                            │
│ value        │ 7c9dbbb2-3e76-4b3b-8d9f-0c8af2c5d2a1         │
└──────────────┴──────────────────────────────────────────────┘

What it means: This is the only time you’ll see the token value in plaintext. Store it in your secret manager immediately.

Decision: If you can’t store secrets properly, stop here and fix that first. Tokens amplify whatever secret hygiene you already have.

Task 6: Create a custom role instead of reusing “Administrator”

cr0x@server:~$ sudo pveum role add BackupOperator -privs "VM.Audit VM.Backup Datastore.Audit"

What it means: You defined a role with explicit privileges. The exact privilege strings should match your Proxmox version and intended operations.

Decision: Prefer small roles with names matching a job function. If you can’t explain a privilege in one sentence, don’t grant it yet.

Task 7: Apply ACLs at a constrained path (pool or specific storage)

cr0x@server:~$ sudo pveum acl modify /pool/prod-vms -token 'backup@pve!restic' -role BackupOperator

What it means: The token gets permissions only for resources within the prod-vms pool (depending on what lives there and how you operate backups).

Decision: If you can’t scope by pool, you’re likely missing good resource organization. Fix organization instead of granting /.

Task 8: Verify ACLs and catch accidental cluster-wide grants

cr0x@server:~$ pveum acl list
┌───────────────┬───────────────────────┬───────────────┬─────────────┐
│ path          │ ugid                  │ roleid        │ propagate   │
╞═══════════════╪═══════════════════════╪═══════════════╪═════════════╡
│ /             │ root@pam              │ Administrator │ 1           │
│ /pool/prod-vms│ backup@pve!restic     │ BackupOperator│ 1           │
│ /             │ ci@pve!ansible        │ Administrator │ 1           │
└───────────────┴───────────────────────┴───────────────┴─────────────┘

What it means: The ci@pve!ansible token has Administrator at /. That’s a problem unless it’s a consciously accepted risk with compensating controls.

Decision: Remove broad ACLs for automation tokens. Create scoped roles and paths.

Task 9: Remove an overpowered ACL (safely)

cr0x@server:~$ sudo pveum acl delete / -token 'ci@pve!ansible' -role Administrator

What it means: You revoked cluster-admin from that token at the root path.

Decision: Immediately confirm what the automation still needs; re-grant only the minimum required permissions at narrow paths.

Task 10: Validate token authentication against the local API endpoint

cr0x@server:~$ curl -k -s \
  -H "Authorization: PVEAPIToken=backup@pve!restic=7c9dbbb2-3e76-4b3b-8d9f-0c8af2c5d2a1" \
  https://127.0.0.1:8006/api2/json/version | jq
{
  "data": {
    "release": "8.1",
    "repoid": "c6d7f9a0",
    "version": "8.1.4"
  }
}

What it means: The token is valid and can reach the API. This is a basic liveness/auth test.

Decision: If this fails, don’t start reworking roles blindly. First confirm time sync, token formatting, and that you’re hitting the right node/realm.

Task 11: Confirm authorization by trying an action that should be denied

cr0x@server:~$ curl -k -s \
  -H "Authorization: PVEAPIToken=backup@pve!restic=7c9dbbb2-3e76-4b3b-8d9f-0c8af2c5d2a1" \
  https://127.0.0.1:8006/api2/json/nodes | jq
{
  "errors": {
    "permission": "permission denied - invalid privileges"
  }
}

What it means: Good. Your backup token can’t enumerate nodes cluster-wide. That’s least privilege working.

Decision: Keep it denied unless your backup system truly needs node inventory. If it does, grant read-only at a narrower scope, not admin.

Task 12: Check for root SSH access (because API tokens won’t save you from SSH drift)

cr0x@server:~$ sudo sshd -T | egrep 'permitrootlogin|passwordauthentication'
permitrootlogin yes
passwordauthentication yes

What it means: Root can SSH in with a password. That’s a credential bypass around your token program.

Decision: Disable password auth and direct root SSH. Use sudo from named accounts or use console access for emergencies.

Task 13: Enforce safer SSH defaults (and know what you changed)

cr0x@server:~$ sudo sh -c 'cat >/etc/ssh/sshd_config.d/99-hardening.conf <

What it means: Root SSH is off; passwords are off. If someone steals a password, it’s less useful. If someone steals a token, it won’t help them SSH.

Decision: Ensure you have tested key-based access for break-glass accounts before turning this on cluster-wide.

Task 14: Check time sync (token auth failures love clock skew)

cr0x@server:~$ timedatectl
               Local time: Tue 2026-02-04 11:15:10 UTC
           Universal time: Tue 2026-02-04 11:15:10 UTC
                 RTC time: Tue 2026-02-04 11:15:10
                Time zone: UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

What it means: Clock is synced. When it isn’t, you get intermittent auth and TLS weirdness that looks like “Proxmox is flaky.” It’s not. Your time is.

Decision: If not synchronized, fix NTP before debugging tokens or TLS.

Task 15: Monitor auth and API errors in logs

cr0x@server:~$ sudo journalctl -u pveproxy -u pvedaemon --since "30 min ago" | tail -n 20
Feb 04 10:58:22 pve1 pveproxy[1832]: authentication failure; rhost=10.20.1.55 user=backup@pve msg=invalid token value
Feb 04 11:02:17 pve1 pvedaemon[1711]: api call failed: permission denied - invalid privileges
Feb 04 11:07:44 pve1 pveproxy[1832]: successful auth for user 'monitor@pve' from 10.20.2.20

What it means: You can distinguish bad token values from insufficient privileges. That difference matters: one is secret distribution, the other is RBAC design.

Decision: If “invalid token value,” rotate and fix secret handling. If “invalid privileges,” adjust ACL paths/roles, not token storage.

Task 16: Find which process is listening on the Proxmox API port

cr0x@server:~$ sudo ss -ltnp | grep ':8006'
LISTEN 0      4096         0.0.0.0:8006      0.0.0.0:*    users:(("pveproxy",pid=1832,fd=6))

What it means: pveproxy is listening on 8006. If you see something else, you have a packaging or process problem.

Decision: If pveproxy isn’t listening, don’t blame tokens. Fix the service first.

Task 17: Rotate a token without downtime (pattern)

cr0x@server:~$ sudo pveum user token add backup@pve restic-v2 --privsep 1
┌──────────────┬──────────────────────────────────────────────┐
│ key          │ value                                        │
╞══════════════╪══════════════════════════════════════════════╡
│ full-tokenid │ backup@pve!restic-v2                         │
│ value        │ 5c1a4e61-9b9c-4f1f-9c7f-9d7a1b4a8d20         │
└──────────────┴──────────────────────────────────────────────┘
cr0x@server:~$ sudo pveum acl modify /pool/prod-vms -token 'backup@pve!restic-v2' -role BackupOperator
cr0x@server:~$ sudo pveum user token list backup@pve
┌──────────────┬───────────────┬────────┬──────────────┐
│ tokenid      │ expire        │ enable │ privsep      │
╞══════════════╪═══════════════╪════════╪══════════════╡
│ restic       │ 0             │ 1      │ 1            │
│ restic-v2    │ 0             │ 1      │ 1            │
└──────────────┴───────────────┴────────┴──────────────┘

What it means: You now have two valid tokens with identical ACLs. Update the dependent system to use v2, verify, then revoke v1.

Decision: Always rotate by overlap, not by cutover, unless you enjoy paging yourself.

Task 18: Revoke a token cleanly during an incident

cr0x@server:~$ sudo pveum user token remove backup@pve restic
cr0x@server:~$ sudo pveum user token list backup@pve
┌──────────────┬───────────────┬────────┬──────────────┐
│ tokenid      │ expire        │ enable │ privsep      │
╞══════════════╪═══════════════╪════════╪══════════════╡
│ restic-v2    │ 0             │ 1      │ 1            │
└──────────────┴───────────────┴────────┴──────────────┘

What it means: The compromised token is dead. Calls using it should start failing immediately.

Decision: After revocation, check logs for continued attempts. If attempts continue, you’ve found the compromised runner or leaked secret location.

Fast diagnosis playbook

When “the token doesn’t work” or “the automation is broken,” don’t wander. Follow a sequence that isolates the bottleneck in minutes.

First: is the API reachable and healthy?

Check ss -ltnp | grep :8006 to ensure pveproxy is listening.
Check systemctl status pveproxy pvedaemon for crash loops.
From the runner host, test connectivity to node:8006 (firewall/routing).

Interpretation: If the port is down or services are unhealthy, token changes won’t matter. Fix platform health first.

Second: is auth failing or authorization failing?

Look at journalctl -u pveproxy -u pvedaemon for “invalid token value” vs “invalid privileges”.
Try a simple API call like /api2/json/version.

Interpretation: Invalid token value = secret handling/format. Invalid privileges = RBAC/ACL design.

Third: did you scope the ACL to the right path?

List ACLs and look for where the token is granted.
Confirm the resource is actually under that pool/path.
Check role privileges are sufficient but not broad.

Interpretation: Most least-privilege failures are wrong path. Second most are wrong role privilege strings.

Fourth: is “privilege separation” doing what you think it is?

List tokens and confirm privsep=1.
Check whether you accidentally granted the parent user too much at / and relied on inheritance.

Interpretation: If a token is overpowered, it’s usually because it inherited something or was granted at /.

Fifth: is time/TLS causing intermittent auth issues?

Check timedatectl across nodes and runners.
If runners validate TLS, confirm certificate trust paths and hostname match.

Interpretation: Clock skew and TLS mismatch masquerade as “token is bad,” especially during node rebuilds.

Common mistakes: symptoms → root cause → fix

1) Symptom: “Everything works in the web UI, but the token gets permission denied”

Root cause: You tested as a human user (who has broad privileges) and assumed the token inherits them. Token is privsep=1 with no ACLs, or ACLs are on the wrong path.

Fix: Attach ACLs to the token explicitly at the correct path; validate with a minimal API call that requires the exact privilege.

2) Symptom: “Token works for a day, then randomly fails”

Root cause: Token value rotated in secret store but not deployed everywhere; or multiple runners have stale values; or clock skew triggers auth edge behavior.

Fix: Implement rotation-by-overlap; add a deployment check that confirms token works before rollout; enforce NTP everywhere.

3) Symptom: “Automation can delete VMs even though it shouldn’t”

Root cause: Token has Administrator at / (often because someone granted it for debugging and never removed it), or token inheritance from an admin user is enabled.

Fix: Remove the root-path ACL; recreate token with --privsep 1; build narrow roles such as “VM.PowerMgmt” only.

4) Symptom: “Revoking the token didn’t stop the behavior”

Root cause: The actor isn’t using that token (wrong assumption), or there are multiple tokens, or they are using a password/SSH route instead.

Fix: Grep logs for the user/token identity; inventory all tokens; disable password-based access routes; revoke systematically.

5) Symptom: “We can’t scope permissions without breaking the pipeline”

Root cause: The pipeline is doing too much: provisioning, firewall, storage, and VM operations under one identity. Or resource organization is missing (no pools, inconsistent naming).

Fix: Split identities by function and stage; create pools; redesign pipeline steps so each uses a specific token.

6) Symptom: “Tokens are leaking into logs”

Root cause: Tooling prints headers, environment variables, or debug output; engineers copy-paste failing commands into chat.

Fix: Disable verbose HTTP logging; scrub CI logs; use masked secrets; enforce “no secrets in tickets” operational policy with tooling support.

Three corporate-world mini-stories (all too plausible)

Mini-story 1: The incident caused by a wrong assumption

They had a Proxmox cluster supporting internal services: build agents, artifact storage, some small databases that everyone swore were “temporary.” The platform team decided to “do the right thing” and moved automation from a root password to an API token.

The wrong assumption: “If the user can do it in the UI, the token will be able to do it too.” They created a token under an admin user, turned on privilege separation because it sounded safer, and never attached ACLs to the token. The user had everything. The token had almost nothing.

At midnight, their CI system tried to spin up a VM for a deployment and received permission denied. The retry logic was enthusiastic. It hammered the API, filled logs, and the on-call saw “Proxmox authentication failure” alerts and assumed compromise. They started revoking credentials and breaking other integrations, because the first mental model was “attack,” not “mis-scoped token.”

What fixed it wasn’t heroics. It was reading the logs carefully: “invalid privileges,” not “invalid token value.” They attached the right role at the pool path, then rate-limited the pipeline retries. The biggest lesson was cultural: when you change auth, you must treat it like a production deployment with staged rollout, not a UI tweak.

Mini-story 2: The optimization that backfired

A different company wanted faster provisioning. Their Terraform pipeline created VMs, set tags, attached ISOs, configured firewall rules, and even did node maintenance tasks. All through one token. It was fast because it was effectively cluster-admin.

They “optimized” further by caching the token in a shared runner image so jobs wouldn’t need to fetch secrets at runtime. This shaved seconds off builds. It also meant every ephemeral runner had a copy of the token on disk, and old images lived in registries and caches longer than anyone admitted.

Months later, an unrelated security review found the token in a runner image layer. Not because of advanced forensics—because someone ran strings on an image during a routine scan and it popped out like a confession. They revoked the token. Half their automation stopped working. The other half kept working because it had a second token hardcoded in a different place, left behind by an earlier migration.

The backfire wasn’t just exposure; it was loss of control. They didn’t know where the secrets were. They couldn’t rotate cleanly. They couldn’t even inventory. After the cleanup, their pipeline was slower by a bit, but their incident response time improved drastically. Speed is good; predictable revocation is better.

Mini-story 3: The boring but correct practice that saved the day

Another org ran Proxmox for edge workloads. Nothing glamorous. Lots of small clusters. Lots of automation. They did one boring thing consistently: every token had an owner, a purpose, a scope, and a rotation date. They stored this metadata in a simple internal registry, and they practiced rotation quarterly.

One morning, monitoring showed unusual API activity: repeated permission denied attempts from a host that shouldn’t have been talking to Proxmox. The logs had a token ID in the auth header identity. They found the exact token in the registry: it belonged to a staging CI runner, scoped to a staging pool, and it was not supposed to be used from that network segment.

They revoked it immediately, and nothing production-related broke because the token was scoped and staging-only. Then they traced the runner host: it had been reimaged and accidentally placed into a broader network. That was the real bug. The token design turned a scary signal into a contained incident.

Their “boring practice” wasn’t expensive. It was disciplined. It prevented a staging mistake from becoming a production outage, and it gave them a calm Tuesday instead of an executive call.

Checklists / step-by-step plan

Step-by-step: move an integration off root password to a scoped token

Inventory current access. Identify where the root password is used (CI variables, scripts, cron jobs, config management).
Create a dedicated service user. Name it after the integration, not a person (e.g., backup@pve).
Create a token with privilege separation. Store token value once in a secret manager.
Create or reuse a narrow role. Only the privileges required for the integration.
Scope ACLs by pool/storage/path. Avoid / unless it is genuinely a cluster-admin integration (rare).
Test with one minimal API call. Validate auth (/version), then validate an authorized call, then validate a denied call.
Roll out in a canary. One runner, one environment, one job.
Enable overlap rotation from day one. Add v2 token, deploy, verify, revoke v1.
Remove old password usage. Delete it from secrets, scripts, images; don’t leave “backup” credentials lying around.
Write down ownership and expiry/rotation. If it isn’t owned, it will become immortal.

Checklist: what “good” looks like

Root password is not used by automation.
Root SSH login is disabled; password auth is disabled where feasible.
Tokens use privilege separation by default.
Tokens are scoped to pools, VMs, storage IDs, or nodes—not /.
Roles are small, named, and reviewed periodically.
Token inventory exists (owner, purpose, created, last rotated).
Rotation is practiced, not promised.
Logs can distinguish bad secrets vs bad permissions quickly.

Checklist: incident containment for suspected credential compromise

Identify the token/user from logs (pveproxy/pvedaemon).
Revoke the token immediately (don’t “wait for confirmation”).
Search for continued attempts to use the revoked token.
Locate the secret distribution point (CI, config, disk).
Rotate adjacent tokens used on the same runner or secret store.
Confirm no broad ACLs exist at / for automation tokens.
Review recent API actions (who did what, from where) and validate VM and storage integrity.

FAQ

1) Should I create tokens under `root@pam`?

No for automation. Use dedicated service users and privilege separation. Root should be for emergency human access, not CI.

2) What does “privilege separation” actually buy me?

It prevents the token from automatically inheriting the user’s permissions. That forces you to explicitly grant what the token needs, which is the whole point.

3) Can I restrict a token to a specific VM only?

Yes, by applying ACLs at /vms/<vmid> with a narrow role. In practice, pools scale better, but per-VM is useful for high-risk systems.

4) How do I rotate tokens without breaking production?

Use overlap rotation: create a second token with identical ACLs, deploy it, verify it, then revoke the old one. Don’t “flip the switch” unless you can tolerate downtime.

5) Do tokens replace two-factor authentication (2FA)?

No. Tokens are for machine-to-machine authentication. 2FA is for humans. You want both: strong human login controls and scoped machine credentials.

6) What’s the safest place to store tokens?

A proper secret manager with access policies and audit logs. If you must use CI secret variables, ensure masking is enforced and debug logs can’t print headers.

7) Why not just put Proxmox behind a VPN and keep using passwords?

Because internal networks and VPNs are not trust boundaries anymore. You need scoped, revocable credentials even when the network is “private.” VPN is a layer, not a strategy.

8) How many roles should we have?

Fewer than you think, but more than one. Start with 5–10 job-focused roles (read-only monitoring, VM power control, provisioning, backup, network admin). Expand only when you can’t express a need cleanly.

9) What if an integration needs broad access across many VMs?

Group those VMs into a pool and scope to /pool/<name>. If it truly needs cluster-wide permissions, treat that token like a production admin credential: extra controls, shorter rotation, tighter storage, and explicit approval.

10) How do I know which tokens exist and who owns them?

Proxmox can list tokens, but ownership is a process problem. Maintain an internal registry (even a simple table) mapping token IDs to owners, systems, and rotation cadence.

Conclusion: next steps that actually stick

API tokens are not a security trophy. They’re a way to make compromise survivable and operations sane. If you keep the root password as the default automation credential, you’re effectively telling attackers—and tired engineers—that the control plane is one leaked secret away.

Practical next steps:

Pick one integration (CI, backup, monitoring) and move it to a dedicated service user + privileged-separated token this week.
Create one narrow custom role that matches that integration’s real job, not your fears.
Scope the ACL to a pool or specific paths. Remove any Administrator grants at / for automation tokens.
Disable root SSH login and password authentication where feasible, and verify you still have a working break-glass path.
Implement overlap rotation and do a rotation drill. Treat it like a deploy: staged, tested, reversible.
Start a token inventory with owners and rotation dates. If you can’t name the owner, you don’t have a token—you have a liability.

If you do only one thing: stop letting “root, everywhere” be the easiest path. Replace it with scoped tokens and an audit trail that tells the truth.