Proxmox VE for beginners: the first 10 settings to fix after install (security, storage, backups)

Was this helpful?

You installed Proxmox VE. The web UI loads. You can create a VM. You feel powerful.

Then a disk fills at 3 a.m., backups “succeed” but don’t restore, and someone discovers your admin UI is reachable from the guest network. That’s the moment Proxmox stops being a lab toy and becomes infrastructure. The good news: the first 10 fixes are mostly boring, mostly quick, and wildly effective.

Quick historical context (because it explains the sharp edges)

  • Proxmox VE’s roots are Debian. That means you get Debian’s stability and packaging culture—plus Debian’s expectation that you act like a sysadmin, not a wizard.
  • KVM became “the Linux hypervisor” over a decade ago. Proxmox rides that maturity. Most weirdness you see is networking/storage, not KVM itself.
  • LXC containers are older than Docker. LXC is closer to “lightweight VM” semantics: great for services, but you inherit kernel-sharing constraints.
  • ZFS wasn’t built for “oops I filled the pool.” It’s a transactional filesystem that rewards planning and punishes lying about free space.
  • Thin provisioning predates cloud hype. Overcommit has always been an accounting trick with a physical bill coming later.
  • Cluster filesystems and quorum are old, cranky disciplines. Corosync isn’t new magic; it’s distributed systems reality with better tooling.
  • Backup “success” has been misleading since tape days. A green checkmark means “job ran,” not “data is recoverable.”
  • “Infrastructure as pets” used to be normal. Proxmox is flexible enough to let you keep doing that. Don’t.

One quote to keep taped to your monitor, in a paraphrased idea from John Allspaw: paraphrased idea: reliability comes from designing for failure and learning quickly, not pretending you can eliminate failure.

1) Fix the package repositories (and update like an adult)

Right after install, Proxmox is either pointed at the enterprise repository (which needs a subscription) or at something you inherited from a guide you half-trust. Your first job is to make updates predictable. Predictability beats heroics.

Task 1: See what repositories you’re actually using

cr0x@server:~$ grep -R --line-number -E 'pve|proxmox|ceph' /etc/apt/sources.list /etc/apt/sources.list.d/*.list
/etc/apt/sources.list.d/pve-enterprise.list:1:deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise
/etc/apt/sources.list.d/ceph.list:1:deb http://download.proxmox.com/debian/ceph-quincy bookworm no-subscription

What the output means: You’re currently using the enterprise repo for PVE (subscription required) and no-subscription for Ceph.

Decision: If you don’t have a subscription, disable pve-enterprise and use the no-subscription repo. If you do have one, keep enterprise and remove no-subscription to avoid mixed provenance.

Task 2: Disable enterprise repo (if you don’t have a subscription)

cr0x@server:~$ sed -i 's/^deb /# deb /' /etc/apt/sources.list.d/pve-enterprise.list
cr0x@server:~$ cat /etc/apt/sources.list.d/pve-enterprise.list
# deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise

What the output means: It’s commented out. APT won’t try to hit enterprise and fail with 401s.

Decision: Add the no-subscription repo so you can get security updates.

Task 3: Add the no-subscription repo (PVE)

cr0x@server:~$ echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" | tee /etc/apt/sources.list.d/pve-no-subscription.list
deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
cr0x@server:~$ apt-get update
Hit:1 http://download.proxmox.com/debian/pve bookworm InRelease
Hit:2 http://security.debian.org/debian-security bookworm-security InRelease
Reading package lists... Done

What the output means: Indexes updated successfully. No red “401 Unauthorized” spam.

Decision: Patch the host now, then adopt a routine (weekly is fine for most small shops; daily for exposed environments).

Task 4: Check what will be upgraded before you do it

cr0x@server:~$ apt-get -s dist-upgrade | sed -n '1,25p'
Reading package lists... Done
Building dependency tree... Done
Calculating upgrade... Done
The following packages will be upgraded:
  pve-kernel-6.8.12-4-pve proxmox-ve pve-manager ...
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

What the output means: Simulation mode (-s) shows impact. Kernel updates mean reboots. Management updates can restart services.

Decision: If this is production, schedule the reboot window. If it’s your first node, reboot now and learn the cycle.

Opinionated rule: Don’t “set and forget” host updates. Proxmox is a control plane. Unpatched control planes are how you earn weekend work.

2) Lock down users, realms, and permissions

Proxmox makes it easy to run everything as root@pam. That’s also how you end up with a shared password in a sticky note ecosystem. Use named accounts and least privilege. Your future self will send a thank-you card.

Task 5: List users and see who can do what

cr0x@server:~$ pveum user list
┌─────────────┬───────────┬───────────┬──────────────────────┐
│ userid      │ enable    │ expire    │ firstname            │
╞═════════════╪═══════════╪═══════════╪══════════════════════╡
│ root@pam    │ 1         │           │                      │
│ alice@pam   │ 1         │           │ Alice                │
└─────────────┴───────────┴───────────┴──────────────────────┘
cr0x@server:~$ pveum acl list | head
/:
  user:root@pam role:Administrator

What the output means: Only root@pam is explicitly admin. That’s typical on fresh installs.

Decision: Create a named admin user, then restrict root usage to break-glass.

Task 6: Create a named admin and require MFA at the edge (or at least strong auth)

cr0x@server:~$ pveum user add sre-admin@pam --comment "Named admin account"
cr0x@server:~$ pveum passwd sre-admin@pam
Enter new password:
Retype new password:
cr0x@server:~$ pveum aclmod / -user sre-admin@pam -role Administrator
cr0x@server:~$ pveum acl list | head -n 10
/:
  user:root@pam role:Administrator
  user:sre-admin@pam role:Administrator

What the output means: You now have a second admin, which means you can stop using root day-to-day.

Decision: Keep root@pam for emergencies and automation that truly needs it, not for casual clicking.

Corporate mini-story: incident caused by a wrong assumption

At a mid-sized company, a new virtualization admin assumed “the Proxmox UI is on the management LAN, so it’s not reachable by anyone untrusted.” It was true in his head, not in the switches.

A contractor plugging into a spare port landed in the same VLAN as the hypervisors. No one noticed because everything “worked,” and the networking diagram was last updated when people still printed them.

They didn’t get breached by a nation-state. They got hit by a bored scanner that found the login page and brute-forced a weak shared root password. The attacker created a VM, used it as a pivot, and the first sign was a spike in outbound traffic.

The fix wasn’t exotic: separate management networking, restrict UI access, disable password SSH for root, and stop sharing credentials. The painful lesson was that assumptions are not controls.

3) Harden SSH and stop logging in as root

SSH is the skeleton key. Treat it like one. You want: key-based auth, no root login, and a trail you can follow.

Task 7: Check current SSH settings that matter

cr0x@server:~$ sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication) '
permitrootlogin yes
passwordauthentication yes
pubkeyauthentication yes
kbdinteractiveauthentication no

What the output means: Root login and password authentication are allowed. That’s convenient; it’s also an invitation.

Decision: Disable root login and password authentication once you have keys working for a named user.

Task 8: Apply a basic hardening drop-in

cr0x@server:~$ install -d -m 0755 /etc/ssh/sshd_config.d
cr0x@server:~$ cat > /etc/ssh/sshd_config.d/99-hardening.conf <<'EOF'
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
EOF
cr0x@server:~$ sshd -t && systemctl reload ssh

What the output means: sshd -t validates syntax. Reload applies it without dropping existing sessions.

Decision: Test a new SSH session before closing your current one. Locking yourself out is a rite of passage; it’s also avoidable.

Task 9: Confirm you can still log in (and root cannot)

cr0x@server:~$ ssh -o PreferredAuthentications=publickey sre-admin@192.0.2.10 'id'
uid=1001(sre-admin) gid=1001(sre-admin) groups=1001(sre-admin)
cr0x@server:~$ ssh -o PreferredAuthentications=password root@192.0.2.10 'true'
root@192.0.2.10: Permission denied (publickey).

What the output means: Named admin works with keys. Root is blocked.

Decision: Keep console access (IPMI/iKVM/physical) as your break-glass path. Don’t rely on SSH as your only door.

4) Turn on the firewall (the right way)

Proxmox has a decent firewall. It’s not magical. The most common failure mode is enabling it everywhere without understanding rule order and locking yourself out.

Short joke #1: Firewalls are like seatbelts: you only regret not using them during the exciting parts.

Task 10: See current firewall status and rules

cr0x@server:~$ pve-firewall status
Status: disabled/running (cluster: disabled)
cr0x@server:~$ pve-firewall localnet
192.0.2.0/24

What the output means: The firewall daemon is running but not enforcing. localnet is what Proxmox treats as “trusted.” If that’s wrong, your rules will be wrong.

Decision: Define the management subnet(s) as localnet, then enable firewall at the datacenter level, then node level—carefully.

Task 11: Validate the management IP and what subnet it’s on

cr0x@server:~$ ip -br addr show vmbr0
vmbr0             UP             192.0.2.10/24 fe80::5054:ff:fe12:3456/64
cr0x@server:~$ ip route show default
default via 192.0.2.1 dev vmbr0

What the output means: Your management interface is on vmbr0 with 192.0.2.10/24. Default route is also there.

Decision: If guests share this network, stop and redesign. Management should not be “just another VLAN” that VMs can join casually.

Task 12: Enable firewall and allow only what you need to the host

cr0x@server:~$ pvesh set /cluster/firewall/options --enable 1
cr0x@server:~$ pvesh set /nodes/$(hostname)/firewall/options --enable 1
cr0x@server:~$ pvesh get /nodes/$(hostname)/firewall/options
┌────────────┬────────┐
│ key        │ value  │
╞════════════╪════════╡
│ enable     │ 1      │
└────────────┴────────┘

What the output means: Firewall is enabled cluster-wide and on the node. You still need explicit rules, or you may break access depending on defaults.

Decision: Add a management allow rule for the web UI and SSH from localnet, then a default drop for everything else.

Task 13: Confirm ports are listening, then confirm they’re only reachable from the right place

cr0x@server:~$ ss -lntp | egrep '(:22|:8006)\s'
LISTEN 0      128          0.0.0.0:22       0.0.0.0:*    users:(("sshd",pid=1123,fd=3))
LISTEN 0      4096         0.0.0.0:8006     0.0.0.0:*    users:(("pveproxy",pid=1876,fd=6))

What the output means: SSH and the Proxmox web proxy are bound on all IPv4 addresses. That’s normal on a single-interface node.

Decision: If you can’t isolate with binding, isolate with network design and firewall rules. “Bound to 0.0.0.0” is fine if your firewall and VLANs are correct.

5) Fix management networking and bridges

Most Proxmox pain is actually Linux networking pain in a nice coat. Bridges are powerful; they’re also honest. If your physical network is sloppy, Proxmox will faithfully reproduce that sloppiness at line rate.

Task 14: Inspect the actual bridge config Proxmox is using

cr0x@server:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto enp3s0
iface enp3s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.0.2.10/24
        gateway 192.0.2.1
        bridge-ports enp3s0
        bridge-stp off
        bridge-fd 0

What the output means: Simple: one NIC, one bridge. Guests attached to vmbr0 are on the same L2 as your management interface.

Decision: If this is anything but a lab, split management and guest traffic with VLANs or a second NIC. At minimum: management on a tagged VLAN only IT can reach.

Task 15: Check link health and driver issues (cheap wins)

cr0x@server:~$ ethtool enp3s0 | egrep 'Speed|Duplex|Link detected'
Speed: 1000Mb/s
Duplex: Full
Link detected: yes

What the output means: You’re at 1GbE. That’s fine for a homelab; it’s also how you accidentally build a “slow storage” problem that’s actually a “slow network” problem.

Decision: If you plan shared storage or backups over the network, 10GbE isn’t luxury. It’s margin.

6) Time, NTP, and certificates: boring until it burns you

If time is off, TLS complains, clusters get weird, and logs become fiction. If certificates are a mess, your browser trains you to ignore warnings—until the one time it shouldn’t.

Task 16: Verify time sync and time zone

cr0x@server:~$ timedatectl
               Local time: Sun 2025-12-28 11:02:15 UTC
           Universal time: Sun 2025-12-28 11:02:15 UTC
                 RTC time: Sun 2025-12-28 11:02:16
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

What the output means: Clock is synchronized. UTC is a good default for servers unless you have a strong reason not to.

Decision: Keep hosts in UTC. Adjust display in your monitoring tool if humans insist.

Task 17: Check certificate status (and avoid training bad habits)

cr0x@server:~$ pvecm status 2>/dev/null | head -n 5
Cluster information
-------------------
Name:             pve-cluster
Config Version:   1
Transport:        knet

What the output means: If you’re already clustered, certificate and hostname consistency matters more. Even on a single node, keep the hostname stable.

Decision: Don’t keep re-installing and renaming nodes casually. Proxmox is forgiving, but clusters remember.

7) Choose a storage layout that matches reality

Storage decisions are where beginners get quietly ruined. Proxmox will happily let you build something that benchmarks great and fails catastrophically. You need to decide: local storage vs shared storage, ZFS vs LVM-thin, and where backups land.

Three sane beginner patterns

  • Single node, local disks: ZFS mirror for VMs + separate disk/dataset for backups (preferably not the same pool).
  • Small cluster without shared storage: Local ZFS on each node + backups to a dedicated Proxmox Backup Server (PBS) + accept that live migration needs shared storage or replication strategies.
  • Cluster with shared storage: Ceph (3+ nodes minimum for sanity) or a dedicated SAN/NAS. This is not “beginner day one” unless you enjoy distributed systems at 2 a.m.

Task 18: Identify your disks and current storage stack

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINTS
NAME          SIZE TYPE FSTYPE MOUNTPOINTS
sda         447.1G disk
├─sda1       1007K part
├─sda2          1G part vfat   /boot/efi
└─sda3      446.1G part zfs_member
sdb           3.6T disk
└─sdb1        3.6T part

What the output means: Your OS/VM pool is on ZFS (zfs_member). You have another large disk (sdb) that’s currently unused or formatted elsewhere.

Decision: Use sdb for backups, not for “extra VM space.” Backups competing with VM I/O is how you manufacture latency.

Task 19: Check pool health before you trust it

cr0x@server:~$ zpool status
  pool: rpool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors

What the output means: Pool is online with no errors. Also: it’s a single disk. That’s not redundancy; it’s hope.

Decision: If this is anything important, rebuild as a mirror. A single disk ZFS root pool is fine for a lab, not for your payroll VM.

Interesting fact: ZFS checksums every block and can self-heal—but only if you give it a second copy (mirror/RAIDZ). On a single disk, it can detect corruption and then shrug.

8) ZFS basics: what to set, what to leave alone

ZFS tuning has a hobbyist culture. Some of it is real. Much of it is people cargo-culting 2016 forum posts onto 2025 kernels. Your job is to keep it simple, measurable, and reversible.

The beginner ZFS defaults you should not fight

  • Let ARC manage itself unless you have a specific memory pressure problem.
  • Don’t enable random compression algorithms because you read they’re “faster.” Use lz4 and move on.
  • Don’t set a giant recordsize without understanding your workload. VMs are random I/O creatures.

Task 20: Confirm key ZFS dataset properties for VM storage

cr0x@server:~$ zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool              98G   300G    96K   /rpool
rpool/ROOT         12G   300G    96K   /rpool/ROOT
rpool/data         72G   300G    96K   /rpool/data
rpool/data/vm-100-disk-0  40G  300G    40G   -
cr0x@server:~$ zfs get -o name,property,value compression,atime,recordsize,volblocksize rpool/data
NAME       PROPERTY      VALUE
rpool/data compression   lz4
rpool/data atime         off
rpool/data recordsize    128K
rpool/data volblocksize  -

What the output means: lz4 and atime=off are good. recordsize on a dataset matters for files; ZVOLs use volblocksize.

Decision: For VM disks, focus on ZVOL block size at creation time (often 16K is a good general compromise for mixed workloads). Don’t churn these settings blindly.

Task 21: Check pool free space and fragmentation risk

cr0x@server:~$ zpool list
NAME    SIZE   ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH
rpool   446G    98G   348G         -     12%    22%  1.00x  ONLINE

What the output means: 22% used is fine. Trouble begins when you treat “FREE” as spendable down to zero.

Decision: Keep ZFS pools under ~80% used for steady performance. Under ~70% if you want fewer surprises with snapshots and churn.

Task 22: Set an explicit refreservation/space policy for safety (optional but smart)

cr0x@server:~$ zfs set refreservation=20G rpool/data
cr0x@server:~$ zfs get refreservation rpool/data
NAME       PROPERTY        VALUE   SOURCE
rpool/data refreservation  20G     local

What the output means: You reserved 20G so the pool doesn’t hit “0 bytes free” as easily under snapshot growth.

Decision: On small pools, a reservation is cheap insurance. On large pools, you may prefer monitoring + strict quotas.

Corporate mini-story: optimization that backfired

A team wanted to “save space” on a Proxmox cluster. They flipped on deduplication in ZFS because it looked like free money in a slide deck. They also enabled aggressive compression tweaks and kept thin-provisioning everything because the numbers looked pretty.

For a week, it was fine. Then the workload changed: more database VMs, more churn, more snapshots. CPU usage on the hosts climbed. Latency became spiky. The helpdesk called it “random slowness,” the most infuriating category of slowness.

Root cause: ZFS dedup is memory-hungry and punishing when starved. The DDT didn’t fit in RAM, so reads and writes dragged the pool through extra I/O. They had optimized for capacity and accidentally purchased a latency generator.

The rollback was painful because data was now in a deduped state. They ended up migrating VM disks off to a new pool built sanely: lz4, no dedup, ample free space, and a clear snapshot retention policy. The space savings they wanted was real—but the cost was operational. Capacity is a metric. Latency is a user experience.

9) Backups that restore: PBS, schedules, retention, verification

Proxmox makes it easy to schedule backups. It also makes it easy to think you have backups when you have a folder full of files that nobody has ever restored.

Do this properly:

  • Back up to a separate system (ideally PBS). Not the same pool. Not the same host.
  • Use sensible retention. Not “keep forever,” not “keep two.”
  • Verify. Test restore. Automate a canary restore if you can.

Task 23: Confirm current backup jobs (or the lack thereof)

cr0x@server:~$ pvesh get /cluster/backup --output-format json-pretty
[]

What the output means: No scheduled backups. This is the default and also the trap.

Decision: Create a job today. If you’re “not ready,” you’re also not ready to lose data.

Task 24: Check configured storage targets (where backups could go)

cr0x@server:~$ pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        98.00 GiB       12.30 GiB       85.70 GiB   12.55%
local-lvm     lvmthin     active       300.00 GiB       70.20 GiB      229.80 GiB   23.40%

What the output means: You only have local storage. Backups here are better than nothing, but not survivable if the host dies.

Decision: Add PBS or at least an NFS target on a different box. Prefer PBS for dedup + verification semantics designed for virtualization backups.

Task 25: Validate that your backup destination won’t fill immediately

cr0x@server:~$ df -h /var/lib/vz
Filesystem      Size  Used Avail Use% Mounted on
rpool/ROOT/pve-1  98G   13G   86G  13% /

What the output means: If you back up to local, you’re spending root filesystem space. That’s how you brick a node with “No space left on device” during a backup window.

Decision: Do not dump backups on the root filesystem long-term. Add separate backup storage.

Task 26: Run a manual backup and read the result like it matters

cr0x@server:~$ vzdump 100 --storage local --mode snapshot --compress zstd --notes-template '{{guestname}} {{vmid}}'
INFO: starting new backup job: vzdump 100 --storage local --mode snapshot --compress zstd
INFO: Backup of VM 100 started at 2025-12-28 11:22:10
INFO: status = running
INFO: VM Name: app01
INFO: include disk 'scsi0' 'local-lvm:vm-100-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive 'vzdump-qemu-100-2025_12_28-11_22_10.vma.zst'
INFO: Total bytes written: 9876543210 (9.1GiB, 220.0MiB/s)
INFO: backup ended at 2025-12-28 11:23:05
INFO: Backup job finished successfully

What the output means: Snapshot mode succeeded, compression used, throughput is shown. “Finished successfully” is necessary—but still not sufficient for recoverability.

Decision: Immediately test restore into a new VM ID or restore to a scratch location. Don’t wait for the incident.

Task 27: Validate the backup file exists and has reasonable size

cr0x@server:~$ ls -lh /var/lib/vz/dump | tail -n 3
-rw-r--r-- 1 root root 9.2G Dec 28 11:23 vzdump-qemu-100-2025_12_28-11_22_10.vma.zst
-rw-r--r-- 1 root root  637 Dec 28 11:23 vzdump-qemu-100-2025_12_28-11_22_10.log

What the output means: Archive plus log file exist. If your “backup” is a few megabytes for a big VM, you likely backed up the wrong thing or hit an error.

Decision: Read the log for warnings and confirm disks were included.

Task 28: Test restore (the part everyone skips)

cr0x@server:~$ qmrestore /var/lib/vz/dump/vzdump-qemu-100-2025_12_28-11_22_10.vma.zst 900 --storage local-lvm
restore vma archive: zstd -q -d -c /var/lib/vz/dump/vzdump-qemu-100-2025_12_28-11_22_10.vma.zst | vma extract -v -r /var/tmp/vzdumptmp1234 - /var/tmp/vzdumptmp1234
progress 1% (reading archive)
progress 55% (reading archive)
progress 100% (reading archive)
restore successful

What the output means: The restore pipeline ran and completed. That’s your real signal.

Decision: Boot the restored VM in an isolated network, verify application health, then delete it. Repeat regularly.

Interesting fact: The phrase “backup window” comes from a time when you had to stop everything else to stream to tape. Your SSDs didn’t get the memo, but your users still feel the pain if you saturate storage I/O.

10) Logs, alerts, and capacity signals

A Proxmox node can be “healthy” right up until it isn’t. You want early signals: disk usage, pool health, SMART warnings, backup failures, and unexpected reboots.

Task 29: Check journal for errors worth caring about

cr0x@server:~$ journalctl -p warning -b --no-pager | tail -n 20
Dec 28 09:12:44 server kernel: nvme nvme0: I/O 123 QID 4 timeout, aborting
Dec 28 09:12:45 server kernel: ata1.00: failed command: READ DMA EXT
Dec 28 09:12:45 server kernel: blk_update_request: I/O error, dev sda, sector 1234567 op 0x0:(READ)

What the output means: This is not “noise.” Timeouts and I/O errors are hardware or cabling warnings, often days before a failure.

Decision: If you see storage I/O errors, stop optimizing and start replacing. Run SMART, check cables/backplane, and plan migration.

Task 30: Check SMART health quickly (if using SATA/SAS)

cr0x@server:~$ smartctl -H /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-4-pve] (local build)
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

What the output means: “PASSED” is not “good as new.” It just means the drive hasn’t declared itself dead yet.

Decision: Look at reallocated sectors, pending sectors, and error logs, not just the headline.

Task 31: Check VM disk usage and thin-provisioning risk

cr0x@server:~$ lvs -o lv_name,vg_name,lv_size,data_percent,metadata_percent,lv_attr
  lv_name   vg_name   lv_size  data_percent  metadata_percent  lv_attr
  data      pve      300.00g       78.12             4.55      twi-aotz--

What the output means: Your thin pool is 78% full. Thin pools fail badly when they hit 100%: VMs can go read-only or crash depending on workload.

Decision: Add space or reduce allocations before 90%. Also monitor metadata usage; metadata full is its own special disaster.

Short joke #2: A thin pool at 99% is Schrödinger’s storage: it’s both fine and on fire until you write one more block.

Corporate mini-story: boring but correct practice that saved the day

A finance org ran Proxmox for internal apps. Nothing glamorous: a couple of databases, file services, and a handful of VMs that everyone pretended were “temporary” for three years.

Their SRE lead insisted on a tedious routine: weekly host updates, nightly backups to PBS, and a monthly restore test of one randomly chosen VM into an isolated network. Nobody loved it. It was paperwork with extra steps.

One quarter-end, a storage controller started returning intermittent errors. The ZFS pool stayed online, then began showing checksum errors. Performance cratered. The incident commander made the only sane call: stop trying to be clever, fail over by restoring critical VMs onto a standby node with clean storage.

The restore worked because it had been tested. Credentials were documented. The PBS datastore had been verified. The team didn’t have to learn the restore process while the business was watching.

Afterward, nobody bragged about heroics. That was the point. The most valuable reliability work is the kind that looks boring in a status update.

Fast diagnosis playbook: what to check first/second/third

When Proxmox “feels slow,” people blame the hypervisor. Usually it’s one of three things: storage latency, network mis-design, or memory pressure. Here’s how to find the bottleneck fast, without guessing.

First: is the host in distress right now?

cr0x@server:~$ uptime
 11:41:02 up 3 days,  2:17,  2 users,  load average: 8.21, 7.95, 7.10
cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            64Gi        56Gi       1.2Gi       1.1Gi       6.8Gi       4.0Gi
Swap:          8.0Gi       6.5Gi       1.5Gi

How to interpret: High load with low available memory and heavy swap suggests memory pressure. Load alone is ambiguous; swap isn’t.

Decision: If swap is active and growing, reduce memory overcommit, add RAM, or move workloads. Also check ballooning and host caches.

Second: is storage latency the real villain?

cr0x@server:~$ iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.00    0.00    6.00   25.00    0.00   57.00

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
sda              5.0   120.0    80.0  9000.0  35.20   2.10  98.00

How to interpret: %iowait is high and disk %util is near 100% with high await. That’s classic storage saturation.

Decision: Find the talker VM, reduce backup/replication concurrency, move hot disks to faster storage, or add spindles/SSDs. Do not “tune kernel knobs” first.

Third: is it actually the network (especially with NFS/Ceph/PBS)?

cr0x@server:~$ ip -s link show enp3s0 | sed -n '1,8p'
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 3c:ec:ef:12:34:56 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
    9876543210 6543210      0     120       0   12345
    TX:  bytes packets errors dropped carrier collsns
    8765432109 5432109      0     250       0       0       0

How to interpret: Dropped packets (RX/TX) aren’t normal on clean networks. They indicate congestion, driver issues, or mismatched MTU/queues.

Decision: Fix the network before blaming Ceph/NFS. Then validate MTU consistency end-to-end if you use jumbo frames.

Bonus: find the noisy neighbor VM quickly

cr0x@server:~$ pvesh get /nodes/$(hostname)/qemu --output-format json-pretty | head -n 20
[
  {
    "cpu": 0.85,
    "mem": 4294967296,
    "name": "app01",
    "pid": 4321,
    "status": "running",
    "vmid": 100
  },
  {
    "cpu": 3.72,
    "mem": 17179869184,
    "name": "db01",
    "pid": 9876,
    "status": "running",
    "vmid": 110
  }
]

How to interpret: Quick view of VMs with CPU usage. Not perfect, but good triage.

Decision: If one VM is pegging CPU, check its disk I/O and memory too. “High CPU” can be the symptom of storage stalls causing busy loops.

Common mistakes: symptoms → root cause → fix

1) Symptom: backups are “successful” but restore fails

Root cause: Backups stored on the same host/pool, corrupted archive, missing VM config, or never tested restore procedure.

Fix: Back up to PBS or external storage, run a monthly restore test, and keep logs. Use verification on PBS and read failures like production incidents.

2) Symptom: random VM freezes during backup window

Root cause: Storage saturation (especially with HDDs), too many concurrent backup jobs, or thin pool near full causing metadata pressure.

Fix: Reduce concurrency, stagger schedules, set I/O limits for backup traffic, move backups off primary pool, and keep free space headroom.

3) Symptom: Proxmox web UI is reachable from places it shouldn’t be

Root cause: Management and guest networks share the same bridge/VLAN; firewall disabled or misconfigured; default “0.0.0.0” binding exposed on broader LAN.

Fix: Separate management VLAN, enforce firewall rules, restrict access upstream (switch ACLs), and stop treating “internal network” as trusted.

4) Symptom: cluster gets weird after a reboot (nodes disagree, GUI shows errors)

Root cause: Time drift, DNS inconsistencies, or broken quorum expectations (two-node cluster without a tie-breaker design).

Fix: Fix NTP, use stable hostnames and forward/reverse DNS, design quorum properly (odd number of voters or a qdevice).

5) Symptom: ZFS pool is “ONLINE” but performance is terrible

Root cause: Pool too full, heavy snapshot churn, slow disks, or a workload mismatch (databases on HDD RAIDZ with high sync write load).

Fix: Keep pool under ~80%, review snapshot retention, use mirrors for IOPS-heavy workloads, add SLOG only if you understand sync writes, and measure latency with iostat.

6) Symptom: VMs suddenly become read-only or crash; host logs show thin pool errors

Root cause: LVM-thin pool hit 100% data or metadata usage.

Fix: Extend thin pool immediately, free space, and prevent recurrence with monitoring and conservative provisioning.

Checklists / step-by-step plan

Day 0 (first hour after install)

  1. Repos: pick enterprise vs no-subscription; run apt-get update and simulate upgrades.
  2. Named admin: create sre-admin@pam, grant admin role, stop using root@pam daily.
  3. SSH hardening: add a drop-in to disable root + passwords; reload SSH and test logins.
  4. Networking reality check: inspect /etc/network/interfaces. If management shares the guest bridge, plan a fix now, not “later.”
  5. Firewall staging: confirm localnet, enable firewall carefully, ensure you can still reach 8006 and 22 from management.
  6. Time sync: confirm timedatectl shows synchronized.

Day 1 (before you put anything important on it)

  1. Storage decision: pick ZFS mirror or LVM-thin intentionally. Avoid single-disk “production.”
  2. Pool health: check zpool status and zpool list; set headroom expectations.
  3. Backup target: add PBS or external storage. Do not accept “backups on the node” as a final state.
  4. First backup + restore test: run vzdump and qmrestore to prove recoverability.

Week 1 (make it sustainable)

  1. Monitoring signals: decide what alerts you need: disk fullness, ZFS errors, SMART warnings, backup job failures.
  2. Patch cadence: pick a window; document reboot expectations for kernel updates.
  3. Access policy: who gets console access, who gets admin, who gets VM-level permissions; remove shared credentials.
  4. Runbook: write down restore steps and where backups live. When you’re stressed, you won’t remember.

FAQ

1) Should I use ZFS or LVM-thin for VM storage?

If you want data integrity features (checksums, snapshots that behave predictably, easy replication tooling), pick ZFS. If you want simplicity and you understand thin-pool monitoring, LVM-thin is fine. Beginners typically do better with ZFS mirrors than with a thin pool they forget to monitor.

2) Can I store backups on the same Proxmox host?

You can, but you shouldn’t call that “backup” for anything you can’t lose. Host failure, ransomware on the host, or a storage pool corruption event takes your VMs and their backups together. Use a separate system.

3) Is it okay to run a single-node Proxmox in production?

Yes, if you accept the risk and build good backups. Many small businesses do. But be honest: a single node means downtime during updates and zero hardware redundancy unless your storage is mirrored and you have spare parts.

4) Why does everyone say “keep ZFS under 80% full”?

ZFS performance and allocator behavior degrade as free space shrinks, especially with snapshots and fragmented workloads like VM images. You can run it fuller, but you’re trading steady latency for capacity bragging rights.

5) Do I need Ceph?

No, not for “I have two nodes and vibes.” Ceph shines when you have enough nodes, enough network, and enough operational maturity to run distributed storage. If you just want shared storage for a small cluster, evaluate simpler options first.

6) What’s the easiest safe way to expose the Proxmox web UI?

Don’t expose it directly to the internet. Put it on a management network, access via VPN or a bastion, and enforce MFA where you terminate remote access. If you must publish it, treat it like any other admin plane and lock it down aggressively.

7) Why did my VM backup take forever even though my disks are fast?

Backups are a pipeline: VM disk reads, compression, and writing to the destination. CPU can be the bottleneck (compression), network can be the bottleneck (PBS/NFS), or storage can be saturated by concurrent jobs. Use iostat, ss/ip -s link, and CPU metrics during the backup window.

8) How do I know if thin provisioning is safe?

It’s safe when you monitor it and when you can grow it before it fills. If you don’t have alerting and you routinely run pools above 85–90%, it’s not thin provisioning—it’s gambling with extra steps.

9) Do I need to change default ports (SSH/8006)?

Port changes are not security; they’re a mild reduction in background noise. Real controls are network isolation, firewall rules, strong authentication, and patching. If changing ports helps your threat model, fine—just don’t confuse it with protection.

Practical next steps

Proxmox is friendly enough to get you running fast, and honest enough to expose every bad habit you bring to it. Fix the basics now: repositories, accounts, SSH, firewall, network separation, time sync, storage layout, ZFS headroom, real backups, and real signals.

Then do the thing that separates “I run a lab” from “I run production”: schedule one restore test on your calendar and treat a failed backup as a real incident. Your VMs are workloads. Your Proxmox node is the platform. Keep the platform boring, patched, and predictable.

← Previous
Docker Observability Minimum: Metrics and Logs That Catch Failures Early
Next →
Email: “554 spam detected” — clean up reputation without wasting weeks

Leave a comment