Fix Proxmox pvestatd.service Failed: Restore Graphs and Statistics Fast

Was this helpful?

Your Proxmox UI looks alive—VMs run, storage mounts, backups happen. But the graphs are dead. No CPU history, no IO lines, no comforting “this was fine yesterday” evidence. You check the node summary and it’s blank or stuck. Then you see it: pvestatd.service failed.

This is one of those failures that feels cosmetic until you need it. The monitoring pipeline is your alibi during incidents. When it goes missing, the next outage is going to be loud and personal. Let’s fix it properly, not by ritual restarts and wishful thinking.

What pvestatd actually does (and what it doesn’t)

pvestatd is Proxmox’s node statistics daemon. Its job is to periodically collect metrics (node, VMs/CTs, storage, some cluster-facing details), and store time-series data so the GUI can draw graphs and show trends. It’s not the GUI itself and it’s not a “monitoring system” in the Prometheus sense. It’s the plumbing that makes the built-in graphs work.

On most installations, graphs are backed by RRD (Round Robin Database) files under /var/lib/rrdcached/db and updates are mediated by rrdcached to avoid constant disk I/O. When pvestatd can’t write updates—or can’t talk to rrdcached, or can’t read system stats cleanly—it fails or becomes stuck. The GUI then shows stale or empty graphs, often without a helpful explanation.

Here’s the practical takeaway: when graphs disappear, don’t stare at the browser. You debug pvestatd, rrdcached, storage health, permissions, and time sync. In that order.

One dry truth: if you run without reliable time-series data, you’re operating production like a pilot who thinks altimeters are “optional accessories.”

Fast diagnosis playbook (first/second/third)

First: confirm the failure mode (is it actually pvestatd?)

  • Check service state and recent logs. You’re looking for a clear error: permission denied, disk full, socket missing, corrupt RRD, time jump.
  • Check if rrdcached is running and its socket exists.
  • Check free space and inode pressure on the filesystem holding RRD data.

Second: isolate whether it’s write-path or read-path

  • Write-path: can we update RRDs? Socket? Permissions? Disk? Corruption?
  • Read-path: does pvestatd crash while collecting stats from storage backends (ZFS, Ceph, NFS, iSCSI)? Does a single broken storage stall everything?

Third: confirm cluster implications

  • In a cluster, see whether one node is sick or all nodes are sick.
  • Check time sync and cluster communication. Weird time jumps can make RRD graphs look “flatlined” or missing even when updates happen.
  • If using shared storage for RRD (unusual, but people do it), check mount health and latency.

Don’t start with reinstalls. Don’t start with random deletions. And please don’t “fix” it by disabling stats collection. That’s like curing a fever by breaking the thermometer.

Interesting facts and context (why this breaks the way it does)

  1. RRD was designed in the late 1990s for compact time-series storage with fixed size, trading detail for predictability. Great for embedded graphs; not great for forensic depth.
  2. rrdcached exists to reduce write amplification: without it, frequent RRD updates can cause lots of small synchronous writes and needless SSD wear.
  3. Proxmox historically leaned on RRD for built-in graphs because it’s simple, local, and doesn’t require a monitoring stack. That simplicity is why it’s everywhere—and why failures feel surprising.
  4. RRD files are binary and picky. A partial write, disk full, or filesystem error can make them unreadable, and tooling often reports “invalid header” rather than “your disk lied to you.”
  5. Time jumps can break the illusion of continuity: RRD consolidates data into buckets; if system time leaps backward/forward, graphs may show gaps or flatlines without an obvious error.
  6. Proxmox node services are systemd units. A single missing directory, wrong ownership, or failed dependency can push pvestatd into restart loops that look like “it’s running” at a glance.
  7. Graph rendering is not done by pvestatd. The daemon only updates data; the GUI consumes it. People often chase the wrong component first (usually the web UI).
  8. Storage backends can poison the stats pipeline: a stalled NFS mount, degraded Ceph, or slow ZFS command can block collection, causing pvestatd timeouts and failures.

Practical tasks: commands, outputs, and decisions (12+)

These are real tasks I’d run on a node with missing graphs and a failing pvestatd. Each includes what the output means and the decision you make from it. Run them as root or with appropriate privileges.

Task 1: Confirm service state and last failure reason

cr0x@server:~$ systemctl status pvestatd --no-pager
● pvestatd.service - PVE status daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled)
     Active: failed (Result: exit-code) since Tue 2025-12-23 09:11:31 UTC; 2min 10s ago
    Process: 1442 ExecStart=/usr/bin/pvestatd (code=exited, status=1/FAILURE)
Dec 23 09:11:31 server pvestatd[1442]: unable to write rrd: Permission denied
Dec 23 09:11:31 server systemd[1]: pvestatd.service: Failed with result 'exit-code'.

Meaning: You have a concrete error. Don’t guess. “Permission denied” strongly points to ownership/mode on RRD directories or rrdcached socket.

Decision: Jump to permission checks (Storage pressure, permissions) and rrdcached socket checks.

Task 2: Read the journal for context (often shows the first real error)

cr0x@server:~$ journalctl -u pvestatd -b --no-pager -n 200
Dec 23 09:11:30 server pvestatd[1442]: starting server
Dec 23 09:11:31 server pvestatd[1442]: rrdcached: connect to /var/run/rrdcached.sock: No such file or directory
Dec 23 09:11:31 server pvestatd[1442]: can't initialize rrd: No such file or directory

Meaning: Not a permissions issue; the socket is missing. That’s either rrdcached not running, or it’s configured with a different socket path, or the directory is missing.

Decision: Check rrdcached service and socket path.

Task 3: Check rrdcached health and socket existence

cr0x@server:~$ systemctl status rrdcached --no-pager
● rrdcached.service - data caching daemon for rrdtool
     Loaded: loaded (/lib/systemd/system/rrdcached.service; enabled)
     Active: active (running) since Tue 2025-12-23 08:55:02 UTC; 18min ago
   Main PID: 612 (rrdcached)
     Tasks: 1
     Memory: 2.3M
Dec 23 08:55:02 server rrdcached[612]: starting up

cr0x@server:~$ ls -l /var/run/rrdcached.sock
ls: cannot access '/var/run/rrdcached.sock': No such file or directory

Meaning: Service is running but socket isn’t at that path. The daemon might be using a different socket location (common across distro packaging differences) or failing to create it.

Decision: Find actual socket path and align Proxmox expectations or service config.

Task 4: Locate the real rrdcached socket and verify it’s a socket

cr0x@server:~$ ss -xlpn | grep rrdcached
u_str LISTEN 0      4096   /run/rrdcached/rrdcached.sock  12345              * 0    users:(("rrdcached",pid=612,fd=4))

Meaning: The socket is actually /run/rrdcached/rrdcached.sock. If pvestatd is hardcoded or configured to use /var/run/rrdcached.sock, you’ve got a mismatch.

Decision: Confirm how Proxmox is configured to connect; fix by adjusting config or symlinking carefully.

Task 5: Check for the expected symlink (common fix)

cr0x@server:~$ ls -ld /var/run
lrwxrwxrwx 1 root root 4 Dec 23 08:40 /var/run -> /run

cr0x@server:~$ ls -l /run/rrdcached/rrdcached.sock
srwxrwxrwx 1 root root 0 Dec 23 08:55 /run/rrdcached/rrdcached.sock

Meaning: /var/run maps to /run. But the expected socket name differs. Sometimes Proxmox is looking for /run/rrdcached.sock while rrdcached uses a subdir.

Decision: Either reconfigure rrdcached to create the socket where Proxmox expects, or create a symlink (prefer config, accept symlink if you know what you’re doing).

Task 6: Inspect rrdcached defaults (socket and base directory)

cr0x@server:~$ grep -E 'SOCK|BASE|OPTS' /etc/default/rrdcached
SOCKGROUP=www-data
BASE_PATH=/var/lib/rrdcached
DAEMON_OPTS="-l unix:/run/rrdcached/rrdcached.sock -g -w 1800 -z 1800 -f 3600 -j /var/lib/rrdcached/journal/"

Meaning: Socket is explicitly configured. Proxmox needs to talk to that socket. If it can’t, it fails.

Decision: Make Proxmox use the configured socket, or configure rrdcached to provide the expected one.

Task 7: Check disk space and inodes where RRD lives (graphs die on “full”)

cr0x@server:~$ df -h /var/lib/rrdcached /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        30G   30G     0 100% /

cr0x@server:~$ df -i /var/lib/rrdcached
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sda2      196608 196608     0  100% /

Meaning: You’re out of both space and inodes. This is not “monitoring broke,” this is “system is suffocating.” RRD updates fail; daemons crash; other services soon follow.

Decision: Free space/inodes immediately, then repair RRD if needed. Do not keep restarting services on a full filesystem; you’ll just generate more logs and pain.

Task 8: Identify what ate your disk (quickly, without drama)

cr0x@server:~$ du -xhd1 /var | sort -h
120M    /var/cache
380M    /var/lib
1.2G    /var/log
28G     /var

Meaning: Disk usage concentrated in /var. Now drill down into /var/log and /var/lib.

Decision: Rotate/truncate runaway logs, prune caches, fix the underlying log storm cause. Then restart stats services.

Task 9: Look for log storms that also cause inode exhaustion

cr0x@server:~$ ls -lh /var/log | tail -n 10
-rw-r----- 1 root adm  12G Dec 23 09:10 syslog
-rw-r----- 1 root adm 8.1G Dec 23 09:09 kern.log
-rw-r----- 1 root adm  55M Dec 23 08:50 pveproxy/access.log

Meaning: Syslog and kernel logs are huge. Something is spamming. Fixing pvestatd without fixing the spam is like bailing a boat while leaving the hole open.

Decision: Stop the spam source; rotate or truncate logs safely; reclaim space. Then confirm pvestatd and rrdcached can write.

Task 10: Verify ownership and permissions of RRD directories

cr0x@server:~$ ls -ld /var/lib/rrdcached /var/lib/rrdcached/db
drwxr-xr-x 5 root root 4096 Dec 23 08:55 /var/lib/rrdcached
drwxr-xr-x 6 root root 4096 Dec 23 08:55 /var/lib/rrdcached/db

Meaning: Owned by root, which is often wrong for rrdcached’s update model. Depending on your setup, rrdcached expects ownership by rrdcached user/group or specific socket group permissions.

Decision: Check rrdcached runtime user and correct ownership. Avoid recursive chmod 777 “fixes.” You’ll regret it later.

Task 11: Confirm rrdcached user and effective permissions

cr0x@server:~$ ps -o user,group,pid,cmd -p $(pidof rrdcached)
USER     GROUP    PID CMD
rrdcached rrdcached 612 rrdcached -l unix:/run/rrdcached/rrdcached.sock -g -w 1800 -z 1800 -f 3600 -j /var/lib/rrdcached/journal/

Meaning: The daemon runs as rrdcached. It needs write access to its journal and DB directories.

Decision: Set correct ownership to rrdcached:rrdcached for /var/lib/rrdcached (if that’s your config), and ensure group permissions match the socket access group if applicable.

Task 12: Fix directory ownership safely (only after confirming config)

cr0x@server:~$ chown -R rrdcached:rrdcached /var/lib/rrdcached
cr0x@server:~$ systemctl restart rrdcached
cr0x@server:~$ systemctl restart pvestatd
cr0x@server:~$ systemctl is-active pvestatd
active

Meaning: Services came back. That’s necessary, not sufficient.

Decision: Confirm graphs are updating by checking RRD file timestamps and pvestatd logs.

Task 13: Validate RRD updates (timestamps should move)

cr0x@server:~$ find /var/lib/rrdcached/db -type f -name '*.rrd' -printf '%TY-%Tm-%Td %TH:%TM %p\n' | tail -n 5
2025-12-23 09:14 /var/lib/rrdcached/db/pve2-node/server/cpu.rrd
2025-12-23 09:14 /var/lib/rrdcached/db/pve2-node/server/mem.rrd

Meaning: RRD files are being touched. That’s a strong sign updates resumed.

Decision: If timestamps don’t move, return to socket/write-path debugging.

Task 14: Catch corruption (RRD “invalid header” is your clue)

cr0x@server:~$ rrdtool info /var/lib/rrdcached/db/pve2-node/server/cpu.rrd | head
ERROR: /var/lib/rrdcached/db/pve2-node/server/cpu.rrd: invalid header (bad magic)

Meaning: That file is corrupt. It won’t graph. pvestatd might choke when attempting to update it (depending on handling).

Decision: Replace the corrupt RRD with a fresh one (data loss for that series) or restore from backups if you have them.

Task 15: Identify how widespread corruption is

cr0x@server:~$ find /var/lib/rrdcached/db -type f -name '*.rrd' -print0 | xargs -0 -n1 sh -c 'rrdtool info "$0" >/dev/null 2>&1 || echo "BAD $0"' | head
BAD /var/lib/rrdcached/db/pve2-node/server/cpu.rrd
BAD /var/lib/rrdcached/db/pve2-node/server/loadavg.rrd

Meaning: You can quantify the blast radius. If it’s just a couple of files, replace those. If it’s hundreds, consider resetting the whole RRD tree after addressing the underlying disk issue.

Decision: Targeted replacement for small corruption; full reset for widespread rot (after ensuring the filesystem is healthy).

Task 16: Check time sync (graphs can “stop” because time is wrong)

cr0x@server:~$ timedatectl
               Local time: Tue 2025-12-23 09:16:02 UTC
           Universal time: Tue 2025-12-23 09:16:02 UTC
                 RTC time: Tue 2025-12-23 06:12:44
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
              NTP service: active
          RTC in local TZ: no

Meaning: NTP is “active” but the clock isn’t synchronized and RTC is drifting. If time jumps, RRD behavior becomes confusing and graphs may show gaps.

Decision: Fix time sync before assuming stats are broken. Time is a dependency, not a suggestion.

Task 17: Spot storage backend timeouts that stall stats collection

cr0x@server:~$ journalctl -u pvestatd -b --no-pager | tail -n 20
Dec 23 09:05:11 server pvestatd[1320]: storage 'nfs-archive' status error: timeout waiting for response
Dec 23 09:05:11 server pvestatd[1320]: unable to update storage statistics

Meaning: A single problematic storage can break or delay the stats loop. This is a classic: your graphs die because an NFS mount is hung, not because RRD is broken.

Decision: Fix or temporarily disable the broken storage definition; confirm pvestatd stabilizes.

Task 18: Verify Proxmox storage status from the CLI (no UI guesswork)

cr0x@server:~$ pvesm status
Name         Type     Status           Total            Used       Available        %
local         dir     active        29454016        10240000        17600000   34.77%
nfs-archive   nfs     inactive              0               0               0    0.00%

Meaning: Storage is inactive; pvestatd complained about it. Now you have a clean reason.

Decision: If it’s required, fix the mount/network. If it’s retired, remove it from config. Zombie storages are stats-killers.

RRD and rrdcached: repair, reset, and when to accept data loss

Understand the moving parts

The pipeline usually looks like this:

  • pvestatd collects values periodically.
  • It submits updates to rrdcached via a UNIX socket.
  • rrdcached caches updates and flushes them to RRD files.
  • The Proxmox web UI reads RRD data and renders graphs.

Failures cluster around:

  • Socket problems: wrong path, permissions, daemon down, stale socket file.
  • Directory/file permissions: rrdcached can’t write its journal/db, or pvestatd can’t connect.
  • Disk full / inode full: writes fail; journals can’t flush; daemons crash.
  • RRD corruption: invalid headers, truncated files, random read errors due to underlying storage issues.
  • Time discontinuities: updates appear to “not show” because the system time is wrong.

Decide: salvage vs reset

RRD is not a transactional database. If you have corruption, your options are limited:

  • Small number of corrupt files: delete/replace those files and let Proxmox recreate them.
  • Large-scale corruption: reset the entire RRD tree after fixing the underlying disk/FS issue. Accept losing history; regain correctness.
  • You have backups/snapshots of /var/lib/rrdcached: restore the directory (best case).

Here’s the opinionated part: if the host had a disk-full incident or filesystem errors, and you see multiple corrupt RRD files, stop trying to preserve them. You don’t want “historical” graphs built on a pile of silent corruption.

Targeted repair: replace corrupt series only

After stopping services, you can move corrupt RRDs aside and allow regeneration. Proxmox will recreate missing RRDs as data comes in.

cr0x@server:~$ systemctl stop pvestatd
cr0x@server:~$ systemctl stop rrdcached
cr0x@server:~$ mkdir -p /root/rrd-quarantine
cr0x@server:~$ mv /var/lib/rrdcached/db/pve2-node/server/cpu.rrd /root/rrd-quarantine/
cr0x@server:~$ systemctl start rrdcached
cr0x@server:~$ systemctl start pvestatd
cr0x@server:~$ systemctl is-active pvestatd
active

What this means: You’ve sacrificed one time-series file to get updates flowing. The new file will be created; your historical CPU graph resets. That’s fine. You wanted monitoring, not nostalgia.

Full reset: when everything is rotten

If most RRDs are corrupt or the DB tree is a mess, reset it. But only after you’ve fixed disk space, filesystem integrity, and time sync—otherwise you’ll just corrupt the fresh ones.

cr0x@server:~$ systemctl stop pvestatd
cr0x@server:~$ systemctl stop rrdcached
cr0x@server:~$ mv /var/lib/rrdcached/db /var/lib/rrdcached/db.broken.$(date +%s)
cr0x@server:~$ mkdir -p /var/lib/rrdcached/db
cr0x@server:~$ chown -R rrdcached:rrdcached /var/lib/rrdcached
cr0x@server:~$ systemctl start rrdcached
cr0x@server:~$ systemctl start pvestatd

What this means: You’re starting fresh. Within minutes, the GUI graphs should begin to populate. If they don’t, the problem was never “old RRD files”—it’s the pipeline or dependencies.

One-liner sanity check: can we talk to rrdcached?

A quick check is to see if the socket exists and is writable for the service user group. You can’t always “test write” easily without knowing RRD names, but you can confirm the socket and permissions.

cr0x@server:~$ stat /run/rrdcached/rrdcached.sock
  File: /run/rrdcached/rrdcached.sock
  Size: 0         	Blocks: 0          IO Block: 4096   socket
Device: 0,21	Inode: 12345       Links: 1
Access: (0777/srwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-12-23 08:55:02.000000000 +0000
Modify: 2025-12-23 08:55:02.000000000 +0000
Change: 2025-12-23 08:55:02.000000000 +0000

What this means: Socket exists and is broadly writable (maybe too broad). If pvestatd still complains about connecting, it may be using a different path or chroot-like environment (rare), or the socket is recreated on reboot in a different location.

Joke #1: When the graphs go missing, it’s not “just UI.” It’s the system politely telling you it’s out of patience.

Cluster and multi-node gotchas

In a standalone node, pvestatd problems are usually local: disk, permissions, socket, corruption. In a cluster, you get two extra classes of pain:

  • Split symptoms: one node missing graphs, another fine, a third intermittently updating.
  • Cross-dependencies: broken storage definitions or time drift on one node can create confusing UI behavior when you jump between nodes.

Cluster rule: confirm whether the issue is per-node

Run service checks on each node. Don’t assume because the GUI is blank everywhere that it’s “the GUI.” It may just be your browser caching, or it may be that all nodes share the same root cause (for example, a template hardening change applied everywhere).

cr0x@server:~$ for n in pve1 pve2 pve3; do echo "== $n =="; ssh $n systemctl is-active pvestatd; done
== pve1 ==
active
== pve2 ==
failed
== pve3 ==
active

Meaning: This is node-specific. Stop looking for a cluster-wide magic fix. Compare pve2’s filesystem usage, rrdcached config, and storage definitions to the other nodes.

Time drift in a cluster is a special kind of stupid

RRD doesn’t like time going backwards. Cluster services don’t like it either. If one node drifts, it can show graphs with gaps, and you’ll misdiagnose “stats broken” when it’s really “time broken.”

cr0x@server:~$ chronyc tracking
Reference ID    : 0A0B0C0D (ntp1)
Stratum         : 3
Last offset     : +0.000842 seconds
RMS offset      : 0.001210 seconds
System time     : 0.000311 seconds fast of NTP time
Leap status     : Normal

Meaning: Time is stable. If you see large offsets or “Not synchronised,” fix that first.

Broken storage definitions affect stats collection

Proxmox’s stats collection often touches storage backends. A dead NFS server, a stuck iSCSI path, or a Ceph command that hangs can cause pvestatd to block or fail.

Your best move: get pvesm status clean. If storage is dead and not required, remove it. If it’s required, fix the underlying mount/network issue. “Inactive forever” is not a harmless state; it’s an invitation for daemons to time out.

Storage pressure, permissions, and “why graphs die first”

Why do graphs disappear early in a host’s decline? Because they are write-heavy and not mission-critical to keep VMs running. The kernel will happily keep your workloads alive while background services quietly fail to update history.

Disk-full incidents: the chain reaction

When root fills:

  • rrdcached can’t flush journals or update RRD files.
  • pvestatd fails to write and exits or retries aggressively.
  • Logging can get louder, which uses more disk, which makes it worse.
  • Eventually other components fail (package updates, authentication caches, cluster messaging).

If you only remember one thing: solve space/inodes before anything else. Every other fix is downstream of the ability to write a few kilobytes.

Permissions: the quiet sabotage

Permissions problems often show up after:

  • Manual “cleanup” by someone who ran chown on /var/lib or restored from backup with wrong ownership.
  • Hardening scripts that tighten /run or change default umasks.
  • Moving directories to different filesystems and losing ACLs or extended attributes.

A good debugging approach is to treat /var/lib/rrdcached like an application data directory: it needs stable ownership, stable permissions, and stable free space. If you keep it on the root filesystem with everything else, you’re betting your monitoring history on your log rotation being perfect. That’s a bold bet.

Filesystem health: if you saw corruption, don’t ignore the disk

RRD corruption is often a symptom, not the disease. If you had a power event, a drive hiccup, or a VM host crash, and now RRDs are corrupt, you should at least check kernel logs for I/O errors and verify SMART status. Otherwise you’ll “fix” graphs and then lose the node.

cr0x@server:~$ dmesg -T | egrep -i 'ext4|xfs|btrfs|I/O error|nvme|ata' | tail -n 15
[Mon Dec 23 08:41:12 2025] EXT4-fs error (device sda2): ext4_journal_check_start:83: Detected aborted journal
[Mon Dec 23 08:41:12 2025] Buffer I/O error on device sda2, logical block 1234567

Meaning: Your filesystem had a bad day. Expect RRD corruption. Do not just delete RRD files and move on; schedule a maintenance window to repair filesystem and validate hardware.

Here’s the one quote, as a paraphrased idea from a reliability heavyweight: paraphrased idea — “Hope is not a strategy; build systems that assume failure.” — attributed to Gene Kranz (paraphrased idea).

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption: “It’s just the web UI”

The setting: a mid-sized company with a Proxmox cluster powering internal services—CI runners, a few databases, and a lot of “temporary” VMs that became permanent. One morning, graphs disappeared on the busiest node. The on-call engineer assumed it was a browser issue or a UI regression after updates. They restarted pveproxy and pvedaemon because those are the services everyone knows.

The graphs stayed blank. So they escalated to “Proxmox is broken,” opened a vendor ticket, and planned a rolling reboot of the cluster nodes to “clear the state.” Meanwhile, the real problem was the root filesystem at 100% because a single VM’s backup job had been logging authentication failures every few seconds, and log rotation wasn’t catching up.

When the first node rebooted, it came back slower than usual. Services that needed to write to disk during boot stumbled. A few VMs failed to start automatically. Suddenly the scope expanded: now it wasn’t graphs; it was availability.

Once someone finally ran df -h, the fix was obvious: stop the log spam, truncate the logs carefully, reclaim space, restart rrdcached and pvestatd. Graphs returned. The reboot was the only truly dangerous step taken, and it was done because someone assumed “graphs are a UI thing.”

Wrong assumption lesson: graphs are a symptom of your write-path health. When they vanish, your first instinct should be “can this node still write data reliably?” not “is my browser cached?”

2) The optimization that backfired: “Let’s move rrdcached to faster storage”

A different org, same vibe: performance-minded team, lots of SSDs, a desire to keep root clean. Someone decided to relocate /var/lib/rrdcached to a “fast” ZFS dataset with aggressive compression and a slightly exotic mount setup. It worked in testing. It even looked cleaner in production.

Then came the backfire. The dataset got included in a snapshot/replication routine with frequent snapshots. RRD files are small but numerous and update often. The snapshot overhead wasn’t catastrophic, but it introduced enough latency spikes during flush windows that rrdcached began falling behind. Occasional timeouts appeared. Graphs went stale intermittently—exactly the kind of issue that makes you mistrust monitoring.

The team’s next “optimization” was increasing rrdcached write intervals to reduce flush frequency. That reduced I/O, yes. It also increased the amount of data lost during an unclean shutdown and amplified the “graphs lag behind reality” effect. People stopped checking graphs because they weren’t timely. The monitoring became decorative.

The eventual fix was boring: keep /var/lib/rrdcached on a stable local filesystem with predictable latency, exclude it from aggressive snapshot policies, and accept that RRD data is not precious enough to replicate with the same zeal as VM disks.

Backfired optimization lesson: the monitoring backend should be stable and dull. Low latency helps, but predictability beats cleverness.

3) The boring but correct practice that saved the day: “Separate /var and enforce thresholds”

A third team had been burned by disk-full outages before. They did two unglamorous things: they put /var on a dedicated filesystem with headroom, and they enforced alerting thresholds at 70% and 85% usage with an escalation policy. Not sexy. Not a conference talk. It just works.

One day, a network change caused repeated auth failures for a storage mount, producing a steady log stream. Disk usage climbed. The 70% alert fired and got acknowledged. The on-call didn’t “fix it later.” They investigated immediately because that’s what the runbook said, and because they’d seen what happens after 90%.

The root cause was addressed (credentials and mount retry behavior), and logs were rotated. pvestatd never failed. Graphs never went missing. Nobody had to guess what happened later, because the time-series history remained intact and boring.

Saved-by-boring lesson: separate critical write-heavy areas, alert early, and treat disk fullness as a reliability incident, not a housekeeping chore.

Joke #2: Disk space is like office politics—ignore it long enough and it will eventually schedule a meeting with your CEO.

Common mistakes: symptom → root cause → fix

1) Symptom: GUI shows no graphs; pvestatd “failed”

Root cause: rrdcached socket path mismatch or missing socket.

Fix: Confirm actual socket with ss -xlpn, align configuration in /etc/default/rrdcached, restart rrdcached then pvestatd.

2) Symptom: pvestatd logs “Permission denied” writing RRD

Root cause: Wrong ownership/mode on /var/lib/rrdcached or journal directory; sometimes from restores or manual changes.

Fix: Confirm rrdcached user with ps, apply correct chown -R to /var/lib/rrdcached, restart services.

3) Symptom: Graphs stopped after “everything was fine,” but services show “active”

Root cause: Disk full or inode exhaustion causing silent update failures; or rrdcached behind with an ever-growing journal.

Fix: Check df -h and df -i. Free space/inodes. Then verify RRD timestamps move.

4) Symptom: Errors like “invalid header (bad magic)”

Root cause: Corrupt RRD files, often after disk-full or filesystem errors.

Fix: Quarantine corrupt RRDs or reset the RRD tree; investigate underlying disk/filesystem health.

5) Symptom: pvestatd intermittently fails; logs mention storage timeouts

Root cause: One broken storage backend (hung NFS, iSCSI path issues, slow Ceph) blocks stats collection.

Fix: pvesm status to identify inactive storages. Fix network/mount, or remove dead storage config.

6) Symptom: Graphs show gaps or odd “flatlines” across nodes

Root cause: Time drift or NTP not synchronized; sometimes after VM host suspend/resume or bad RTC.

Fix: Use timedatectl and chronyc tracking. Fix time sync; avoid large manual time jumps.

7) Symptom: Restarting pvestatd “fixes” it briefly, then it fails again

Root cause: Restart loop hiding an upstream persistent issue (disk, permissions, storage timeouts). Restarts are a painkiller, not surgery.

Fix: Read the journal for the first error and resolve the dependency. Stop flapping; stabilize.

Checklists / step-by-step plan

Checklist A: Get graphs back within 15 minutes (safe path)

  1. Check systemctl status pvestatd and journalctl -u pvestatd -b for the first real error.
  2. Check systemctl status rrdcached and confirm the socket with ss -xlpn | grep rrdcached.
  3. Check free space and inodes: df -h /, df -i /.
  4. If disk/inodes are full, free space first. Prefer stopping the spam source, then rotating/truncating logs.
  5. Verify /var/lib/rrdcached ownership matches the rrdcached runtime user.
  6. Restart in order: systemctl restart rrdcached then systemctl restart pvestatd.
  7. Confirm RRD updates: check RRD file timestamps in /var/lib/rrdcached/db.
  8. If corruption appears, quarantine single corrupt files. If many are corrupt, reset RRD tree after ensuring storage health.

Checklist B: Stabilize and prevent recurrence (the grown-up path)

  1. Ensure time sync is stable: enable and verify chrony/systemd-timesyncd; fix RTC drift if it’s chronic.
  2. Ensure /var has sufficient headroom; consider separate filesystem or dataset with monitoring thresholds.
  3. Audit log rotation settings. Confirm you can’t generate multi-GB logs in a few hours without alerts.
  4. Review storage definitions; remove dead storages from config. “Inactive” should be exceptional, not normal.
  5. If you saw filesystem errors, schedule maintenance: fsck where appropriate, SMART tests, replace questionable media.
  6. Document the rrdcached socket path and permissions. Avoid “mystery symlinks” nobody understands.

Checklist C: When you must reset RRD data (without making it worse)

  1. Fix disk space and verify filesystem health first.
  2. Stop pvestatd and rrdcached.
  3. Move /var/lib/rrdcached/db aside (don’t delete immediately).
  4. Create a fresh db directory with correct ownership.
  5. Start rrdcached then pvestatd.
  6. Verify new RRD files appear and timestamps advance.

FAQ

1) Do I need to reboot the Proxmox node to fix pvestatd?

No. If a reboot “fixes it,” you likely had a transient mount issue or a service state problem. Diagnose the underlying cause (space, socket, permissions, storage timeouts) so it doesn’t return.

2) Why are my VMs fine but graphs are missing?

Because the monitoring pipeline is not required for virtualization to function. It’s a canary for write-path health and service dependencies. Treat it as an early warning, not a cosmetic glitch.

3) I restarted pveproxy and still no graphs. Why?

Because pveproxy renders the UI but doesn’t generate the time-series. If pvestatd can’t update RRD data, the UI has nothing new to render.

4) Can a broken NFS storage really break node graphs?

It can. If stats collection tries to query storage status and that query blocks or times out, the loop may fail or lag. Fix the mount, network, or remove dead storage definitions.

5) Is it safe to delete RRD files?

It’s safe in the sense that you won’t break VMs. You will lose historical graphs for the deleted series. If the files are corrupt, deletion/quarantine is often the fastest route back to working monitoring.

6) Why does time sync matter for graphs?

RRD buckets data by time. Large jumps create gaps or confusing consolidation artifacts. In clusters, time drift also causes broader operational weirdness. Keep NTP healthy.

7) My filesystem isn’t full. What else commonly causes “Permission denied”?

Wrong ownership from restores, moved directories, or hardening scripts changing umasks or runtime directories. Confirm the service user and directory ownership. Don’t fix with world-writable permissions.

8) How do I confirm graphs are updating without using the GUI?

Check modification times on *.rrd files under /var/lib/rrdcached/db. If timestamps advance every few minutes, updates are flowing. You can also confirm pvestatd logs go quiet (in a good way).

9) Why does rrdcached sometimes run but the socket is missing?

If the socket directory doesn’t exist at startup, or permissions prevent socket creation, rrdcached may not expose the expected endpoint. Confirm socket path in config and verify the directory exists and is writable.

10) Should I move RRD data to shared storage?

Generally no. RRD is small and write-heavy; shared storage adds latency and failure modes. Keep it local and boring. If you must centralize metrics, use a proper monitoring stack rather than trying to “cluster” RRD files.

Next steps that keep this from coming back

Restoring graphs is the easy part. Keeping them reliable is the job.

  • Make disk pressure visible early: alert on space and inode usage long before 95%.
  • Keep rrdcached predictable: stable socket path, correct ownership, avoid clever relocations that add latency.
  • Clean up dead storage definitions: don’t let broken mounts linger as “inactive” for months.
  • Fix time sync permanently: stable NTP, sane RTC, no manual time jumps in production.
  • If you saw corruption, investigate hardware: don’t treat RRD errors as purely a software inconvenience.

If you do those, pvestatd becomes what it should be: background noise you never think about. That’s the best kind of monitoring.

← Previous
Troubleshooting Proxmox Performance: CPU Steal, IO Wait, ZFS ARC, and Noisy Neighbors
Next →
How Server Features Keep Sliding Into Desktop CPUs

Leave a comment