Your Proxmox UI looks alive—VMs run, storage mounts, backups happen. But the graphs are dead. No CPU history, no IO lines, no comforting “this was fine yesterday” evidence. You check the node summary and it’s blank or stuck. Then you see it: pvestatd.service failed.
This is one of those failures that feels cosmetic until you need it. The monitoring pipeline is your alibi during incidents. When it goes missing, the next outage is going to be loud and personal. Let’s fix it properly, not by ritual restarts and wishful thinking.
What pvestatd actually does (and what it doesn’t)
pvestatd is Proxmox’s node statistics daemon. Its job is to periodically collect metrics (node, VMs/CTs, storage, some cluster-facing details), and store time-series data so the GUI can draw graphs and show trends. It’s not the GUI itself and it’s not a “monitoring system” in the Prometheus sense. It’s the plumbing that makes the built-in graphs work.
On most installations, graphs are backed by RRD (Round Robin Database) files under /var/lib/rrdcached/db and updates are mediated by rrdcached to avoid constant disk I/O. When pvestatd can’t write updates—or can’t talk to rrdcached, or can’t read system stats cleanly—it fails or becomes stuck. The GUI then shows stale or empty graphs, often without a helpful explanation.
Here’s the practical takeaway: when graphs disappear, don’t stare at the browser. You debug pvestatd, rrdcached, storage health, permissions, and time sync. In that order.
One dry truth: if you run without reliable time-series data, you’re operating production like a pilot who thinks altimeters are “optional accessories.”
Fast diagnosis playbook (first/second/third)
First: confirm the failure mode (is it actually pvestatd?)
- Check service state and recent logs. You’re looking for a clear error: permission denied, disk full, socket missing, corrupt RRD, time jump.
- Check if
rrdcachedis running and its socket exists. - Check free space and inode pressure on the filesystem holding RRD data.
Second: isolate whether it’s write-path or read-path
- Write-path: can we update RRDs? Socket? Permissions? Disk? Corruption?
- Read-path: does pvestatd crash while collecting stats from storage backends (ZFS, Ceph, NFS, iSCSI)? Does a single broken storage stall everything?
Third: confirm cluster implications
- In a cluster, see whether one node is sick or all nodes are sick.
- Check time sync and cluster communication. Weird time jumps can make RRD graphs look “flatlined” or missing even when updates happen.
- If using shared storage for RRD (unusual, but people do it), check mount health and latency.
Don’t start with reinstalls. Don’t start with random deletions. And please don’t “fix” it by disabling stats collection. That’s like curing a fever by breaking the thermometer.
Interesting facts and context (why this breaks the way it does)
- RRD was designed in the late 1990s for compact time-series storage with fixed size, trading detail for predictability. Great for embedded graphs; not great for forensic depth.
- rrdcached exists to reduce write amplification: without it, frequent RRD updates can cause lots of small synchronous writes and needless SSD wear.
- Proxmox historically leaned on RRD for built-in graphs because it’s simple, local, and doesn’t require a monitoring stack. That simplicity is why it’s everywhere—and why failures feel surprising.
- RRD files are binary and picky. A partial write, disk full, or filesystem error can make them unreadable, and tooling often reports “invalid header” rather than “your disk lied to you.”
- Time jumps can break the illusion of continuity: RRD consolidates data into buckets; if system time leaps backward/forward, graphs may show gaps or flatlines without an obvious error.
- Proxmox node services are systemd units. A single missing directory, wrong ownership, or failed dependency can push pvestatd into restart loops that look like “it’s running” at a glance.
- Graph rendering is not done by pvestatd. The daemon only updates data; the GUI consumes it. People often chase the wrong component first (usually the web UI).
- Storage backends can poison the stats pipeline: a stalled NFS mount, degraded Ceph, or slow ZFS command can block collection, causing pvestatd timeouts and failures.
Practical tasks: commands, outputs, and decisions (12+)
These are real tasks I’d run on a node with missing graphs and a failing pvestatd. Each includes what the output means and the decision you make from it. Run them as root or with appropriate privileges.
Task 1: Confirm service state and last failure reason
cr0x@server:~$ systemctl status pvestatd --no-pager
● pvestatd.service - PVE status daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled)
Active: failed (Result: exit-code) since Tue 2025-12-23 09:11:31 UTC; 2min 10s ago
Process: 1442 ExecStart=/usr/bin/pvestatd (code=exited, status=1/FAILURE)
Dec 23 09:11:31 server pvestatd[1442]: unable to write rrd: Permission denied
Dec 23 09:11:31 server systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Meaning: You have a concrete error. Don’t guess. “Permission denied” strongly points to ownership/mode on RRD directories or rrdcached socket.
Decision: Jump to permission checks (Storage pressure, permissions) and rrdcached socket checks.
Task 2: Read the journal for context (often shows the first real error)
cr0x@server:~$ journalctl -u pvestatd -b --no-pager -n 200
Dec 23 09:11:30 server pvestatd[1442]: starting server
Dec 23 09:11:31 server pvestatd[1442]: rrdcached: connect to /var/run/rrdcached.sock: No such file or directory
Dec 23 09:11:31 server pvestatd[1442]: can't initialize rrd: No such file or directory
Meaning: Not a permissions issue; the socket is missing. That’s either rrdcached not running, or it’s configured with a different socket path, or the directory is missing.
Decision: Check rrdcached service and socket path.
Task 3: Check rrdcached health and socket existence
cr0x@server:~$ systemctl status rrdcached --no-pager
● rrdcached.service - data caching daemon for rrdtool
Loaded: loaded (/lib/systemd/system/rrdcached.service; enabled)
Active: active (running) since Tue 2025-12-23 08:55:02 UTC; 18min ago
Main PID: 612 (rrdcached)
Tasks: 1
Memory: 2.3M
Dec 23 08:55:02 server rrdcached[612]: starting up
cr0x@server:~$ ls -l /var/run/rrdcached.sock
ls: cannot access '/var/run/rrdcached.sock': No such file or directory
Meaning: Service is running but socket isn’t at that path. The daemon might be using a different socket location (common across distro packaging differences) or failing to create it.
Decision: Find actual socket path and align Proxmox expectations or service config.
Task 4: Locate the real rrdcached socket and verify it’s a socket
cr0x@server:~$ ss -xlpn | grep rrdcached
u_str LISTEN 0 4096 /run/rrdcached/rrdcached.sock 12345 * 0 users:(("rrdcached",pid=612,fd=4))
Meaning: The socket is actually /run/rrdcached/rrdcached.sock. If pvestatd is hardcoded or configured to use /var/run/rrdcached.sock, you’ve got a mismatch.
Decision: Confirm how Proxmox is configured to connect; fix by adjusting config or symlinking carefully.
Task 5: Check for the expected symlink (common fix)
cr0x@server:~$ ls -ld /var/run
lrwxrwxrwx 1 root root 4 Dec 23 08:40 /var/run -> /run
cr0x@server:~$ ls -l /run/rrdcached/rrdcached.sock
srwxrwxrwx 1 root root 0 Dec 23 08:55 /run/rrdcached/rrdcached.sock
Meaning: /var/run maps to /run. But the expected socket name differs. Sometimes Proxmox is looking for /run/rrdcached.sock while rrdcached uses a subdir.
Decision: Either reconfigure rrdcached to create the socket where Proxmox expects, or create a symlink (prefer config, accept symlink if you know what you’re doing).
Task 6: Inspect rrdcached defaults (socket and base directory)
cr0x@server:~$ grep -E 'SOCK|BASE|OPTS' /etc/default/rrdcached
SOCKGROUP=www-data
BASE_PATH=/var/lib/rrdcached
DAEMON_OPTS="-l unix:/run/rrdcached/rrdcached.sock -g -w 1800 -z 1800 -f 3600 -j /var/lib/rrdcached/journal/"
Meaning: Socket is explicitly configured. Proxmox needs to talk to that socket. If it can’t, it fails.
Decision: Make Proxmox use the configured socket, or configure rrdcached to provide the expected one.
Task 7: Check disk space and inodes where RRD lives (graphs die on “full”)
cr0x@server:~$ df -h /var/lib/rrdcached /var
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 30G 30G 0 100% /
cr0x@server:~$ df -i /var/lib/rrdcached
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda2 196608 196608 0 100% /
Meaning: You’re out of both space and inodes. This is not “monitoring broke,” this is “system is suffocating.” RRD updates fail; daemons crash; other services soon follow.
Decision: Free space/inodes immediately, then repair RRD if needed. Do not keep restarting services on a full filesystem; you’ll just generate more logs and pain.
Task 8: Identify what ate your disk (quickly, without drama)
cr0x@server:~$ du -xhd1 /var | sort -h
120M /var/cache
380M /var/lib
1.2G /var/log
28G /var
Meaning: Disk usage concentrated in /var. Now drill down into /var/log and /var/lib.
Decision: Rotate/truncate runaway logs, prune caches, fix the underlying log storm cause. Then restart stats services.
Task 9: Look for log storms that also cause inode exhaustion
cr0x@server:~$ ls -lh /var/log | tail -n 10
-rw-r----- 1 root adm 12G Dec 23 09:10 syslog
-rw-r----- 1 root adm 8.1G Dec 23 09:09 kern.log
-rw-r----- 1 root adm 55M Dec 23 08:50 pveproxy/access.log
Meaning: Syslog and kernel logs are huge. Something is spamming. Fixing pvestatd without fixing the spam is like bailing a boat while leaving the hole open.
Decision: Stop the spam source; rotate or truncate logs safely; reclaim space. Then confirm pvestatd and rrdcached can write.
Task 10: Verify ownership and permissions of RRD directories
cr0x@server:~$ ls -ld /var/lib/rrdcached /var/lib/rrdcached/db
drwxr-xr-x 5 root root 4096 Dec 23 08:55 /var/lib/rrdcached
drwxr-xr-x 6 root root 4096 Dec 23 08:55 /var/lib/rrdcached/db
Meaning: Owned by root, which is often wrong for rrdcached’s update model. Depending on your setup, rrdcached expects ownership by rrdcached user/group or specific socket group permissions.
Decision: Check rrdcached runtime user and correct ownership. Avoid recursive chmod 777 “fixes.” You’ll regret it later.
Task 11: Confirm rrdcached user and effective permissions
cr0x@server:~$ ps -o user,group,pid,cmd -p $(pidof rrdcached)
USER GROUP PID CMD
rrdcached rrdcached 612 rrdcached -l unix:/run/rrdcached/rrdcached.sock -g -w 1800 -z 1800 -f 3600 -j /var/lib/rrdcached/journal/
Meaning: The daemon runs as rrdcached. It needs write access to its journal and DB directories.
Decision: Set correct ownership to rrdcached:rrdcached for /var/lib/rrdcached (if that’s your config), and ensure group permissions match the socket access group if applicable.
Task 12: Fix directory ownership safely (only after confirming config)
cr0x@server:~$ chown -R rrdcached:rrdcached /var/lib/rrdcached
cr0x@server:~$ systemctl restart rrdcached
cr0x@server:~$ systemctl restart pvestatd
cr0x@server:~$ systemctl is-active pvestatd
active
Meaning: Services came back. That’s necessary, not sufficient.
Decision: Confirm graphs are updating by checking RRD file timestamps and pvestatd logs.
Task 13: Validate RRD updates (timestamps should move)
cr0x@server:~$ find /var/lib/rrdcached/db -type f -name '*.rrd' -printf '%TY-%Tm-%Td %TH:%TM %p\n' | tail -n 5
2025-12-23 09:14 /var/lib/rrdcached/db/pve2-node/server/cpu.rrd
2025-12-23 09:14 /var/lib/rrdcached/db/pve2-node/server/mem.rrd
Meaning: RRD files are being touched. That’s a strong sign updates resumed.
Decision: If timestamps don’t move, return to socket/write-path debugging.
Task 14: Catch corruption (RRD “invalid header” is your clue)
cr0x@server:~$ rrdtool info /var/lib/rrdcached/db/pve2-node/server/cpu.rrd | head
ERROR: /var/lib/rrdcached/db/pve2-node/server/cpu.rrd: invalid header (bad magic)
Meaning: That file is corrupt. It won’t graph. pvestatd might choke when attempting to update it (depending on handling).
Decision: Replace the corrupt RRD with a fresh one (data loss for that series) or restore from backups if you have them.
Task 15: Identify how widespread corruption is
cr0x@server:~$ find /var/lib/rrdcached/db -type f -name '*.rrd' -print0 | xargs -0 -n1 sh -c 'rrdtool info "$0" >/dev/null 2>&1 || echo "BAD $0"' | head
BAD /var/lib/rrdcached/db/pve2-node/server/cpu.rrd
BAD /var/lib/rrdcached/db/pve2-node/server/loadavg.rrd
Meaning: You can quantify the blast radius. If it’s just a couple of files, replace those. If it’s hundreds, consider resetting the whole RRD tree after addressing the underlying disk issue.
Decision: Targeted replacement for small corruption; full reset for widespread rot (after ensuring the filesystem is healthy).
Task 16: Check time sync (graphs can “stop” because time is wrong)
cr0x@server:~$ timedatectl
Local time: Tue 2025-12-23 09:16:02 UTC
Universal time: Tue 2025-12-23 09:16:02 UTC
RTC time: Tue 2025-12-23 06:12:44
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
NTP service: active
RTC in local TZ: no
Meaning: NTP is “active” but the clock isn’t synchronized and RTC is drifting. If time jumps, RRD behavior becomes confusing and graphs may show gaps.
Decision: Fix time sync before assuming stats are broken. Time is a dependency, not a suggestion.
Task 17: Spot storage backend timeouts that stall stats collection
cr0x@server:~$ journalctl -u pvestatd -b --no-pager | tail -n 20
Dec 23 09:05:11 server pvestatd[1320]: storage 'nfs-archive' status error: timeout waiting for response
Dec 23 09:05:11 server pvestatd[1320]: unable to update storage statistics
Meaning: A single problematic storage can break or delay the stats loop. This is a classic: your graphs die because an NFS mount is hung, not because RRD is broken.
Decision: Fix or temporarily disable the broken storage definition; confirm pvestatd stabilizes.
Task 18: Verify Proxmox storage status from the CLI (no UI guesswork)
cr0x@server:~$ pvesm status
Name Type Status Total Used Available %
local dir active 29454016 10240000 17600000 34.77%
nfs-archive nfs inactive 0 0 0 0.00%
Meaning: Storage is inactive; pvestatd complained about it. Now you have a clean reason.
Decision: If it’s required, fix the mount/network. If it’s retired, remove it from config. Zombie storages are stats-killers.
RRD and rrdcached: repair, reset, and when to accept data loss
Understand the moving parts
The pipeline usually looks like this:
pvestatdcollects values periodically.- It submits updates to
rrdcachedvia a UNIX socket. rrdcachedcaches updates and flushes them to RRD files.- The Proxmox web UI reads RRD data and renders graphs.
Failures cluster around:
- Socket problems: wrong path, permissions, daemon down, stale socket file.
- Directory/file permissions: rrdcached can’t write its journal/db, or pvestatd can’t connect.
- Disk full / inode full: writes fail; journals can’t flush; daemons crash.
- RRD corruption: invalid headers, truncated files, random read errors due to underlying storage issues.
- Time discontinuities: updates appear to “not show” because the system time is wrong.
Decide: salvage vs reset
RRD is not a transactional database. If you have corruption, your options are limited:
- Small number of corrupt files: delete/replace those files and let Proxmox recreate them.
- Large-scale corruption: reset the entire RRD tree after fixing the underlying disk/FS issue. Accept losing history; regain correctness.
- You have backups/snapshots of
/var/lib/rrdcached: restore the directory (best case).
Here’s the opinionated part: if the host had a disk-full incident or filesystem errors, and you see multiple corrupt RRD files, stop trying to preserve them. You don’t want “historical” graphs built on a pile of silent corruption.
Targeted repair: replace corrupt series only
After stopping services, you can move corrupt RRDs aside and allow regeneration. Proxmox will recreate missing RRDs as data comes in.
cr0x@server:~$ systemctl stop pvestatd
cr0x@server:~$ systemctl stop rrdcached
cr0x@server:~$ mkdir -p /root/rrd-quarantine
cr0x@server:~$ mv /var/lib/rrdcached/db/pve2-node/server/cpu.rrd /root/rrd-quarantine/
cr0x@server:~$ systemctl start rrdcached
cr0x@server:~$ systemctl start pvestatd
cr0x@server:~$ systemctl is-active pvestatd
active
What this means: You’ve sacrificed one time-series file to get updates flowing. The new file will be created; your historical CPU graph resets. That’s fine. You wanted monitoring, not nostalgia.
Full reset: when everything is rotten
If most RRDs are corrupt or the DB tree is a mess, reset it. But only after you’ve fixed disk space, filesystem integrity, and time sync—otherwise you’ll just corrupt the fresh ones.
cr0x@server:~$ systemctl stop pvestatd
cr0x@server:~$ systemctl stop rrdcached
cr0x@server:~$ mv /var/lib/rrdcached/db /var/lib/rrdcached/db.broken.$(date +%s)
cr0x@server:~$ mkdir -p /var/lib/rrdcached/db
cr0x@server:~$ chown -R rrdcached:rrdcached /var/lib/rrdcached
cr0x@server:~$ systemctl start rrdcached
cr0x@server:~$ systemctl start pvestatd
What this means: You’re starting fresh. Within minutes, the GUI graphs should begin to populate. If they don’t, the problem was never “old RRD files”—it’s the pipeline or dependencies.
One-liner sanity check: can we talk to rrdcached?
A quick check is to see if the socket exists and is writable for the service user group. You can’t always “test write” easily without knowing RRD names, but you can confirm the socket and permissions.
cr0x@server:~$ stat /run/rrdcached/rrdcached.sock
File: /run/rrdcached/rrdcached.sock
Size: 0 Blocks: 0 IO Block: 4096 socket
Device: 0,21 Inode: 12345 Links: 1
Access: (0777/srwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-12-23 08:55:02.000000000 +0000
Modify: 2025-12-23 08:55:02.000000000 +0000
Change: 2025-12-23 08:55:02.000000000 +0000
What this means: Socket exists and is broadly writable (maybe too broad). If pvestatd still complains about connecting, it may be using a different path or chroot-like environment (rare), or the socket is recreated on reboot in a different location.
Joke #1: When the graphs go missing, it’s not “just UI.” It’s the system politely telling you it’s out of patience.
Cluster and multi-node gotchas
In a standalone node, pvestatd problems are usually local: disk, permissions, socket, corruption. In a cluster, you get two extra classes of pain:
- Split symptoms: one node missing graphs, another fine, a third intermittently updating.
- Cross-dependencies: broken storage definitions or time drift on one node can create confusing UI behavior when you jump between nodes.
Cluster rule: confirm whether the issue is per-node
Run service checks on each node. Don’t assume because the GUI is blank everywhere that it’s “the GUI.” It may just be your browser caching, or it may be that all nodes share the same root cause (for example, a template hardening change applied everywhere).
cr0x@server:~$ for n in pve1 pve2 pve3; do echo "== $n =="; ssh $n systemctl is-active pvestatd; done
== pve1 ==
active
== pve2 ==
failed
== pve3 ==
active
Meaning: This is node-specific. Stop looking for a cluster-wide magic fix. Compare pve2’s filesystem usage, rrdcached config, and storage definitions to the other nodes.
Time drift in a cluster is a special kind of stupid
RRD doesn’t like time going backwards. Cluster services don’t like it either. If one node drifts, it can show graphs with gaps, and you’ll misdiagnose “stats broken” when it’s really “time broken.”
cr0x@server:~$ chronyc tracking
Reference ID : 0A0B0C0D (ntp1)
Stratum : 3
Last offset : +0.000842 seconds
RMS offset : 0.001210 seconds
System time : 0.000311 seconds fast of NTP time
Leap status : Normal
Meaning: Time is stable. If you see large offsets or “Not synchronised,” fix that first.
Broken storage definitions affect stats collection
Proxmox’s stats collection often touches storage backends. A dead NFS server, a stuck iSCSI path, or a Ceph command that hangs can cause pvestatd to block or fail.
Your best move: get pvesm status clean. If storage is dead and not required, remove it. If it’s required, fix the underlying mount/network issue. “Inactive forever” is not a harmless state; it’s an invitation for daemons to time out.
Storage pressure, permissions, and “why graphs die first”
Why do graphs disappear early in a host’s decline? Because they are write-heavy and not mission-critical to keep VMs running. The kernel will happily keep your workloads alive while background services quietly fail to update history.
Disk-full incidents: the chain reaction
When root fills:
rrdcachedcan’t flush journals or update RRD files.pvestatdfails to write and exits or retries aggressively.- Logging can get louder, which uses more disk, which makes it worse.
- Eventually other components fail (package updates, authentication caches, cluster messaging).
If you only remember one thing: solve space/inodes before anything else. Every other fix is downstream of the ability to write a few kilobytes.
Permissions: the quiet sabotage
Permissions problems often show up after:
- Manual “cleanup” by someone who ran
chownon/var/libor restored from backup with wrong ownership. - Hardening scripts that tighten
/runor change default umasks. - Moving directories to different filesystems and losing ACLs or extended attributes.
A good debugging approach is to treat /var/lib/rrdcached like an application data directory: it needs stable ownership, stable permissions, and stable free space. If you keep it on the root filesystem with everything else, you’re betting your monitoring history on your log rotation being perfect. That’s a bold bet.
Filesystem health: if you saw corruption, don’t ignore the disk
RRD corruption is often a symptom, not the disease. If you had a power event, a drive hiccup, or a VM host crash, and now RRDs are corrupt, you should at least check kernel logs for I/O errors and verify SMART status. Otherwise you’ll “fix” graphs and then lose the node.
cr0x@server:~$ dmesg -T | egrep -i 'ext4|xfs|btrfs|I/O error|nvme|ata' | tail -n 15
[Mon Dec 23 08:41:12 2025] EXT4-fs error (device sda2): ext4_journal_check_start:83: Detected aborted journal
[Mon Dec 23 08:41:12 2025] Buffer I/O error on device sda2, logical block 1234567
Meaning: Your filesystem had a bad day. Expect RRD corruption. Do not just delete RRD files and move on; schedule a maintenance window to repair filesystem and validate hardware.
Here’s the one quote, as a paraphrased idea from a reliability heavyweight: paraphrased idea — “Hope is not a strategy; build systems that assume failure.”
— attributed to Gene Kranz (paraphrased idea).
Three corporate mini-stories from the trenches
1) The incident caused by a wrong assumption: “It’s just the web UI”
The setting: a mid-sized company with a Proxmox cluster powering internal services—CI runners, a few databases, and a lot of “temporary” VMs that became permanent. One morning, graphs disappeared on the busiest node. The on-call engineer assumed it was a browser issue or a UI regression after updates. They restarted pveproxy and pvedaemon because those are the services everyone knows.
The graphs stayed blank. So they escalated to “Proxmox is broken,” opened a vendor ticket, and planned a rolling reboot of the cluster nodes to “clear the state.” Meanwhile, the real problem was the root filesystem at 100% because a single VM’s backup job had been logging authentication failures every few seconds, and log rotation wasn’t catching up.
When the first node rebooted, it came back slower than usual. Services that needed to write to disk during boot stumbled. A few VMs failed to start automatically. Suddenly the scope expanded: now it wasn’t graphs; it was availability.
Once someone finally ran df -h, the fix was obvious: stop the log spam, truncate the logs carefully, reclaim space, restart rrdcached and pvestatd. Graphs returned. The reboot was the only truly dangerous step taken, and it was done because someone assumed “graphs are a UI thing.”
Wrong assumption lesson: graphs are a symptom of your write-path health. When they vanish, your first instinct should be “can this node still write data reliably?” not “is my browser cached?”
2) The optimization that backfired: “Let’s move rrdcached to faster storage”
A different org, same vibe: performance-minded team, lots of SSDs, a desire to keep root clean. Someone decided to relocate /var/lib/rrdcached to a “fast” ZFS dataset with aggressive compression and a slightly exotic mount setup. It worked in testing. It even looked cleaner in production.
Then came the backfire. The dataset got included in a snapshot/replication routine with frequent snapshots. RRD files are small but numerous and update often. The snapshot overhead wasn’t catastrophic, but it introduced enough latency spikes during flush windows that rrdcached began falling behind. Occasional timeouts appeared. Graphs went stale intermittently—exactly the kind of issue that makes you mistrust monitoring.
The team’s next “optimization” was increasing rrdcached write intervals to reduce flush frequency. That reduced I/O, yes. It also increased the amount of data lost during an unclean shutdown and amplified the “graphs lag behind reality” effect. People stopped checking graphs because they weren’t timely. The monitoring became decorative.
The eventual fix was boring: keep /var/lib/rrdcached on a stable local filesystem with predictable latency, exclude it from aggressive snapshot policies, and accept that RRD data is not precious enough to replicate with the same zeal as VM disks.
Backfired optimization lesson: the monitoring backend should be stable and dull. Low latency helps, but predictability beats cleverness.
3) The boring but correct practice that saved the day: “Separate /var and enforce thresholds”
A third team had been burned by disk-full outages before. They did two unglamorous things: they put /var on a dedicated filesystem with headroom, and they enforced alerting thresholds at 70% and 85% usage with an escalation policy. Not sexy. Not a conference talk. It just works.
One day, a network change caused repeated auth failures for a storage mount, producing a steady log stream. Disk usage climbed. The 70% alert fired and got acknowledged. The on-call didn’t “fix it later.” They investigated immediately because that’s what the runbook said, and because they’d seen what happens after 90%.
The root cause was addressed (credentials and mount retry behavior), and logs were rotated. pvestatd never failed. Graphs never went missing. Nobody had to guess what happened later, because the time-series history remained intact and boring.
Saved-by-boring lesson: separate critical write-heavy areas, alert early, and treat disk fullness as a reliability incident, not a housekeeping chore.
Joke #2: Disk space is like office politics—ignore it long enough and it will eventually schedule a meeting with your CEO.
Common mistakes: symptom → root cause → fix
1) Symptom: GUI shows no graphs; pvestatd “failed”
Root cause: rrdcached socket path mismatch or missing socket.
Fix: Confirm actual socket with ss -xlpn, align configuration in /etc/default/rrdcached, restart rrdcached then pvestatd.
2) Symptom: pvestatd logs “Permission denied” writing RRD
Root cause: Wrong ownership/mode on /var/lib/rrdcached or journal directory; sometimes from restores or manual changes.
Fix: Confirm rrdcached user with ps, apply correct chown -R to /var/lib/rrdcached, restart services.
3) Symptom: Graphs stopped after “everything was fine,” but services show “active”
Root cause: Disk full or inode exhaustion causing silent update failures; or rrdcached behind with an ever-growing journal.
Fix: Check df -h and df -i. Free space/inodes. Then verify RRD timestamps move.
4) Symptom: Errors like “invalid header (bad magic)”
Root cause: Corrupt RRD files, often after disk-full or filesystem errors.
Fix: Quarantine corrupt RRDs or reset the RRD tree; investigate underlying disk/filesystem health.
5) Symptom: pvestatd intermittently fails; logs mention storage timeouts
Root cause: One broken storage backend (hung NFS, iSCSI path issues, slow Ceph) blocks stats collection.
Fix: pvesm status to identify inactive storages. Fix network/mount, or remove dead storage config.
6) Symptom: Graphs show gaps or odd “flatlines” across nodes
Root cause: Time drift or NTP not synchronized; sometimes after VM host suspend/resume or bad RTC.
Fix: Use timedatectl and chronyc tracking. Fix time sync; avoid large manual time jumps.
7) Symptom: Restarting pvestatd “fixes” it briefly, then it fails again
Root cause: Restart loop hiding an upstream persistent issue (disk, permissions, storage timeouts). Restarts are a painkiller, not surgery.
Fix: Read the journal for the first error and resolve the dependency. Stop flapping; stabilize.
Checklists / step-by-step plan
Checklist A: Get graphs back within 15 minutes (safe path)
- Check
systemctl status pvestatdandjournalctl -u pvestatd -bfor the first real error. - Check
systemctl status rrdcachedand confirm the socket withss -xlpn | grep rrdcached. - Check free space and inodes:
df -h /,df -i /. - If disk/inodes are full, free space first. Prefer stopping the spam source, then rotating/truncating logs.
- Verify
/var/lib/rrdcachedownership matches the rrdcached runtime user. - Restart in order:
systemctl restart rrdcachedthensystemctl restart pvestatd. - Confirm RRD updates: check RRD file timestamps in
/var/lib/rrdcached/db. - If corruption appears, quarantine single corrupt files. If many are corrupt, reset RRD tree after ensuring storage health.
Checklist B: Stabilize and prevent recurrence (the grown-up path)
- Ensure time sync is stable: enable and verify chrony/systemd-timesyncd; fix RTC drift if it’s chronic.
- Ensure
/varhas sufficient headroom; consider separate filesystem or dataset with monitoring thresholds. - Audit log rotation settings. Confirm you can’t generate multi-GB logs in a few hours without alerts.
- Review storage definitions; remove dead storages from config. “Inactive” should be exceptional, not normal.
- If you saw filesystem errors, schedule maintenance: fsck where appropriate, SMART tests, replace questionable media.
- Document the rrdcached socket path and permissions. Avoid “mystery symlinks” nobody understands.
Checklist C: When you must reset RRD data (without making it worse)
- Fix disk space and verify filesystem health first.
- Stop
pvestatdandrrdcached. - Move
/var/lib/rrdcached/dbaside (don’t delete immediately). - Create a fresh
dbdirectory with correct ownership. - Start
rrdcachedthenpvestatd. - Verify new RRD files appear and timestamps advance.
FAQ
1) Do I need to reboot the Proxmox node to fix pvestatd?
No. If a reboot “fixes it,” you likely had a transient mount issue or a service state problem. Diagnose the underlying cause (space, socket, permissions, storage timeouts) so it doesn’t return.
2) Why are my VMs fine but graphs are missing?
Because the monitoring pipeline is not required for virtualization to function. It’s a canary for write-path health and service dependencies. Treat it as an early warning, not a cosmetic glitch.
3) I restarted pveproxy and still no graphs. Why?
Because pveproxy renders the UI but doesn’t generate the time-series. If pvestatd can’t update RRD data, the UI has nothing new to render.
4) Can a broken NFS storage really break node graphs?
It can. If stats collection tries to query storage status and that query blocks or times out, the loop may fail or lag. Fix the mount, network, or remove dead storage definitions.
5) Is it safe to delete RRD files?
It’s safe in the sense that you won’t break VMs. You will lose historical graphs for the deleted series. If the files are corrupt, deletion/quarantine is often the fastest route back to working monitoring.
6) Why does time sync matter for graphs?
RRD buckets data by time. Large jumps create gaps or confusing consolidation artifacts. In clusters, time drift also causes broader operational weirdness. Keep NTP healthy.
7) My filesystem isn’t full. What else commonly causes “Permission denied”?
Wrong ownership from restores, moved directories, or hardening scripts changing umasks or runtime directories. Confirm the service user and directory ownership. Don’t fix with world-writable permissions.
8) How do I confirm graphs are updating without using the GUI?
Check modification times on *.rrd files under /var/lib/rrdcached/db. If timestamps advance every few minutes, updates are flowing. You can also confirm pvestatd logs go quiet (in a good way).
9) Why does rrdcached sometimes run but the socket is missing?
If the socket directory doesn’t exist at startup, or permissions prevent socket creation, rrdcached may not expose the expected endpoint. Confirm socket path in config and verify the directory exists and is writable.
10) Should I move RRD data to shared storage?
Generally no. RRD is small and write-heavy; shared storage adds latency and failure modes. Keep it local and boring. If you must centralize metrics, use a proper monitoring stack rather than trying to “cluster” RRD files.
Next steps that keep this from coming back
Restoring graphs is the easy part. Keeping them reliable is the job.
- Make disk pressure visible early: alert on space and inode usage long before 95%.
- Keep rrdcached predictable: stable socket path, correct ownership, avoid clever relocations that add latency.
- Clean up dead storage definitions: don’t let broken mounts linger as “inactive” for months.
- Fix time sync permanently: stable NTP, sane RTC, no manual time jumps in production.
- If you saw corruption, investigate hardware: don’t treat RRD errors as purely a software inconvenience.
If you do those, pvestatd becomes what it should be: background noise you never think about. That’s the best kind of monitoring.