Proxmox SMB/CIFS Is Slow for VM Disks: Why It’s Bad and What to Use Instead

Was this helpful?

You provision a shiny new Proxmox cluster. You point VM disks at an SMB share because the file server is “already there” and permissions are “easy.”
The first day looks fine. Then the ticket queue fills with the same sentence written in different fonts: “VMs randomly freeze.”

This isn’t a mystery. SMB/CIFS is a good protocol for office files and home directories. It is routinely a bad choice for VM disks.
Not because SMB is “slow” in the abstract—because the failure modes are ugly, the latency tax is real, and the recovery story is mean when things wobble.

What “SMB is slow” actually looks like in Proxmox

When people say “SMB is slow,” they usually mean one of three things:
throughput is low, latency is high, or performance is spiky.
VM disks care most about the last two. A VM can boot on mediocre throughput.
It will not tolerate random 500 ms write latency while the hypervisor waits on synchronous semantics.

The on-call reality is not “copying a file is slow.” It’s:

  • VM boot takes 4–10 minutes, then suddenly goes normal once caches are warm.
  • Databases stall. Journals wait. “fsync() taking too long” shows up in logs.
  • Interactive latency: SSH sessions freeze for a second, then catch up.
  • Cluster operations look haunted: migrations hang, backups time out, qemu processes go D-state.
  • The worst: everything is fine until it isn’t, then it’s a full-host incident.

SMB-backed VM disks often fail the “predictability test.” A storage stack that delivers 5 ms most of the day but 800 ms
every time antivirus scans the file server is not “fast,” it’s “a surprise generator.”

Joke #1: SMB for VM disks is like using a delivery truck as a race car. It moves, but nobody’s happy about the lap times.

Why SMB/CIFS is a poor fit for VM disks

1) VM disk I/O is small, sync-heavy, and latency-sensitive

A lot of VM I/O is 4K–64K random reads/writes, plus metadata churn.
Guest OSes and applications also love fsync(), barriers, and journal commits.
Even when your application is “async,” the guest filesystem usually is not.

SMB is a remote filesystem protocol. That means:
every write acknowledgement is a network conversation plus server-side work plus whatever durability semantics the server enforces.
You can optimize some of that with SMB3 features, but you don’t get to delete physics.

2) Extra layers add extra waits (and extra ways to stall)

With SMB VM disks, the I/O path is typically:

  • Guest filesystem → guest block layer → virtio-scsi/virtio-blk
  • QEMU on Proxmox → host kernel → CIFS client
  • Network stack → switch → NIC → file server NIC
  • Samba/SMB server → local filesystem on server → server’s storage

Any one of those layers can introduce head-of-line blocking. CIFS mounts can hang on reconnect.
File server filesystem can pause on snapshots. Antivirus can lock files.
A “simple” write becomes a committee meeting.

3) Locking, leases, oplocks, and caching are tuned for files—VM disks behave like hot blocks

SMB is great at coordinating access to shared files. VM disks aren’t “shared files” in the happy-path sense.
They’re big images with intense random I/O patterns, often accessed by one hypervisor process that expects stable semantics and steady latency.

SMB’s caching and locking features (oplocks/leases) can help normal file workloads.
But when you run a database inside a VM inside a file, you’re building a Jenga tower of caches.
A transient lock delay becomes “MySQL froze” and now you’re debugging the wrong layer.

4) Failover and reconnect semantics are not what your VMs want

SMB can reconnect. That sounds good until you realize what “reconnect” looks like to a VM disk:
long I/O stalls. QEMU threads stuck. Guest timeouts. Sometimes filesystem corruption if the stack lies about durability.

Yes, SMB3 has durable handles, witness, multichannel, and more.
In practice, you’re still betting the farm on the file server’s SMB stack behaving perfectly under partial failure.
That bet loses more often than people admit in postmortems.

5) Sync write semantics can quietly kneecap you

VM disks, especially with QEMU defaults and guest behavior, can generate lots of synchronous writes.
On the server side, those might map to “write-through” behavior, forcing stable storage commits.
If your file server’s underlying storage doesn’t have write-back caching with protection (BBU/flash-backed),
your IOPS are now gated by rotational latency or low-end SSD write amplification.

Even with good disks, SMB adds round trips. Latency dominates.
You can have a 10GbE link and still get trash IOPS if each I/O requires multiple serialized waits.

6) The “easy permissions” benefit is irrelevant for VM disks

People like SMB because ACLs integrate with corporate identity systems.
That matters for user shares. It rarely matters for VM disk images.
For VM storage you want: predictable performance, predictable failure, and predictable recovery.
ACL elegance does not help you at 3 a.m. when the CEO’s ERP VM is hung in D-state.

Paraphrased idea (John Allspaw, operations/reliability): “Complex systems fail in complex ways; reliability comes from learning and resilience, not wishful configuration.”

Interesting facts and historical context (short, concrete, and relevant)

  1. SMB began life in the 1980s as a network file sharing protocol; its DNA is “files,” not “virtual block devices.”
  2. CIFS was essentially SMB1 branding and became infamous for chatty behavior; SMB2/3 improved a lot, but the reputation was earned.
  3. SMB2 reduced round trips and introduced credit-based flow control; that helped WANs, but VM disks still punish latency.
  4. SMB3 added encryption and multichannel; great for security and throughput, but encryption can cost CPU and increase jitter under load.
  5. Windows file servers popularized “the share as storage platform”; virtualization pushed the industry back toward block or distributed storage for disks.
  6. NFS became a virtualization staple partly because its semantics and implementations mapped more cleanly to VM image access patterns (especially on enterprise arrays).
  7. iSCSI survived because it’s boring: block device, clear multipath story, predictable semantics for hypervisors.
  8. Ceph’s RADOS Block Device (RBD) design was shaped by the need to serve VM-like random I/O at scale with replication and failure handling built in.
  9. “NAS vs SAN” debates have cycled for decades; VM disks keep dragging everyone back to latency and write ordering fundamentals.

Fast diagnosis playbook: find the bottleneck in minutes

This is the order I use when a Proxmox host is running VMs from SMB and users claim “the whole cluster is slow.”
The goal isn’t perfect benchmarking. It’s to locate the dominant limiter fast so you can make a decision: tune, migrate, or replace.

First: confirm it’s storage latency, not CPU or memory pressure

  • Check host load and I/O wait. If iowait spikes with VM freezes, you’re in the right neighborhood.
  • Confirm qemu processes stuck in D-state (uninterruptible sleep). That’s usually storage waits.

Second: isolate where the latency is introduced

  • Is the SMB mount itself stalling? Look for CIFS reconnects, timeouts, or “server not responding.”
  • Is the network dropping packets or negotiating wrong settings (MTU mismatch, flow control weirdness)?
  • Is the file server’s disk subsystem saturated or forcing sync writes to slow media?

Third: validate the semantics you accidentally configured

  • SMB mount options: caching mode, actimeo, vers, multichannel, signing, encryption.
  • Samba/server: strict sync, sync always, oplocks/leases, aio settings.
  • Proxmox/QEMU: disk cache mode, IO thread, aio backend, discard settings.

Fourth: decide whether you’re tuning a dead end

If the workload is databases, CI runners, mail servers, or anything with lots of small sync writes, stop tuning SMB and start planning migration.
If it’s light desktop VMs and the latency is “only” moderately bad, you might get away with changes—temporarily.

Practical tasks: commands, outputs, and decisions (12+)

These are real operational checks you can run on a Proxmox host and, where relevant, on the SMB server.
Each task includes: the command, what typical output means, and the decision you make from it.

Task 1: Confirm iowait and run queue during the “slow” window

cr0x@pve1:~$ mpstat -P ALL 1 5
Linux 6.8.12-pve (pve1)  12/26/2025  _x86_64_ (32 CPU)

12:01:10 PM  CPU   %usr  %nice   %sys %iowait  %irq %soft  %steal  %idle
12:01:11 PM  all   12.3   0.0     4.1   38.7    0.0  0.6    0.0     44.3

Meaning: 38.7% iowait is a red flag: CPUs are idle but stuck waiting on I/O completion.
Decision: treat this as storage latency, not “needs more CPU.” Move to I/O path checks.

Task 2: Identify which devices are actually slow

cr0x@pve1:~$ iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.52    0.00    3.90   36.10    0.00   49.48

Device            r/s     w/s   rMB/s   wMB/s  await  aqu-sz  %util
sda              0.2     1.1     0.0     0.1   2.10    0.00   0.3
dm-5            55.0   420.0     4.2    31.8  180.4   22.10  99.2

Meaning: dm-5 is saturated with ~180 ms average wait. That’s “VMs will feel drunk.”
Decision: map dm-5 to the SMB mount backing store (often a loop, a file, or a FUSE layer).
If it’s the CIFS mount, proceed to CIFS-specific checks.

Task 3: See if qemu processes are blocked on I/O

cr0x@pve1:~$ ps -eo pid,stat,comm,args | grep -E "qemu-system|D " | head
21133 D    qemu-system-x86_64 /usr/bin/kvm -id 101 -name vm101 ...

Meaning: D state indicates uninterruptible sleep, commonly storage waits.
Decision: confirm which filesystem/mount those I/Os hit; look for CIFS stalls and server-side latency.

Task 4: Confirm the VM disk is on SMB storage in Proxmox

cr0x@pve1:~$ qm config 101 | sed -n '1,120p'
boot: order=scsi0;ide2;net0
memory: 8192
name: vm101
scsi0: smbstore:101/vm-101-disk-0.qcow2,size=80G
scsihw: virtio-scsi-single

Meaning: The disk is a qcow2 file on a storage named smbstore.
Decision: expect double indirection (qcow2 metadata + SMB filesystem semantics). Plan migration to raw-on-block or RBD.

Task 5: Inspect the CIFS mount options on the Proxmox host

cr0x@pve1:~$ mount | grep cifs
//filesrv01/vmstore on /mnt/pve/smbstore type cifs (rw,relatime,vers=3.1.1,cache=strict,username=svc_pve,uid=0,gid=0,soft,nounix,serverino,mapposix,nofail)

Meaning: soft can cause I/O errors under transient outages; cache=strict can increase metadata round trips.
Decision: remove “cute” options. Avoid soft for VM disks; prefer hard semantics or—better—stop using SMB for VM disks.

Task 6: Check CIFS client stats for reconnects and timeouts

cr0x@pve1:~$ cat /proc/fs/cifs/Stats
Resources in use
CIFS Session: 1
Share (unique mount targets): 1
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 3 Pool size: 30
Operations (MIDs): 0

0 session 0 share reconnects 17
Total vfs operations: 121044
Total timeouts: 9
Total retries: 203

Meaning: reconnects/timeouts during business hours are performance and data-integrity risk.
Decision: treat network and server stability as a requirement. If you can’t get reconnects to zero, migrate off SMB.

Task 7: Look for CIFS errors in the kernel log

cr0x@pve1:~$ dmesg -T | grep -i cifs | tail -n 8
[Thu Dec 26 12:03:18 2025] CIFS: VFS: Server filesrv01 has not responded in 60 seconds. Reconnecting...
[Thu Dec 26 12:04:19 2025] CIFS: VFS: cifs_reconnect: server reconnect succeeded

Meaning: That 60 seconds is an eternity for VM disks. Guests will hang; some will crash.
Decision: stop pretending this is acceptable for production VM storage. Fix network/server, or move storage.

Task 8: Confirm network path health (drops, errors, MTU surprises)

cr0x@pve1:~$ ip -s link show dev bond0
4: bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    RX:  bytes  packets  errors  dropped  missed  mcast
    9123349901  9812331  0       214      0       0
    TX:  bytes  packets  errors  dropped  carrier collsns
    8331990021  8441122  0       0        0       0

Meaning: RX drops exist. They may be harmless, or they may correlate with spikes.
Decision: if drops increase during stalls, investigate switch buffers, NIC ring sizes, flow control, and MTU consistency end-to-end.

Task 9: Measure latency to the file server under load (not just ping once)

cr0x@pve1:~$ ping -c 20 -i 0.2 filesrv01
PING filesrv01 (10.20.0.10) 56(84) bytes of data.
64 bytes from 10.20.0.10: icmp_seq=1 ttl=63 time=0.380 ms
64 bytes from 10.20.0.10: icmp_seq=2 ttl=63 time=0.412 ms
64 bytes from 10.20.0.10: icmp_seq=9 ttl=63 time=12.881 ms
64 bytes from 10.20.0.10: icmp_seq=10 ttl=63 time=0.401 ms

--- filesrv01 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 3810ms
rtt min/avg/max/mdev = 0.362/1.104/12.881/2.776 ms

Meaning: occasional 10–15 ms spikes are already bad for sync-heavy I/O. SMB amplifies spikes.
Decision: if latency jitter exists, don’t put VM disks on a remote filesystem that depends on consistent round trips.

Task 10: Quick and dirty I/O test on the SMB mount (latency tells the story)

cr0x@pve1:~$ fio --name=randwrite --directory=/mnt/pve/smbstore --size=2G --bs=4k --rw=randwrite --iodepth=16 --numjobs=1 --runtime=30 --time_based --direct=1
randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, ioengine=psync, iodepth=16
...
  write: IOPS=420, BW=1680KiB/s (1720kB/s)(50.0MiB/30500msec)
    lat (usec): min=2100, max=980000, avg=38000.12, stdev=92000.55

Meaning: average 38 ms with near-1s max latency. That’s not “a bit slow.” That’s “VM pauses.”
Decision: stop using SMB for VM disks for this workload. You can tune, but you can’t tune away 980 ms tail latency.

Task 11: Check Proxmox storage configuration (what it thinks “smbstore” is)

cr0x@pve1:~$ cat /etc/pve/storage.cfg | sed -n '1,200p'
dir: local
        path /var/lib/vz
        content iso,vztmpl,backup

cifs: smbstore
        path /mnt/pve/smbstore
        server filesrv01
        share vmstore
        content images,backup
        username svc_pve
        vers 3.1.1

Meaning: Proxmox treats this as file-based storage suitable for images and backups.
Decision: keep SMB for backups/ISOs if you must; remove images content if you want fewer incidents.

Task 12: Inspect QEMU disk cache mode (it can make things worse or merely different)

cr0x@pve1:~$ qm config 101 | grep -E '^scsi0:'
scsi0: smbstore:101/vm-101-disk-0.qcow2,cache=writeback,discard=on,size=80G

Meaning: cache=writeback may improve perceived speed but increases the blast radius of host crashes and SMB stalls.
Decision: don’t use cache mode as a band-aid for bad storage. If you need writeback to be “fast enough,” the backend is wrong.

Task 13: Check the Samba server for synchronous write policies (server side)

cr0x@filesrv01:~$ testparm -sv 2>/dev/null | grep -E 'strict sync|sync always|aio read size|aio write size'
        aio read size = 1
        aio write size = 1
        strict sync = Yes
        sync always = Yes

Meaning: sync always = Yes is a performance cliff for VM disks: it forces stable storage semantics on every write.
Decision: if this is a file server for documents, keep durability. If you were trying to use it as a SAN, stop and redesign.

Task 14: Verify the file server’s filesystem and storage are not the real limiter

cr0x@filesrv01:~$ iostat -x 1 3
Device            r/s     w/s   rMB/s   wMB/s  await  aqu-sz  %util
nvme0n1         120.0   980.0    22.4   110.8   3.20    1.90  72.0
md0              10.0   320.0     1.1    38.0  45.00   18.50  99.0

Meaning: One device is fine (NVMe), the RAID/md device is pegged at 99% util and 45 ms await.
Decision: even if SMB were perfect, the backend isn’t. VM disks will punish this design. Move VM storage to faster, VM-oriented storage.

What to use instead: sane storage options for Proxmox

If you remember one thing: VM disks want block storage or block-like semantics with predictable latency.
They also want a recovery story that doesn’t involve “hope the file server reconnects quickly.”

Option A: Local NVMe/SSD with ZFS (fast, predictable, surprisingly operable)

For many Proxmox shops, the best answer is also the least glamorous:
put VM disks on local NVMe/SSD and use ZFS for integrity and snapshots.
You lose “shared storage” live migration unless you replicate or use ZFS replication.
You gain tail latency that doesn’t look like a heart monitor.

When local ZFS wins:

  • You can tolerate planned migrations or replication-based failover.
  • Your workload is latency-sensitive and you value consistency over centralization.
  • You want strong data integrity checksums and sane snapshot tooling.

The key: store VM disks as ZVOLs (block devices) or raw files on ZFS, not qcow2 on SMB. Give ZFS RAM. Monitor write amplification.

Option B: Ceph RBD (shared storage done the hard way, which is the right way)

If you need shared storage and live migration at scale, Ceph is the Proxmox-native answer.
It’s not “easy.” It’s a storage system with its own failure modes and operational requirements.
But it’s designed for what you’re doing: serving block devices to hypervisors with replication and recovery baked in.

When Ceph wins:

  • You need shared VM disks across nodes.
  • You need failure tolerance without a single file server as a choke point.
  • You’re willing to run storage as a first-class system: monitoring, capacity planning, and upgrade discipline.

If your cluster is small and you don’t have the operational maturity for Ceph, don’t force it. Shared storage isn’t free.

Option C: iSCSI + LVM (boring block storage, works well, predictable semantics)

iSCSI is a frequent “adult in the room” choice: a storage array exports LUNs, hosts see block devices, multipath handles path failure.
You can layer LVM or LVM-thin on top in Proxmox.
The performance is usually solid, and the behavior under load is easier to reason about than SMB.

Where iSCSI shines:

  • You have a real array or a well-built target with proper caching and redundancy.
  • You want central storage and consistent performance.
  • You need a clear multipath story.

Option D: NFS (better than SMB for VM images, still not my first pick for write-heavy workloads)

NFS is commonly used for VM images and can work well when backed by an enterprise NAS tuned for virtualization.
It tends to have less “Windows file server baggage” and simpler UNIX-oriented semantics in many environments.
Still: it’s a network filesystem. Latency and server-side behavior still matter.

If you use NFS, use a storage appliance that is built and supported for hypervisor VM storage,
and validate latency under sync-write load.

Option E: SMB is fine for backups, ISOs, templates, and cold storage

SMB is not evil. It’s just being asked to be something it isn’t.
Use it for:

  • ISO repositories
  • Proxmox Backup Server datastores (usually better on local disks, but SMB can be acceptable for secondary copies)
  • Exporting backups to another system
  • Templates and “not latency sensitive” artifacts

Keep VM disks off SMB unless the workload is trivial and the outage impact is trivial.
If both are trivial, you probably didn’t need Proxmox either—but here we are.

Joke #2: The fastest way to speed up SMB VM storage is to stop using SMB VM storage.

Three corporate mini-stories from the land of “it seemed fine”

Mini-story 1: The incident caused by a wrong assumption (“10GbE means it’s fast”)

A mid-sized company had a Proxmox cluster for internal services: Git, CI, a few Windows app servers, and a database that everyone pretended wasn’t critical.
Storage was an SMB share on a Windows file server because the server already had “a lot of space,” and the infrastructure team liked the ACL tooling.
They also had 10GbE uplinks, which made everyone feel modern and therefore safe.

The first serious incident started during a routine security scan. The scanner hit the VM datastore share.
The file server’s antivirus saw thousands of block-level changes inside qcow2 files as “interesting.”
CPU spiked on the file server, disk queue depth climbed, and SMB response times went from sub-millisecond to “eventually.”

On the Proxmox side, qemu processes piled up in D-state. Guests didn’t crash immediately—they just stopped answering.
Monitoring reported healthy VMs because the processes existed and the hosts were up. Users reported “everything is frozen.”
The team rebooted a Proxmox node, which made the problem worse: cached writes that hadn’t been safely committed turned into guest filesystem checks and partial corruption.

The wrong assumption was simple: 10GbE throughput equals VM storage performance.
They had bandwidth. They did not have low, stable latency under sync-heavy random writes.
After the postmortem, they moved VM disks to local ZFS mirrors and used replication for the handful of VMs that needed faster recovery.
The file server went back to what it was good at: files.

Mini-story 2: The optimization that backfired (cache modes and “just make it writeback”)

Another organization ran a small Proxmox farm where someone noticed SMB-backed VMs were slow during morning logins.
A well-meaning engineer toggled QEMU disk cache mode to writeback for “performance.”
It worked. Login storms were smoother. Ticket volume dropped. Everyone moved on.

Two months later a power event hit the rack. UPS held, then didn’t. Hosts went down hard.
On reboot, most VMs came back. A few didn’t. One database VM booted, but the application threw integrity errors later that day.
Nobody had a clear “we lost X seconds of writes” statement because the storage path involved multiple caches with different durability semantics.

The backfire wasn’t that writeback is always wrong. It’s that writeback was used to disguise a backend that couldn’t meet the workload’s sync latency needs.
They optimized the symptom and increased the ambiguity of failure.
The follow-up fix was boring: move to iSCSI from a small array with protected cache and configure multipath correctly.
Performance became stable and failures became comprehensible.

There’s a special kind of operational debt where you “speed things up” by reducing how often you ask the truth.
It looks like success until the day reality files a complaint.

Mini-story 3: The boring but correct practice that saved the day (measure tail latency, not averages)

A fintech-ish company (the kind that loves dashboards) planned a Proxmox expansion and wanted shared storage for migrations.
SMB was proposed because there was an existing highly available file server pair and the storage team promised “it’s enterprise.”
The SRE team didn’t argue on taste. They asked for a test window.

They ran fio from a Proxmox node to the proposed SMB mount with a workload shaped like VM disks: 4K random writes, sync-ish engine, iodepth tuned to mimic contention.
Average latency wasn’t terrible. The 99.9th percentile latency was a mess, with periodic spikes in the hundreds of milliseconds.
Then they repeated the same test while triggering normal operational activity on the file server: snapshots, log rotations, and a backup job.
The tail got worse.

The result was politically inconvenient but operationally perfect: SMB was approved for backups and ISO storage only.
VM disks went to Ceph RBD because the org already had the appetite to operate a distributed system, and they wanted node-level failure tolerance.
When a top-of-rack switch later misbehaved and caused transient packet loss, Ceph degraded but stayed usable; SMB would likely have turned it into VM-wide stalls.

The practice that saved them wasn’t magic. It was measuring the right thing (tail latency) and testing during realistic background noise.
Boring, correct, and it avoided an incident that would have been blamed on “random Proxmox instability.”

Common mistakes: symptom → root cause → fix

1) Symptom: VMs “freeze” but host CPU is mostly idle

Root cause: high storage latency and blocked I/O (qemu in D-state), often during SMB reconnects or server-side pauses.

Fix: confirm with mpstat, iostat, dmesg, and CIFS stats; then migrate VM disks off SMB (Ceph RBD, iSCSI, or local ZFS).

2) Symptom: Backups are fine, but interactive workloads are awful

Root cause: throughput is okay, tail latency is not. Backup streams are sequential; VM disks are random and sync-heavy.

Fix: benchmark with small-block random I/O and examine max/percentiles. Don’t rely on file copy speed as a proxy.

3) Symptom: Performance is great after reboot, then degrades

Root cause: cache warmth hides backend latency until the working set grows; SMB metadata and writeback caches mask reality.

Fix: test cold and warm. Watch server-side cache eviction, writeback behavior, and disk queues. Prefer storage with consistent latency.

4) Symptom: Random “Input/output error” inside guests

Root cause: CIFS mounted with soft or aggressive timeout/retry behavior under transient failures.

Fix: avoid soft for VM storage; fix network stability; better yet stop using SMB for VM disks.

5) Symptom: Migration stalls or takes forever

Root cause: shared SMB storage becomes a serialization point; metadata ops and locking slow down under load; network jitter hurts.

Fix: use Ceph/shared block for live migration, or use local storage with replication and accept planned downtime.

6) Symptom: Everything goes bad when “someone runs a scan” or “backup starts”

Root cause: file server background tasks (AV scanning, snapshots, backups) compete with VM I/O and introduce latency spikes.

Fix: separate concerns. VM disk storage should not share the same appliance and policy set as user file shares.

7) Symptom: Good latency in ping, bad latency in disk I/O

Root cause: SMB operations aren’t ICMP; they involve server CPU, filesystem locks, journaling, and storage commits.

Fix: measure what matters: I/O latency and tail behavior with fio and real VM workload traces.

Checklists / step-by-step plan

Checklist A: If you’re already on SMB and suffering

  1. Confirm the symptom is I/O latency: run mpstat and iostat -x during the incident.
  2. Check CIFS client health: look for reconnects/timeouts in /proc/fs/cifs/Stats and kernel logs.
  3. Validate network basics: drops/errors, MTU consistency, and latency jitter to the server.
  4. Inspect server-side constraints: disk queue, sync policies, snapshot jobs, AV scans.
  5. Stop the bleeding: move the noisiest VMs (databases, CI, mail) first to local SSD or block storage.
  6. Change Proxmox storage usage: remove images from SMB storage and keep it for backups/templates.
  7. Build the replacement: pick Ceph/iSCSI/local ZFS based on operational reality, not aspiration.
  8. Migrate with a plan: schedule maintenance windows, test restores, and validate guest filesystem integrity after moves.

Checklist B: Choosing the right alternative

  • Need live migration and shared disks? Prefer Ceph RBD or a proper iSCSI/FC SAN.
  • Need simplest reliable performance? Local NVMe mirrors with ZFS, plus replication for critical workloads.
  • Already have an enterprise NAS built for VM storage? NFS can be acceptable; prove tail latency first.
  • Using a Windows file server because it exists? Use it for files and backup exports, not primary VM disks.

Checklist C: Migration safety moves (because storage changes are where careers go to die)

  1. Take a fresh backup and do a test restore of at least one VM to the target storage.
  2. Move one non-critical VM first and validate performance and logs for a full business day.
  3. Track tail latency (99th/99.9th), not just averages, before and after.
  4. Document rollback: where the old disk image is, how to reattach it, and what DNS/app dependencies exist.
  5. Only then migrate the loud workloads.

FAQ

1) Is SMB always slow, or just slow for VM disks?

Mostly slow (or unstable) for VM disks. SMB can be perfectly fine for user files, media, backups, and artifacts.
VM disks turn latency and jitter into outages.

2) What if I’m using SMB3.1.1 with multichannel and a fast Windows server?

You can improve throughput and resilience, but you’re still using a network filesystem for latency-sensitive block-like I/O.
If your workload is light, it might be acceptable. For databases and busy servers, it’s still the wrong tool.

3) Can I make SMB acceptable with mount options?

You can reduce damage. You can’t make it behave like a purpose-built VM storage backend.
If your “fix” is a pile of mount options plus a spreadsheet of do-not-touch server policies, you’ve already lost.

4) Is qcow2 on SMB worse than raw on SMB?

Usually yes. qcow2 adds metadata reads/writes and fragmentation behavior that amplifies latency on remote filesystems.
Raw reduces overhead, but it doesn’t fix SMB’s fundamental latency and failure semantics.

5) Is NFS really better than SMB for Proxmox VM disks?

Often, yes, especially on storage designed for virtualization. But NFS is still a network filesystem.
It can be great, mediocre, or terrible depending on the server and network. Measure tail latency under realistic load.

6) Should I run Ceph for a small 3-node cluster?

Maybe. Proxmox makes Ceph approachable, but it’s still a distributed system.
If you can commit to monitoring, capacity discipline, and predictable networking, it can work well.
If your team struggles with basic storage hygiene, local ZFS plus replication is usually the safer win.

7) What’s the simplest “good enough” architecture for reliable VM storage?

Local NVMe mirrors on each node with ZFS, plus scheduled replication for critical VMs, plus good backups.
It’s not as glamorous as shared storage, but it’s extremely effective and easy to reason about when things break.

8) Why do VM freezes correlate with “file server maintenance jobs”?

Because those jobs cause latency spikes: snapshots, antivirus, dedupe, tiering, cloud sync, and backup agents all compete for I/O and locks.
VM disks experience those spikes as stalled writes and blocked I/O threads.

9) If SMB is bad for VM disks, why does Proxmox support CIFS storage at all?

Because it’s useful for other content types: ISO images, backups, templates, and general file sharing.
“Supported” doesn’t mean “a good idea for every workload.”

10) What’s the single metric I should alert on for this problem?

Host-level disk latency (await) and guest-visible fsync latency if you can collect it.
Also alert on CIFS reconnects/timeouts—those are early smoke signals before users start screaming.

Next steps you can do this week

If you’re running VM disks on SMB today and you care about uptime, treat this as technical debt with interest.
Your goal is not “make SMB faster.” Your goal is “stop letting the VM storage path depend on a file server’s mood.”

  1. Run the fast diagnosis playbook during your next slow period and capture iowait, iostat, CIFS stats, and dmesg excerpts.
  2. Classify workloads: identify the top 5 VMs by write IOPS and fsync sensitivity (databases, CI, mail, directory services).
  3. Pick your target:
    • Local ZFS if you want predictability fast.
    • Ceph RBD if you need shared storage and can operate it.
    • iSCSI if you have an array and want boring block semantics.
  4. Move one VM, validate tail latency and stability for a full day, then move the rest in batches.
  5. Demote SMB to backups/templates/ISOs. Let it do what it’s good at.

The payoff is immediate: fewer “random freezes,” fewer midnight reboots, and fewer meetings where you have to explain that storage “was technically up.”
Production systems don’t care about technically.

← Previous
Container Queries Practical Guide: Component-First Responsive Design
Next →
No backups: the oldest tech horror with no monsters

Leave a comment