Backups to a NAS fail in boring, predictable ways—until they don’t. Then you get the 2 a.m. “it worked yesterday” ticket, and your only artifact is a one-line error code and a suspiciously quiet network share.
This is the production-grade setup: hardened SMB, sane permissions, stable credentials, predictable scheduling, measurable performance, and verification that catches silent corruption and “success-with-missing-files” lies.
What we’re actually building (and what we’re not)
We’re building reliable Windows backups to a NAS over SMB with:
- Two backup types: file-level (Robocopy) and optional image-level (Windows Server Backup to SMB where appropriate).
- Controlled identity: a dedicated backup account, stable credentials, no “who ran the task last?” surprises.
- Predictable IO: SMB settings that don’t self-sabotage, and NAS storage choices that won’t crumble under small-file storms.
- Proof: logs, exit codes, retention, verification and restore drills.
- Diagnosis speed: you’ll know whether the bottleneck is DNS, auth, SMB negotiation, disk, CPU, MTU, or schedule overlap—fast.
We are not pretending this replaces proper 3-2-1. A NAS share is not an offsite copy, not immutable by default, and not ransomware-proof unless you make it so.
Facts and context: why Windows-to-NAS is weird
Some background matters because the failure modes are baked into history, not your competence.
- SMB is old. The protocol lineage goes back to the 1980s; compatibility layers still haunt modern deployments in “helpful” defaults.
- SMB1 is effectively a fossil. It was widely exploited (think worms and lateral movement); modern Windows versions disable it by default in many SKUs for good reasons.
- “Network share” and “backup target” are different beasts. File shares are built for interactive access; backups hammer metadata and create long sequential streams, often in the same job.
- Windows file semantics are strict. Alternate data streams, long paths, ACLs, and open-file handling are normal; many NAS appliances imitate this imperfectly.
- VSS exists because Windows apps don’t politely close. Shadow copies were introduced to get consistent snapshots while apps keep writing.
- SMB signing became a security baseline. Many environments require it; it can impact throughput on weaker NAS CPUs or under heavy concurrency.
- Time skew breaks auth in non-obvious ways. Kerberos and token lifetimes don’t care that “it’s only five minutes.” Backup jobs fail like they’re haunted.
- NAS vendors optimize for mixed workloads. Your backup workload (millions of tiny files, then big streams) is the worst of both worlds.
- “Success” often means “success-ish.” Tools can exit 0 with skipped files, excluded junctions, or path issues unless you treat the logs like evidence.
One quote you should keep around, because it’s the whole job: Hope is not a strategy.
(paraphrased idea, commonly attributed in operations circles)
Design principles that prevent random failure
1) Make identity boring: one backup account per scope
Create a dedicated account (local or domain) used only for backups. Give it read access to sources (or admin rights if you must, but don’t start there), and write-only where possible to the NAS target. Avoid using your own admin token for scheduled backups. Your password rotation policy will eventually meet Task Scheduler, and only one of them will survive.
2) Make the NAS share purpose-built
A general “Public” share is how you get retention chaos and ransomware-assisted self-harm. Create a share just for backups with:
- separate dataset/volume (so you can snapshot and enforce quotas)
- separate SMB share (so you can tune settings without breaking user shares)
- separate permissions (so “accidental delete” requires effort)
3) Prefer “pull” only if you can secure it
Backing up by pushing from Windows to NAS is common and fine. Pulling from NAS (NAS connects to Windows and copies data) can reduce credential sprawl on endpoints, but it’s often worse for Windows permissions and VSS consistency. If you have to choose, push is usually simpler to get correct—then harden the NAS so the pushed data can’t be modified later.
4) Eliminate silent failure paths
Backups fail loudly when the network is down. They fail quietly when:
- you hit path length limits
- your backup user can’t read a subset of folders
- your job runs longer than the next scheduled run (overlap)
- your NAS runs out of inodes / metadata space / snapshot reserve
- SMB sessions are dropped under idle timeouts mid-transfer
Design for detection: explicit exit code checks, log shipping, and periodic restore tests.
5) Don’t let “optimization” outrun observability
Compression, dedupe, multi-threaded copy, huge MTUs, jumbo reads—these can be great. They can also make failures intermittent and hard to reproduce. Optimize only after you can measure baseline throughput and error rate.
Joke #1: Backups are like parachutes: the one time you need them is a terrible moment to discover you bought “some assembly required.”
NAS-side setup that survives reality
Choose a storage layout that matches backup IO
Backups are typically write-heavy and bursty. File-level backups can be metadata-heavy (small files), while image-level backups are large sequential writes. Your NAS should have:
- Enough RAM to avoid thrashing metadata caches.
- Disks that can sustain writes without falling off a cliff (SMR drives are a known “surprise!” for sustained writes).
- Redundancy appropriate to business impact (mirrors/RAIDZ/RAID6 etc.).
If your NAS is ZFS-based, be cautious with deduplication. It’s not a free lunch; it’s a recurring invoice in RAM and CPU.
SMB share settings: security first, then performance
Defaults vary by vendor. What you want, generally:
- SMB2/SMB3 only (disable SMB1).
- Encryption: enable if you’re crossing untrusted networks; otherwise evaluate CPU impact.
- Signing: follow your security baseline; measure throughput.
- Opportunistic locking / leasing: usually fine; backup workloads are mostly sequential, but metadata storms can behave oddly.
- Durable handles: helpful when clients reconnect after brief glitches.
Permissions: “backup can write, but not rewrite history”
On a perfect day, your Windows host can create new backup sets but cannot delete or modify old ones. Achieving true immutability on SMB is tricky, but you can approximate it:
- Use separate subfolders per host and per backup type.
- Make the backup account write/create to its own folder; restrict delete where feasible.
- Use NAS snapshots (hourly/daily) with retention. Snapshots are your “undo.”
- If your platform supports it, use WORM/immutable snapshots or snapshot locking.
DNS, NTP, and certificates: the boring parts that bite
If the NAS and Windows disagree on time, or if DNS returns the wrong IP due to stale records, SMB auth fails in a way that looks like “random network weirdness.” Put the NAS on the same NTP source as your domain controllers (or at least within sane drift) and treat DNS records as production config.
Windows-side setup: credentials, scripts, and scheduling
Use Robocopy for file-level backups, but treat it like a power tool
Robocopy is reliable and brutally honest—if you read its exit codes correctly. The key decisions:
- /MIR mirrors deletions (dangerous if pointed wrong, useful if you design for it).
- /Z restartable mode helps over flaky links; /ZB can fall back to backup mode if permissions allow.
- /R and /W decide whether you want to wait forever on locked files (you don’t).
- /COPY:DAT vs /COPY:DATSOU depends on whether you need ACLs and auditing preserved.
For most backup-to-NAS setups, you’ll copy data and timestamps, and you’ll log aggressively. If you need ACL fidelity, you must test restores and verify the NAS actually preserves them correctly.
Windows Server Backup to SMB: useful, but know its personality
Windows Server Backup (WSB) can target a remote shared folder, but it has constraints: it manages its own folder structure, may not keep multiple versions the way you expect on a share, and can behave differently than backing up to a dedicated disk. Use it when you specifically need system state, bare-metal recovery, or app-aware VSS integration on Windows Server.
Task Scheduler: run as a service identity, not “when I’m logged on”
Scheduled backups should run whether or not someone is logged in. Use a dedicated account (domain or local), set “Run whether user is logged on or not,” and store credentials. Then harden that account. This is not the place to get creative.
Joke #2: Nothing ages you faster than a “completed successfully” backup that can’t restore.
Practical tasks (commands, outputs, decisions)
These are concrete tasks you can run from a Linux admin box, a NAS shell, or a jump host. Each one includes what the output means and what decision you make from it. Use them as your standard operating procedure when you’re diagnosing “random failures.”
Task 1: Verify DNS is stable for the NAS name
cr0x@server:~$ dig +short nas01.corp.local
10.20.30.40
Output meaning: You got one A record. If you see multiple changing IPs, you may be hitting round-robin, stale records, or multi-homed NAS confusion.
Decision: If the IP is unstable or wrong, fix DNS first. Don’t tune SMB on a moving target.
Task 2: Check basic reachability and latency (ICMP)
cr0x@server:~$ ping -c 5 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 56(84) bytes of data.
64 bytes from 10.20.30.40: icmp_seq=1 ttl=64 time=0.512 ms
64 bytes from 10.20.30.40: icmp_seq=2 ttl=64 time=0.488 ms
64 bytes from 10.20.30.40: icmp_seq=3 ttl=64 time=0.501 ms
64 bytes from 10.20.30.40: icmp_seq=4 ttl=64 time=0.497 ms
64 bytes from 10.20.30.40: icmp_seq=5 ttl=64 time=0.493 ms
--- 10.20.30.40 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4090ms
rtt min/avg/max/mdev = 0.488/0.498/0.512/0.010 ms
Output meaning: Low latency and no loss. If you see loss or spikes, SMB will “randomly” stall and reconnect.
Decision: If packet loss exists, stop blaming Windows Backup. Fix switching, Wi‑Fi, cabling, or overloaded links.
Task 3: Confirm TCP 445 is reachable (SMB)
cr0x@server:~$ nc -vz 10.20.30.40 445
Connection to 10.20.30.40 445 port [tcp/microsoft-ds] succeeded!
Output meaning: Port is open. If it times out or is refused, you have firewall or routing issues.
Decision: If blocked, fix network policy before touching credentials or share permissions.
Task 4: List SMB shares to confirm negotiation works
cr0x@server:~$ smbclient -L //10.20.30.40 -U 'CORP\backup_svc%REDACTED'
Sharename Type Comment
--------- ---- -------
backups Disk Windows backups
IPC$ IPC IPC Service (nas01)
SMB1 disabled -- no workgroup available
Output meaning: Authentication succeeded and the share exists. The “SMB1 disabled” line is fine; it’s a good sign.
Decision: If listing fails with NT_STATUS_LOGON_FAILURE, you’re in credential/identity territory, not storage.
Task 5: Validate you can write and read a test file on the share
cr0x@server:~$ smbclient //10.20.30.40/backups -U 'CORP\backup_svc%REDACTED' -c 'mkdir _smb_test; cd _smb_test; put /etc/hosts hosts.txt; ls; get hosts.txt /tmp/hosts.txt; del hosts.txt; cd ..; rmdir _smb_test'
hosts.txt A 173 Tue Feb 4 12:10:11 2026
getting file \_smb_test\hosts.txt of size 173 as /tmp/hosts.txt (28.2 KiloBytes/sec) (average 28.2 KiloBytes/sec)
Output meaning: Create/write/read/delete succeeded. If delete fails but write succeeds, your “append-only-ish” permissions may be working (or misconfigured).
Decision: Decide whether backups should be allowed to delete. If you use /MIR in Robocopy, deletion permission matters a lot.
Task 6: Measure raw throughput with a large file (quick reality check)
cr0x@server:~$ dd if=/dev/zero of=/tmp/backup_test.bin bs=16M count=256 status=progress
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 2 s, 2.1 GB/s
cr0x@server:~$ time smbclient //10.20.30.40/backups -U 'CORP\backup_svc%REDACTED' -c 'put /tmp/backup_test.bin _perf/backup_test.bin'
real 0m39.214s
user 0m0.081s
sys 0m0.051s
Output meaning: If a 4 GiB file takes ~39s, you’re around ~100–120 MB/s effective, which is typical for 1GbE once overhead shows up. If it’s 10 MB/s, something is wrong.
Decision: If throughput is low, you’ll hunt network (duplex, MTU), NAS CPU (signing/encryption), or disk write speed.
Task 7: Check the NAS disk health quickly (SMART summary)
cr0x@server:~$ sudo smartctl -H /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.0] (local build)
SMART overall-health self-assessment test result: PASSED
Output meaning: “PASSED” is necessary, not sufficient. If it fails, stop: you’re backing up to a dying target.
Decision: Replace failing disks, then re-check array resilver/rebuild status before trusting backups.
Task 8: Inspect NAS capacity and inode pressure
cr0x@server:~$ df -h /mnt/backups
Filesystem Size Used Avail Use% Mounted on
tank/backups 40T 31T 9.0T 78% /mnt/backups
cr0x@server:~$ df -i /mnt/backups
Filesystem Inodes IUsed IFree IUse% Mounted on
tank/backups 20000000 8400000 11600000 43% /mnt/backups
Output meaning: Plenty of space and inodes. If inode usage is near 100%, small-file backups will fail even with free TBs remaining.
Decision: If inode pressure is high, adjust retention, consolidate small files (archive), or use different backup approach for that dataset.
Task 9: Detect SMB session drops by watching logs (NAS syslog)
cr0x@server:~$ sudo tail -n 20 /var/log/samba/log.smbd
[2026/02/04 12:03:09.120145, 1] ../../source3/smbd/service.c:787(make_connection_snum)
make_connection_snum: Connect path is '/mnt/backups' for service [backups]
[2026/02/04 12:05:44.772903, 1] ../../source3/smbd/sesssetup.c:157(auth3_check_password)
check_ntlm_password: Authentication for user [CORP]\[backup_svc] -> [backup_svc] FAILED with error NT_STATUS_WRONG_PASSWORD
Output meaning: You have authentication failures midstream. That’s not “random.” It’s credential rotation, cached credentials, or multiple tasks using different passwords.
Decision: Fix the account management: either coordinate rotation with scheduled tasks or use a gMSA (if you’re domain-based and the client supports it).
Task 10: Confirm the Windows host can resolve and reach the NAS (from a Linux jump host, check ARP and routing)
cr0x@server:~$ ip route get 10.20.30.40
10.20.30.40 via 10.20.30.1 dev eth0 src 10.20.30.50 uid 1000
cache
Output meaning: The route is clear. If traffic hairpins through a firewall unexpectedly, you’ll see it here.
Decision: If routing is odd, fix it—SMB hates asymmetric routing and stateful firewall surprises.
Task 11: Measure path MTU issues (blackhole detection)
cr0x@server:~$ ping -c 3 -M do -s 1472 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1472(1500) bytes of data.
1480 bytes from 10.20.30.40: icmp_seq=1 ttl=64 time=0.605 ms
1480 bytes from 10.20.30.40: icmp_seq=2 ttl=64 time=0.598 ms
1480 bytes from 10.20.30.40: icmp_seq=3 ttl=64 time=0.602 ms
--- 10.20.30.40 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2041ms
Output meaning: Standard MTU works without fragmentation. If this fails on a network where jumbo frames are “enabled,” you may have inconsistent MTU causing stalls.
Decision: Either standardize MTU end-to-end or disable jumbo frames. Half-configured jumbo is a classic “random throughput” generator.
Task 12: Verify the NAS is not CPU-bound during SMB transfers
cr0x@server:~$ top -b -n 1 | head -n 12
top - 12:11:26 up 31 days, 4:22, 2 users, load average: 6.21, 6.02, 5.44
Tasks: 214 total, 1 running, 213 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.4 us, 3.1 sy, 0.0 ni, 82.9 id, 0.2 wa, 0.0 hi, 1.4 si, 0.0 st
MiB Mem : 64384.0 total, 8120.5 free, 10212.0 used, 46051.5 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 53500.2 avail Mem
Output meaning: CPU is mostly idle. If you see high system CPU during transfers (especially with SMB encryption/signing), the NAS CPU becomes the bottleneck.
Decision: If CPU-bound, consider disabling SMB encryption on trusted LANs, upgrading NAS CPU, or moving to 10GbE only if the CPU can keep up.
Task 13: Confirm ZFS pool/dataset health (if applicable)
cr0x@server:~$ zpool status
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool.
action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features.
scan: scrub repaired 0B in 09:12:33 with 0 errors on Sun Feb 2 03:10:22 2026
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
Output meaning: The pool is healthy; scrubs are running; no errors. If you see checksum errors, your backup target may be corrupting data.
Decision: If errors exist, stop treating the NAS as safe. Fix hardware, cabling, HBA firmware, RAM issues, then re-verify backups.
Task 14: List snapshots and confirm retention exists (if applicable)
cr0x@server:~$ zfs list -t snapshot -o name,used,creation -s creation | tail -n 5
tank/backups@auto-2026-02-04_0800 1.2G Tue Feb 4 08:00 2026
tank/backups@auto-2026-02-04_0900 1.1G Tue Feb 4 09:00 2026
tank/backups@auto-2026-02-04_1000 1.3G Tue Feb 4 10:00 2026
tank/backups@auto-2026-02-04_1100 1.2G Tue Feb 4 11:00 2026
tank/backups@auto-2026-02-04_1200 1.4G Tue Feb 4 12:00 2026
Output meaning: Snapshots are happening regularly. If you have no snapshots, you have no “oops buffer” against ransomware or accidental /MIR disasters.
Decision: If snapshots are missing, implement them before you declare the backup system “done.”
Task 15: Check SMB server config for weak protocol fallback (Samba)
cr0x@server:~$ testparm -s | egrep -i 'server min protocol|server max protocol|smb encrypt|server signing'
server min protocol = SMB2_10
server max protocol = SMB3_11
server signing = mandatory
smb encrypt = disabled
Output meaning: You’re not allowing SMB1. Signing is mandatory. Encryption is off (fine on trusted LAN; not fine on hostile networks).
Decision: If min protocol is SMB1, change it. If encryption is needed, enable it and re-measure CPU and throughput.
Task 16: Confirm the share path isn’t on a slow filesystem tier
cr0x@server:~$ mount | grep backups
tank/backups on /mnt/backups type zfs (rw,xattr,noacl)
Output meaning: You’re on the intended dataset. If you find backups landing on a small OS volume or a cache device by accident, you’ve found the “randomly ran out of space” culprit.
Decision: Ensure share paths map to the dataset/volume with the right quotas and snapshot policy.
Fast diagnosis playbook
This is the “someone is waiting on a call” version. The goal is to identify the bottleneck in minutes, not hours.
First: Is it name resolution, routing, or port access?
- Check DNS resolution for the NAS name (Task 1).
- Ping and look for loss/latency spikes (Task 2).
- Confirm TCP 445 is reachable (Task 3).
If any of these fail: it’s network/infrastructure. Don’t waste time rotating passwords or tuning Robocopy.
Second: Is it authentication or authorization?
- List shares and authenticate (Task 4).
- Write/read a test file (Task 5).
- Check NAS Samba logs for logon failures (Task 9).
If you can list shares but can’t write: permissions. If you can write sometimes, but not after password changes: credential hygiene.
Third: Is it performance (NAS CPU, disk, or network MTU)?
- Quick throughput test with a big file (Task 6).
- MTU sanity check (Task 11).
- NAS CPU load during transfer (Task 12).
- Disk health and pool status (Tasks 7 and 13).
- Capacity/inodes (Task 8).
Rule of thumb: If big-file throughput is fine but backups are slow, the problem is metadata/small files, antivirus scanning, or too many concurrent jobs. If big-file throughput is bad, it’s the transport or the NAS write path.
Fourth: Is it scheduling and overlap?
This is the stealth failure. Jobs run long, the next run starts, sessions collide, locks increase, and you get partial backups. Audit schedules and ensure only one job per host is active at a time unless you’ve proven concurrency works.
Common mistakes: symptom → root cause → fix
1) Symptom: “It fails with access denied, but only on some folders”
Root cause: Backup account doesn’t have rights to protected paths, or UAC/token filtering changes behavior between interactive and scheduled runs.
Fix: Use a dedicated service account; explicitly grant it read rights to required trees; test using the same identity the scheduled task uses. Avoid “it works when I run it as admin.” That’s not a test; that’s a confession.
2) Symptom: “Robocopy says success, but files are missing”
Root cause: You ignored Robocopy exit codes and summary lines; skipped files due to path length, junctions, locked files, or exclusions.
Fix: Parse exit codes; fail the job on skipped/failed counts; enable logging; decide how to handle junctions (/XJ) and long paths (enable long paths in Windows policy where applicable).
3) Symptom: “Backup to NAS is randomly slow, sometimes fine”
Root cause: MTU mismatch, Wi‑Fi links, power-saving NIC settings, SMB signing/encryption CPU saturation, or concurrent jobs causing NAS disk contention.
Fix: Standardize MTU; use wired links for servers; measure NAS CPU; stagger schedules; cap Robocopy threads; consider separate NIC/VLAN for backups.
4) Symptom: “It worked for months, then started failing after password rotation”
Root cause: Scheduled task stored old credentials; NAS has cached session tokens; multiple tasks use different secrets.
Fix: Adopt gMSA where possible; otherwise coordinate credential rotation and update all scheduled tasks in one change. Verify by checking NAS auth failure logs.
5) Symptom: “NAS has space, but backups fail with ‘no space left’”
Root cause: Snapshot reserve, quotas, or inode exhaustion. Or the share points to a smaller volume than you think.
Fix: Check df -h and df -i (Task 8), quotas, and the share’s backing path (Task 16). Adjust retention; grow the dataset; stop pretending 90% full is “plenty.”
6) Symptom: “WSB to network share keeps only one version / overwrites weirdly”
Root cause: WSB behavior for remote shares differs from dedicated disks; it manages versions differently and may not maintain the history you expect.
Fix: Use WSB to a dedicated disk or iSCSI target if you need proper versioning, or wrap with NAS snapshots so versions exist at the storage layer.
7) Symptom: “Backups stop mid-transfer, then resume, then corrupt”
Root cause: Unstable network, SMB session drops, or flaky NIC/driver offloads. Sometimes “helpful” firewall inspection resets sessions.
Fix: Check loss and logs (Tasks 2 and 9). Disable problematic NIC offloads on Windows (test), and remove stateful middleboxes from the backup path where feasible.
8) Symptom: “Ransomware encrypted the backups on the NAS too”
Root cause: Backup credentials had delete/modify rights; backups were just another writable share; no immutable snapshot policy.
Fix: Implement snapshot retention and lock/immutability if available; use separate credentials; restrict delete; segment access; add offline/offsite copy.
Checklists / step-by-step plan
Phase 1: Build a NAS target that behaves
- Create a dedicated dataset/volume for backups (separate from user shares).
- Enable regular snapshots (hourly + daily is common; tune to business needs).
- Create a dedicated SMB share (e.g.,
backups) pointing only to that dataset. - Disable SMB1; require SMB2+.
- Decide on signing/encryption based on your security baseline; measure CPU headroom.
- Create a dedicated backup identity (
CORP\backup_svcor local NAS user). - Set permissions so the account can write to its target but can’t casually erase history.
- Set quotas/alerts so “NAS full” becomes a ticket before it becomes a failure.
Phase 2: Build a Windows job that doesn’t lie
- Pick a backup method per workload:
- Robocopy for file-level (good default).
- WSB for system state/bare metal (server use cases).
- Write a script that logs to a stable local path and to the NAS (if reachable).
- Make the job fail on non-zero/undesired exit codes, not on vibes.
- Schedule it with Task Scheduler under a dedicated identity.
- Stagger schedules across hosts to avoid NAS stampedes at midnight.
- Run a restore drill on a non-production host. Time it. Document it.
Phase 3: Operationalize it (the part people skip)
- Ship logs somewhere central (even if it’s just another share or a log collector).
- Alert on failure, but also alert on “job didn’t run” and “job took 2x longer.”
- Monthly: verify snapshots exist and retention is as expected.
- Quarterly: restore drill for at least one representative host.
- After any NAS firmware update: re-run throughput and auth tests.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “It’s a file share, so it’s fine”
A mid-sized company moved from USB backup disks to a shiny NAS. The plan was simple: create one share, grant “Domain Admins” full rights, and point every Windows server’s scheduled Robocopy at it. They called it centralized backups. It was centralized, all right.
The wrong assumption was subtle: they assumed SMB semantics are uniform across devices and that “full rights” prevents permission problems. In reality, several servers had long paths and junctions into application caches. Robocopy skipped some paths, logged warnings, returned a code that wasn’t “0,” and the wrapper script still emailed “SUCCESS” because it only checked whether a log file existed.
The failure showed up during a restore request: one server’s “backup” didn’t include the one directory that mattered. Not because the NAS was down, but because the copy had been silently incomplete for months.
The fix wasn’t exotic. They wrote a real exit-code gate, turned junction handling into an explicit policy (/XJ in some places, include where needed), enabled long paths where appropriate, and added a weekly sampling restore test. The NAS didn’t change. The truthiness of their process did.
Optimization that backfired: “Let’s enable all the speed features”
Another org had performance complaints: backups were running into business hours. Someone toggled jumbo frames on the NAS NICs and enabled SMB encryption “for compliance,” figuring modern hardware could handle it.
It worked in a quick test: a single large file copied fast from one host. Then the nightly backups started. Under concurrency, the NAS CPU spiked on encryption and signing, and one intermediate switch had jumbo frames misconfigured. Some clients would stall, reconnect, then continue—except the reconnect behavior wasn’t consistent across Windows versions and NIC drivers.
The result looked like randomness: some hosts succeeded, some failed, some ran 8 hours instead of 2. The ticket volume climbed. People began debating which phase of the moon was best for backups.
The eventual solution was dull: set MTU back to 1500 everywhere (until they could guarantee end-to-end jumbo), keep signing per policy, selectively disable encryption on the trusted backup VLAN, and cap concurrency. Backups finished before dawn again. Performance came from consistency, not hero toggles.
Boring but correct practice that saved the day: snapshots + least privilege
A third environment had done two “unsexy” things from the start: the backup share lived on its own dataset with hourly snapshots, and the backup account could write new data but didn’t have broad delete rights. It wasn’t perfect immutability, but it was meaningfully harder to ruin history.
They still got hit by ransomware on a workstation that had access to some shared drives. The malware tried to traverse the network and encrypt anything writable. It reached the NAS, found the backup share, and did damage—limited damage.
What saved them wasn’t a magical endpoint product. It was that last night’s snapshot was intact, and the malware couldn’t easily delete the snapshot history. The restore path was clear: roll back the affected backup dataset to a known-good snapshot, then restore clients.
The postmortem was almost boring, which is the highest compliment in operations. Their “extra” snapshot and permission work turned a potentially catastrophic restore into a long weekend of routine, documented steps.
FAQ
1) Should I use Windows built-in “Backup and Restore (Windows 7)” to a NAS?
You can, but it’s legacy and quirky. For modern setups, prefer Robocopy for file-level, and Windows Server Backup for server imaging/system state if you specifically need that workflow.
2) Is SMB signing required, and will it slow backups?
Many corporate baselines require it. Yes, it can reduce throughput on weaker NAS CPUs or with many parallel streams. Measure before and after, and watch NAS CPU during transfers.
3) Should I enable SMB encryption?
If the backup traffic crosses untrusted networks or shared infrastructure you don’t control, encryption is sensible. On a dedicated, trusted backup VLAN inside a data center, it may be optional. If you enable it, re-test throughput and CPU headroom.
4) Do I need VSS if I’m just copying files?
If you copy databases, PST files, or anything that stays open and changes during backup, you need an application-aware approach. Robocopy alone can copy inconsistent versions of live files. Use VSS-aware tools or application-native backup methods for those workloads.
5) Should backups use /MIR in Robocopy?
Only when you are absolutely sure about the target path and you have snapshot protection on the NAS. /MIR will delete files on the destination that are not on the source. Point it wrong once and you’ll learn what adrenaline tastes like.
6) How do I keep multiple versions if Robocopy mirrors changes?
Use NAS snapshots for versioning, or implement rotation into folder structures (date-based sets) and prune with a policy. Snapshots are usually the cleanest “versions” layer for a NAS.
7) How do I know backups are restorable?
By restoring. At minimum: pick one host per quarter, restore a representative dataset to an isolated location, and verify application start or file integrity. Logs are not evidence of restorability; restores are.
8) What’s the best way to protect NAS backups from ransomware?
Layer it: least-privilege backup credentials, snapshots with retention (and immutability/locking if supported), network segmentation, and at least one offline/offsite copy. A writable SMB share alone is not protection.
9) My backups are slow only on small files. Why?
Small files are metadata-heavy: lots of directory lookups, ACL checks, and SMB round trips. Improve with faster disks/SSDs for metadata, more RAM, fewer concurrent jobs, and realistic expectations. Also exclude irrelevant caches and build outputs from backups.
10) Can I use a single share for all servers?
You can, but you shouldn’t unless you have strict subfolder permissions and quotas. One share tends to become a junk drawer with retention fights and accidental overwrites. Separate by host or by environment at least.
Next steps that make this stick
- Pick your backup method per workload: Robocopy for file-level, WSB for system state/bare metal needs.
- Build the NAS target correctly: dedicated dataset, snapshots, SMB2/3 only, sane permissions.
- Standardize identity: one backup service account, tested in the same context Task Scheduler uses.
- Instrument and alert: exit codes, log parsing, “didn’t run,” and “took too long.”
- Do a restore drill and write down the steps while it’s still fresh and mildly embarrassing.
If you do only one thing this week: implement snapshots and verify a restore. Everything else is optimization; those two are survival.