Compare Two Folders: Detect Missing/Changed Files Instantly

Was this helpful?

Somewhere, right now, a backup job is “green” while quietly omitting files. Another team is shipping a release built from a folder that “should match prod.” And someone is about to copy a terabyte twice because they don’t trust the first copy.

Comparing two folders sounds like a basic task. In production it’s a trap door. The right method depends on what you’re proving: presence, byte-for-byte identity, or “close enough” for a deploy. Choose wrong and you get either false confidence or an expensive, slow answer that arrives after the incident.

What you’re actually proving (and why it matters)

“Compare two folders” is not one problem. It’s at least five. If you don’t say which one you mean, you’ll default to something slow (hash everything) or misleading (trust file sizes). Here’s the taxonomy I use on call.

1) Existence: are all the expected files present?

This is the “missing files” problem: a partial copy, a failed backup, an incomplete artifact sync. You care about filenames, and maybe directory structure. You usually don’t care about file contents yet.

Good tools: rsync --dry-run, comm on sorted file lists, find manifests.

2) Metadata equality: do timestamps, permissions, ownership match?

This matters when you’re restoring systems, migrating home directories, or moving data between NFS servers where modes and owners are the “real data.” If permissions differ, your app fails in ways that look like “random outages.” They aren’t random.

Good tools: rsync -a --dry-run, stat, getfacl if ACLs matter.

3) Content equality: are bytes identical?

This is integrity verification. You’re proving that two files are the same, even if timestamps or file names lie. It’s slower, but it’s the only thing that shuts down arguments after a restore.

Good tools: checksums (SHA-256), rsync --checksum (with care), content sampling when you’re time-boxed.

4) Logical equality: do they produce the same result?

For builds, containers, and datasets, you might not care about byte-for-byte equality. You care whether the application behaves the same. That’s a different proof. Folder compare can still help, but don’t confuse “looks the same” with “is the same.”

5) Consistency under churn: are you comparing a moving target?

Comparing a folder while files are being written is like measuring a fish while it’s escaping. Sometimes you can do it (snapshots, quiescing, immutable builds). Sometimes you can’t, and the right answer is “stop the world, then compare.”

One operational quote to keep handy:

Paraphrased idea — Gene Kim: high-performing ops shorten feedback loops and make work visible, so problems surface before they become incidents.

Folder comparison is visibility. Not glamorous. But it’s the difference between “we think it copied” and “we can prove it copied.”

Fast methods first: picking the right tool

Rule #1: don’t hash a petabyte because you’re nervous

Checksums are the gold standard, but they’re also the “read every byte” tax. If your only question is “did we miss anything?”, hashing is self-harm. Start with a manifest of relative paths and sizes; then escalate.

Rule #2: rsync --dry-run is the most useful “compare folders” tool on Earth

It speaks the language of operations: “What would change if I synced this?” That’s a diff with consequences. Use it locally, over SSH, across mount points. When the pressure is on, it gives you an actionable list.

Rule #3: timestamps are often lies

Copy tools preserve mtimes sometimes, sometimes not. Filesystems round mtimes differently. Time zone changes and clock drift happen. If you compare mtimes, treat results as a hint, not a verdict.

Rule #4: permissions and ACLs are data

If you’re restoring a service, wrong ownership is “data loss” with better marketing. If your compare ignores ACLs, you will ship a problem and call it a successful restore.

Rule #5: if the dataset is live, use snapshots (or accept uncertainty)

On ZFS, snapshots make this easy. On LVM, you can do it too. On cloud object storage, you’ll need versioned manifests. Without a stable point-in-time view, your compare will show changes that are just today’s writes.

Joke #1: Hashing everything “just to be safe” is like weighing your car every morning to check the fuel gauge. Accurate, yes. Necessary, no.

Practical tasks (commands, outputs, decisions)

Below are production-grade tasks. Each includes a command, a realistic output snippet, and the decision you make from it. Run these with a clear goal: find missing files, changed files, metadata drift, or integrity problems.

Task 1: Fast missing-file detection with rsync (local-to-local)

cr0x@server:~$ rsync -a --dry-run --itemize-changes /data/src/ /data/dst/
sending incremental file list
*deleting   old/unused.log
>f+++++++++ reports/2026-01.csv
>f..t...... images/logo.png

sent 1,204 bytes  received 88 bytes  2,584.00 bytes/sec
total size is 9,812,441,102  speedup is 7,589,904.72 (DRY RUN)

What it means: >f+++++++++ is a file that would be created in destination (missing there). >f..t...... means timestamp differs (content may or may not). *deleting indicates destination has extra files not in source (only shown if you add delete flags; see next task).

Decision: If you’re verifying a copy, missing files (+++++++++) are a stop-the-line issue. Timestamp-only differences require further checks (size/hash) depending on your risk tolerance.

Task 2: Detect extras in destination (safely) with rsync

cr0x@server:~$ rsync -a --dry-run --delete --itemize-changes /data/src/ /data/dst/
sending incremental file list
*deleting   tmp/debug.dump
*deleting   cache/.DS_Store

sent 1,112 bytes  received 64 bytes  2,352.00 bytes/sec
total size is 9,812,441,102  speedup is 8,342,119.31 (DRY RUN)

What it means: With --delete, rsync will propose deleting files in destination that aren’t in source. In dry-run, it only reports.

Decision: If destination is supposed to be a mirror (backups, replicas), these extras are drift. If destination is an archive, you probably do not want delete semantics.

Task 3: Compare two directories quickly with diff -rq

cr0x@server:~$ diff -rq /data/src /data/dst | head
Only in /data/src/reports: 2026-01.csv
Files /data/src/images/logo.png and /data/dst/images/logo.png differ
Only in /data/dst/tmp: debug.dump

What it means: Missing files and “differ” lines. diff compares content, but it reads files; on huge trees it can be slower than rsync’s metadata-based detection.

Decision: Use this when you need a straightforward “same/different” report and the dataset is moderate. If it’s multi-terabyte, prefer rsync manifests and targeted hashing.

Task 4: Build a stable manifest of paths and sizes (cheap, usually enough)

cr0x@server:~$ cd /data/src
cr0x@server:~$ find . -type f -printf '%P\t%s\n' | sort > /tmp/src.pathsizes
cr0x@server:~$ wc -l /tmp/src.pathsizes
84217 /tmp/src.pathsizes

What it means: A line per file: relative path and size. Sorting makes it comparable.

Decision: If src and dst manifests match exactly, you’ve proven “same files, same sizes.” Not byte-identical, but strong evidence for copies where corruption is unlikely and time is tight.

Task 5: Compare manifests with comm to find missing files

cr0x@server:~$ cd /data/dst
cr0x@server:~$ find . -type f -printf '%P\t%s\n' | sort > /tmp/dst.pathsizes
cr0x@server:~$ comm -3 /tmp/src.pathsizes /tmp/dst.pathsizes | head
reports/2026-01.csv	12044
	tmp/debug.dump	4096

What it means: Lines prefixed with a tab exist only in the second file; lines without leading tab exist only in the first. Here: report missing in dst; debug.dump extra in dst.

Decision: Missing files: recopy or resync. Extra files: decide whether dst should be an exact mirror; if yes, delete or rebuild dst from source-of-truth.

Task 6: Identify “same path, different size” quickly

cr0x@server:~$ join -t $'\t' -j 1 \
  <(cut -f1,2 /tmp/src.pathsizes) \
  <(cut -f1,2 /tmp/dst.pathsizes) | awk -F'\t' '$2 != $3 {print $1, $2, $3}' | head
images/logo.png 18432 18312
db/seed.sql 901122 901114

What it means: Same relative path exists in both, but file sizes differ.

Decision: Treat as changed content. If it’s supposed to be identical, resync those paths and then hash them to confirm.

Task 7: Targeted hashing of only suspicious files (fast escalation)

cr0x@server:~$ sha256sum /data/src/images/logo.png /data/dst/images/logo.png
a1c3d2d19c9f7f0c2a2a0ddc7f6d4b2e9f1b2d3c4a5b6c7d8e9f001122334455  /data/src/images/logo.png
b88f2bd42a9e0f1c9d8e7f6a5b4c3d2e1f0a9b8c7d6e5f4a3b2c1d0e9f8a7b6c  /data/dst/images/logo.png

What it means: Different hashes: different bytes, full stop.

Decision: If dst is a backup/replica, investigate transfer/storage corruption or an unsynchronized update. If this is a deploy artifact, stop and rebuild from a single source-of-truth.

Task 8: Full-tree checksum manifest (slow, definitive)

cr0x@server:~$ cd /data/src
cr0x@server:~$ find . -type f -print0 | sort -z | xargs -0 sha256sum > /tmp/src.sha256
cr0x@server:~$ head -3 /tmp/src.sha256
9f2a...  ./bin/app
44c1...  ./conf/app.yaml
e11b...  ./images/logo.png

What it means: Stable, sorted checksum manifest for the entire tree. Sorting with null separators avoids path weirdness and ordering drift.

Decision: Use this when you need court-grade proof (restores, compliance, artifact promotion). Accept that it’s I/O heavy; schedule it, don’t surprise prod.

Task 9: Verify destination matches a checksum manifest

cr0x@server:~$ cd /data/dst
cr0x@server:~$ sha256sum -c /tmp/src.sha256 | head
./bin/app: OK
./conf/app.yaml: OK
./images/logo.png: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

What it means: OK means exact match. FAILED means content differs. If it says “No such file,” the file is missing.

Decision: Any FAILED or missing is a hard failure for backup verification. Recopy those files; if failures persist, suspect storage/media problems or silent corruption.

Task 10: Compare folders over SSH with rsync (dry-run)

cr0x@server:~$ rsync -a --dry-run --itemize-changes -e ssh /data/src/ backup01:/data/dst/
sending incremental file list
>f+++++++++ reports/2026-01.csv
>f..t...... images/logo.png

sent 1,882 bytes  received 112 bytes  3,988.00 bytes/sec
total size is 9,812,441,102  speedup is 4,920,184.83 (DRY RUN)

What it means: Same semantics, but now you’re validating remote state. Note: rsync compares metadata by default, not checksums.

Decision: Use as a first-pass remote audit. If anything surprising shows up, do targeted hashing over SSH for the suspect files (or build a manifest on each side).

Task 11: Catch permission drift (the “it exists but doesn’t work” class)

cr0x@server:~$ stat -c '%a %U:%G %n' /data/src/conf/app.yaml /data/dst/conf/app.yaml
640 app:app /data/src/conf/app.yaml
600 app:app /data/dst/conf/app.yaml

What it means: Same owner, different modes. That can break reads, writes, or config management expectations.

Decision: If permissions must match, fix your copy method: rsync -a (and run as root when ownership matters), or preserve ACLs/xattrs if your environment uses them.

Task 12: Compare extended attributes (xattrs) when they matter

cr0x@server:~$ getfattr -d -m - /data/src/bin/app 2>/dev/null
# file: data/src/bin/app
security.capability=0sAQAAAgAgAAAAAAAAAAAAAAAAAAA=

cr0x@server:~$ getfattr -d -m - /data/dst/bin/app 2>/dev/null
# file: data/dst/bin/app

What it means: Source has a Linux capability xattr; destination does not. The binary may fail to bind to low ports or behave differently.

Decision: Use rsync -aX (xattrs) and possibly -A (ACLs). If you’re on filesystems that drop xattrs, stop pretending it’s a faithful copy.

Task 13: Find files that changed recently (triage during an incident)

cr0x@server:~$ find /data/src -type f -mmin -60 -printf '%TY-%Tm-%Td %TH:%TM %p\n' | head
2026-02-05 09:12 /data/src/reports/2026-02.csv
2026-02-05 09:21 /data/src/tmp/run.log

What it means: Files modified in the last hour. Great for “why does compare keep changing?” questions.

Decision: If your tree is actively changing, stop comparing it live. Use snapshots, freeze writes, or compare only immutable subtrees (like release artifacts).

Task 14: Quick “are these directories on different filesystems with different semantics?” check

cr0x@server:~$ df -T /data/src /data/dst
Filesystem     Type  1K-blocks       Used  Available Use% Mounted on
tank/data      zfs   9767548928  512334112 9255214816   6% /data/src
backup:/data   nfs4  9767548928  515220480 9252328448   6% /data/dst

What it means: Source is ZFS; destination is NFS. Expect different timestamp precision, ownership mapping weirdness, and attribute caching.

Decision: If compare results are “noisy,” validate with checksums for critical files and tune NFS mount options if necessary. Also: do not assume POSIX features behave identically across these.

Task 15: Measure the cost of your compare method (don’t guess)

cr0x@server:~$ /usr/bin/time -v rsync -a --dry-run /data/src/ /data/dst/ >/dev/null
	Command being timed: "rsync -a --dry-run /data/src/ /data/dst/"
	User time (seconds): 0.82
	System time (seconds): 4.11
	Percent of CPU this job got: 62%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.96
	File system inputs: 126544
	File system outputs: 0

What it means: Lots of filesystem inputs even for dry-run. That’s metadata I/O. On cold caches, this can slam your storage.

Decision: If metadata scans are bottlenecking prod, schedule compares off-peak, warm caches carefully, or keep a rolling manifest updated incrementally.

Task 16: When you absolutely need byte-level compare for a single huge file

cr0x@server:~$ cmp -n 1048576 /data/src/big.img /data/dst/big.img && echo "first 1MiB matches"
first 1MiB matches

What it means: The first 1 MiB matches. This is not full verification, but it’s a fast sanity check when you’re triaging.

Decision: Use as a quick smoke test, then follow up with full checksums if the file is critical.

Joke #2: Nothing builds team unity like a missing file discovered five minutes before a demo.

Fast diagnosis playbook

When “folder compare is slow” or “results don’t make sense,” don’t thrash. Diagnose like an SRE: isolate the bottleneck, then pick the least invasive fix.

First: confirm the problem type

  • Missing files? Use rsync -a --dry-run or path manifests. Don’t hash yet.
  • Changed content? Check size/mtime deltas, then hash only the delta set.
  • Permission/ACL issues? Compare metadata; hashing won’t help.
  • Live churn? Stop comparing moving targets: snapshots or freeze writes.

Second: identify what’s slow (CPU, metadata I/O, network, or disk)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server)  02/05/2026  _x86_64_  (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          3.11    0.00    9.44   41.02    0.00   46.43

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await  aqu-sz  %util
nvme0n1        821.0  66240.0     0.0   0.00   47.20    80.70     2.0     64.0    1.50    38.80  99.20

Interpretation: High %iowait and %util near 100%: storage is the bottleneck. Directory scans are punishing your disks.

Action: Stop doing full-tree scans during peak. Use snapshots/manifests, or run compares on a replica/secondary.

cr0x@server:~$ sar -n DEV 1 3 | tail -n +4
Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   %ifutil
Average:         eth0    120.11    115.88   9821.33   9012.47     82.50

Interpretation: Network is busy; remote compares may be constrained by link utilization or latency.

Action: Prefer remote-side manifest generation and compare small text outputs. Avoid reading entire files over the network unless necessary.

Third: verify filesystem and mount semantics

  • NFS attribute caching can make mtimes look “wrong” temporarily.
  • SMB/CIFS may normalize case; macOS copies might create AppleDouble files.
  • Different timestamp precision can produce constant “changed” signals.
cr0x@server:~$ mount | grep -E ' /data/dst | /data/src '
tank/data on /data/src type zfs (rw,xattr,noacl)
backup:/data on /data/dst type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)

Interpretation: Different platforms. Expect metadata edge cases.

Action: If the goal is integrity, don’t argue with mtimes. Use checksum manifests for critical subsets.

Fourth: reduce scope intelligently

  • Compare only immutable directories (releases, snapshots, daily partitions).
  • Exclude caches and temp folders; they’re entropy generators.
  • Start with “list-only” comparisons and escalate.
cr0x@server:~$ rsync -a --dry-run --itemize-changes --exclude='tmp/' --exclude='cache/' /data/src/ /data/dst/ | head
sending incremental file list
>f+++++++++ reports/2026-01.csv

Interpretation: Signal-to-noise improves immediately.

Action: Bake excludes into your standard compare scripts, but document them so you don’t hide real problems.

Three corporate mini-stories (pain included)

Incident caused by a wrong assumption: “size match means content match”

A team migrated a large report archive between two NAS boxes. They used a quick check: count of files and total bytes. Everything lined up. They declared victory and decommissioned the old system.

Weeks later, auditors asked for a specific set of monthly reports. A handful were unreadable: PDFs opened with errors. The filenames existed, and sizes matched. That’s what made it nasty; nobody suspected corruption.

The root cause was a flaky network path during the original copy. The copy tool retried and “completed,” but a few files were silently truncated and then padded by an application process that touched metadata. Same size. Different bytes. The file count check didn’t catch it; size checks didn’t catch it; timestamp checks didn’t catch it.

The fix was boring: generate a SHA-256 manifest on the source, verify it on the destination, and only then delete the old dataset. They rehydrated the corrupted files from older backups and implemented checksum verification for future migrations.

The lesson: if you’re retiring the source-of-truth, do at least one content-based verification pass. Not necessarily for everything daily—but for migrations, always.

Optimization that backfired: “let’s use --checksum everywhere”

A platform group got tired of debating whether two directories were “really” the same. They wrapped rsync in a script and used --checksum by default. On paper, it solved the trust problem: rsync would hash each file and compare hashes.

In reality, it moved the cost from “sometimes we re-copy too much” to “we read the entire dataset every run.” On a storage cluster with heavy metadata pressure and busy disks, that meant the compare job became a denial-of-service against their own infrastructure. Latency spiked. Backup windows slipped. A few services started timing out against shared storage.

The most annoying part: the script ran as a cron job. So the blast radius was periodic and confusing, like a haunted house but with graphs. They initially blamed the network. Then they blamed the storage vendor. Then someone finally ran iostat during the event and saw read saturation.

The eventual fix was layered verification: use rsync metadata compare for daily sync checks, and run a checksum manifest weekly (or on demand after incidents). They also added scope control: only hash immutable release artifacts and compliance-critical directories.

The lesson: “correct” can still be operationally wrong when applied indiscriminately. Deterministic truth is great; deterministic outages are not.

Boring but correct practice that saved the day: snapshot + manifest

A data engineering team needed to replicate a dataset nightly to a disaster recovery site. The dataset was large, and it changed constantly during business hours. Early attempts to compare live directories produced inconsistent results: files appeared and disappeared mid-scan, and diffs were noisy.

They adopted a strict practice: at 01:00, take a snapshot of the source dataset, replicate that snapshot, then compare the snapshot contents to the replicated snapshot. No “live” comparisons. No arguments about churn.

On a random Tuesday, replication “succeeded” but downstream jobs started failing in DR. Their compare pipeline flagged a small number of missing files in a partition. Because the compare was against a snapshot, it was unambiguous: the files were missing from the replicated view, not “still being written.”

They found the culprit quickly: a misconfigured exclude pattern in the replication job that filtered out a directory name used only for one client. It had been introduced during a cleanup. The compare pipeline caught it immediately, before a disaster forced them to rely on DR.

The lesson: snapshots turn comparison from opinion into math. The practice is boring. That’s why it works.

Interesting facts & history you can actually use

  • Rsync’s “delta transfer” idea dates back to the mid-1990s and made remote synchronization practical over slow links by not re-sending unchanged blocks.
  • Unix diff predates most modern filesystems. It was built for text, but its recursive mode became a blunt instrument for directory trees.
  • MD5 used to be common for integrity checks, but collision attacks made it a poor choice for adversarial contexts. For ops verification, SHA-256 is the sane default.
  • Filesystem timestamp precision varies: ext4 supports nanoseconds; some network filesystems round to 1 second (or worse). Your “changed” signal may be rounding noise.
  • Case sensitivity differs: Linux filesystems are usually case-sensitive; many Windows and default macOS setups are case-insensitive. Two different files can collapse into one during a copy.
  • Hard links complicate “file count” metrics: one inode can have multiple directory entries. Some copy methods duplicate data instead of preserving links unless configured.
  • Extended attributes and ACLs became widely used as systems grew more security-aware; losing them can change execution and access behavior without changing file contents.
  • Silent data corruption is real: modern systems rely on end-to-end checks (filesystems like ZFS, application checksums, or manifest verification) because disks, controllers, and RAM can all lie occasionally.
  • “Backup succeeded” rarely means “restore verified”: operational maturity often shows up as routine verification, not prettier dashboards.

Common mistakes: symptoms → root cause → fix

1) Symptom: compare shows thousands of “changed” files every run

Root cause: timestamp precision mismatch (local ext4 vs NFS/SMB), time skew, or a copy tool that doesn’t preserve mtimes.

Fix: Stop trusting mtimes as a primary signal. Use size manifests first; then hash deltas. If you need metadata fidelity, use rsync -a and ensure both sides support it.

2) Symptom: files exist on destination but app fails with “permission denied”

Root cause: ownership/mode/ACL/xattr not preserved; restore performed as non-root; ACLs dropped by filesystem.

Fix: Use rsync -aAX where appropriate. Validate with stat, getfacl, getfattr. If destination can’t store ACLs/xattrs, change destination or adjust the security model intentionally.

3) Symptom: “Only in source” files come and go during the scan

Root cause: comparing a live directory that is being written, rotated, or cleaned.

Fix: Compare snapshots or stop writes during comparison. For logs and temp paths, exclude them from the compare scope.

4) Symptom: rsync shows differences, but hashing shows files are identical

Root cause: metadata differences (mtime, perms) or timestamp rounding; rsync is telling the truth about metadata, not content.

Fix: Decide what “same” means for this workflow. For backup integrity, content matters most; for system restores, metadata matters too. Configure compare accordingly.

5) Symptom: checksum verification is unbearably slow and impacts production

Root cause: full read of large datasets, cold caches, contention with user I/O; doing it at peak.

Fix: Schedule checksum scans off-peak. Hash only immutable or high-value subsets. Maintain rolling manifests per partition/day rather than scanning the whole tree.

6) Symptom: destination has extra dotfiles and “weird” metadata files

Root cause: copying from macOS (AppleDouble ._ files), Windows metadata, or backup software artifacts.

Fix: Exclude known junk patterns intentionally, but document exclusions. Better: separate “data” from “client OS artifacts” at the source.

7) Symptom: compare results differ depending on who runs it

Root cause: permission differences affect what find can see; some files are unreadable to non-root. Also, NFS root-squash can hide ownership realities.

Fix: Run compares with consistent privileges. For system-level restores, do it as root on both ends (carefully). Capture errors from find and treat them as failures, not noise.

8) Symptom: a file with the same name overwrote another during copy

Root cause: case-insensitive destination filesystem; Readme and README collide.

Fix: Do not copy case-sensitive trees to case-insensitive volumes unless you have a naming policy and validation. Detect collisions by normalizing case in manifests before migration.

Checklists / step-by-step plan

Checklist A: “I just need to know what’s missing” (fast, low risk)

  1. Confirm the scope: pick the root directories and ensure trailing slashes are correct for rsync (/src/ vs /src).
  2. Run rsync dry-run itemized compare:
    cr0x@server:~$ rsync -a --dry-run --itemize-changes /data/src/ /data/dst/ | head -50
    sending incremental file list
    >f+++++++++ reports/2026-01.csv
  3. If you need to detect extras in destination, add --delete (still dry-run) and review *deleting lines.
  4. Decide: if the goal is mirror, fix by syncing; if the goal is “backup retains more,” don’t delete.

Checklist B: “I need to prove integrity” (slower, defensible)

  1. Stabilize the dataset: snapshot or pause writes. If you can’t, be honest about uncertainty.
  2. Generate a sorted SHA-256 manifest on source:
    cr0x@server:~$ cd /data/src
    cr0x@server:~$ find . -type f -print0 | sort -z | xargs -0 sha256sum > /tmp/src.sha256
    cr0x@server:~$ tail -1 /tmp/src.sha256
    3d4c...  ./reports/2026-01.csv
  3. Copy the manifest to destination (or generate it there too) and verify:
    cr0x@server:~$ cd /data/dst
    cr0x@server:~$ sha256sum -c /tmp/src.sha256 | tail -3
    ./images/logo.png: OK
    ./reports/2026-01.csv: OK
  4. Any FAILED or “No such file” is a hard failure. Resync and re-verify.
  5. Decide: if failures persist, stop and investigate storage/network corruption, bad RAM, controller issues, or an application that modifies files post-copy.

Checklist C: “Restore must work” (permissions, ACLs, and xattrs)

  1. Pick a representative sample of files: configs, executables, secrets, and directories that enforce permissions.
  2. Compare metadata:
    cr0x@server:~$ stat -c '%a %U:%G %n' /data/src/bin/app /data/dst/bin/app
    755 root:root /data/src/bin/app
    755 root:root /data/dst/bin/app
  3. If your environment uses ACLs/xattrs, check them explicitly (getfacl, getfattr).
  4. Fix the copy method (rsync -aAX) and rerun validation.

Checklist D: “Large dataset, limited time” (triage)

  1. Run a paths+sizes manifest compare (cheap) to find missing and size-different files.
  2. Hash only the size-different files and a random sample of “same size” files.
  3. If random sample fails, expand hashing scope. If it passes, proceed with cautious confidence and document what you did.

FAQ

1) What’s the fastest way to compare two folders on Linux?

For operational use, rsync -a --dry-run --itemize-changes is usually the fastest actionable compare. It mostly reads metadata, not full file contents, so it scales better than hashing.

2) Does diff -rq verify file contents?

Yes, it can read and compare file contents, which makes it more definitive than mtime/size comparisons. It can also be much slower on large trees and noisy on binary-heavy datasets.

3) When should I use checksums?

Use checksums when you need proof: migrations where you’ll delete the source, compliance verification, restore testing, or when you suspect corruption. Use manifests and rsync for daily drift detection.

4) Is rsync --checksum the same as a checksum manifest?

Not exactly. --checksum makes rsync compute checksums to decide whether to transfer. It still requires reading all file contents, and it’s not a durable artifact unless you capture the results separately. A manifest is a standalone record you can re-check later.

5) Why do I see changes even when nobody touched the files?

Common causes: timestamp rounding differences, timezone/clock drift, metadata updates by antivirus/indexers, NFS attribute caching, or tools that rewrite files in place (even if content is logically the same).

6) How do I compare folders over SSH without pulling all data across the network?

Generate small manifests on each side (paths+sizes, or checksums if needed) and compare the text outputs. Or use rsync dry-run over SSH, which transfers only metadata and file lists.

7) What about symlinks and hard links?

Symlinks can be compared as links or as dereferenced targets, depending on your tool and flags. Hard links require special handling if you want to preserve link relationships; otherwise copies may duplicate data and still “look correct” by file content.

8) How do I handle filenames with spaces and weird characters in manifests?

Use null-delimited pipelines: find ... -print0 with sort -z and xargs -0. Avoid naive line-based parsing when paths can contain tabs or newlines.

9) Can I trust file size comparisons for integrity?

File size matches are useful for quick triage and catching obvious truncation, but they do not prove content identity. If the stakes are high, size is a hint, not evidence.

10) How do I compare only a subset (exclude caches, temp, logs)?

Use rsync excludes (or find pruning) explicitly and consistently. Keep the exclude list version-controlled, because “temporary exclusions” have a habit of becoming permanent blind spots.

Next steps you can do today

  1. Pick a standard for your org: rsync dry-run for daily compares, checksum manifests for migrations and restore verification.
  2. Write down what “same” means for each workflow: existence-only, metadata fidelity, or byte identity. Put it in the runbook so 3 a.m. you doesn’t improvise.
  3. Build one reusable manifest script that is null-safe and sorted, and store manifests next to backups/snapshots.
  4. Schedule heavy checks (full-tree hashing) off-peak and measure impact with iostat and timing. Don’t guess.
  5. Practice restore verification on a small but representative subset weekly. Not because you love paperwork. Because incidents love surprises.

If you take one thing from all this: start cheap, escalate only where evidence says you must, and never compare a moving target unless you enjoy philosophical debates with your storage array.

← Previous
Find What’s Eating Bandwidth in Windows (No Sketchy Apps)
Next →
WSL Networking Explained: Why localhost Works (and Why It Sometimes Doesn’t)

Leave a comment