Excel Runs the World: Terrifying Stories That Keep Happening

October 1, 2025 • February 3, 2026 • Read: 21 min • Views: 0

Was this helpful?

The pager goes off, the dashboard is red, and the executive question arrives in your inbox with the subtlety of a brick: “Why are we paying customers negative money?”

You trace the “data pipeline” and discover the last mile isn’t Kafka or Snowflake. It’s a spreadsheet named final_FINAL_reallyfinal_v7.xlsx on someone’s desktop, exported to CSV on Fridays, then manually uploaded into production. The scary part isn’t that this exists. The scary part is that it works—until it doesn’t.

Why Excel keeps winning (even in 2026)

Excel is the most successful end-user programming platform ever shipped. Not because it’s “easy,” but because it’s available, fast to iterate, and it lets domain experts move without waiting for engineering capacity. A spreadsheet is a user interface, a data store, a compute engine, and a report generator. In one file. With copy/paste.

That’s also why it’s dangerous. It collapses boundaries that production systems rely on: version control, typed schemas, deterministic builds, access controls, reproducible environments, automated tests, and clear ownership. Excel’s superpower is that it bypasses governance by being “just a file.”

In SRE terms, spreadsheets become production when:

They are the source of truth for customer-impacting decisions (pricing, eligibility, payouts, inventory).
They sit on the critical path (weekly billing run, month-end close, compliance reporting).
They have no documented SLO but will absolutely ruin your week if wrong.
They have two owners (meaning: zero owners).

People don’t use Excel because they’re reckless. They use it because it’s the shortest path from question to answer. Your job is to stop that path from going off a cliff.

A single quote worth keeping on your desk

Werner Vogels, speaking about reliability at scale, put it bluntly: “Everything fails, all the time.”

Translate that into spreadsheet reality: everything corrupts, all the time. Not always loudly. Sometimes as a silent formatting change that turns IDs into scientific notation and ruins your join keys.

Joke #1: Excel is a wonderful database as long as your database needs exactly one user, one table, and one nervous breakdown.

Facts and history: how we got here

Excel horror isn’t new. What’s new is how many modern systems still terminate in it. Here are concrete facts and context points that explain why spreadsheets keep showing up in serious places:

VisiCalc (1979) is often credited as the “killer app” that sold early personal computers to businesses—spreadsheets were foundational, not a side feature.
Excel debuted in 1985 (first for Macintosh), and it quickly became the default for business modeling because it combined calculation with a flexible grid UI.
Excel introduced PivotTables in the 1990s, making ad-hoc analysis available to non-programmers—great for insights, terrible for governance.
Spreadsheet “macro” languages (like VBA) turned files into executable programs with weak deployment practices: copy a file, now you’ve deployed code.
CSV became the universal data interchange not because it’s good, but because it’s everywhere; Excel made CSV feel “standard” even when quoting and encodings aren’t.
Excel’s automatic type inference (dates, numbers, scientific notation) is a design choice optimized for interactive convenience, not data integrity.
Row limits mattered historically (65,536 rows in older Excel versions); people learned to split data into multiple sheets/files, then glue them back together manually.
Regulated industries adopted spreadsheets because audit teams could “see” the logic in a worksheet; visibility replaced correctness as the perceived control.
Modern SaaS exports are still “Excel-first” because stakeholders want downloads that open in Excel, not in a notebook or BI tool.

If you’re trying to eradicate Excel from your org, start by acknowledging a hard truth: Excel isn’t a tool; it’s a treaty between engineering and the business. Break it without offering an alternative, and you’ll just drive it underground.

The real failure modes: not “human error,” but predictable system behavior

When an incident is blamed on “someone edited a spreadsheet,” it’s usually because nobody wanted to describe the system honestly. The system is: manual workflows, implicit schemas, fragile exports, and approvals done by vibes.

1) Silent type coercion (Excel “helps” you)

Excel guesses. It guesses aggressively. It guesses wrong in ways that look right.

Long numeric IDs become scientific notation; leading zeros vanish.
Date-like strings become actual dates; “03-04” becomes April 3rd or March 4th depending on locale.
Large integers lose precision when treated as floating point.

Operational result: joins fail, dedupe fails, and “missing records” appear in downstream systems. You don’t get an exception; you get wrong money.

2) Implicit schemas and drifting columns

Spreadsheets invite column drift: someone inserts a column for “Notes,” another renames “SKU” to “Sku,” someone else merges cells to “make it pretty.” Downstream tooling doesn’t care that it’s pretty. It cares that column 7 used to be price.

3) The copy/paste supply chain

The most common “pipeline” is: export → paste into another workbook → filter → copy visible rows → paste values → export. At each step you can lose rows, reorder them, keep stale formulas, or export only what’s visible.

Filtering is especially brutal: it creates a UI state that is not obvious in the exported output. Your “complete dataset” might be “whatever was visible when you exported.”

4) Concurrency without locks

Files don’t merge cleanly. People email attachments. They keep two versions open. They “save as” to resolve conflicts. This is distributed systems without any of the fun parts.

5) Macros and external links: the hidden dependency graph

Workbooks that reference other workbooks, network shares, or ODBC sources behave like programs with runtime dependencies. A file share permission change or a moved folder can silently break logic. Sometimes it “fixes” itself because a cached value is used, which is worse.

6) Spreadsheet-as-database: performance cliffs

People sort 200k rows, use volatile formulas, and wonder why Excel freezes. Then they “optimize” by splitting the file, which creates reconciliation problems. Or they disable calculation, which creates stale totals. Or they export to CSV and import into something else, which introduces encoding and quoting issues.

Joke #2: A spreadsheet is just a program where every variable is named “Column F” and every bug report starts with “It used to work.”

Three corporate mini-stories from the spreadsheet front

Mini-story #1: The incident caused by a wrong assumption

At a mid-size company, the pricing team maintained a spreadsheet that generated discount tiers for enterprise renewals. Sales ops exported the sheet weekly to CSV and uploaded it to an internal admin tool, which pushed the tiers into a pricing service. The workflow was old, boring, and everyone assumed it was “fine.”

A new region launched. Someone added a column to the spreadsheet called Region and inserted it near the front to keep the sheet “readable.” The export still produced a CSV, the upload still succeeded, and the admin tool still returned a green “Import complete.” The importer was positional, not header-based. It treated the new column as CustomerSegment, shifting everything by one.

The pricing service didn’t crash. It happily accepted garbage. Suddenly, certain customers got discounts meant for a different segment, while others lost their negotiated tiers. The on-call was paged for “conversion drop” and “spike in support tickets,” not for a data integrity failure.

The root cause wasn’t “someone added a column.” The wrong assumption was that an import job that says “complete” means “correct,” and that humans will never reorganize a spreadsheet. The fix was also not “tell people to stop.” They changed the importer to match by header name, validated required fields, rejected unknown columns, and produced a diff report. The spreadsheet could change shape, but the system wouldn’t silently accept it.

Mini-story #2: The optimization that backfired

A logistics team had a workbook that calculated replenishment quantities for warehouses. It pulled data from an exported report and used a jungle of formulas. The workbook became slow. To “speed it up,” an analyst replaced formulas with hard-coded values after the weekly refresh—copy, paste values, done. It worked. It was also a time bomb.

Months later, a new warehouse was added. The workbook had named ranges and a few formulas that were supposed to extend automatically. They didn’t. Because the sheet was now mostly pasted values, the missing warehouse never entered the calculation. The dashboard still looked healthy because totals were close enough. Nobody noticed that a whole facility was being replenished using last month’s pattern and manual adjustments.

When inventory accuracy finally dipped enough to trigger alarms, engineering got involved. They found a “fast” workbook whose logic had been amputated to make it responsive. The optimization wasn’t wrong in spirit—it was solving a real performance problem—but it destroyed the workbook’s ability to adapt to changing inputs.

The fix was to move the calculations into a proper pipeline: extract report data, validate schema, calculate replenishment in code, store outputs, and let Excel become a view layer if people still needed it. Performance improved, and the logic stopped being editable by accidental copy/paste surgery.

Mini-story #3: The boring but correct practice that saved the day

A finance operations team ran monthly payouts to partners. Their source data arrived as spreadsheets from multiple channels. The process had a reputation for being fragile, so one ops engineer insisted on something deeply unglamorous: checksums, immutable raw inputs, and a reproducible transform step.

They required that every inbound file be stored unchanged in an “incoming” directory with a timestamp, then converted into a normalized CSV via a scripted process. The script validated column headers, counts, and basic invariants (no negative amounts unless flagged, partner IDs match expected patterns). It also produced a summary report and a diff against the previous month’s totals per partner.

One month, a partner’s file arrived with a subtle issue: the “Amount” column contained values with commas in a locale-specific format, and Excel displayed them fine. The transform script rejected the file because numeric parsing failed. Instead of silently “fixing” it, the process forced a human decision: request a corrected file or use an explicit parser configuration for that partner.

The payout was delayed by a few hours. That was the “cost.” The avoided cost was a seven-figure mispayment and a compliance event. The boring practice was the win: treat spreadsheets as untrusted inputs, store raw artifacts immutably, validate aggressively, and force failures early where they’re cheap.

Fast diagnosis playbook: find the bottleneck fast

When Excel is involved in production outcomes, your incident triage needs to cover both infrastructure and the human-in-the-loop pipeline. The fastest path is to determine whether you have a data correctness problem, a data freshness problem, or a system availability problem.

First: classify the incident in 5 minutes

Correctness: numbers are wrong, but jobs “succeeded.” Look for parsing, coercion, column drift, duplicate uploads, stale formulas.
Freshness: numbers are old. Look for missed uploads, cron failures, stuck queues, manual step not executed, timezone confusion.
Availability: pipeline jobs fail loudly. Look for storage full, permissions, network share down, API outages, DB constraint failures.

Second: identify the “handoff boundary”

There is always a moment where “system” becomes “spreadsheet” (export), and another where “spreadsheet” becomes “system” (import). Most failures cluster at these boundaries:

Export encoding/locale changed (UTF-8 vs Windows-1252, decimal comma vs dot).
Header row missing or duplicated.
Only visible rows exported due to filtering.
Excel auto-converted identifiers.
Import script assumes column positions.

Third: check invariants, not just logs

Logs tell you whether the job ran. Invariants tell you whether the job did the right thing. You want quick “smoke tests”:

Row counts within expected range.
Unique keys are actually unique.
Totals reconcile to baseline within tolerance.
Schema matches expected headers and types.

Fourth: decide whether to stop the line

If money moves, stop the line when invariants fail. Don’t “patch” by editing the spreadsheet in place. That’s how you lose auditability and end up arguing about what was changed and when. Freeze inputs, rerun transforms, and only then re-import.

Practical tasks: commands, outputs, and decisions (the SRE edition)

Below are real tasks you can run during an incident or as preventive hygiene. Each includes a command, sample output, what it means, and the decision to make.

Task 1: Confirm the file you’re about to process is immutable (hash it)

cr0x@server:~$ sha256sum /data/incoming/pricing_tiers.xlsx
a3f2c1d8bb9c2d2bb1e0e1e2b72f16c4d6b3a2a0c3c1ad4f1c9c0b7a3d8e2f11  /data/incoming/pricing_tiers.xlsx

Meaning: This is the fingerprint of the exact file used for processing.

Decision: Store the hash alongside the run metadata. If someone “fixes the spreadsheet,” require a new file and new hash; never overwrite.

Task 2: See when the file changed (catch last-minute edits)

cr0x@server:~$ stat /data/incoming/pricing_tiers.xlsx
  File: /data/incoming/pricing_tiers.xlsx
  Size: 4821932    Blocks: 9424       IO Block: 4096   regular file
Device: 0,42   Inode: 1311042     Links: 1
Access: (0640/-rw-r-----)  Uid: ( 1001/  ingest)   Gid: ( 1001/  ingest)
Access: 2026-01-22 08:11:17.000000000 +0000
Modify: 2026-01-22 08:10:59.000000000 +0000
Change: 2026-01-22 08:10:59.000000000 +0000

Meaning: Modify time near the run time can indicate someone edited and re-uploaded mid-process.

Decision: If a file changed during a run window, quarantine it and re-run from a stable snapshot.

Task 3: Convert XLSX to CSV in a deterministic way (avoid Excel export)

cr0x@server:~$ ssconvert --export-type=Gnumeric_stf:stf_csv /data/incoming/pricing_tiers.xlsx /data/staging/pricing_tiers.csv
Importing file `/data/incoming/pricing_tiers.xlsx'
Saving file `/data/staging/pricing_tiers.csv'

Meaning: You’ve converted using a server-side tool; the output is reproducible and scriptable.

Decision: Standardize conversion in CI/CD or a controlled job runner. Don’t rely on human “Save As CSV.”

Task 4: Detect non-UTF8 encodings (silent corruption source)

cr0x@server:~$ file -bi /data/staging/pricing_tiers.csv
text/plain; charset=utf-8

Meaning: Confirms the file is UTF-8; if it were Windows-1252 you’d expect a different charset.

Decision: If not UTF-8, convert explicitly before parsing and document the source system’s encoding.

Task 5: Check for weird delimiter/quote behavior fast

cr0x@server:~$ head -n 3 /data/staging/pricing_tiers.csv
customer_id,segment,region,discount_pct
001234,enterprise,EU,12.5
009876,midmarket,US,7.0

Meaning: Quick eyeball: expected header, commas present, decimals are dots.

Decision: If you see semicolons, inconsistent quoting, or embedded newlines, adjust parser settings or reject and request a corrected export.

Task 6: Validate row count against expectation (freshness/correctness smoke test)

cr0x@server:~$ wc -l /data/staging/pricing_tiers.csv
4821 /data/staging/pricing_tiers.csv

Meaning: 4,821 lines includes the header; compare to previous run’s typical range.

Decision: If the count drops or spikes unexpectedly, stop the import and investigate filters, missing sheets, or duplicated sections.

Task 7: Confirm headers exactly match the contract (kill column drift)

cr0x@server:~$ head -n 1 /data/staging/pricing_tiers.csv | tr ',' '\n' | nl -ba
     1	customer_id
     2	segment
     3	region
     4	discount_pct

Meaning: You have an ordered list of columns; this is your schema interface.

Decision: If any column is missing, renamed, or added, fail the pipeline loudly. Don’t try to guess.

Task 8: Detect duplicate keys (a classic spreadsheet “paste twice”)

cr0x@server:~$ awk -F, 'NR>1{print $1}' /data/staging/pricing_tiers.csv | sort | uniq -d | head
001234
004455

Meaning: These customer IDs appear more than once.

Decision: Decide the policy: reject duplicates, or require an explicit “effective_date” column and deterministic resolution rules.

Task 9: Detect “scientific notation” damage in identifiers

cr0x@server:~$ awk -F, 'NR>1 && $1 ~ /E\+/ {print NR ":" $1; exit}' /data/staging/pricing_tiers.csv
129:1.23457E+11

Meaning: An ID got converted into scientific notation somewhere in the process.

Decision: Reject the file and fix the export path. IDs must be treated as strings end-to-end.

Task 10: Check numeric sanity (negative values, impossible percentages)

cr0x@server:~$ awk -F, 'NR>1 && ($4<0 || $4>100){print NR ":" $0}' /data/staging/pricing_tiers.csv | head
77:008812,enterprise,US,150

Meaning: 150% discount is probably not a strategy; it’s a parsing or data entry error.

Decision: Define invariants and enforce them as gates. If exceptions exist, require an explicit override column and approvals.

Task 11: Compare today’s totals to a baseline (catch silent shifts)

cr0x@server:~$ awk -F, 'NR>1{sum+=$4} END{printf "%.2f\n", sum}' /data/staging/pricing_tiers.csv
39210.50

Meaning: A crude aggregate. It won’t prove correctness, but it will catch huge deviations.

Decision: If the sum deviates beyond a tolerance band from prior runs, block the import and open an investigation.

Task 12: Validate freshness by checking pipeline logs for last successful run

cr0x@server:~$ journalctl -u pricing-importer --since "2 days ago" | tail -n 8
Jan 22 08:12:03 server pricing-importer[24911]: Starting import: /data/staging/pricing_tiers.csv
Jan 22 08:12:04 server pricing-importer[24911]: Parsed rows=4820 rejected=0
Jan 22 08:12:05 server pricing-importer[24911]: Upsert complete
Jan 22 08:12:05 server pricing-importer[24911]: Import success run_id=7b3d2c

Meaning: Confirms the importer ran recently, and how many rows were parsed/rejected.

Decision: If the last success is too old, treat it as a freshness incident and look for stuck handoffs.

Task 13: Check DB for the latest data timestamp (don’t trust the job log)

cr0x@server:~$ psql -d pricing -c "select max(updated_at) from discount_tiers;"
         max
---------------------
 2026-01-22 08:12:05
(1 row)

Meaning: Confirms downstream state changed when you think it did.

Decision: If timestamps lag, the import may have succeeded but wrote nothing (constraints, dedupe rules, wrong key mapping).

Task 14: Confirm storage isn’t full (because “import failed” is often “disk full”)

cr0x@server:~$ df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       500G  496G  4.0G  100% /data

Meaning: You are living on fumes. Temp files and conversions will fail unpredictably.

Decision: Free space immediately and add alerting on filesystem utilization with time-to-full estimates.

Task 15: Find the biggest offenders in the staging area

cr0x@server:~$ du -sh /data/staging/* | sort -h | tail -n 5
3.2G  /data/staging/archive
5.8G  /data/staging/tmp
7.1G  /data/staging/exports

Meaning: Space is being consumed by archives/tmp; cleanup policy may be broken.

Decision: Implement retention (e.g., keep last N versions) and move long-term storage to object storage with lifecycle rules.

Task 16: Verify the cron/systemd timer actually fired (the “nobody pressed the button” check)

cr0x@server:~$ systemctl list-timers --all | grep pricing
Thu 2026-01-22 08:12:00 UTC  2min ago Thu 2026-01-22 08:12:00 UTC  2min ago pricing-import.timer  pricing-import.service

Meaning: The timer fired; if the service didn’t run, look at service failures or dependencies.

Decision: If timers aren’t firing, you have scheduling/host issues. If they fire but data is wrong, focus on inputs/parsing.

Task 17: Catch schema drift by diffing headers against a pinned contract

cr0x@server:~$ diff -u /etc/pricing/header_contract.txt <(head -n 1 /data/staging/pricing_tiers.csv)
--- /etc/pricing/header_contract.txt
+++ /dev/fd/63
@@ -1 +1 @@
-customer_id,segment,region,discount_pct
+customer_id,segment,Region,discount_pct

Meaning: Case change (“Region” vs “region”) can break strict parsers or downstream mapping.

Decision: Decide whether headers are case-sensitive; then enforce consistently. My advice: be strict and make producers comply.

Task 18: Verify permissions on shared drop location (ingest failures that look like “no data”)

cr0x@server:~$ namei -l /data/incoming
f: /data/incoming
drwxr-xr-x root   root   /
drwxr-xr-x root   root   data
drwxrwx--- ingest ingest incoming

Meaning: Only the ingest group can write. If a human uploader isn’t in the group, they’ll “upload” somewhere else or fail silently.

Decision: Fix group membership and document the upload path; consider a web upload with authentication rather than a shared folder.

Common mistakes: symptom → root cause → fix

These show up again and again. Treat them like known vulnerabilities.

1) “Import completed” but downstream values are nonsense

Symptom: Job logs show success; dashboards shift wildly; no exceptions thrown.

Root cause: Positional CSV parsing with column drift, or Excel export reordered columns.

Fix: Parse by header name, validate required/allowed columns, and fail on unknown fields. Produce a human-readable diff report before applying changes.

2) Missing records after a spreadsheet update

Symptom: A subset of customers/products vanish after a “minor edit.”

Root cause: Excel filters left active; only visible rows exported. Or a sheet tab wasn’t included in export.

Fix: Server-side conversion from the raw XLSX; reject files with filtered ranges (if detectable), and validate row count ranges and key coverage.

3) IDs stop matching across systems

Symptom: Joins fail; “new” entities appear that look like duplicates.

Root cause: IDs auto-coerced (leading zeros dropped, scientific notation, numeric rounding).

Fix: Enforce IDs as strings at the earliest ingestion step; validate patterns (length/regex). Never accept numeric IDs from a spreadsheet without explicit formatting.

4) Month-end “just takes longer now”

Symptom: Spreadsheet processing time grows until it becomes a fire drill.

Root cause: Spreadsheet has become a compute engine; volatile formulas, huge ranges, multiple cross-sheet lookups.

Fix: Move computation into a pipeline or database; keep Excel as a front-end or reporting artifact. Set a hard row/complexity limit.

5) Two teams swear they used “the same file”

Symptom: Conflicting outcomes, finger-pointing, unreproducible results.

Root cause: No immutability, no artifact retention, no checksums; files overwritten or emailed around.

Fix: Store raw inputs immutably with hashes and timestamps. Require run IDs and attach artifacts to the run.

6) “We fixed it” but it breaks again next week

Symptom: Repeat incidents with new small variations.

Root cause: Fix was a manual patch in the spreadsheet; pipeline accepts whatever arrives; no tests or contracts.

Fix: Write a schema contract, enforce invariants, add pre-import validation gates, and set ownership with change control.

7) Currency/decimal issues across regions

Symptom: Values off by 10x/100x; decimals appear as thousands separators.

Root cause: Locale mismatch (comma vs dot), Excel formatting vs raw value confusion, or CSV exported with localized separators.

Fix: Normalize numeric formats in the ingestion layer. Require ISO currency codes and explicit decimal separators. Reject ambiguous formats.

8) Import starts failing “randomly” after OS updates

Symptom: Same spreadsheet “works on my machine,” fails on server, or vice versa.

Root cause: Dependency drift: converter versions, library parsing differences, or macro security settings.

Fix: Containerize conversion/parsing tooling; pin versions; create golden test fixtures and run them on every deploy.

Checklists / step-by-step plan

These are not aspirational. They’re the minimum to keep spreadsheet-driven operations from becoming a recurring incident category.

Checklist A: If Excel is a source of truth, treat it like production code

Define ownership: one accountable owner (not a committee), with a backup.
Define a schema contract: required columns, allowed columns, types, and invariants.
Stop overwrites: store every inbound file immutably with timestamp + hash.
Automate conversion: XLSX → normalized CSV using server-side tooling, not Excel UI.
Validate before import: row count ranges, unique keys, type checks, invariants, and sanity aggregates.
Make imports idempotent: rerunning the same file produces the same DB state.
Produce an approval diff: “Here’s what will change” summary for humans to review.
Log run IDs: every downstream change tied to a run ID and input hash.

Checklist B: Step-by-step incident response when the spreadsheet is suspected

Freeze the pipeline: pause imports or disable the job to stop compounding damage.
Identify last known good run: find run ID, input hash, and downstream timestamp.
Collect artifacts: raw XLSX, normalized CSV, validation reports, importer logs.
Run invariants: row counts, duplicates, schema diff, numeric sanity checks.
Compare to baseline: totals and key distributions vs previous run.
Decide rollback strategy: revert DB to last good state or re-import known-good file.
Fix at the boundary: don’t “correct” the spreadsheet in place; correct the ingestion contract or request a new file.
Post-incident: add a gate that would have prevented the same class of failure.

Checklist C: Migration plan (Excel to something safer) without a civil war

Inventory critical spreadsheets: which ones move money, change customer entitlements, or affect compliance?
Pick the first target by blast radius: highest impact × highest change frequency.
Extract the model: identify inputs, outputs, and hidden dependencies (other sheets, external links, macros).
Write a reference implementation: code that reproduces outputs from the same inputs.
Run in parallel: spreadsheet and new system produce outputs side-by-side until deltas are understood.
Keep Excel as UI if needed: allow export/import but under contract, with validation and approvals.
Cut over with guardrails: feature flag, rollback plan, and monitoring for invariants.
Lock down the old path: remove “manual upload” or make it an emergency-only workflow with explicit approvals.

FAQ

1) Is Excel “bad”?

No. Excel is fantastic for exploration, prototyping, and one-off analysis. It becomes dangerous when it’s the system of record without system-grade controls.

2) What’s the single highest-leverage control to add?

Immutable raw input storage with checksums and a run ID. If you can’t prove what file produced what output, you don’t have operations—you have folklore.

3) Why not just tell people to stop using spreadsheets?

Because they won’t. Or they’ll comply publicly and keep doing it privately. Replace the workflow: offer a safer tool that is as fast as Excel for their job.

4) CSV seems simple. Why does it cause so many outages?

Because “CSV” isn’t one format. Delimiters, quoting rules, encodings, embedded newlines, and locale-specific numbers turn “simple” into “ambiguous.” Excel exports frequently amplify that ambiguity.

5) We already have a data warehouse. Why does Excel still matter?

Because decisions often happen outside the warehouse: someone downloads, edits, and re-uploads. The warehouse can be pristine while the last-mile operational file is chaos.

6) How do I detect spreadsheet-driven incidents earlier?

Monitor invariants: row counts, key uniqueness, distribution shifts, and reconciliation totals. Alert on deviations, not on “job failed,” because spreadsheet failures often succeed.

7) What should be the boundary between Excel and production systems?

Excel can be an input, but only through a controlled ingestion path that normalizes formats, validates schema, and produces an approval diff. Excel should not directly write into production state without gates.

8) Are macros always unacceptable?

In regulated or high-impact workflows, treat macros like unreviewed code running on random endpoints—because that’s what they are. If you must keep them, version them, sign them, and move execution server-side.

9) How do we handle “but the spreadsheet contains business logic we can’t lose”?

Extract and encode it. Start by writing tests from known historical inputs/outputs, then reimplement the logic in code. Keep Excel as a view during transition, not as the engine.

10) What’s a pragmatic first migration target?

Any spreadsheet that feeds an import job. Replace “manual export/import” with: raw artifact capture → deterministic conversion → validation → idempotent import → audit trail.

Conclusion: what to do on Monday

If Excel is in your production chain, you’re not doomed. You just need to treat it as an untrusted, high-variance interface—like the public internet, but with more merged cells.

Next steps that actually reduce incidents:

Find the top five spreadsheets that move money or permissions. Write down owners, cadence, and downstream dependencies.
Install guardrails at the boundaries: immutable storage, hashes, deterministic conversion, schema contracts, invariant checks.
Make failure loud and early: reject ambiguous files, block imports on drift, and require a new artifact for any “fix.”
Move compute out of Excel: if a workbook is doing serious transformations, it belongs in code and a database.
Stop relying on heroics: build the boring pipeline that makes the right thing the easy thing.

Excel will keep running the world. Your job is to stop it from running your incident queue.