Somewhere in your company, a spreadsheet is quietly making decisions that look like “the system” from the outside.
It approves credits, sets prices, allocates inventory, pays bonuses, forecasts cash, throttles marketing spend, and
feeds dashboards that executives treat as physical reality.
Then one day: the numbers are off, the warehouse stops shipping, finance can’t close, or a regulator asks how you
derived a figure and you point to “Sheet3 (final) (really final).xlsx”. Welcome to Excel-driven disaster: not because
Excel is “bad”, but because you used it like production infrastructure without production-grade controls.
Why this keeps happening (and why it’s rational)
Excel isn’t “just a spreadsheet.” In many organizations it’s the fastest available programming environment with a UI,
a distribution channel (email), and an execution engine (whoever opens it). It’s the only tool that lets a domain
expert go from “I have an idea” to “I have a working thing” before the next meeting starts.
In SRE terms, Excel is a serverless platform where every user is a scheduler and every laptop is production. That’s
why it spreads. It’s also why it fails: you get all the power of software with none of the guardrails of software.
The classic escalation pattern is boring and predictable:
- Phase 1: Analysis. A spreadsheet is built for a one-off decision.
- Phase 2: Reuse. Someone copies it “for next month,” adds a new tab, ships it via email.
- Phase 3: Dependency. Another team starts relying on it. It becomes a “source of truth.”
- Phase 4: Integration. A script exports CSV from it. Now it’s in a pipeline.
- Phase 5: Institutionalization. The spreadsheet has an owner, but not an SLA, not monitoring, not change control.
- Phase 6: Incident. The spreadsheet breaks and the business breaks with it.
This isn’t a moral failure. It’s an incentives failure. The spreadsheet’s value is immediate and local; the risk is
delayed and distributed. The spreadsheet author gets praised for speed. The incident is paid for later by operations,
finance, customers, and audit.
You don’t fix this by banning Excel. You fix it by deciding which spreadsheets are allowed to be “important,” and
then treating those like production systems: versioning, validation, access control, reproducible builds, and
observability. The goal isn’t purity. The goal is to stop being surprised.
Facts and history: spreadsheets have been breaking things for decades
A little context helps, because this problem is older than your current tech stack.
- Spreadsheets predate most enterprise software. VisiCalc (1979) popularized the “electronic spreadsheet” and made personal computers economically defensible in offices.
- Excel inherited a compatibility burden. Decades of file formats, formulas, and add-ins created a huge surface area where “it works on my machine” becomes policy.
- Spreadsheet errors are a studied phenomenon. Research in end-user computing repeatedly finds non-trivial error rates in real spreadsheets, even when built by careful professionals.
- CSV is not a standard, it’s a détente. It’s the “universal” format precisely because it avoids agreement about types, escaping, time zones, and encoding. That’s not a feature when money is involved.
- Excel has multiple date systems. The infamous 1900 leap-year bug (Excel treats 1900 as a leap year) persists for compatibility reasons, and it still bites cross-system comparisons.
- Auto-formatting has caused data loss in science. Gene names and identifiers have been silently converted to dates or scientific notation in published datasets; the same mechanics hit product SKUs and account numbers.
- Large Excel files are effectively “stateful apps.” Once macros, Power Query, external connections, and pivot caches appear, the file is no longer just data; it’s a program with hidden state.
- Email is an unreliable message bus. Spreadsheet workflows often rely on human forwarding, which is the only distributed system that makes packet loss feel personal.
If you’re hoping for a world where spreadsheets don’t matter: sorry. Spreadsheets are the user interface of business
reality. The task is to make that interface less capable of detonating production.
Failure modes: how spreadsheets fail in the real world
1) The wrong assumption: “This column is always numeric”
Spreadsheets are permissive. That’s why people love them. They also accept mixed types, blank cells, strings that
look like numbers, numbers that look like dates, and whitespace that looks like nothing. A human eye glides over
“1,234” and “1234” as the same value; parsers don’t.
The failure shows up downstream: a join fails, a null propagates, a sum changes, a filter excludes half the data.
The spreadsheet didn’t “crash.” It simply produced a plausible wrong answer. Those are the worst outages.
2) Hidden state: cached pivots, volatile formulas, and ghost references
Pivot tables can cache data. Power Query can cache results. External links can point to network paths that used to
exist. Named ranges can continue to reference old sheets even after “cleanup.” Volatile formulas like NOW()
and RAND() make outputs non-reproducible.
In operational terms: you can’t rebuild the artifact deterministically. That makes incident response painful because
you can’t answer “what changed?” with confidence.
3) Concurrency and ownership: “Who edited the source of truth?”
Excel in shared drives introduces concurrency without real conflict resolution. Even with modern co-authoring,
you can end up with logic changes mixed into data updates. A key risk is that the spreadsheet has two roles:
data store and program. You can lock one, but people will duplicate the other.
4) Performance as a reliability issue
When spreadsheets get big, they get slow. Users respond by killing calculation, disabling refresh, copying values,
or saving “a lighter version.” Each of those is an optimization that quietly changes semantics.
Joke #1: Nothing is more permanent than a “temporary” Excel workaround, except maybe the “temporary” VPN exception.
5) Security and compliance: sensitive data in the wrong place
Spreadsheets are portable, which means they are exfiltration-friendly. They get emailed, synced, copied, and backed
up in consumer tools. They also hold secrets: customer lists, payroll, pricing, credentials (yes, really), and token
dumps from some analyst’s “quick export.”
6) The spreadsheet-to-pipeline interface: CSV and hope
The operationally dangerous moment is when a spreadsheet becomes input to automation: nightly import jobs, “upload
this template,” or a script pulling the newest file from a folder. You just turned manual work into a production data
pipeline without schema, tests, or validation.
At that point, you’re no longer arguing about tool preference. You’re running a distributed system with no contract.
Fast diagnosis playbook (find the bottleneck fast)
When “the spreadsheet” breaks the business, you need a path to certainty. Not a debate. Not a blame tour. A fast
diagnosis sequence that narrows the failure domain.
First: confirm what actually changed
- Is the wrongness in the spreadsheet artifact? The file’s content or formulas changed.
- Or is the wrongness in the inputs? Upstream exports changed shape, encoding, delimiters, or time ranges.
- Or is the wrongness in the import process? A job parsed differently after an OS/library update.
Your first job is to locate the boundary where “correct” becomes “incorrect.” If you can’t do that, you’ll spend
hours “re-running” the same broken pipeline in different costumes.
Second: verify the schema and types, not just totals
Totals can be accidentally right. Type drift is the real killer: IDs losing leading zeros, dates changing locale,
decimal separators flipping, and “N/A” turning into null. Look at a handful of representative fields and confirm
type expectations.
Third: check time, time zones, and date systems
If finance says “yesterday is missing,” assume time zone or boundary logic. If operations says “today doubled,”
assume duplicate ingestion. If sales says “this month looks weird,” assume a calendar and fiscal month mismatch.
Fourth: decide whether you’re dealing with data or logic
- Data incident: wrong rows, missing rows, duplicated rows, wrong encoding.
- Logic incident: changed formulas, broken references, altered pivot filters, macro behavior changes.
Data incidents are usually easier to roll back if you have immutable raw inputs. Logic incidents require change
history and reproducible builds. If you don’t have those, you’ll be doing spreadsheet archaeology.
Fifth: implement a safe stop
If the spreadsheet feeds payouts, pricing, or inventory: stop propagation. Put the pipeline in “read-only” mode or
freeze uploads until the boundary is identified. It’s better to ship late than to ship wrong.
Practical tasks: commands, outputs, and the decision you make
The fastest way to reduce spreadsheet-driven incidents is to treat the spreadsheet as an artifact in a pipeline:
you inventory it, inspect it, validate it, and observe it. Below are concrete tasks I’ve used in real environments.
Each includes a runnable command, sample output, what it means, and the decision you make.
Task 1: Inventory spreadsheet sprawl by extension and age
cr0x@server:~$ find /data/shared -type f \( -iname "*.xlsx" -o -iname "*.xlsm" -o -iname "*.csv" \) -printf "%TY-%Tm-%Td %p\n" | sort | tail -n 5
2026-01-18 /data/shared/finance/close/close_model_final_v7.xlsx
2026-01-19 /data/shared/sales/forecast/region_rollup.xlsm
2026-01-20 /data/shared/ops/inventory/inbound_template.xlsx
2026-01-21 /data/shared/pricing/pricebook_export.csv
2026-01-22 /data/shared/bonus/payout_calc.xlsm
What it means: You have a list of “recently touched” spreadsheet artifacts. Fresh edits correlate with incidents.
Decision: Put the newest business-critical files on a watchlist: change control, checksum monitoring, and a named owner.
Task 2: Detect duplicate “final” files (a reliability smell)
cr0x@server:~$ find /data/shared -type f -iname "*final*.xlsx" -o -iname "*final*.xlsm" | wc -l
137
What it means: “Final” is not a state; it’s a cry for help.
Decision: Pick a canonical location and naming convention, then enforce it with access controls and automation (not policy PDFs).
Task 3: Identify who last modified critical spreadsheets
cr0x@server:~$ stat /data/shared/bonus/payout_calc.xlsm
File: /data/shared/bonus/payout_calc.xlsm
Size: 4839012 Blocks: 9456 IO Block: 4096 regular file
Device: 8,17 Inode: 5501021 Links: 1
Access: (0660/-rw-rw----) Uid: ( 1032/ finops) Gid: ( 210/ finance)
Access: 2026-01-22 07:41:10.000000000 +0000
Modify: 2026-01-22 07:40:55.000000000 +0000
Change: 2026-01-22 07:40:55.000000000 +0000
What it means: You know when the artifact changed. On SMB/NFS you can often correlate this with audit logs.
Decision: If it changed near incident start, treat it as a suspect. Freeze edits until you capture a copy for analysis.
Task 4: Hash a spreadsheet to get immutability you can reason about
cr0x@server:~$ sha256sum /data/shared/bonus/payout_calc.xlsm
8b0b7f5b73a3b5c6f7e1ab2d0c8f4e2c1e2d9b5f1f18db7f24a9e51b6fcb2d11 /data/shared/bonus/payout_calc.xlsm
What it means: This fingerprint lets you compare “the file used for the run” vs “the file after people started poking it.”
Decision: Store hashes with each pipeline run. If the hash differs, you have a change event even if nobody admits it.
Task 5: Inspect an XLSX as a ZIP to find suspicious internals
cr0x@server:~$ unzip -l /data/shared/finance/close/close_model_final_v7.xlsx | head
Archive: /data/shared/finance/close/close_model_final_v7.xlsx
Length Date Time Name
--------- ---------- ----- ----
1167 2026-01-18 10:12 [Content_Types].xml
587 2026-01-18 10:12 _rels/.rels
251943 2026-01-18 10:12 xl/workbook.xml
48712 2026-01-18 10:12 xl/styles.xml
--------- -------
What it means: Modern Office files are archives. You can inspect structure without “opening” the file.
Decision: If you suspect external links or query connections, extract and inspect xl/externalLinks and workbook relationships.
Task 6: Search for external links that make results non-local
cr0x@server:~$ unzip -p /data/shared/finance/close/close_model_final_v7.xlsx xl/_rels/workbook.xml.rels | sed -n '1,30p'
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheets/sheet1.xml"/>
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink" Target="externalLinks/externalLink1.xml"/>
</Relationships>
What it means: There is at least one external link. Your spreadsheet output can change when some other file changes.
Decision: Treat this as a production dependency. Either vendor the inputs into the pipeline, or remove external links and replace with controlled imports.
Task 7: Detect macro-enabled files and treat them as code
cr0x@server:~$ file /data/shared/sales/forecast/region_rollup.xlsm
/data/shared/sales/forecast/region_rollup.xlsm: Microsoft Excel 2007+
What it means: This is macro-enabled. Logic can execute on open, refresh, or button clicks.
Decision: Require code review equivalents: limited editors, signed macros where possible, and a “build” step that exports deterministic outputs.
Task 8: Extract and review macros (VBA) without trusting Excel’s UI
cr0x@server:~$ oledump.py /data/shared/sales/forecast/region_rollup.xlsm | head -n 15
1: 114 '\x01CompObj'
2: 392 '\x05DocumentSummaryInformation'
3: 548 '\x05SummaryInformation'
4: 8231 'VBA/ThisWorkbook'
5: 19244 'VBA/Module1'
6: 2110 'VBA/_VBA_PROJECT'
What it means: VBA streams exist. You can extract Module1 and see what it does (file I/O, web calls, hidden edits).
Decision: If macros touch network paths, shells, or hidden sheets, treat as a software component. Assign ownership and a change gate.
Task 9: Validate CSV delimiter and detect “CSV that isn’t really CSV”
cr0x@server:~$ head -n 3 /data/shared/pricing/pricebook_export.csv
sku,region,price,currency
001234,US,12.50,USD
001235,EU,11,EUR
What it means: EU row uses comma as delimiter but also uses comma-like decimal conventions elsewhere in many exports; here it shows a price of 11 (no decimals), which might be real or a locale casualty.
Decision: Declare a contract: delimiter, decimal separator, quoting rules, encoding, and required headers. If you can’t, stop pretending it’s safe to automate.
Task 10: Detect encoding issues before they corrupt names and addresses
cr0x@server:~$ file -bi /data/shared/pricing/pricebook_export.csv
text/plain; charset=utf-8
What it means: UTF-8 is good news. If this were “unknown-8bit” or a Windows codepage, you’d expect mojibake in downstream systems.
Decision: Standardize on UTF-8. Reject uploads that aren’t UTF-8, or transcode them in a controlled step and log it.
Task 11: Catch schema drift by comparing headers
cr0x@server:~$ head -n 1 /data/shared/ops/inventory/inbound_template.xlsx 2>/dev/null || echo "XLSX is binary; export to CSV for header checks"
XLSX is binary; export to CSV for header checks
What it means: You can’t reliably parse XLSX with shell tools directly. That’s the point: you need an export step to a machine-readable format.
Decision: Make the pipeline export XLSX → CSV (or Parquet) in a controlled container, then validate schema on the exported artifact.
Task 12: Validate exported CSV headers against an expected contract
cr0x@server:~$ expected="sku,warehouse,qty,arrival_date"
cr0x@server:~$ actual=$(head -n 1 /var/pipeline/stage/inbound_export.csv)
cr0x@server:~$ echo "expected=$expected"; echo "actual=$actual"
expected=sku,warehouse,qty,arrival_date
actual=sku,warehouse,qty,arrival_date,notes
What it means: A new column appeared. That might be harmless, or it might shift column-based imports and poison your load.
Decision: If your loader is position-based, fail the job immediately. If it’s header-based, decide whether to ignore unknown columns or update the contract with a reviewed change.
Task 13: Detect leading-zero loss (classic SKU/account number failure)
cr0x@server:~$ awk -F, 'NR>1{ if ($1 ~ /^[0-9]+$/ && length($1) < 6) c++ } END{print "short_numeric_skus=" c+0}' /var/pipeline/stage/pricebook_export.csv
short_numeric_skus=42
What it means: You have numeric-only SKUs shorter than expected length; that’s often what “001234” becomes after Excel helpfully “fixes” it.
Decision: Treat identifiers as strings. Enforce quoting or a “SKU text” column. Reject files where the ID column violates a regex contract.
Task 14: Find duplicates introduced by copy/paste or re-exports
cr0x@server:~$ awk -F, 'NR>1{print $1}' /var/pipeline/stage/pricebook_export.csv | sort | uniq -d | head
001235
001240
001240
What it means: Duplicate keys exist. If you do an “upsert” without deterministic precedence, you’ll get non-repeatable outcomes.
Decision: Decide a rule: reject duplicates, or require a timestamp/version column and select the latest deterministically.
Task 15: Verify row counts across stages to catch silent truncation
cr0x@server:~$ wc -l /var/pipeline/raw/pricebook_export.csv /var/pipeline/stage/pricebook_export.csv
50001 /var/pipeline/raw/pricebook_export.csv
49980 /var/pipeline/stage/pricebook_export.csv
99981 total
What it means: You lost 21 rows during staging. That’s rarely “fine.” It’s usually parse errors, embedded newlines, or bad quoting.
Decision: Fail the pipeline and surface the rejected rows. Silent drops are how you end up with customer-impacting “partial truth.”
Task 16: Look for embedded newlines that break naive CSV parsing
cr0x@server:~$ awk 'BEGIN{FS=","} { if (NF<4) bad++ } END{print "rows_with_too_few_fields=" bad+0}' /var/pipeline/raw/pricebook_export.csv
rows_with_too_few_fields=21
What it means: Exactly the number of missing rows. Likely multiline fields or broken quoting.
Decision: Use a real CSV parser in staging (Python csv, Go encoding/csv) and enforce quoting rules in the template.
Task 17: Check database load errors (because the database is often the first honest system)
cr0x@server:~$ sudo -u postgres psql -d pricing -c "select now() as ts, count(*) as rows, min(updated_at), max(updated_at) from pricebook;"
ts | rows | min | max
-----------------------------+--------+---------------------------+---------------------------
2026-01-22 08:05:31.182+00 | 49892 | 2026-01-21 00:02:11+00 | 2026-01-22 07:59:58+00
What it means: Loaded row count doesn’t match raw export line count. Something got filtered, rejected, or deduplicated.
Decision: If you can’t explain the delta, stop using the table for pricing decisions until reconciliation passes.
Task 18: Find pipeline errors in logs around the incident window
cr0x@server:~$ journalctl -u pricebook-import --since "2026-01-22 06:00" --until "2026-01-22 08:10" | tail -n 20
Jan 22 07:58:12 server pricebook-import[19244]: parsed rows=50000 rejected=21 reason="unclosed quote"
Jan 22 07:58:12 server pricebook-import[19244]: load started table=pricebook
Jan 22 07:58:13 server pricebook-import[19244]: load finished inserted=49892 updated=0
Jan 22 07:58:13 server systemd[1]: pricebook-import.service: Succeeded.
What it means: The system told you exactly what happened. The painful part is that it still “succeeded.”
Decision: Change the service to fail non-zero when rejections > 0 (or > threshold) and page the owner. Partial loads are incidents.
Three mini-stories from the corporate wild
Mini-story 1: The incident caused by a wrong assumption
A retail-ish company had a spreadsheet that acted as the weekly “promotion calendar.” Marketing filled in SKUs,
discount percentages, and start/end dates. A cron job converted the sheet to CSV and loaded it into a service that
applied prices at checkout.
The wrong assumption was clean and almost reasonable: “SKU is numeric.” In the database schema, SKU was stored
as an integer. It had been that way forever. For years, SKUs were mostly numeric, and the few alphanumeric ones were
handled manually because “they’re special anyway.”
Then the company acquired a smaller brand whose SKUs had leading zeros and occasional letters. The spreadsheet
workflow didn’t change. Excel “helped” by trimming leading zeros. The import job “helped” by parsing integers. The
pricing service “helped” by applying discounts to whatever matched.
The blast radius wasn’t immediate because some SKUs still matched. The incident arrived as a slow-burn: customer
complaints about wrong prices and a finance team seeing margin erosion on a subset of items. The worst part was that
all the numbers looked plausible. They were just wrong in a pattern.
The fix was not heroic: change SKU types to text end-to-end, add a contract validator on upload, and make the import
fail when a SKU doesn’t match the expected regex. Once the pipeline started rejecting bad files loudly, the business
adapted within a week. People will comply when the system is consistent and the failure is immediate.
Mini-story 2: The optimization that backfired
In a services company, finance owned a large workbook that calculated monthly commissions. It pulled CRM exports,
applied rules, and produced payout files. The workbook was slow. Like “open it, go make coffee, come back, still
calculating” slow.
An analyst did what smart people do under pressure: optimized. They turned calculation to manual, replaced lookups
with pasted values, and disabled refresh on open. The file became snappy. People applauded. The monthly close got
faster.
The backfire came two months later when a new commission rule rolled out mid-month. The CRM export changed slightly,
adding a new field and shifting a column in the raw tab. In a healthy system, schema drift would trigger a failure.
In this workbook, the raw data was pasted as values, calculation was manual, and the lookups were based on
positional ranges that no longer aligned.
The workbook produced payouts that were internally consistent but wrong. It was a perfect failure: fast, quiet,
and confidently incorrect. The first signal was HR getting a handful of angry emails from people who are usually
very polite when money is involved.
They fixed it by un-optimizing in the right direction: move transformations to a controlled staging job, store raw
exports immutably, add tests on row counts and key uniqueness, and generate payout outputs in a deterministic build.
Performance improved again, but this time without changing semantics. The lesson: speed hacks that remove
recomputation remove correctness checks. You didn’t optimize; you disabled brakes.
Mini-story 3: The boring but correct practice that saved the day
A logistics team relied on an “inbound schedule” spreadsheet shared with a vendor. It drove staffing, dock
assignments, and purchase order receipts. It was business-critical, and everyone knew it—even if nobody admitted it
in architecture reviews.
One engineer insisted on a tedious practice: every day’s inbound spreadsheet export was archived to an immutable
folder with a timestamp, plus a checksum file. The import job logged row counts, rejected rows, and a summary of key
fields. There was a simple dashboard: “last successful import,” “rows today vs yesterday,” and “duplicate PO count.”
The day the vendor accidentally sent a sheet with a duplicated header row in the middle of the data (copy/paste
gone wild), the import job rejected it and paged the on-call for the pipeline owner. Operations was annoyed for ten
minutes.
The alternative timeline is obvious: half the inbound schedule would have been mis-read, the warehouse would have
staffed incorrectly, and the “outage” would have manifested as chaos rather than an alert. Instead, the team pulled
the previous day’s archived artifact, diffed it, pinpointed the format change, and asked the vendor to resend.
The business impact was “late by an hour,” not “broken for a day.”
Joke #2: The only thing scarier than an untested spreadsheet is a tested spreadsheet that nobody reads the test failures for.
Common mistakes: symptom → root cause → fix
This section is intentionally specific. If you can map an incident to one of these patterns, you can stop arguing
and start repairing.
1) Symptom: totals are off by a small percentage
- Root cause: Hidden filters in a pivot table or table view; someone saved the file with a filter applied.
- Fix: Ban filtered pivots as sources. Export from raw tables only. Add a check that the export row count matches the raw source row count.
2) Symptom: IDs don’t match between systems
- Root cause: Leading zeros trimmed; IDs interpreted as numbers; scientific notation applied to long identifiers.
- Fix: Treat identifiers as strings end-to-end. Add regex validation (length, allowed characters). Reject files that violate it.
3) Symptom: “Yesterday’s data” appears today or vice versa
- Root cause: Time zone mismatch, local time parsing, Excel date serial conversion, or midnight boundary logic.
- Fix: Standardize to ISO timestamps in UTC in pipeline artifacts. Convert at the UI edge, not in storage. Add boundary tests around DST changes.
4) Symptom: pipeline runs “succeed” but downstream is missing rows
- Root cause: Import job drops bad rows silently (parse errors, unclosed quotes, multiline fields).
- Fix: Fail the job if rejects > 0 (or above a small threshold with explicit policy). Output a rejects file and alert on-call.
5) Symptom: numbers change when you open the same file twice
- Root cause: Volatile functions, external connections, cached refresh results, or macros that mutate data on open.
- Fix: Remove volatility from production artifacts. Freeze inputs and generate outputs via a deterministic exporter (containerized). Log exact artifact hashes.
6) Symptom: “It works on my laptop” but fails on the server
- Root cause: Locale differences (decimal separators), font/format dependencies, Excel version differences, or add-in availability.
- Fix: Run conversion in a controlled environment with pinned versions. Store environment metadata with runs. Normalize locale explicitly during parsing.
7) Symptom: intermittent wrong results after “minor” edits
- Root cause: Named ranges shifted, formulas copied with relative references, or accidental range truncation.
- Fix: Use structured tables with explicit columns. Add automated checks that expected ranges contain no blanks where keys must exist.
8) Symptom: compliance asks “who approved this change?” and nobody knows
- Root cause: Email-based distribution; no change history; no access control; macros treated like documents not code.
- Fix: Put critical spreadsheets in a managed repository, restrict editors, require peer review for logic changes, and keep audit logs.
Checklists / step-by-step plan (what to do Monday)
Step 1: Classify spreadsheets by blast radius
- Tier 0 (do-not-break): Affects money movement, pricing, customer entitlements, compliance reporting, inventory fulfillment.
- Tier 1 (important): Affects planning, forecasting, staffing, executive dashboards.
- Tier 2 (local): Personal productivity and ad-hoc analysis.
Only Tier 0 and Tier 1 need governance. Tier 2 is where you let people be fast.
Step 2: Assign a real owner and an on-call path
- Name an owner team for each Tier 0 spreadsheet-driven workflow.
- Define what “broken” means (SLO-style): timeliness, correctness, and completeness.
- Decide who gets paged when the pipeline rejects input or sees anomalies.
Step 3: Turn the spreadsheet into an artifact, not a live pet
- Store the input file immutably per run (timestamped, checksum).
- Export to a normalized format (CSV with strict rules, or Parquet) in a controlled environment.
- Store the exported artifact and treat it as the source for downstream steps.
Step 4: Add contract validation
- Header validation (required columns, no unexpected columns unless allowed).
- Type validation (IDs as strings, decimals as decimals).
- Uniqueness constraints (no duplicate keys).
- Completeness checks (no nulls in required fields).
- Range checks (discount between 0 and 1, quantities non-negative, dates within expected windows).
Step 5: Implement change control that matches risk
- Tier 0: logic changes require peer review and a recorded approval (ticket, pull request, or equivalent).
- Tier 1: at minimum, keep version history and editor logs; require a second set of eyes for structural changes.
- Tier 2: no bureaucracy. Just education and templates.
Step 6: Build observability for spreadsheet pipelines
- Metrics: rows ingested, rows rejected, duplicates found, null rates, runtime, file hash.
- Dashboards: last success time, today vs yesterday deltas, top rejection reasons.
- Alerts: rejects > 0 for Tier 0, large deltas, schema drift, missed schedules.
Step 7: Decide what to migrate, and what to harden
Not everything needs a grand rewrite. Some spreadsheets should be migrated to proper services or databases; others
should be hardened and left alone because they’re stable and the business value is in the UI.
- Migrate when: multiple teams depend on it, logic is complex, changes are frequent, or it feeds automated actions.
- Harden when: the spreadsheet is mainly a data-entry UI but outputs can be validated and exported deterministically.
One operations quote (because reliability is a mindset)
“Hope is not a strategy.” —General Gordon R. Sullivan
You can run a business on Excel for a long time. You just can’t run it on hope that the next edit won’t matter.
FAQ
1) Should we ban Excel for anything operational?
No. Bans create shadow systems. Instead, classify spreadsheets by risk and apply controls to Tier 0/Tier 1 workflows.
Let low-risk work stay fast.
2) When does a spreadsheet become “production”?
When it triggers automated actions (pricing, payouts, inventory), becomes a system of record, or is required for a
recurring business process with deadlines. If an outage bridge would mention it, it’s production.
3) What’s the single highest-value control to add first?
Immutable archiving plus validation. If you can reproduce what ran (archived artifact + hash) and you can reject bad
inputs loudly, you cut both incident frequency and mean time to recovery.
4) Why do “small” spreadsheet errors create massive business impact?
Because spreadsheets often sit at decision points: money movement, eligibility rules, pricing, and allocations.
A one-cell error can multiply across thousands of rows and become policy.
5) Are macros always a problem?
Macros are code. Code can be fine. The problem is running code without code controls: review, versioning, restricted
edit access, and reproducible execution. Treat macros like software, not like formatting.
6) How do we handle vendors who insist on “upload this Excel template”?
You don’t have to win a philosophical battle. You need a contract at the boundary: export their file to a normalized
format in a controlled step, validate it, and reject it when it violates the contract. Vendors learn quickly when
the feedback loop is immediate.
7) What’s the safest file format for pipeline ingestion?
For human exchange, XLSX is unavoidable. For machine ingestion, prefer a format with explicit schema and types
(Parquet/Avro) or a strictly validated CSV. The key is not the format; it’s enforcing a contract.
8) How do we convince leadership this is worth doing?
Don’t sell “governance.” Sell reduced payout mistakes, fewer pricing incidents, faster closes, cleaner audits, and
shorter incident bridges. Tie it to risk and recovery time, not aesthetics.
9) What if our “spreadsheet system” is too big to migrate?
Then harden first: immutable inputs, deterministic exports, validation gates, and observability. Migration can be a
gradual extraction of the highest-risk logic, not a big-bang rewrite.
10) How do we avoid breaking analysts’ workflows?
Don’t fight the UI. Keep Excel as the interface where it helps, but move correctness-critical transformation and
storage into controlled steps. Analysts get speed; the business gets reliability.
Conclusion: ship fewer spreadsheets, ship more certainty
Excel isn’t the enemy. Uncontrolled logic is. The spreadsheet that breaks your business is almost always doing one of
three things: acting like a database, acting like an application, or acting like a message bus. It is rarely good at
any of those roles at scale.
Practical next steps:
- Inventory Tier 0 and Tier 1 spreadsheets by business impact.
- Make every run reproducible: archive inputs, compute hashes, keep exported artifacts.
- Add validation that fails loudly on schema drift, rejects, duplicates, and type violations.
- Instrument the pipeline: row counts, rejects, deltas, and “last good run.”
- Migrate the most dangerous logic out of spreadsheets first: payouts, pricing, entitlements, and inventory rules.
Run spreadsheets like you run systems: assume change, assume failure, and build guardrails that turn “we got lucky”
into “we were prepared.”