You inherit a system held together by cron jobs, tribal knowledge, and a database schema that looks like it was designed during a fire drill. Someone says the magic words: “Let’s rewrite it from scratch.” Heads nod. Roadmaps get refreshed. A new repo appears like a fresh notebook on January 1st.
Then production happens. The rewrite doesn’t know your customers, your edge cases, your operational constraints, or your data gravity. And your on-call rotation definitely didn’t sign up for “two systems, both broken, forever.”
The lie: why “rewrite from scratch” feels true
Rewrites sell hope. They offer a clean break from accumulated mess: no more legacy frameworks, no more “temporary” hacks from 2017, no more untestable modules, no more that one stored procedure everyone is scared to touch. The pitch is emotionally correct. The problem is that production systems don’t run on emotions. They run on invariants.
A rewrite from scratch is usually sold as a technical project. It is actually an organizational bet: that you can rebuild not just code, but behavior, data semantics, operational posture, and failure handling—while the old system continues to evolve under real customer load.
Here’s the part people omit in the rewrite pitch deck: the old system is a fossil record of real incidents. It contains the scar tissue of outages, fraud attempts, weird client devices, partial failures, and regulatory surprises. That scar tissue is ugly. It is also valuable.
Rewrites ignore that the “requirements” are not in the ticket system. They are in production graphs, in on-call notes, and in the silent assumptions that keep the lights on. When you rewrite, you delete those assumptions—then rediscover them at 2:13 a.m.
Joke #1: A rewrite plan is like buying a new treadmill to get fit. The purchase feels productive; the running part is where reality shows up.
Why rewrites fail in production (the real reasons)
1) Feature parity is a trap, not a milestone
Teams treat “feature parity” as a checklist. In practice, the old system doesn’t have features; it has behaviors. Behaviors include undocumented defaults, timing quirks, idempotency expectations, retry semantics, and data correction workflows that happen outside the happy path.
When a rewrite aims at parity, it targets the visible UI/API surface and misses the messy parts that matter: how the system behaves when a payment gateway times out, when a downstream is slow, when a client retries a POST, when clocks skew, when you have to reprocess a day of events.
2) Data is the product, and data is heavy
Most systems are data systems wearing a UI. A rewrite that doesn’t start with data semantics—what records mean, how they change over time, what’s allowed to be eventually consistent—will drift into a “new database schema that feels nicer” and then crash into reality at cutover.
Data migration is not a weekend project. It’s a sustained reliability exercise with backfills, dual-writes (or change data capture), reconciliation, and rollback plans. If your rewrite plan does not include months of running both data paths, you’re not planning a cutover—you’re planning a coin toss.
3) The rewrite creates a split-brain organization
Two codebases means two priorities, two bug queues, two operational models, and one shared customer base that expects the same service. Usually the senior people get pulled into the rewrite, leaving the legacy system with reduced capacity and increasing risk. Then an incident happens in the legacy system, and the rewrite schedule gets raided to respond. The rewrite slows. The legacy decays. Everyone loses.
4) Operational readiness is not “we have Kubernetes now”
Modern stacks can make things worse when they’re used as credibility props. Swapping one set of failure modes for another isn’t progress; it’s just new ways to page people.
Operational readiness is about: well-defined SLOs, instrumentation, alert quality, controlled rollouts, capacity modeling, dependency management, runbooks, and a culture that can sustain ongoing change. If the rewrite team can’t run the old system well, it won’t run the new system well either—just with shinier YAML.
5) Performance is an emergent property, and you can’t unit-test it into existence
The old system has performance hacks that came from battle: caching at strange layers, denormalized tables, precomputed aggregates, carefully placed indexes, request coalescing, and “don’t do that on the hot path” rules. The rewrite often starts clean, then performance regressions show up under production-like load. Then you bolt on caches and queues and background jobs, eventually rebuilding the same complexity—without the institutional memory.
6) The new system is correct in the small and wrong in the large
Code reviews catch local issues. They don’t catch system-level behavior under partial failures. Rewrites fail because they model the world as reliable and consistent. Production is neither.
7) Security and compliance aren’t “later”
Rewrites frequently postpone security controls, audit trails, retention rules, and least-privilege access. Then you discover that the legacy system’s “weird” logging and access patterns were there because an auditor once asked a very specific question. You either scramble or you delay cutover. Both are expensive.
Facts & history: the industry has been here before
- 1980s–1990s: Large organizations repeatedly attempted “CASE tool” driven rewrites of legacy mainframe systems; many collapsed under scope and data migration complexity.
- The Year 2000 (Y2K) effort taught enterprises a brutal lesson: replacing everything is rarely feasible; remediation and risk-based triage often win.
- The “big bang” ERP rollout era showed a pattern: cutovers fail when business processes aren’t mapped to real workflows and exceptions.
- The rise of service-oriented architecture (SOA) promised modularity; many projects delivered distributed monoliths with more latency and harder debugging.
- Microservices popularity (mid-2010s) increased the temptation to rewrite, but also increased the cost of operational maturity: tracing, dependency mapping, and failure containment became mandatory.
- Change data capture (CDC) tooling matured and made incremental migrations more practical, shifting the economics away from big-bang rewrites.
- Cloud elasticity reduced some capacity risks, but introduced new ones: noisy neighbors, service quotas, and billing-driven incidents.
- Observability as a discipline (metrics, logs, traces) became mainstream; it exposed that many “legacy” outages were actually dependency and capacity issues.
Three mini-stories from the corporate trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company rewrote its billing service in a new language to “make it maintainable.” The team did a careful job on unit tests and the API contract. They built a clean database schema and shipped behind a feature flag.
The wrong assumption was subtle: they assumed all clients would treat POST /charge as non-idempotent and would never retry automatically. The legacy system had quietly implemented idempotency using a client-supplied token, because years earlier a mobile client had been retrying requests on flaky networks.
The rewrite didn’t implement that. Under a routine network jitter event between regions, a subset of clients retried charges. The new service dutifully created multiple charges. Support lit up. Finance got involved. Engineers got paged for a “data correctness incident,” which is the kind that doesn’t stop hurting when the graphs go green.
The fix was not “add more tests.” The fix was to treat idempotency and retries as first-class requirements, document them as invariants, and build reconciliation tooling. They also added a canary that simulated retries and verified the ledger stayed stable.
The moral: if you don’t explicitly model client behavior, the network will do it for you.
Mini-story 2: The optimization that backfired
An enterprise internal platform team rewrote a reporting pipeline to reduce costs. They replaced a database-driven aggregation job with a streaming pipeline and aggressively tuned batching to minimize CPU and storage.
The optimization looked great in synthetic benchmarks. Production was different. Their batching increased end-to-end latency and created bursty load on downstream services. A “cheap” pipeline turned into a thundering herd generator. The downstream rate-limited. Retries piled up. The streaming system’s internal buffers grew, and then the backpressure logic began dropping messages under sustained load.
Now the system had two problems instead of one: reports were late, and some were wrong. The team spent weeks building compensating controls: dead-letter queues, replay tooling, and a “late data” correction process. Costs went up, not down, because operational overhead is also a cost—just paid in human attention.
They eventually backed off the batching, accepted a higher steady-state compute cost, and put strict SLOs around freshness and correctness. The “optimization” had optimized the wrong thing: the bill, not the product.
Mini-story 3: The boring but correct practice that saved the day
A company modernizing an authentication stack resisted the rewrite temptation and did something painfully unglamorous: they built a compatibility test suite based on real production traffic. Not just unit tests—recorded requests, edge-case tokens, and representative failure modes.
They deployed the new service as a shadow reader first. It validated tokens and computed decisions but did not enforce them. For weeks it compared its outputs to the legacy system’s decisions and logged mismatches with enough context to debug.
They found a long tail of oddities: clock skew tolerance, an older signing algorithm still in use by one customer, and a specific error code that a partner depended on. None of this was in the spec. All of it mattered.
When they finally cut traffic over, the launch was almost boring. On-call got a few pages—mostly due to dashboards being too sensitive—and then things stabilized. The “boring practice” was treating migration as an evidence-gathering exercise, not a heroic leap.
What actually works: patterns that survive contact with reality
Start with invariants, not architecture
Before you debate frameworks, write down the invariants. The things that must remain true even when dependencies fail:
- Idempotency rules: which operations can be safely retried, and how.
- Data correctness constraints: what “cannot happen” (double charge, negative balance, lost audit record).
- Latency budgets and availability targets: what the user will tolerate.
- Consistency requirements: where eventual consistency is acceptable and where it isn’t.
- Rollback requirements: what it means to undo a deploy, a migration, a backfill.
Use the strangler fig pattern (and mean it)
The strangler fig pattern works because it respects that production systems are living ecosystems. You don’t replace the tree in one day; you grow a new system around it and gradually move responsibilities over.
Practically, this means:
- Put a routing layer (API gateway, reverse proxy, or service mesh ingress) in front of the old system.
- Move one endpoint, one workflow, or one domain slice at a time.
- Keep a fast rollback: route traffic back immediately.
- Use shadow reads and comparison when possible.
Prefer “replace behind an interface” over “rewrite everything”
When a subsystem is truly rotten, replace it behind a stable interface. Keep the contract. Keep the metrics. Keep the operational runbooks. Change the internals. This reduces blast radius and keeps teams from rebuilding the entire world just to fix one wall.
Data migration: dual-write or CDC, plus reconciliation
Choose your poison carefully:
- Dual-write: application writes to old and new stores. Simpler conceptually, harder to make correct under partial failure.
- CDC: treat the old database as the source of truth and stream changes to the new store. Often more robust, but requires careful ordering and schema evolution discipline.
Regardless, you need reconciliation: periodic jobs that compare counts, checksums, and invariants between old and new. Without reconciliation you are operating on vibes.
Build operational parity before feature parity
Operational parity is the ability to run, debug, and recover. It includes:
- Dashboards that show saturation, errors, latency, and dependency health.
- Alerting that is actionable (pages on symptoms, not noise).
- Runbooks that assume partial failures and include rollbacks.
- Load testing that matches production shape, not just volume.
One quote you should tape to your monitor
Hope is not a strategy.
— James Cameron
Operations isn’t cynical; it’s allergic to magical thinking. Plan for the failure modes you will definitely have.
Keep the old system healthy while you migrate
This is where leadership needs to grow up. If you starve the legacy system while building the replacement, the legacy will collapse and consume the replacement team. Allocate explicit capacity for legacy reliability work during the migration. Treat it as risk reduction, not “wasted effort.”
Joke #2: The only thing worse than one brittle system is two brittle systems that disagree about whose fault it is.
Practical tasks: commands, outputs, what it means, and what you decide
These are the tasks you actually run when someone says “the rewrite will fix performance/reliability.” You don’t argue. You measure. Each task below includes a realistic command, typical output, what that output means, and the decision you make from it.
1) Identify CPU saturation vs. latency complaints
cr0x@server:~$ uptime
14:22:01 up 37 days, 4:11, 2 users, load average: 18.42, 17.96, 16.88
What it means: Load average far above CPU count (you need to know cores) suggests CPU contention or runnable queue buildup, possibly also I/O wait depending on the workload.
Decision: Don’t start a rewrite because “it’s slow.” First determine if you’re CPU-bound, I/O-bound, or lock-bound. Next: check CPU breakdown and run queue.
2) Check per-CPU utilization and iowait
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (prod-app-01) 02/04/2026 _x86_64_ (16 CPU)
22:22:10 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
22:22:11 all 62.11 0.00 11.83 18.44 0.00 0.52 0.00 0.00 0.00 7.10
22:22:11 0 71.00 0.00 12.00 10.00 0.00 0.00 0.00 0.00 0.00 7.00
What it means: High %iowait suggests the CPU is waiting on disk/network storage. High %usr suggests compute-bound. Here it’s mixed: CPU is busy and waiting on I/O.
Decision: Investigate storage and database latency before rewriting application logic. A rewrite won’t change your disks.
3) Check memory pressure and swapping
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 62Gi 54Gi 1.1Gi 1.8Gi 6.9Gi 4.2Gi
Swap: 8.0Gi 2.7Gi 5.3Gi
What it means: Swap use is non-trivial. If the system is actively swapping, tail latency will spike.
Decision: Before rewriting, fix memory sizing, leaks, or container limits. If you can’t run the current service without swapping, the new one will probably do it too—just faster.
4) Check active swapping and major faults
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 2 2795520 1187328 9120 6852140 0 64 1220 1780 8210 9230 61 11 7 21 0
7 1 2795584 1169000 9120 6854100 0 128 1100 1650 8030 9012 60 12 8 20 0
What it means: so (swap out) indicates active swapping. wa is also high, consistent with I/O wait.
Decision: Treat this as an ops incident, not a roadmap opportunity. Reduce memory footprint, fix noisy neighbors, or scale. Rewriting won’t cure swapping.
5) Find top CPU consumers
cr0x@server:~$ ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head
PID COMMAND %CPU %MEM
8123 java 345.2 18.4
9001 redis-server 72.1 3.2
7442 nginx 38.0 0.8
What it means: A single process consuming multiple cores might be expected, but confirm it aligns with throughput. If CPU is high and throughput is low, you’re spinning or lock-bound.
Decision: Profile the hot process and check thread contention. If the rewrite pitch is “new language will be faster,” demand evidence with flame graphs first.
6) Identify disk bottlenecks quickly
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (prod-db-01) 02/04/2026 _x86_64_ (16 CPU)
Device r/s w/s rMB/s wMB/s await svctm %util
nvme0n1 220.0 410.0 35.2 88.1 18.40 0.90 92.50
What it means: High %util and rising await implies the device is saturated or queueing. Low svctm with high await indicates queue depth/latency, not raw device slowness.
Decision: You need query tuning, index changes, or IO distribution. A rewrite that keeps the same access patterns will hit the same wall.
7) Check filesystem capacity and inode exhaustion
cr0x@server:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 900G 855G 45G 96% /
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 5900000 5892000 8000 100% /
What it means: Disk nearly full is bad; inode exhaustion is sneakier and can break deployments, logging, and temp files.
Decision: Stop. Clean up. Add retention policies. If your rewrite is “because deploys are failing,” and the reason is inodes, you don’t need a new codebase—you need housekeeping.
8) Verify network errors and retransmits
cr0x@server:~$ netstat -s | egrep -i 'retrans|listen|listenoverflows|packet receive errors' | head
12455 segments retransmitted
37 packet receive errors
What it means: Retransmits and receive errors can produce “random latency.” Your app may be innocent.
Decision: Investigate NIC, MTU mismatches, overloaded load balancers, or cross-AZ issues before rewriting the service layer.
9) Inspect TCP connection states (leaks or slow clients)
cr0x@server:~$ ss -s
Total: 14021
TCP: 10234 (estab 812, closed 9132, orphaned 5, timewait 7210)
Transport Total IP IPv6
RAW 0 0 0
UDP 29 25 4
TCP 1102 1011 91
INET 1131 1036 95
FRAG 0 0 0
What it means: Excessive timewait can indicate short-lived connections without keep-alives, or aggressive client retry behavior.
Decision: Tune connection reuse, load balancer settings, and client behavior. A rewrite won’t change TCP physics.
10) Check container throttling (Kubernetes CPU limits bite)
cr0x@server:~$ kubectl -n payments top pods | head
NAME CPU(cores) MEMORY(bytes)
payments-api-6d8d6c6b6c-2qz7m 980m 740Mi
payments-api-6d8d6c6b6c-pk9h4 995m 755Mi
cr0x@server:~$ kubectl -n payments describe pod payments-api-6d8d6c6b6c-2qz7m | egrep -i 'Limits|Requests|throttl' -n | head -n 20
118: Limits:
119: cpu: 1
120: memory: 1Gi
What it means: Pods pegged at the CPU limit likely experience throttling, causing latency spikes that look like “the new service is slower.”
Decision: Revisit CPU limits/requests and HPA policies. If you rewrite onto Kubernetes without understanding throttling, you’ve just moved the problem into YAML.
11) Identify database lock contention (a classic rewrite blind spot)
cr0x@server:~$ psql -U postgres -d appdb -c "select pid, wait_event_type, wait_event, state, query from pg_stat_activity where wait_event_type is not null order by pid limit 5;"
pid | wait_event_type | wait_event | state | query
------+-----------------+----------------+--------+------------------------------------------
4142 | Lock | transactionid | active | UPDATE invoices SET status='paid' ...
4221 | Lock | relation | active | ALTER TABLE ledger ADD COLUMN ...
What it means: Requests are blocked on locks. Performance problems may be due to migration DDL, not application code quality.
Decision: Schedule heavy migrations, reduce lock scopes, use online schema change techniques. Don’t rewrite because “Postgres is slow” while you’re holding locks.
12) Check slow queries and pick the top offenders
cr0x@server:~$ psql -U postgres -d appdb -c "select calls, mean_exec_time, rows, left(query,120) as q from pg_stat_statements order by mean_exec_time desc limit 5;"
calls | mean_exec_time | rows | q
-------+----------------+------+------------------------------------------------------------
412 | 982.14 | 12 | SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 50
201 | 744.33 | 1 | SELECT balance FROM accounts WHERE id = $1 FOR UPDATE
What it means: You have concrete targets: add indexes, change query shape, reduce locking. This is usually cheaper than rewriting.
Decision: Fix the hot queries first. If you still want a rewrite, at least carry over the query lessons so the new system doesn’t repeat them.
13) Validate replication lag before a cutover
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep -i 'Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running'
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 43
What it means: Replication lag means your “read from new system” might be stale. This can break user expectations during migration.
Decision: Either accept staleness explicitly (and design for it) or don’t cut reads over until lag is consistently low.
14) Detect error rate changes during canary rollout
cr0x@server:~$ kubectl -n payments logs deploy/payments-api --since=5m | egrep -c " 5[0-9][0-9] "
27
What it means: A rising 5xx count after a deploy is a canary failure until proven otherwise.
Decision: Roll back fast, then debug with traces and dependency checks. Don’t “push through” because the rewrite roadmap says you must.
15) Confirm you have enough file descriptors under load
cr0x@server:~$ ulimit -n
1024
cr0x@server:~$ cat /proc/$(pgrep -n nginx)/limits | egrep -i "open files"
Max open files 1024 1024 files
What it means: 1024 FDs is often too low for busy proxies/services. You can get connection failures that look like “the new app is flaky.”
Decision: Raise limits, verify container runtime settings, and retest. Again: fix fundamentals before redesigning the universe.
Fast diagnosis playbook: what to check first/second/third
This is the drill when someone claims “the legacy system is the bottleneck” or “the rewrite will be faster.” You can run this in under an hour on a live incident (carefully) or in a staging environment with production-like load.
First: Is it saturation, errors, or dependency latency?
- Check error rates (5xx/4xx spikes, timeouts). If errors spiked, performance may be a symptom of partial failure.
- Check saturation: CPU iowait, disk await, network retransmits, DB connections, thread pools.
- Check dependency health: DB, cache, message broker, external APIs, DNS, certificate expiration.
Goal: classify the issue: compute-bound, I/O-bound, lock-bound, network-bound, or dependency failure.
Second: Find the tight loop in the request path
- Pick one user-facing endpoint (highest traffic or highest latency).
- Trace it end-to-end (distributed tracing if available; otherwise log correlation IDs).
- Measure time spent in: app CPU, DB query, cache, upstream, serialization, retries.
Goal: identify where the time goes, not where it feels like it goes.
Third: Validate with a controlled experiment
- Make one change (index, cache TTL, connection pool size, CPU limit).
- Canary it to a small slice of traffic.
- Compare: latency percentiles, error rates, saturation metrics.
Goal: evidence-driven decisions. If the rewrite proposal can’t survive this level of scrutiny, it’s a morale project, not an engineering project.
Common mistakes: symptoms → root cause → fix
“We rewrote and latency got worse”
Symptoms: p95/p99 latency up, CPU okay, dashboards show more network calls.
Root cause: You decomposed into services without a latency budget, turning in-process calls into RPC chains. You built a distributed monolith.
Fix: Collapse chatty boundaries, batch calls, introduce local caching, and enforce budgets per hop. Prefer coarse-grained APIs over “pure” microservice boundaries.
“Cutover worked, then data drift appeared”
Symptoms: Reports don’t match, balances differ, customers see inconsistent states days later.
Root cause: Dual-write without exactly-once semantics; missing reconciliation; out-of-order events; inconsistent timezone/rounding rules.
Fix: Implement reconciliation jobs and invariants checks; use idempotency keys; define authoritative source per field; adopt CDC with ordering guarantees when possible.
“The new system is stable, but on-call is worse”
Symptoms: More alerts, harder debugging, more unknown unknowns.
Root cause: Observability and runbooks were deferred; alerts are based on raw metrics rather than user-impact signals; no tracing.
Fix: Instrument golden signals (latency, traffic, errors, saturation). Add tracing. Rewrite alerts to be symptom-based and tie to SLOs.
“We can’t ship because we’re chasing parity forever”
Symptoms: Rewrite project runs for quarters/years, business keeps adding features to legacy, rewrite never catches up.
Root cause: Big-bang mindset; no incremental cutovers; rewrite team isolated from real product priorities.
Fix: Strangler pattern with thin vertical slices. Move one workflow end-to-end. Freeze some legacy features or redirect new features to the new path only.
“We replaced the database schema and everything hurt”
Symptoms: Slow queries, lock contention, migration windows expand, rollbacks risky.
Root cause: Schema redesign ignored access patterns and operational constraints; missing indexes; unbounded migrations.
Fix: Start with query profiling and index strategy. Use online migrations, backfills, and phased constraints. Keep old schema as an adapter layer when needed.
“We rewrote to improve security and introduced new holes”
Symptoms: Missing audit trails, weaker authorization checks, secrets sprawl.
Root cause: Security controls were implicit in legacy and not modeled; new stack shipped without threat modeling.
Fix: Inventory security invariants (authz rules, logging, retention). Add automated checks in CI. Use least privilege and centralized secrets management from day one.
Checklists / step-by-step plan
Decision checklist: should you rewrite at all?
- Can you name the bottleneck? If not, do measurement first (see tasks and diagnosis playbook).
- Is the problem code maintainability or system behavior? If incidents are mostly capacity/dependency, rewriting app code won’t help.
- Is there a stable contract? If the interface is unstable, lock it down before moving internals.
- Is there a data plan? If you can’t articulate dual-write/CDC, reconciliation, and rollback, you’re not ready.
- Do you have ops maturity? Dashboards, alerts, tracing, runbooks, staged rollouts. If no, build that first.
- Can you staff two systems? If not, do incremental replacement, not parallel rewrites.
A safer modernization plan (works even with limited time)
- Inventory invariants: idempotency, correctness rules, retention, authz, error codes, rate limits.
- Instrument the legacy system if it’s blind: add request IDs, latency histograms, error taxonomies.
- Put a routing layer in front: gateway/proxy that can split traffic and rollback instantly.
- Pick one vertical slice: one workflow that delivers real value and exercises real dependencies.
- Shadow first: new system computes answers and logs mismatches, but doesn’t serve them.
- Canary: 1% traffic, then 5%, then 25%, measuring SLOs and invariants.
- Cut read paths carefully: stale reads are user-visible. Use consistency budgets and clear behavior.
- Cut write paths last: make sure idempotency, retries, and reconciliation are proven.
- Decommission in chunks: remove legacy endpoints as they drain to zero traffic; keep archive access paths for audit.
Release checklist for a migrated component
- SLO defined; dashboards show golden signals.
- Alerting tuned; on-call has runbooks and rollback instructions.
- Capacity tested with production-like load shape.
- Dependency timeouts and retries configured (with budgets).
- Idempotency implemented for unsafe operations.
- Data reconciliation jobs in place; mismatch triage process defined.
- Security controls validated: authz parity, audit logs, retention.
- Game day performed: dependency failure, slow DB, partial deploy, rollback.
FAQ
1) When is a rewrite actually justified?
When the current system cannot be evolved safely: unsupported runtime with unpatchable security risk, licensing constraints, or architecture that blocks critical business requirements. Even then, prefer incremental replacement behind stable interfaces.
2) Isn’t incremental migration slower than rewriting?
Incremental feels slower because it’s honest about operating two realities. Big rewrites feel fast until you hit integration, data, and operations—then time explodes. Incremental migration wins by shipping value early and reducing existential risk.
3) We have terrible code quality. Doesn’t that demand a rewrite?
Bad code quality demands boundaries, tests around invariants, and operational visibility. Often you can isolate the worst modules and replace them behind an interface. A full rewrite resets code quality to “unknown,” which is not automatically better.
4) How do we avoid “two systems forever”?
By migrating in slices that fully retire legacy responsibilities. Don’t build a parallel system that duplicates everything before shipping. Route traffic, cut over a slice, then delete the old slice. Deletion is a milestone.
5) What’s the biggest hidden risk in rewrites?
Semantic drift: the new system behaves differently under retries, partial failures, and weird input. Users don’t file tickets for “semantic drift.” They file tickets for money missing, data wrong, and “your API is flaky.”
6) Does microservices architecture require a rewrite?
No. You can carve services out of a monolith over time. The first step is often to create internal modular boundaries and extract one domain with clear ownership and data contracts.
7) How do we handle data migration without downtime?
Use CDC or dual-write, then reconcile. Cut reads when staleness is acceptable or mitigated, cut writes last with strong idempotency. Always have a rollback route and a plan for backfills.
8) What should leadership measure to know the migration is healthy?
Not story points. Measure SLO attainment, incident rate, rollback frequency, time-to-detect/time-to-recover, and migration progress in retired legacy surface area (endpoints/workflows removed).
9) How do we stop engineers from “boiling the ocean”?
Define a thin vertical slice that reaches production, then require every expansion to include an exit plan for the equivalent legacy path. Reward deletion and operational stability, not novelty.
Next steps you can ship this quarter
If you’re sitting in a meeting where someone is pitching a rewrite as a cure-all, here’s what you do instead—practically, without drama:
- Run the fast diagnosis playbook and publish the bottleneck classification. Get the debate out of the realm of aesthetics.
- Write down invariants (idempotency, correctness, latency budgets, authz rules). Make them reviewable and testable.
- Pick one workflow and migrate it using routing + canary + rollback. Prove you can move slices safely.
- Invest in operational parity: dashboards, tracing, alert hygiene, runbooks. Make it easier to run systems than to argue about them.
- Make data reconciliation a product feature, not a side quest. If you can’t prove data correctness, you don’t have correctness.
The rewrite-from-scratch lie survives because it offers a story where complexity disappears. In real systems, complexity doesn’t disappear; it moves. Your job is to move it into places where it’s measurable, controllable, and boring. Boring is underrated. Boring ships.