Vendor Lock-In Horror Stories (And How to Escape Cleanly)

February 16, 2026 • February 16, 2026 • Read: 21 min • Views: 0

Was this helpful?

You don’t notice lock-in when the dashboard is green and the sales engineer is still replying in under five minutes.
You notice it when an outage drags on because the only people who can fix it are a vendor escalation team on a different time zone,
and your “standard export” turns out to be a politely formatted hostage note.

Lock-in is rarely a single bad decision. It’s the compound interest of small conveniences: proprietary APIs, managed add-ons,
special drivers, opaque billing, and contracts written like they were designed by a labyrinth enthusiast.

What vendor lock-in really is (and what it isn’t)

Vendor lock-in isn’t “we use a product.” It’s “we can’t leave without taking a material outage, a financial hit, or a compliance hit.”
The difference matters because every serious system depends on vendors. The goal isn’t purity. The goal is leverage.

The three kinds of lock-in (and how to recognize each)

Technical lock-in: proprietary APIs, formats, drivers, or control planes that become embedded in your app and operations.
You recognize it when “migration” means “rewrite.”
Economic lock-in: egress fees, long-term commitments, or pricing models that punish portability.
You recognize it when the CFO is suddenly in your architecture meetings.
Operational lock-in: only vendor staff can debug certain failure modes, or your team stops understanding the system because the vendor hides it.
You recognize it when your incident channel contains the phrase “waiting on vendor.”

Lock-in isn’t always evil. It’s often just unpaid debt.

Some lock-in is an explicit trade: “We accept proprietary features because speed matters more than portability.”
That can be rational—if you price the exit upfront, keep your data portable, and maintain an escape route that isn’t imaginary.
Most teams do none of that. They treat “we can always migrate later” like it’s a feature, not a project.

One practical heuristic: if you cannot explain your exit plan in ten minutes, you don’t have an exit plan. You have hope.

Facts and historical context you can use in meetings

A little history helps because lock-in is not a modern cloud invention. It’s an old pattern wearing a hoodie.
Here are concrete points you can use to steer decisions without sounding like you’re afraid of change.

1960s–1970s: IBM “bundling” shaped the industry. The unbundling of software from hardware created an ecosystem,
but also taught vendors that control planes and proprietary interfaces are power.
1980s: proprietary Unix variants proliferated. Portability existed, but “portable” often meant
“recompile and pray,” with vendor-specific tooling creeping in.
1990s: enterprise storage arrays became ecosystems. You didn’t just buy disks; you bought management software,
replication features, and special firmware—plus the idea that leaving would be unpleasant.
Early 2000s: virtualization reduced some lock-in and created new kinds. It became easier to move workloads as “VMs,”
but hypervisor ecosystems (backup plugins, drivers, orchestration) became sticky.
2010s: cloud made portability look easy—until data gravity arrived. Moving compute is usually a weekend. Moving many petabytes
is a season, sometimes a fiscal year.
Containers revived the dream of portability. But the real lock-in often moved to managed control planes
(IAM, managed databases, managed Kubernetes add-ons, proprietary observability).
S3’s API became a de facto standard. That helped portability for object storage, but not for the surrounding
lifecycle policies, identity integration, eventing, and analytics tie-ins.
“Open source” isn’t automatically “portable.” Managed open-source services can still lock you in via
proprietary extensions, billing models, and operational dependency on the provider.

Paraphrased idea, attributed to Werner Vogels (Amazon CTO): Everything fails, all the time—so design for failure as a default state.
Lock-in is what happens when you design for success only.

Lock-in failure modes: where it bites ops first

1) Your data is “exportable” but not “usable”

Vendors love the word “export.” Export in what format? With what metadata? With what checksums? With what ordering guarantees?
Does it preserve ACLs, timestamps, object versions, retention locks, legal holds, KMS context, and audit trails?
If your compliance story depends on those, “export” without fidelity is a breach waiting to happen.

2) The control plane becomes the single point of failure

Plenty of managed services are operationally solid until the day the control plane stutters.
If you can’t rotate credentials, can’t create volumes, can’t reschedule nodes, can’t view metrics, can’t open support cases due to SSO issues—congratulations,
your incident response now depends on a web UI.

3) “Managed” means “you don’t get the knobs you need during an incident”

In a storage incident, you want IOPS caps, queue depth control, per-tenant throttling, visibility into replication lag,
and the ability to isolate noisy neighbors. In a managed service, those knobs may exist only as “contact support.”
That’s not a knob. That’s a ticket.

4) Billing is the hidden availability risk

Egress and request pricing can alter architecture decisions under pressure. Teams start doing unsafe things—like disabling replication,
compressing logs “later,” or stretching retention windows—because someone got spooked by an unexpected bill.
When cost becomes a surprise, reliability becomes negotiable.

Joke #1: Vendor lock-in is like a hotel mini-bar—you don’t notice the pricing until you’re already thirsty and out of options.

5) Skills atrophy: your team stops knowing how the system works

Lock-in isn’t just about technology. It’s about cognition. If a vendor’s black box runs “fine” for two years, the team forgets
how to run the equivalent themselves. Then the vendor changes terms, sunsets a feature, or has a regional event. Suddenly you’re migrating
with a team that no longer has the muscle memory.

6) Your incident timeline includes “procurement”

The moment your DR plan needs a last-minute license, a re-negotiation, or an approval from a vendor account manager, you don’t have a DR plan.
You have a motivational poster.

Three mini-stories from the corporate trenches

Mini-story #1: The outage caused by a wrong assumption (the “standard API” trap)

A mid-sized fintech rebuilt a document pipeline around “S3-compatible” object storage, provided by a specialty vendor that promised
low latency and integrated compliance features. The dev team was careful: they used the S3 SDK, avoided vendor-specific endpoints,
and even wrote integration tests against a local S3 emulator. They considered it portable.

Then legal asked for a full export of a subset of objects with immutable retention proof—object versions, retention mode, and access logs—because of an audit.
The team assumed they could pull that via standard S3 APIs. The vendor did support the core object APIs, sure. But the compliance proof chain
lived in proprietary metadata services and a proprietary audit log pipeline.

Exporting “the data” was easy. Exporting “the data plus the evidence” was not. They spent days assembling a chain of custody manually,
parsing vendor-provided CSV dumps, and arguing about timestamp precision. Meanwhile, a release was blocked because compliance couldn’t sign off.

The failure wasn’t technical incompetence. It was an assumption: “If the data API is standard, the operational semantics are standard.”
They aren’t. S3 compatibility can cover reads and writes while everything that matters in regulated environments—retention, legal hold, audit trails, key management—remains proprietary.

The fix was painful but clarifying: they defined a “portability contract” for storage, including which metadata must be extractable, in which format,
and how it will be validated. They also started mirroring compliance logs into an independent system they controlled, even if the vendor stored the objects.
More cost. Much more leverage.

Mini-story #2: The optimization that backfired (the “let’s use the vendor’s magic feature” incident)

An e-commerce company had a busy PostgreSQL fleet and a lot of read traffic. They moved to a managed database offering and immediately fell in love
with a vendor feature: near-zero-config read scaling with proprietary read replicas and a transparent routing layer.

Performance improved. The team celebrated. Then they got ambitious: they pushed read-heavy workloads, analytics queries, and some critical API endpoints
through the vendor’s router. It was convenient. It was also a new dependency that wasn’t visible in application code.

During a regional network event, the routing layer degraded in a way that didn’t look like a database issue. Connections were accepted, then stalled.
Timeouts multiplied. The application autoscaled, making it worse. Their runbook said “fail over to self-managed replicas.” Except the application
no longer had a straightforward connection string to do that. The router was now the contract.

They escaped by adding an explicit database access layer in their platform: connection endpoints were abstracted behind internal DNS,
and the vendor router became one implementation, not the implementation. They also added an emergency “direct-to-primary” path and tested it quarterly.

The lesson: optimizations that remove knobs from your hands are not optimizations. They’re outsourcing your ability to improvise.

Mini-story #3: The boring but correct practice that saved the day (the “exit drill” win)

A healthcare SaaS company ran a mix of cloud services and on-prem storage, and they had a deeply unsexy rule: every quarter,
they did an “exit drill” for one critical dependency. Not a full migration. A rehearsal. The goal was to prove you could move a representative slice of production data,
validate it, and run a key workload elsewhere.

They maintained a second identity provider configuration in “cold standby,” kept Terraform modules provider-agnostic where possible,
and stored their encryption key material in a way that could be rehydrated outside a single cloud KMS. They also wrote down the annoying bits:
which service quotas needed pre-approval, what DNS changes were required, and which teams had to sign off.

One year, their primary cloud provider had a control plane incident that blocked provisioning and broke parts of their CI/CD pipeline.
Not catastrophic, but it stretched into business hours, which in healthcare is a long time to be “sort of down.”

Their response wasn’t heroic. It was procedural. They flipped a subset of customer workloads to the alternate environment using pre-planned DNS weights,
restored from continuously replicated backups, and kept the lights on. No one wrote a viral postmortem thread. That’s the point.

The boring practice that saved them was a quarterly exercise that executives had once questioned. After that incident, nobody questioned it again.

Fast diagnosis playbook: find the bottleneck before you blame the vendor

When things go sideways, vendor lock-in amplifies panic because it reduces your options. Your job is to separate
“the system is slow” from “we are trapped.” Here’s the fast triage order I use in production.

First: confirm the blast radius and whether it’s control-plane or data-plane

Control-plane symptoms: can’t create resources, can’t change config, auth/SSO broken, dashboards down, API calls failing fast.
Data-plane symptoms: reads/writes time out, latency spikes, replication lag grows, throughput collapses.

Second: identify the bottleneck class

Network: packet loss, MTU mismatch, TLS issues, cross-region routing weirdness, saturated links.
Storage: IOPS/throughput caps, queue depth, throttling, noisy neighbor, replication backpressure.
Compute: CPU steal, memory pressure, GC pauses, thread pool exhaustion, kernel limits.
Service limits: quotas, rate limits, connection caps, API throttling.
App behavior: retry storms, thundering herd, bad rollout, inefficient queries.

Third: decide if you need an escape hatch now

An escape hatch isn’t “migrate everything.” It’s “reduce dependency quickly.” Examples:
switch read traffic to cached responses, fail over to a simpler DB endpoint, temporarily stop non-critical writes,
or route object downloads through a CDN you control rather than the vendor endpoint.

Joke #2: The best time to design an exit strategy was last year. The second best time is before your pager starts composing poetry.

Hands-on: 14 real tasks to measure lock-in and plan an exit

These are intentionally practical. Each task includes: a command, realistic output, what it means, and the decision you make.
Run them on a representative host (or in CI) and paste results into your architecture notes.

Task 1: Inventory proprietary agents and vendor daemons

cr0x@server:~$ systemctl list-units --type=service --state=running | egrep -i 'agent|vendor|backup|snap|replic|monitor' | head
vendorx-agent.service               loaded active running VendorX Node Agent
backup-appliance-connector.service  loaded active running Backup Appliance Connector
node-exporter.service               loaded active running Prometheus Node Exporter

What it means: You have at least one vendor-specific agent in the critical path. That’s not automatically bad, but it’s a dependency.

Decision: For each proprietary service, document: what breaks if it stops, how to remove it, and what replaces its function.

Task 2: Identify kernel modules/drivers that tie you to a platform

cr0x@server:~$ lsmod | egrep -i 'nvme|virtio|vendor|mlx|zfs' | head
nvme                   57344  2
virtio_net             45056  0
mlx5_core             942080  0
zfs                   4157440  3

What it means: Standard modules are fine. A vendor-specific kernel module is a migration risk (different kernels, signing, support).

Decision: If you see a vendor driver, plan a test boot without it in staging and validate performance/functionality.

Task 3: Quantify data gravity (how much you would have to move)

cr0x@server:~$ df -hT | egrep -v 'tmpfs|overlay' 
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4  1.8T  1.2T  530G  70% /
/dev/nvme1n1   ext4  7.0T  6.6T  110G  99% /data

What it means: 6.6T of hot data isn’t “just copy it.” It’s time, bandwidth, and validation. At 200 MB/s effective, that’s ~9.2 hours minimum, before verification and retries.

Decision: Classify datasets into: must-migrate, can-rebuild, can-expire. Exit plans are won by deleting data, not copying it.

Task 4: Measure real disk performance vs expectations

cr0x@server:~$ fio --name=randread --filename=/data/fio.test --size=8G --direct=1 --rw=randread --bs=4k --iodepth=32 --numjobs=4 --runtime=30 --time_based --group_reporting
randread: (groupid=0, jobs=4): err= 0: pid=2213: Tue Feb  4 10:21:31 2026
  read: IOPS=58.2k, BW=227MiB/s (238MB/s)(6815MiB/30001msec)
    slat (usec): min=3, max=612, avg=12.1, stdev=6.3
    clat (usec): min=121, max=48912, avg=2188.4, stdev=1330.2
    lat (usec): min=138, max=48926, avg=2200.9, stdev=1330.5

What it means: You can now compare the measured IOPS/latency to vendor claims and to your SLOs. High tail latency (max ~49ms) might hurt databases.

Decision: If tail latency is high, don’t adopt a platform feature that assumes predictable storage latency (some managed DB tiers do).

Task 5: Detect storage throttling or IO wait under load

cr0x@server:~$ iostat -xz 1 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.30    0.00    3.10   24.80    0.00   59.80

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
nvme1n1         820.0  410.0  98000  42000   18.4   0.9   98.7

What it means: High %util and high await suggest the device is saturated or throttled. If this device is “managed,” you may not have knobs.

Decision: If saturation is frequent, plan for a migration path to storage where you can scale IOPS independently or add caching.

Task 6: Check filesystem and mount options that can block portability

cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /data
/data /dev/nvme1n1 ext4 rw,noatime,nodiratime,discard

What it means: Mount options like discard, noatime are portable; but some filesystems (or features like reflinks, compression) can change migration tooling requirements.

Decision: Standardize on portable filesystem features for “moveable” data; reserve fancy features for data you can regenerate.

Task 7: Validate you can export and restore a database logically

cr0x@server:~$ pg_dump --format=custom --file=/tmp/appdb.dump --dbname=postgresql://app@db01/appdb
pg_dump: dumping contents of database "appdb" ...
pg_dump: finished

What it means: Logical dumps are portable across many environments (with caveats). If pg_dump is too slow or too large, you’re relying on physical snapshots—often vendor-tied.

Decision: If you can’t complete a logical dump in your RTO/RPO window, invest in logical replication or dual-writing patterns before you’re forced to migrate.

Task 8: Check if your infrastructure is glued to one provider via Terraform state and providers

cr0x@server:~$ terraform providers
Providers required by configuration:
.
├── provider[registry.terraform.io/hashicorp/aws] ~> 5.0
├── provider[registry.terraform.io/hashicorp/kubernetes] ~> 2.0
└── provider[registry.terraform.io/hashicorp/helm] ~> 2.0

Providers required by state:
  provider[registry.terraform.io/hashicorp/aws]

What it means: If state is dominated by one provider, migration requires either importing into a new state or rebuilding—both risky under time pressure.

Decision: Introduce an abstraction boundary: modules that can target multiple backends, and separate states per environment and per dependency class.

Task 9: Identify container images tied to a vendor registry or base image

cr0x@server:~$ crictl images | head
IMAGE                                     TAG        IMAGE ID            SIZE
vendor.registry.local/platform/app         1.42.0     9b1d0c4a2f3e        312MB
docker.io/library/nginx                    1.25       e34f3c9c7b7a        187MB

What it means: Vendor registry dependency can become a production outage if auth breaks or registry goes down.

Decision: Mirror critical images into a registry you control, and pin by digest for reproducibility.

Task 10: Confirm your Kubernetes cluster relies on vendor-specific CRDs/controllers

cr0x@server:~$ kubectl get crd | egrep -i 'vendor|ingress|certificate|backup' | head
backups.vendorx.io
snapshots.vendorx.io
certificaterequests.cert-manager.io

What it means: Vendor CRDs imply your workloads and backups may not be portable as-is.

Decision: Prefer upstream APIs (CSI snapshots, standard Ingress) where possible; for vendor CRDs, document an equivalence mapping.

Task 11: Inspect CSI drivers and storage classes for portability

cr0x@server:~$ kubectl get sc
NAME                 PROVISIONER                 RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION
fast-ssd             csi.vendorx.io              Delete          WaitForFirstConsumer   true
standard             kubernetes.io/no-provisioner Delete         Immediate              false

What it means: A vendor CSI provisioner can lock your PV lifecycle and snapshot/restore mechanics.

Decision: Ensure you can restore PV data outside the cluster (file-level backups) and that you’ve tested moving a stateful workload to another CSI backend.

Task 12: Verify backups are actually restorable without the vendor appliance

cr0x@server:~$ borg list /backup/borg::appdb-2026-02-04
appdb-2026-02-04               Tue, 2026-02-04 02:00:02 [d2d84c4c7b3a]  18.42 GB

What it means: This is an independent, file-based backup format you can restore anywhere you can run borg. If your backups only restore via a vendor UI, you’re locked.

Decision: Require at least one backup path that is vendor-independent and tested via CLI restore.

Task 13: Estimate egress exposure by measuring outbound traffic

cr0x@server:~$ sar -n DEV 1 3 | egrep 'IFACE|eth0'
IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s
eth0    1420.00   1890.00    9820.50   21540.25
eth0    1411.00   1922.00    9750.10   21910.80
eth0    1468.00   1855.00    10020.90  21010.40

What it means: Your sustained outbound rate is ~21 MB/s (txkB/s). Multiply by hours/day and you can ballpark monthly egress if you had to move data out or serve traffic cross-zone.

Decision: If egress is a material budget line, add caching/CDN, compress responses, and architect for locality before you need an exit.

Task 14: Find vendor-specific endpoints hard-coded in configs

cr0x@server:~$ grep -R --line-number -E 'vendorx|\\.cloudprovider\\.internal|kms|s3-' /etc /opt/app 2>/dev/null | head
/opt/app/config.yaml:17:  endpoint: https://s3-us-east-1.cloudprovider.internal
/opt/app/config.yaml:22:  kms_key_id: arn:cloud:kms:us-east-1:acct:key/1234abcd
/etc/systemd/system/vendorx-agent.service:5:ExecStart=/usr/local/bin/vendorx-agent --region us-east-1

What it means: Hard-coded endpoints and KMS identifiers are migration landmines. They guarantee surprises.

Decision: Move these into an indirection layer: DNS names you own, config variables with per-environment overrides, and a secrets system that can switch backends.

Common mistakes: symptoms → root cause → fix

1) Symptom: “We can’t fail over because the managed service won’t let us change X”

Root cause: You designed around a control plane you don’t control (IAM, routing, provisioning) and didn’t keep a manual path.

Fix: Maintain a documented break-glass path: static credentials stored securely, alternate endpoints, and a tested procedure to run core workloads with fewer features.

2) Symptom: “Export succeeded, but the restored system is missing permissions/version history”

Root cause: You exported objects/rows but not the semantics (ACLs, retention, audit logs, schema extensions, triggers, users).

Fix: Define an export manifest: data + metadata + verification. Run restore drills that validate security posture, not just checksums.

3) Symptom: “Migration estimates keep slipping; we keep discovering hidden dependencies”

Root cause: No dependency graph. Teams used vendor SDK conveniences and platform features informally.

Fix: Build an inventory: SDKs, APIs, CRDs, IAM policies, KMS keys, DNS names, network dependencies, and operational runbooks. Make it part of change review.

4) Symptom: “Costs spike during incidents; leadership demands risky cost cuts”

Root cause: Pricing model is coupled to reliability actions (egress for DR, cross-region replication, extra logs, extra reads).

Fix: Pre-approve a reliability budget line item. Treat DR traffic as a capacity reservation, not a surprise.

5) Symptom: “Support says it’s our fault, we say it’s theirs, nothing moves”

Root cause: No shared observability. You can’t provide evidence because telemetry is trapped in their platform.

Fix: Export metrics/logs/traces to a system you control (or at least dual-home them). During an incident, evidence wins arguments.

6) Symptom: “We can’t change providers because our IaC is provider-specific”

Root cause: Terraform modules reflect one provider’s primitives, not your platform’s intent.

Fix: Rewrite modules around intent (“network”, “cluster”, “db”, “object store”) and keep provider implementations behind interfaces. Separate state files.

7) Symptom: “Backups are green, but restores are terrifying”

Root cause: Backups rely on proprietary snapshot formats or appliances without independent restore tooling.

Fix: Add at least one portable backup: file-level, logical dumps, or open formats, and run restores on a schedule. Treat restore time as an SLO.

8) Symptom: “Our app can’t run outside this cloud because of IAM/KMS assumptions”

Root cause: Identity and encryption are implemented with provider-specific constructs, directly referenced in code/config.

Fix: Introduce an internal identity abstraction (OIDC/SAML boundaries) and key abstraction (envelope encryption with rewrap capability). Use indirection for identifiers.

Checklists / step-by-step plan: escaping cleanly

Step 1: Admit what you’re locked into (inventory + classification)

Data layer: databases, object storage, block storage, backup systems, queues.
Control plane: IAM, KMS, DNS, secrets, CI/CD, Kubernetes management.
Ops layer: observability, incident tooling, on-call workflows, runbooks, support contracts.

Make a simple table for each dependency:
What it does, what depends on it, how to run without it for 24 hours,
what the exit looks like, what data must move.

Step 2: Define “portable by design” standards

Protocols: prefer standard protocols and APIs (PostgreSQL wire protocol, S3 core API, CSI, OIDC).
Formats: use exportable formats (Parquet/CSV where appropriate, logical DB dumps, open backup formats).
Identifiers: avoid hard-coding vendor ARNs/IDs in app configs; use indirection layers.
Secrets and keys: design for key rewrap and secret backend changes.

Step 3: Build an “exit budget” and get it approved

Exits cost money: extra storage for dual-write, extra bandwidth for replication, extra environments for drills.
Treat it like insurance. If leadership wants portability, leadership funds portability.

Step 4: Choose an exit architecture pattern (pick one, don’t mash them)

Cold standby: minimal resources elsewhere; restore from backups.
Cheap, slower RTO. Great for boring services.
Warm standby: replicated data + periodically tested workloads.
Balanced cost and speed. Often the best default.
Active-active: dual-run across providers/regions.
Expensive and operationally complex. Worth it for a small number of critical systems.

Step 5: Make portability testable

You can’t manage what you can’t test. Add CI checks that fail builds when someone introduces a hard-coded endpoint,
a vendor-specific SDK feature, or a non-portable CRD dependency without an explicit exception.

Step 6: Execute a staged migration (the clean way)

Rehearse with non-prod: prove restore and cutover mechanics.
Migrate the read path first: replicate data, serve reads, validate correctness.
Introduce dual-write or change-data-capture: keep systems in sync for a controlled period.
Cut over writes: short freeze window if needed; keep rollback plan.
Run parallel validation: compare counts, checksums, sampling, and business metrics.
Decommission deliberately: keep a defined retention window, then remove access and costs.

Step 7: Institutionalize the boring practices

Quarterly exit drills for one dependency.
Restore tests with measured RTO/RPO.
Dependency inventory as part of architecture review.
Contract review checklist: termination assistance, export formats, support SLAs, pricing change clauses.

FAQ

1) Is multi-cloud the same thing as avoiding lock-in?

No. Multi-cloud can reduce some lock-in, but it can also double your operational complexity and create “lock-in to your own mess.”
The better goal is exit-capable: you can leave within an agreed time/cost envelope.

2) What’s the fastest way to tell if we’re locked into a managed database?

Try to answer three questions with evidence: Can you do a logical export/import at required scale? Can you run the same engine elsewhere without feature loss?
Can you operate it (monitoring, backups, failover) without the vendor console?

3) Are “open source managed services” safe from lock-in?

Safer, not safe. You can still be locked in by proprietary extensions, operational dependencies, or opaque billing. “Open source” reduces one category of risk:
code availability. It doesn’t automatically give you portability of operations.

4) What contract terms actually matter for lock-in?

Termination assistance, data export formats and timelines, support response during termination, pricing change clauses, egress cost predictability,
and what happens to keys/logs/audit trails after termination. Also: how fast you can raise limits and whether that requires human approval.

5) What’s the biggest technical lock-in trap in Kubernetes?

Vendor-specific CRDs for core workflows (ingress, backups, policy, networking) and proprietary storage snapshots. They seep into manifests and runbooks.
Use upstream APIs where possible, and treat CRDs as code you must migrate later.

6) How do I reduce egress pain without rewriting everything?

Cache aggressively, compress responses, keep data local to compute, avoid chatty cross-region traffic, and use replication strategies that minimize full re-reads.
Also: measure outbound traffic now, so you can negotiate pricing from a position of facts.

7) What’s a realistic “clean exit” timeline?

For moderate systems: weeks to months. For multi-petabyte, compliance-heavy estates: quarters. The gating factor is usually data validation and operational parity,
not copying bytes.

8) How do we keep engineers from using proprietary features?

Don’t rely on vibes. Create a portability policy: which features are allowed, which require explicit approval, and what the exit plan is.
Add CI checks that detect hard-coded endpoints and vendor-specific APIs, and make exceptions visible.

9) What’s the minimum viable exit strategy for a small team?

A portable backup you can restore without the vendor, DNS indirection for endpoints, infrastructure definitions stored in version control,
and one quarterly restore drill. You can be small and still be difficult to trap.

Conclusion: next steps you can do this week

Vendor lock-in isn’t a moral failing. It’s a predictable outcome when convenience compounds and nobody prices the exit.
The fix isn’t “never use managed services.” The fix is to keep leverage: portable data, portable operations, and contracts that don’t punish you for leaving.

Do these next steps in order

Run the inventory tasks above on one representative production host and one cluster. Create a dependency list you can defend.
Pick one critical dependency (database, object store, identity, backups) and schedule an exit drill within 30 days.
Add one portable backup path if you don’t already have it, and restore it in a clean environment.
Introduce indirection: DNS names you control, config-driven endpoints, and a break-glass access procedure.
Write down the exit SLO: “We can leave this vendor in X days for Y cost with Z downtime.” Then make it true.

If you do nothing else: make your data independently restorable, and practice doing it. Everything else is negotiable.