DNS TTL Mistakes That Haunt Migrations — How to Set TTL Like a Pro

October 29, 2025 • February 3, 2026 • Read: 22 min • Views: 13

Was this helpful?

Every migration plan has a lie in it: “DNS cutover will be quick.” The truth is harsher. DNS doesn’t “propagate” like gossip; it’s cached on purpose, in layers, by systems you don’t control, with clocks you can’t reset.

If you set TTLs casually, DNS becomes the slowest part of your migration. If you set them professionally, DNS becomes boring. And boring is what you want at 2:17 AM when the CFO is “just checking in.”

TTL is not “propagation”: what actually happens

TTL (time to live) is the number of seconds a DNS answer is allowed to live in cache. Not how long it takes “the internet” to hear about your change. That “DNS propagation” phrase is marketing for people who sell domain names and aspirin.

Here’s the real flow in production:

Authoritative DNS hosts the truth for a zone (your records, your SOA, your NS set).
Recursive resolvers (ISP resolvers, corporate resolvers, public resolvers) fetch and cache answers from authoritative servers.
Stub resolvers on machines ask a recursive resolver. They may also cache, depending on OS, libraries, and local daemons.
Application-level caching happens too. Java, Go, Envoy, browsers, CDN edge, and mystery SDKs sometimes cache DNS in ways you didn’t ask for.

When you change a record at the authoritative level, nothing pushes that change outward. The world learns the new answer only when caches expire and resolvers ask again. TTL is the permission slip that tells caches how lazy they’re allowed to be.

One more thing: TTL is per recordset in the response. That includes negative answers (NXDOMAIN), which are cached too. People forget that, then wonder why the brand-new hostname “doesn’t exist” for an hour.

Joke #1: DNS “propagation” is like “we’ll circle back” in corporate email: it means “not on your schedule.”

Interesting facts and short history (you can use in meetings)

DNS replaced HOSTS.TXT because a centrally maintained file didn’t scale. TTL exists because caching is the only way a global naming system survives.
TTL has been in DNS since the early RFCs as a core mechanism for controlling cache behavior. It’s not an optional tuning knob; it’s a contract.
Negative caching became standardized later, after operators realized that repeatedly asking “does this name exist?” melts resolvers during outages and attacks.
The SOA record has two different “minimum” concepts historically: older semantics vs modern “negative caching TTL” use. Confusion here still causes migration pain.
Resolvers are allowed to cap TTLs (both minimum and maximum) for policy or performance. Your 30-second TTL may turn into 300 seconds on some networks.
Some platforms ignore TTL in practice because they pin DNS answers or cache them aggressively. That’s not DNS being broken; it’s application behavior.
CDNs and global load balancers often rely on DNS precisely because TTL gives controlled “eventual movement” of traffic. Used well, it’s reliable and predictable.
Low TTLs were historically discouraged when authoritative servers were weaker and bandwidth expensive. Today it’s less about cost and more about operational discipline.
EDNS and modern resolver features improved performance and robustness, but they didn’t eliminate caching. Caching is still the point.

A mental model for TTL that survives real networks

If you remember only one model, use this:

Observed cutover time = max(cache layers) + your own mistakes.

Layer 1: Recursive resolver caching

This is the big one. Your users don’t query your authoritative servers directly; they query a resolver. That resolver typically obeys TTL, but may clamp it. If the resolver cached the old answer five minutes ago with a 1-hour TTL, you can change the record all you want—those users are glued to the old answer for up to 55 more minutes.

Layer 2: Stub resolver and host-level caching

On Linux you might have systemd-resolved, nscd, dnsmasq, or nothing. On macOS there’s mDNSResponder behavior. On Windows there’s a DNS Client service cache. Some of these honor TTL, some have additional logic, and some applications bypass them.

Layer 3: Application and runtime caching

Browsers can cache DNS, but so can your HTTP client stack. The JVM historically had “cache forever” defaults in certain modes. Some service meshes or sidecars cache aggressively for performance. And some teams have their own “DNS cache” library because they once got paged for resolver latency. Guess what: now you get paged for migrations.

Layer 4: Connection reuse and long-lived sessions

Even if DNS updates instantly, traffic might not move because clients keep existing TCP connections alive. HTTP/2, gRPC, WebSockets, database pools—these are designed to be sticky. DNS only affects new connections (unless you actively drain/close). During migration, connection lifetime can matter more than TTL.

Why “set TTL to 60 seconds” doesn’t automatically mean “cutover in 60 seconds”

Because the record might already be cached with a higher TTL, because negative caching exists, because resolvers clamp, because application caches exist, and because connection reuse exists.

TTL is not a stopwatch. It’s a maximum age from the time of caching. If you need predictability, you plan backwards: lower TTL well in advance so caches refresh on the low TTL before the cutover window.

One quote to keep handy in reliability conversations:

“Hope is not a strategy.” — paraphrased idea attributed to many reliability engineers and SRE leaders

How to set TTL like a pro (pre-cutover, cutover, post-cutover)

Step 0: Decide what “cutover” means in your system

DNS cutover is only one lever. Before you touch TTLs, decide:

Are you moving all traffic or just a subset?
Do you need fast rollback (minutes) or is hours acceptable?
Are clients internal (corp resolvers you control) or external (the chaos zoo)?
Are services stateful (databases) or stateless (web/API)?
Do you have a second control plane (load balancer weights, CDN config, service discovery) that might be better than DNS?

DNS is excellent for coarse traffic movement and failover when you can tolerate cache windows. It’s a lousy tool for second-by-second routing. If you treat it like a load balancer, it will remind you who’s boss.

Pre-cutover: lower TTLs early, not “right before”

The pro move is to lower TTLs at least one full old-TTL period before cutover. Ideally two. Because the world may have cached the record at any point during the prior TTL window.

Example: your current TTL is 3600 seconds (1 hour). You want a 60-second cutover window. Lower TTL to 60 at least 1 hour before cutover—and preferably earlier. That gives caches a chance to refresh and start obeying the new low TTL.

Pick sane TTL values (and don’t get cute)

Here’s an opinionated starting point that works for most migrations:

Normal operations (steady-state): 300–900 seconds for most A/AAAA records; 900–3600 for records that rarely change.
Migration window: 30–120 seconds for the specific names you’ll cut over.
Post-cutover: raise back to 300–900 once stable; don’t leave everything at 30 seconds forever unless you’ve priced the query load and audited resolvers.

Yes, 30-second TTLs are possible. No, they are not free. You pay in resolver load, authoritative QPS, and incident complexity when your DNS provider has a bad day.

Cutover: change the right record in the right place

Most migrations fail because someone changed a record, not the record.

If you have a chain (CNAME to another CNAME to A/AAAA), you must understand TTL at each hop.
If you have split-horizon DNS, you must change internal and external views intentionally.
If you use managed DNS plus an internal override zone, know which wins for which clients.

Also: if you’re using an apex record (example.com) and doing fancy CNAME-like tricks, be extra careful. Provider-specific behavior can change what TTLs actually appear in responses.

Rollback planning: TTL is your blast radius knob

Rollback timing is also controlled by caches. If you cut over and it’s bad, rolling back is not “instant,” even if you’re fast. If you want fast rollback, you needed low TTL before the cutover, and you need to consider connection reuse. Otherwise rollback is just a second migration, also subject to caching.

Post-cutover: raise TTLs, then monitor for stragglers

After you’re stable, raise TTLs back to a value that reduces query load and operational churn. But don’t rush it. Keep TTLs low long enough to support rollback while the new environment settles.

And watch for straggler clients. There are always a few: embedded devices, old JVMs, corporate proxies, and “helpful” libraries that pin DNS.

Joke #2: Setting TTLs to 5 seconds feels powerful until your DNS bill arrives like a performance review.

Practical tasks: 12+ command-driven checks and decisions

These are the tasks I actually run during a migration. Each one has: command, what the output means, and what decision you make.

Task 1: Verify what the world sees (authoritative answer via +trace)

cr0x@server:~$ dig +trace www.example.com A

; <<>> DiG 9.18.24 <<>> +trace www.example.com A
;; Received 525 bytes from 127.0.0.1#53(127.0.0.1) in 1 ms

www.example.com.      60      IN      A       203.0.113.42
;; Received 56 bytes from 198.51.100.53#53(ns1.example.net) in 22 ms

Meaning: The final answer shows TTL=60 at the authoritative source (as observed through trace). That’s the “truth” being served now.

Decision: If TTL here is not what you expect, stop. Fix the authoritative zone first. Don’t debug clients yet.

Task 2: Check the recursive resolver you actually use

cr0x@server:~$ dig @1.1.1.1 www.example.com A +noall +answer +ttlid

www.example.com.      47      IN      A       203.0.113.42

Meaning: Cloudflare’s resolver has cached the record and will keep it for 47 more seconds.

Decision: If the TTL remaining is huge, your earlier TTL-lowering didn’t “take” in time, or the resolver clamped it. Adjust expectations and rollback strategy.

Task 3: Compare multiple resolvers to detect clamping or stale caches

cr0x@server:~$ for r in 1.1.1.1 8.8.8.8 9.9.9.9; do dig @$r www.example.com A +noall +answer +ttlid; done

www.example.com.      52      IN      A       203.0.113.42
www.example.com.      300     IN      A       203.0.113.42
www.example.com.      58      IN      A       203.0.113.42

Meaning: One resolver is effectively using 300 seconds. That could be a clamp, or it cached before TTL was lowered.

Decision: If external user experience matters, plan for the worst observed caching. Your “60-second” cutover is not globally 60 seconds.

Task 4: Inspect CNAME chains and TTL at each hop

cr0x@server:~$ dig www.example.com CNAME +noall +answer +ttlid

www.example.com.      60      IN      CNAME   app-lb.example.net.

cr0x@server:~$ dig app-lb.example.net A +noall +answer +ttlid

app-lb.example.net.   300     IN      A       198.51.100.77

Meaning: Even if the CNAME TTL is 60, the A record it points to may have TTL 300 and be cached independently.

Decision: During migrations, lower TTLs consistently across the chain, or change the record at the right level (often the CNAME target).

Task 5: Confirm AAAA behavior (IPv6 can surprise you)

cr0x@server:~$ dig www.example.com AAAA +noall +answer +ttlid

www.example.com.      60      IN      AAAA    2001:db8:10::42

Meaning: Clients preferring IPv6 will use this path. If you only updated A, half your traffic may ignore you.

Decision: Treat A and AAAA as a pair. Migrate both, or intentionally disable one (with full awareness of impact).

Task 6: Check negative caching (NXDOMAIN) before creating new names

cr0x@server:~$ dig newservice.example.com A +noall +answer +authority

example.com.          900     IN      SOA     ns1.example.net. hostmaster.example.com. 2025123101 3600 600 1209600 300

Meaning: No answer section; the authority section shows SOA with a minimum/negative caching TTL behavior (commonly 300 here).

Decision: If you’re about to create a brand-new name during cutover, check and tune negative caching in advance. Otherwise “it doesn’t exist” can persist.

Task 7: Verify authoritative NS set and delegation correctness

cr0x@server:~$ dig example.com NS +noall +answer

example.com.          3600    IN      NS      ns1.example.net.
example.com.          3600    IN      NS      ns2.example.net.

Meaning: These are the authoritative servers clients should reach (after delegation).

Decision: If you’re migrating DNS providers, mismatched NS sets or partial delegation will create “some users see old, some new” chaos for days.

Task 8: Check SOA serial and whether secondaries picked up changes

cr0x@server:~$ dig @ns1.example.net example.com SOA +noall +answer
example.com.          900     IN      SOA     ns1.example.net. hostmaster.example.com. 2025123102 3600 600 1209600 300

cr0x@server:~$ dig @ns2.example.net example.com SOA +noall +answer
example.com.          900     IN      SOA     ns1.example.net. hostmaster.example.com. 2025123101 3600 600 1209600 300

Meaning: ns2 is behind (serial differs). Your “change” isn’t globally served yet.

Decision: Fix zone transfer/propagation between authoritative servers before cutover. Otherwise resolvers will get different answers depending on which NS they hit.

Task 9: Observe TTL from a specific client host (system resolver path)

cr0x@server:~$ resolvectl query www.example.com

www.example.com: 203.0.113.42                    -- link: eth0
                 (A) --> 47s

Meaning: systemd-resolved shows remaining TTL in its cache for that host.

Decision: If the host cache is sticky or wrong, you may need to flush local caches for critical systems (or restart a service) as part of cutover.

Task 10: Identify which resolver a host is using (you’d be amazed)

cr0x@server:~$ cat /etc/resolv.conf
nameserver 10.0.0.53
search corp.example
options timeout:1 attempts:2

Meaning: This host uses a corporate resolver, not public DNS. Your external tests might be irrelevant.

Decision: Run cutover validation against the resolvers your clients actually use. If you don’t know them, you don’t have a plan.

Task 11: Confirm the resolver’s cache status (BIND example)

cr0x@server:~$ sudo rndc dumpdb -cache
cr0x@server:~$ sudo grep -n "www.example.com" /var/cache/bind/named_dump.db | head
12451:www.example.com. 47 IN A 203.0.113.42

Meaning: The local recursive resolver has the record cached with 47 seconds remaining.

Decision: If the resolver cache is stale during cutover and you control it, consider flushing the specific name (not the entire cache unless you enjoy self-inflicted outages).

Task 12: Measure whether users are stuck on old IPs via logs

cr0x@server:~$ sudo awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head
  912 203.0.113.10
  301 198.51.100.77
  118 203.0.113.11

Meaning: Requests are arriving at multiple frontends/IPs. If 198.51.100.77 is the “old” endpoint, some clients haven’t moved.

Decision: Keep the old endpoint healthy until the tail dies down, or actively drain/redirect. Don’t turn it off “because DNS.”

Task 13: Validate that the new endpoint answers correctly (don’t trust DNS yet)

cr0x@server:~$ curl -sS -o /dev/null -w "%{http_code} %{remote_ip}\n" --resolve www.example.com:443:203.0.113.42 https://www.example.com/healthz
200 203.0.113.42

Meaning: You forced the connection to the new IP while keeping the hostname for TLS/SNI. Health is good.

Decision: If this fails, do not cut DNS. Fix the new endpoint first. DNS is not a test tool; it’s a steering wheel.

Task 14: Check for long-lived connections that ignore DNS changes

cr0x@server:~$ ss -antp | grep ':443' | head
ESTAB 0 0 203.0.113.42:443 198.51.100.25:52144 users:(("nginx",pid=2210,fd=44))
ESTAB 0 0 203.0.113.42:443 198.51.100.25:52145 users:(("nginx",pid=2210,fd=45))

Meaning: Active sessions exist. If you change DNS, existing clients may keep talking to the old endpoint until these connections close.

Decision: Plan draining and connection lifetime controls (keepalive timeouts, graceful shutdown) alongside TTL changes.

Task 15: Confirm DNSSEC status if you’re changing providers

cr0x@server:~$ dig www.example.com A +dnssec +noall +answer +adflag

www.example.com.      60      IN      A       203.0.113.42

Meaning: If you see the AD flag in some resolver contexts, validation succeeded. If validation fails after changes, clients may treat answers as bogus.

Decision: During provider migrations, manage DS records and signing carefully. DNSSEC failures look like random outages with “but DNS looks fine for me” sprinkled on top.

Fast diagnosis playbook (first/second/third)

When traffic isn’t moving and the migration window is burning, you need a triage order that works.

First: is the authoritative truth correct?

Run dig +trace for the exact name and type (A, AAAA, CNAME).
Verify TTL, targets, and that the answer matches the intended destination.
Check SOA serial on all authoritative servers if you have multiple.

If authoritative is wrong, stop. Fix the zone. Everything else is noise.

Second: are resolvers returning stale answers?

Query the resolvers your clients use (corp resolvers, public resolvers, regional resolvers).
Compare TTL remaining and IPs returned.
Look for clamps (unexpectedly high TTL) or split behavior (some resolvers old, some new).

If resolvers are stale, you’re waiting on caches unless you control the resolvers and can flush selectively.

Third: is traffic stuck for non-DNS reasons?

Check whether clients keep existing connections to old endpoints.
Check client-side DNS caching behavior in app runtimes (JVM, sidecars, proxies).
Check LB/health checks: maybe DNS moved but the new backend is unhealthy so failover logic routes back.

If DNS is correct but traffic still doesn’t move, you’re dealing with connection lifetime, application caching, or upstream routing—not DNS TTL.

Three corporate mini-stories (anonymized, painfully plausible)

1) The incident caused by a wrong assumption

They were migrating a customer portal from a colo to a cloud load balancer. The plan said: “Lower TTL to 60 seconds the day before, then flip the A record at midnight.” Clean and civilized.

The engineer on duty lowered the TTL—on the wrong record. There was a pretty CNAME chain: portal.example.com CNAME’d to portal.edge.example.net, and that had the A record pointing at the old VIP. They lowered TTL on portal.example.com, but left portal.edge.example.net at 3600.

At midnight they flipped the A record at the target name, but caches around the world still held the old A record for up to an hour. Some customers hit the new portal, some hit the old one, and sessions bounced depending on whose resolver you got. Support saw it as “random logouts.” Engineering saw it as “impossible to reproduce.” Everyone was wrong at the same time.

The postmortem wasn’t about “DNS is unreliable.” DNS did exactly what it was asked to do. The failure was assuming the visible hostname was the one controlling cache behavior. CNAME chains are little TTL time machines. If you don’t map them, they will map you.

2) The optimization that backfired

A different company ran a high-traffic API and decided DNS latency was too expensive. They introduced an in-process DNS cache in their client library with a 10-minute minimum TTL. The stated goal: reduce resolver QPS and shave p99 latency.

It worked. Resolver graphs went down. Latency improved a bit. Everyone forgot about it because the dashboard was green and nobody enjoys reading RFCs during a good quarter.

Then came a regional migration. They planned a gradual DNS shift using a weighted setup behind a CNAME. On paper, a 60-second TTL gave them fast control. In reality, large customers running the client library held onto old answers for 10 minutes at a time, and the shift behaved like a stubborn staircase.

Worse: rollback wasn’t real. When a subset of requests started failing in the new region, they “rolled back” DNS, but the cache kept sending traffic to the broken region for minutes. Engineers started to doubt their own tools. The business started to doubt engineering. That’s how you burn credibility, not just uptime.

The fix was not “never cache DNS.” The fix was to respect TTL and make caching behavior observable and configurable. Performance hacks that override contracts always come due during migrations.

3) The boring but correct practice that saved the day

One enterprise team had a ritual. Before any major migration, they ran a DNS readiness drill 48 hours ahead. Not a meeting—a drill. They verified current TTLs, lowered them in a controlled change, and then validated from multiple vantage points (corp resolvers, public resolvers, and a couple of cloud regions).

They also had a policy: no “brand-new name” created during the migration window. If a new hostname was needed, it was created a week prior, queried repeatedly, and monitored specifically to burn off negative caching and weird resolver behaviors.

On migration night, they still hit a snag: one authoritative secondary wasn’t picking up updates reliably due to a firewall rule that had been “temporarily” changed during another project. The drill had caught it two days earlier. They fixed it during business hours with time to spare.

Cutover was uneventful. Not because they were geniuses. Because they made DNS boring on purpose. The best migrations look like nothing happened, which is exactly the point.

Common mistakes: symptom → root cause → fix

1) “We changed DNS but some users still hit the old site for hours”

Symptom: Mixed traffic to old and new endpoints long after cutover.

Root cause: Old TTL was high and you lowered it too late; or resolvers cached before the TTL change; or a resolver clamps TTL upward.

Fix: Lower TTL at least one full old-TTL period before cutover (preferably two). Validate TTL on multiple resolvers. Plan to keep old endpoint alive for the tail.

2) “Internal users see new, external users see old”

Symptom: Corp employees report success; customers report failures (or vice versa).

Root cause: Split-horizon DNS, internal override zones, or different resolvers with different caching states.

Fix: Document which resolvers and zones serve which clients. Test cutover from both inside and outside. Update both views intentionally.

3) “The hostname says NXDOMAIN, then later it works”

Symptom: Newly created names appear broken intermittently.

Root cause: Negative caching of NXDOMAIN due to SOA negative TTL; or name was queried before it existed and cached as “doesn’t exist.”

Fix: Pre-create names well before the window; keep negative caching TTL reasonable; verify SOA parameters; avoid introducing new names at cutover.

4) “We lowered TTL but clients still don’t respect it”

Symptom: Some clients stick to an IP far beyond TTL.

Root cause: Application/runtime DNS caching (JVM settings, custom caches, sidecars), or long-lived connections.

Fix: Audit client DNS behavior. Configure caches to honor TTL. Set max connection ages, drain connections, and plan for session stickiness separately from DNS.

5) “We flipped A record but nothing changed”

Symptom: Monitoring and users still go to old endpoint.

Root cause: You changed the wrong name/type (CNAME chain, different record in use), or there’s an internal override.

Fix: Map the resolution path (dig CNAME + trace). Confirm the queried name and type used by clients. Remove or update overrides.

6) “After DNS provider migration, some users can’t resolve at all”

Symptom: SERVFAIL or timeouts for a subset of resolvers.

Root cause: Delegation mismatch, missing records, partial zone, or DNSSEC DS/signing mismatch.

Fix: Validate NS delegation, authoritative completeness, and DNSSEC chain prior to NS cut. Keep old provider serving during overlap if possible.

7) “Rollback didn’t rollback”

Symptom: You revert DNS but traffic keeps hitting the new broken target.

Root cause: Caches now hold the new answer; long-lived connections persist; some clients pinned DNS.

Fix: Treat rollback as a planned move with its own cache window. Keep TTL low pre-cutover and manage connection draining/timeouts.

Checklists / step-by-step plan

Plan backwards from the cutover window

Inventory the names involved (customer-facing, internal, API, callbacks, webhook endpoints, certificate SANs).
Map resolution: A/AAAA vs CNAME chain, split-horizon views, internal overrides.
Record current TTLs for each record in the chain. This is your “old TTL” you must out-wait.
Decide migration TTL (usually 30–120 seconds) and rollback requirements.
Lower TTLs at least one old-TTL period before cutover (two if you want sleep).
Verify on multiple resolvers that the low TTL is now what they’re caching.
Validate new endpoints by forcing resolution (curl --resolve) and running health checks.
Cut over DNS at the planned time. Log the exact change, time, and serial.
Monitor the tail: old endpoint traffic should decay over a few TTL windows; watch error rates and regional patterns.
Keep rollback viable while you’re still in the “unknown unknowns” period.
Raise TTLs back to steady-state once stable and rollback window closes.
Postmortem the process: what surprised you (clamping, app caches, internal overrides), and bake it into the next runbook.

Cutover-night operational checklist (the stuff you actually do)

Confirm authoritative answer with dig +trace just before the change.
Query corp resolvers and public resolvers and record TTL remaining.
Verify A and AAAA answers (or intentionally disabled IPv6) match your plan.
Force-test new endpoint with curl --resolve for TLS correctness and health.
Confirm observability: logs, metrics, and alerting for both old and new endpoints.
Make the DNS change; increment SOA serial if relevant.
Re-check authoritative and resolver answers immediately after.
Watch client error rates and old-endpoint traffic. Don’t shut down the old endpoint early.
If rollback is needed, execute it fast, but expect cache tail—communicate that reality.

Policy checklist (the boring rules that prevent “creative” outages)

Don’t introduce brand-new hostnames during a cutover window.
Don’t set extremely low TTLs globally; scope them to migration-critical records.
Don’t rely on a single resolver’s behavior for validation.
Don’t assume “TTL=60” equals “users move in 60 seconds.”
Do document where DNS answers come from (authoritative provider, internal views, overrides).
Do audit application DNS caching and connection reuse behavior before migrations.

FAQ (the questions you’ll get five minutes before cutover)

1) What TTL should we use for a migration?

For the specific names you’ll flip: 30–120 seconds during the window. But only after you lower TTL early enough for caches to refresh.

2) How early should we lower TTL?

At least one full old TTL before cutover; two if you want more consistent behavior. If the old TTL is 86400, you lower it days ahead, not “tonight.”

3) Why does “DNS propagation” take longer than TTL?

Because the record might have been cached earlier under an older TTL, some resolvers clamp TTL, and some applications cache independently. TTL is maximum cache age from when it was cached.

4) Should we set TTL to 0?

Don’t. Many systems treat 0 in surprising ways, and you’ll spike DNS query load dramatically. If you need near-instant traffic steering, use a load balancer or service discovery mechanism designed for it.

5) Is a CNAME safer than an A record for migrations?

CNAMEs are great for indirection, but they add another caching layer. If you use CNAMEs, manage TTLs across the chain or you’ll get inconsistent cutovers.

6) Can we “flush the internet’s DNS cache”?

No. You can flush caches you control (your resolvers, your hosts, your apps), but not everyone else’s. Plan for the tail and keep the old endpoint alive.

7) Why do some users keep hitting the old endpoint even after caches expire?

Long-lived connections. Clients may reuse TCP connections for minutes or hours. DNS changes only affect new connection establishment unless you drain/close connections.

8) How do we validate the new endpoint without changing DNS yet?

Use forced resolution (for HTTPS, preserve hostname for SNI): curl --resolve. Or edit a controlled client’s hosts file for testing, but don’t mistake that for real-world behavior.

9) What about DNSSEC during migrations?

DNSSEC is fine until you change signing keys or providers. A DS mismatch can cause widespread SERVFAIL. Treat DNSSEC changes as a separate, carefully staged migration.

10) After cutover, when do we raise TTL again?

After you’ve observed stability and your rollback window closes. Commonly: keep low TTLs for a few hours to a day depending on risk, then raise to 300–900 seconds.

Conclusion: practical next steps

If DNS TTL keeps haunting your migrations, it’s usually not because DNS is flaky. It’s because TTL was treated as a last-minute tweak instead of a schedule.

Do these next:

Pick one critical hostname in your environment and map its full resolution chain (including AAAA).
Record current TTLs and decide your migration TTL target (30–120 seconds for cutover names).
Run a drill: lower TTL early, confirm via dig +trace and multiple resolvers, and measure how quickly traffic actually moves.
Audit application DNS caching and connection lifetimes. Fix the “pins DNS forever” cases before they fix you at 3 AM.
Write the fast diagnosis playbook into your on-call runbook, and make someone run it once before the real night.

Set TTL like a pro and DNS becomes a predictable part of your migration, not a ghost story you tell new hires.