The pitch usually happens in a meeting where nobody has paged out at 3 a.m. recently. “Just use NoSQL,” someone says, as if choosing a database were like choosing a font. A few months later, you’re staring at duplicated orders, missing ledger entries, and a “temporary” reconciliation job that has become your most reliable system.
The problem isn’t that NoSQL is bad. The problem is that “just use NoSQL” is a lazy answer to a hard question: how do you store truth under real production constraints—latency, cost, concurrency, failure, human error, and the business changing its mind every two sprints?
Why “Just Use NoSQL” burns teams
“Just use NoSQL” is usually a proxy for one of these beliefs:
- SQL won’t scale (translation: we once had a slow query and blamed the database).
- Schemas slow teams down (translation: we don’t want to argue about the data model today).
- We need flexibility (translation: our product requirements are unstable, so we’ll encode that instability into storage).
- Joins are expensive (translation: we saw a join in a slow query plan once).
- We want “web scale” (translation: we want to feel safe without doing capacity planning).
Those beliefs aren’t always wrong. But the advice is incomplete—dangerously incomplete—because it skips the part where you define invariants and decide where to enforce them.
The invariants are the product
If you store money, inventory, permissions, billing entitlements, seat counts, or “only one active session per user,” you are storing invariants. That’s not an implementation detail; that’s the business. The database is where invariants go to either live in peace, or die loudly.
Relational databases are not magical because they speak SQL. They’re powerful because they offer a mature, integrated set of tools for enforcing truth:
- Transactions with well-understood isolation levels
- Constraints (foreign keys, unique constraints, check constraints)
- Declarative indexes and query planning
- Durable write-ahead logging and predictable recovery
- Operational observability that has been sharpened by decades of pain
NoSQL systems can absolutely be correct, reliable, and fast. But many of them push correctness up the stack: to the application, to background jobs, to compensating transactions, to “we’ll clean it up later.” That “later” becomes a permanent tax, paid in on-call fatigue.
One practical rule: If the data needs to be correct today, enforce it where it is written. If you can tolerate being “eventually correct,” you still need to define what “eventually” means and how you detect when it didn’t happen.
Short joke #1: Eventual consistency is like a group chat apology: it arrives later, and it doesn’t always fix what broke.
NoSQL is not a single thing
When someone says “NoSQL,” ask which category they mean, because the operational profile and correctness story differs:
- Document stores (e.g., Mongo-like): great for nested documents, can be a trap for cross-document invariants.
- Key-value (e.g., Redis-like): fantastic for caching and ephemeral state, terrible as a source of truth unless carefully constrained.
- Wide-column (e.g., Cassandra-like): excellent for high write throughput with known access patterns; hot partitions will ruin your weekend.
- Search indexes (e.g., Elasticsearch-like): not a database; it’s an index with opinions.
- Graph databases: niche but useful when the relationships are the query.
“Just use NoSQL” ignores the part where you match data shape, access patterns, and failure tolerance to a system’s actual guarantees.
Eight quick facts and a little history
Some context helps, because the industry keeps re-learning the same lessons with different branding.
- Relational databases got their theory in 1970 when E. F. Codd published the relational model. That wasn’t “SQL”; it was a correctness model.
- SQL standardization started in the 1980s. The boring parts—transactions, constraints, query optimization—are where most of the value lives.
- “NoSQL” as a term was popularized around 2009, largely as a banner for systems built to address scale and distribution pains of the time.
- The CAP theorem was formalized in the early 2000s. People still misuse it, mostly to justify broken behavior as a “tradeoff.”
- Amazon’s Dynamo paper (mid-2000s) influenced a generation of key-value and wide-column designs focused on availability and partition tolerance.
- Google’s Bigtable paper (mid-2000s) shaped wide-column stores and LSM-tree-heavy storage engines optimized for write throughput.
- Two-phase commit predates most cloud platforms. It’s not new; it’s just expensive in distributed environments and painful to operate.
- “NewSQL” wasn’t a magic replacement; it was an attempt to bring SQL semantics to distributed systems, sometimes successfully, often with new operational constraints.
History doesn’t pick your database. But it does explain why certain tradeoffs exist, and why your “flexible” schema keeps turning into a series of bespoke backfills.
The real alternative: design from invariants, then pick tech
The real alternative to “just use NoSQL” is not “just use PostgreSQL.” It’s a decision process:
- Write down the invariants. Not features. Invariants. “An invoice total must equal sum of line items.” “A user cannot have two active subscriptions.” “Inventory cannot go negative.”
- Define the consistency contract. For each invariant, answer: must it be true at write time, or can it be reconciled later?
- Model the access patterns. Read paths, write paths, cardinality, bursts, fan-outs.
- Choose where truth lives. One system is the system of record. Everything else is derived, cached, indexed, or denormalized.
- Pick the minimal set of databases. Every additional datastore is an additional failure mode, on-call runbook, backup story, and data migration plan.
“Use one database” is not dogma; it’s a cost model
Teams love polyglot persistence until they have to restore it. “We have Postgres for transactions, Redis for cache, Elasticsearch for search, Kafka for events, and a document store for flexible stuff.” That can be correct. It can also be a reliability horror show.
A practical default for many companies:
- PostgreSQL (or another relational DB) as the system of record.
- Redis for caching and rate-limiting, not for canonical state unless you accept the risks.
- A search index for text search, fed from the system of record.
- A log/event bus if you need async integration and replayability.
Start with the simplest architecture that can be correct. Scale it with boring techniques—indexes, query tuning, partitioning, read replicas—before you graduate to “we need a distributed database.” The graduation criteria should be load you can measure, not fear you can imagine.
One quote, because it’s still true
Paraphrased idea (Werner Vogels): “You build it, you run it” means teams own operational outcomes, not just code merges.
Database choices are operational outcomes. If your database is “someone else’s problem,” your incident queue will teach you otherwise.
NoSQL failure modes you will actually hit
1) “Flexible schema” becomes “silent corruption”
Document stores make it easy to write data with missing fields, wrong types, or subtly inconsistent shapes. Without constraints, your application becomes the schema enforcer. And applications change. Engineers rotate. Migrations get deferred.
Failure pattern: one service writes price_cents as a number, another writes it as a string, analytics casts it “helpfully,” and now revenue is a rounding error with a quarterly review.
2) Cross-entity invariants become background jobs
Need “only one active subscription per account”? In a relational DB: unique constraint plus transaction. In many NoSQL setups: check in the app, race under concurrency, fix with a periodic “dedupe” job, and add a manual tool for the support team.
3) Hot partitions and uneven keys
Distributed NoSQL systems love even key distribution. Real businesses love sequential IDs, tenants with huge traffic, and “today’s date” prefixes. You can build a wide-column cluster that handles millions of writes… until one partition key becomes the black hole.
4) Secondary indexes are not free (and sometimes not real)
Some systems treat secondary indexes as optional, eventually consistent, or expensive to maintain. Your query works fine in staging and then falls apart under production cardinality.
5) Operational complexity becomes the product
Replication, compaction, repair, hinted handoff, quorum reads, tombstones—these are not “advanced features.” They are the cost of admission. If your team doesn’t have the appetite for that work, don’t buy that system.
6) Backup/restore and point-in-time recovery get weird
A relational database with WAL archiving gives you a straightforward story: full backup + WAL replay. In some NoSQL systems, backups are “take snapshots per node and hope they line up.” You can do it, but you need to practice it. Otherwise the first restore attempt will be during an outage. That’s not a drill; it’s a career moment.
When NoSQL is the right call (and when it isn’t)
Use NoSQL when these are true
- Your access patterns are known and stable. You can describe your queries up front and they won’t mutate weekly.
- You need massive write throughput across partitions and can design keys to avoid hotspots.
- You can tolerate weaker transactional semantics for the primary workflow, or you have a compensating design that you have tested.
- You are building derived views: caches, materialized read models, search indexes, time-series ingestion layers.
- Your team can operate it. Not “someone can Google it.” Someone will own it.
Avoid NoSQL as your system of record when these are true
- You need multi-entity invariants at write time. Billing, ledger, permissions, inventory, entitlements.
- You don’t know the access patterns yet. “Flexible schema” won’t save you from unknown queries; it just delays the argument.
- You can’t afford data cleanup staff work. Background reconciliation is a hidden headcount plan.
- You need straightforward auditing. You want constraints, logs, and a crisp story for “who changed what.”
Short joke #2: “Schemaless” usually means “schema written in Slack and lost in a channel nobody checks.”
Three corporate mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption (eventual consistency as a default)
A mid-sized SaaS company rebuilt its billing workflow to “move faster.” The old monolith used a relational database with transactions. The new architecture split “subscriptions,” “invoices,” and “payments” into separate services, each with its own document store. The services talked over an event bus.
The wrong assumption was subtle: they assumed “eventual consistency” meant “a few seconds.” In reality, it meant “whenever the consumers are healthy, caught up, and not reprocessing.” During a deploy, the invoice service lagged behind payment events for long enough that the UI showed “payment received” but “invoice unpaid.” Support tickets spiked, then finance escalated because automated dunning fired on paid accounts.
On-call tried to patch it with retries and consumer scaling. It got worse. Replays re-ordered events. Some events were processed twice. A small number were dropped due to a bug in idempotency keys. Nobody had a single place to query “what is the truth for account X right now?”
The fix was not heroic. They made one service—the billing ledger—authoritative in a relational database. Other services became projections. They added idempotency enforcement at the ledger write path and made the UI read from the ledger for customer-facing status. The event bus remained, but its role was integration, not truth.
Mini-story #2: The optimization that backfired (denormalization to avoid joins)
An e-commerce platform had a slow “order details” endpoint. Someone profiled it, saw multiple joins, and made a confident call: “We should denormalize. Joins don’t scale.” They moved order line items and shipping status into a single document per order in a document store. Reads got faster immediately. Everyone celebrated.
Then the backfires arrived on schedule. Returns processing needed to update individual line items while preserving an audit trail. Customer support needed partial updates without overwriting concurrent changes. A fraud workflow added tags and notes. Three teams started writing to the same document, from different services, with different assumptions about concurrency.
Conflicts turned into last-write-wins data loss. Engineers added per-field merge logic. Then they added a “document version” field and retried updates. Then they added a background “repair” job to reconcile missing fields. The order document became a contested territory. Incidents became political: “your service overwrote our fields.”
The eventual stable design reintroduced normalization where concurrency mattered. The order header stayed in one row, line items in another table, and mutable annotations in their own store with explicit ownership. The “fast read” came from materialized views and caching—not from turning a correctness problem into a JSON blob.
Mini-story #3: The boring but correct practice that saved the day (tested restores and PITR)
A B2B platform ran Postgres as its primary store. Nothing flashy. But an SRE insisted on two boring disciplines: WAL archiving for point-in-time recovery and a monthly restore drill. Not “we have backups,” but “we restored to a new cluster and ran application checks.”
One afternoon, an engineer ran a migration that dropped a column and re-created it with the same name but a different type. The migration passed in staging. In production, the app started writing nonsense into the new column. The bug was caught quickly, but the data was already wrong in a way that couldn’t be “fixed forward” without a reliable reference point.
They performed a point-in-time recovery to minutes before the migration, extracted the affected tables, and replayed the legitimate writes using application logs and idempotent job runners. The outage was painful, but bounded. No “we lost a day of data” email went out. Finance didn’t get involved. Nobody had to explain “we think it’s correct.”
That incident didn’t make a conference talk. It did prevent a quarter of slow-motion chaos. Boring wins because it is repeatable.
Fast diagnosis playbook: what to check first/second/third
This is the on-call sequence that finds the bottleneck without spending an hour arguing with dashboards.
First: decide if it’s saturation, contention, or correctness
- Saturation: CPU, IO, memory, or network maxed. Latency rises with load.
- Contention: locks, hot keys, limited concurrency. Throughput stalls; queues build.
- Correctness: replication lag, inconsistent reads, missing indexes causing timeouts that look like outages.
Second: locate the choke point (app vs database vs storage)
- Check app p95 latency and error rate around the DB calls.
- Check DB health: connections, lock waits, slow queries.
- Check storage: iowait, disk latency, saturation, filesystem errors.
Third: pick the fastest safe mitigation
- Reduce load: rate limit, shed non-critical traffic, disable expensive endpoints.
- Improve query plan: add/adjust index, fix N+1 query patterns, cap unbounded queries.
- Fix contention: avoid hot rows/partitions, shorten transactions, tune isolation, add queues.
- Scale safely: add read replicas, add cache, increase instance size only if you understand the limit.
Hands-on tasks: commands, outputs, and the decision you make
These are real tasks you can run today. Each includes a command, what the output tells you, and what decision to take. Assume Linux servers and a mix of PostgreSQL and common NoSQL components. Adjust hostnames and credentials accordingly.
Task 1: Check if the box is IO-bound (Linux iostat)
cr0x@server:~$ iostat -x 1 3
Linux 6.1.0 (db01) 02/04/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.31 0.00 4.22 31.88 0.00 51.59
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 85.0 220.0 5200.0 14800.0 18.2 0.9 92.4
What it means: High %iowait and high disk %util suggest storage saturation. await indicates queueing latency.
Decision: Stop “optimizing code” first. Reduce write amplification (indexes, autovacuum, compaction), move WAL to faster disk, or scale storage/IOPS.
Task 2: See who is eating memory and whether swapping is involved
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 64Gi 52Gi 1.2Gi 1.1Gi 11Gi 8.5Gi
Swap: 8Gi 2.9Gi 5.1Gi
What it means: Swap usage on a DB host often correlates with random latency spikes.
Decision: Reduce memory pressure (tune shared buffers / cache sizes, fix leaks, right-size). If you keep swapping, your “database problem” is a kernel problem.
Task 3: Confirm filesystem space and inode health
cr0x@server:~$ df -h /var/lib/postgresql
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 900G 812G 42G 96% /var/lib/postgresql
What it means: 96% full is an outage in slow motion (vacuum, compaction, and temp files need headroom).
Decision: Expand storage or purge safely. Add alerts at 80/85/90%. Do not wait for 100%.
Task 4: Check disk latency directly
cr0x@server:~$ nvme smart-log /dev/nvme0n1 | sed -n '1,12p'
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0x00
temperature : 41 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 4%
data_units_read : 1,210,331
data_units_written : 2,980,552
host_read_commands : 18,220,114
host_write_commands : 55,991,202
What it means: Not a latency metric, but it tells you if the device is unhealthy. Critical warnings matter.
Decision: If warnings/SMART errors appear, stop debating query plans and start planning disk replacement or migration.
Task 5: Find the slowest PostgreSQL queries by total time
cr0x@server:~$ psql -d appdb -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"
query | calls | total_time | mean_time
----------------------------------------------------------------------+-------+------------+----------
SELECT * FROM orders WHERE account_id = $1 ORDER BY created_at DESC... | 9842 | 812345.12 | 82.54
UPDATE inventory SET available = available - $1 WHERE sku = $2 | 62111 | 420998.77 | 6.78
SELECT * FROM events WHERE tenant_id = $1 AND created_at > $2 | 2109 | 318120.33 | 150.86
What it means: total_time finds “death by a thousand cuts.” mean_time finds sharp knives.
Decision: Fix high total_time first if it dominates load; fix high mean_time if it dominates tail latency.
Task 6: Explain a specific slow query (Postgres EXPLAIN ANALYZE)
cr0x@server:~$ psql -d appdb -c "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE account_id = 42 ORDER BY created_at DESC LIMIT 50;"
Limit (cost=0.56..220.14 rows=50 width=312) (actual time=120.331..120.410 rows=50 loops=1)
Buffers: shared hit=120 read=845
-> Index Scan Backward using orders_created_at_idx on orders (cost=0.56..44120.12 rows=10024 width=312) (actual time=120.329..120.401 rows=50 loops=1)
Filter: (account_id = 42)
Rows Removed by Filter: 580000
Buffers: shared hit=120 read=845
Planning Time: 0.219 ms
Execution Time: 120.450 ms
What it means: The index is on created_at, but you filter by account_id. It’s scanning a lot then filtering.
Decision: Add a composite index like (account_id, created_at DESC) or restructure query. This is “SQL doesn’t scale” only if you refuse to index correctly.
Task 7: Check lock contention in PostgreSQL
cr0x@server:~$ psql -d appdb -c "SELECT blocked.pid AS blocked_pid, blocking.pid AS blocking_pid, blocked.query AS blocked_query FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid = blocked.pid JOIN pg_locks kl ON kl.locktype = bl.locktype AND kl.database IS NOT DISTINCT FROM bl.database AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.page IS NOT DISTINCT FROM bl.page AND kl.tuple IS NOT DISTINCT FROM bl.tuple AND kl.virtualxid IS NOT DISTINCT FROM bl.virtualxid AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid AND kl.classid IS NOT DISTINCT FROM bl.classid AND kl.objid IS NOT DISTINCT FROM bl.objid AND kl.objsubid IS NOT DISTINCT FROM bl.objsubid AND kl.pid != bl.pid JOIN pg_stat_activity blocking ON blocking.pid = kl.pid WHERE NOT bl.granted;"
blocked_pid | blocking_pid | blocked_query
------------+--------------+--------------------------------
21934 | 21811 | UPDATE inventory SET available =
What it means: A transaction is blocking others. The DB is fine; your concurrency model is not.
Decision: Shorten transactions, add proper indexes to avoid lock escalation, or redesign the hot-row pattern (per-SKU row counters are notorious).
Task 8: Check replication lag (Postgres streaming replica)
cr0x@server:~$ psql -d appdb -c "SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
application_name | state | sync_state | write_lag | flush_lag | replay_lag
------------------+---------+------------+-----------+-----------+------------
replica01 | streaming | async | 00:00:01 | 00:00:02 | 00:00:15
What it means: Replay lag means reads from the replica may be stale by that duration.
Decision: If your app reads-after-writes from replicas, you need read routing rules, session stickiness, or synchronous replication for specific workflows.
Task 9: See if autovacuum is keeping up (Postgres bloat risk)
cr0x@server:~$ psql -d appdb -c "SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
relname | n_dead_tup | last_autovacuum
------------+------------+--------------------------
events | 81234567 | 2026-02-03 01:12:02+00
sessions | 21003444 | 2026-01-29 12:40:10+00
orders | 8120032 | 2026-02-04 00:02:11+00
What it means: High dead tuples mean table bloat, worse cache hit rates, and slower queries.
Decision: Tune autovacuum, adjust fillfactor, or partition large churn-heavy tables. Bloat is not “mysterious slowness”; it’s physics.
Task 10: Check connection pressure (Postgres)
cr0x@server:~$ psql -d appdb -c "SELECT count(*) AS connections, state FROM pg_stat_activity GROUP BY state ORDER BY connections DESC;"
connections | state
------------+---------------------
220 | active
180 | idle
40 | idle in transaction
What it means: Too many active connections can thrash CPU. “Idle in transaction” is a silent killer (it holds locks and prevents vacuum).
Decision: Add a pooler (e.g., pgbouncer), fix transaction handling, and cap connection counts per service.
Task 11: Detect a hot key pattern at the application layer (nginx / access logs example)
cr0x@server:~$ awk '{print $7}' /var/log/nginx/access.log | grep -E '^/api/orders\?account_id=' | sort | uniq -c | sort -nr | head
18421 /api/orders?account_id=42
9211 /api/orders?account_id=17
4420 /api/orders?account_id=5
What it means: A small number of keys dominate traffic. That maps to hot partitions, hot rows, or cache stampedes.
Decision: Add caching, pagination limits, or precomputed views for hot accounts/tenants. Also consider per-tenant throttles.
Task 12: Verify Kafka consumer lag (if you use events for “eventual truth”)
cr0x@server:~$ kafka-consumer-groups.sh --bootstrap-server kafka01:9092 --describe --group invoice-service
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
invoice-service billing 0 1288812 1299910 11098 consumer-1
invoice-service billing 1 1290011 1299988 9977 consumer-2
What it means: Lag means your projections are stale. If your UI or workflows depend on those projections, users will see inconsistent state.
Decision: Either scale consumers/fix processing, or stop using projections as the system of record for user-visible truth.
Task 13: Check MongoDB index usage (example with mongosh)
cr0x@server:~$ mongosh --quiet --eval 'db.orders.find({account_id: 42}).sort({created_at:-1}).limit(50).explain("executionStats").executionStats'
{
"nReturned" : 50,
"executionTimeMillis" : 187,
"totalKeysExamined" : 0,
"totalDocsExamined" : 580050
}
What it means: Examining many documents with zero keys examined suggests no useful index for that query shape.
Decision: Add compound index {account_id:1, created_at:-1} or reconsider data model. Your cluster is not slow; your query is expensive.
Task 14: Check Redis memory and eviction policy (cache turning into outage)
cr0x@server:~$ redis-cli INFO memory | egrep 'used_memory_human|maxmemory_human|mem_fragmentation_ratio'
used_memory_human:14.92G
maxmemory_human:16.00G
mem_fragmentation_ratio:1.62
What it means: Close to max memory with fragmentation means you are approaching eviction or OOM behavior depending on config.
Decision: If Redis is a cache, verify eviction policy is set and safe. If Redis stores canonical state, reconsider your life choices and add persistence/backup discipline.
Common mistakes: symptoms → root cause → fix
1) Symptom: duplicate “unique” records (users, subscriptions, orders)
Root cause: uniqueness enforced in application code, not in the datastore; concurrency races win.
Fix: enforce uniqueness at write time with a database constraint or an atomic conditional write. If you can’t, redesign: pick a canonical key, use idempotency keys, and introduce a de-duplication process with auditability.
2) Symptom: “random” stale reads after deploys
Root cause: reading from replicas or projections without a staleness contract; replication lag or consumer lag spikes during deploys.
Fix: read-your-writes routing (stick to primary for a session), bounded staleness checks, or synchronous replication for the workflows that require it.
3) Symptom: tail latency spikes under moderate load
Root cause: storage saturation (iowait), compaction/vacuum pressure, or cache thrash from bloat.
Fix: reduce write amplification, tune autovacuum/compaction, add appropriate indexes, partition churn-heavy tables, and ensure the disk is sized for the workload.
4) Symptom: the database is “slow,” but CPU is low
Root cause: lock contention or queueing on a single hot key/partition/row. Or you’re IO-bound.
Fix: identify blockers (lock views), redesign hot counters, shard by a better key, or batch updates. For IO-bound, address disk and write patterns.
5) Symptom: schema changes take weeks and are scary
Root cause: no migration discipline; large table rewrites; missing feature flags; insufficient backfill strategy.
Fix: adopt expand/contract migrations, backfill in controlled batches, and keep code compatible across versions until rollout completes.
6) Symptom: “We can’t restore quickly”
Root cause: backups exist but restores are untested; no RTO/RPO defined; no runbook.
Fix: run restore drills, define RPO/RTO, implement PITR where possible, and keep automation for provisioning a restore environment.
7) Symptom: search results don’t match the source of truth
Root cause: search index treated as a database; indexing pipeline has lag, drops, or reorder issues.
Fix: treat search as derived state; add reindex capability; use versioning and idempotency; build an integrity check comparing index vs source.
8) Symptom: NoSQL cluster is “healthy” but app errors increase
Root cause: timeouts and retries amplify load; clients misconfigured; p95 latency rises but average looks fine.
Fix: instrument client-side latency, cap retries, add circuit breakers, and tune timeouts to match SLA and failure behavior.
Checklists / step-by-step plan
Step-by-step: choosing the datastore without lying to yourself
- List invariants (write-time truth vs eventual truth).
- Define failure tolerance: what happens if writes partially succeed? What’s the user-visible impact?
- Define read patterns: list the top 10 queries/endpoints, not “we’ll search later.”
- Estimate growth: data volume, write QPS, read QPS, and distribution across tenants/keys.
- Pick one system of record. Everything else is derived.
- Choose the simplest DB that satisfies invariants. If that’s relational, embrace it.
- Plan migrations and rollbacks. If you cannot roll back safely, you do not have a plan.
- Operationalize: monitoring, backups, restore drills, schema change policy, capacity alerts.
Checklist: production readiness for any database
- RPO and RTO defined, written down, agreed by the business
- Backups automated, encrypted, and verified
- Restore drill performed (not “we checked the logs”)
- Capacity alerts for CPU, memory, disk, and IOPS
- Slow query visibility with sampling and retention
- Clear ownership and escalation path
- Client timeouts and retries configured intentionally
- Schema/migration process with expand/contract
- Data retention and archiving plan
- Load testing that includes failure modes (node loss, network partitions, consumer lag)
Checklist: if you insist on “eventual consistency”
- Define “eventual”: seconds? minutes? hours? What’s acceptable?
- Idempotency keys for every consumer that mutates state
- Replay strategy tested (including out-of-order and duplicate events)
- Dead-letter queue with operational ownership
- Reconciliation job that is auditable, not a black box
- A single place to query current truth (system of record)
FAQ
1) Is “SQL vs NoSQL” even the right question?
Not really. The real question is: where do you enforce invariants, and what failure modes can your team operate? SQL is often the simplest correct answer for systems of record.
2) But doesn’t NoSQL scale better?
Some NoSQL systems scale write throughput and horizontal distribution very well—when you model keys and access patterns correctly. Relational systems also scale far beyond what most teams reach, especially with proper indexing, partitioning, and read scaling.
3) Are joins inherently slow?
No. Bad joins are slow. Missing indexes are slow. Joining huge unfiltered datasets is slow. A well-indexed join on selective predicates is often fast and predictable.
4) What’s the biggest hidden cost of “schemaless” databases?
Operational and organizational drift: you lose a central enforcement point. Every service becomes a schema enforcer, migrations become “best effort,” and data cleanup becomes a permanent workflow.
5) Can I do transactions in NoSQL databases?
Sometimes, yes—within limits. The question is whether the transaction model is mature, what it costs, and whether it’s operationally predictable under load and partitions. Test failure modes, not demos.
6) What’s the “real alternative” if my team wants speed and flexibility?
Use a relational DB as the ledger of truth, then build flexible derived models: JSON columns where appropriate, materialized views, caches, and search indexes. You get agility without abandoning integrity.
7) When is document storage the correct system of record?
When documents are truly independent and your invariants are mostly within the document boundary. Example: content management blobs, user preferences, or session-like data—assuming you still design for migrations and data validation.
8) How do we avoid hot partitions in a distributed NoSQL store?
Design partition keys for even distribution, avoid monotonically increasing prefixes, and simulate real traffic. Watch top keys and skew. If one tenant dominates, build tenant-aware isolation and throttling.
9) What if we already chose NoSQL and now regret it?
Don’t rewrite everything in a panic. Identify the invariants that are hurting you, introduce an authoritative store for those (often relational), and migrate workflows incrementally with dual writes only if you can verify correctness.
10) What’s the first database “upgrade” most teams should do?
Not sharding. Usually: fix indexing, adopt connection pooling, implement backup+PITR, and build a disciplined migration process. Sharding is an operational lifestyle, not a feature flag.
Next steps you can do this week
- Write down your top 10 invariants and mark which must be true at write time.
- Pick the system of record for each invariant (ideally one). Declare everything else derived.
- Run the fast diagnosis playbook on your current pain: is it saturation, contention, or correctness?
- Do one restore drill to a fresh environment. Time it. Record the steps. That’s your real RTO.
- Kill one “temporary” reconciliation job by moving the invariant into a write-time constraint or atomic operation.
- Make one query boring: add the right index, verify with
EXPLAIN (ANALYZE), and lock in performance with a regression test.
Database technology won’t save you from unclear truth. But a clear truth model will save you from most database technology debates. Choose systems the way you choose incident response: by what happens when things fail, not by what happens in a demo.