MariaDB vs SQLite Backups: Simple Restore vs Real PITR

October 13, 2025 • February 3, 2026 • Read: 21 min • Views: 14

Was this helpful?

The fastest way to ruin a calm Tuesday is to discover your “backup” is really just a file you copied once,
from a process that was actively writing to it, and nobody ever practiced restoring. The second fastest way
is to realize you need point-in-time recovery (PITR) but you never retained the logs that make PITR possible.

SQLite and MariaDB sit on opposite ends of the operational spectrum. SQLite is a file. MariaDB is a server with
moving parts, write-ahead logging, replication, and the ability to rewind time—if you set it up.
This is a practical guide to backups that restore, not backups that soothe.

What you’re really buying: “restore” vs “PITR”

“Backup” is an overloaded word. In most orgs it means “we have some bytes somewhere else.” In production, it means:
we can restore service and data to an acceptable point, within an acceptable time, with predictable risk.
That’s RPO (how much data you can lose) and RTO (how long you can be down) in plain clothes.

SQLite’s superpower is simplicity: one database is one file. If you can make a consistent copy and store versions,
you can restore quickly. But PITR is not a native, first-class experience. You can approximate it with frequent snapshots,
WAL archiving, or application-level journaling, but it’s not the same as replaying a transaction log to a specific second.

MariaDB’s superpower is exactly that transaction log: binary logs (binlogs). If you keep a consistent base backup and
retain binlogs, you can restore to “right before someone dropped a table,” or “right before the bug shipped,” within the
resolution of events and log positions. It’s more operational work, more disk, more failure modes, and a much better answer
when the business asks: “Can we get back what we had at 10:41?”

Simple restore (good enough when…)

You can tolerate losing the last N hours of writes (RPO is hours, not minutes).
Writes are low volume, and the DB is not the system of record.
The cost of setting up PITR (and operating it) exceeds the cost of occasional loss.
Your main threat is infrastructure failure, not user error or bad deploys.

Real PITR (non-negotiable when…)

Accidental deletes, bad migrations, or app bugs are realistic threats (they are).
You need to recover to “a time,” not just “a day.”
You need audit-grade recovery: explainable, repeatable, and testable.
Regulated data or contractual SLAs make “we lost half a day” unacceptable.

One dry truth: backups are a reliability feature. Treat them like any other production feature—design, operate, test, and
measure. The “feature” is not the backup job; it’s the restore outcome.

Paraphrased idea from Werner Vogels (reliability/operations): “Everything fails, all the time; design assuming failure.”
Your backups are the part where you admit he’s right.

Interesting facts & historical context (why the defaults look weird)

SQLite was designed in 2000 by D. Richard Hipp for embedded use—small, portable, zero-admin.
It’s not “a small MySQL.” It’s a library with a file format.
SQLite’s single-writer model (with multiple readers) is a deliberate design choice to keep locking simple.
WAL mode improves concurrency, but it doesn’t turn SQLite into a server.
SQLite’s “hot backup” concept exists because copying a live database file can be inconsistent.
The safe backup path is a first-class API and CLI feature.
MariaDB forked from MySQL after Oracle acquired Sun in 2010; many operational behaviors (binlogs, InnoDB)
remain compatible in spirit, even as features diverge.
MySQL/MariaDB binlogs were built for replication and later became the backbone of PITR in many shops.
PITR is a happy consequence of “we need to ship changes elsewhere.”
InnoDB crash recovery (redo logs) is not the same as PITR. Redo logs help you recover to a consistent state after a crash,
not to a chosen point before a bad query.
Logical dumps (mysqldump) used to be the default backup approach because it’s portable and simple, but it’s slow at scale,
and restores are often slower.
Percona XtraBackup popularized “hot physical backups” for InnoDB by copying pages and applying logs; MariaDB has its own tooling
ecosystem and compatibility layers, but the operational idea is the same.

SQLite backups: file restores done correctly

SQLite backup strategy is mostly about not lying to yourself about “just copy the file.”
If the process is writing while you copy, the file might not represent a valid consistent state.
Sometimes you’ll get lucky. Production is where luck goes to die.

Three SQLite realities you should accept early

Backups are either consistent or they’re theater. A byte-identical copy made at the wrong time is still garbage.
WAL changes the backup story. If you run in WAL mode, the DB state is spread across the main file and the WAL file.
Backing up one without the other is a classic “it worked in staging” trap.
PITR is not a checkbox. You can approximate PITR via frequent snapshots (filesystem/ZFS/LVM),
WAL archiving, or app-level event logs. But if the question is “restore to 10:41:17,” your life is harder.

What “good” looks like for SQLite backups

For most embedded deployments, “good” is:

Use WAL mode if you need concurrency, but back it up correctly.
Use SQLite’s online backup mechanism (VACUUM INTO or .backup) for consistent copies.
Version the backups (timestamped or snapshot-based).
Run integrity checks on restored copies, not on the live DB under load.

SQLite “PITR-like” options (pick one, be honest about the gaps)

SQLite doesn’t ship a binlog that you can replay to arbitrary times. Here are the practical approximations:

Frequent snapshots of the DB file(s). Works well with ZFS/btrfs/LVM snapshots. Your “PITR resolution” is snapshot frequency.
Archive WAL files. This can provide finer-grained recovery, but you need a disciplined scheme and careful restore procedure.
Many teams stop here and still can’t replay cleanly because they didn’t capture checkpoint boundaries or they rotate WAL incorrectly.
Application-level event log. If the DB is a cache or local store, maybe you can rebuild from upstream events.
That’s not PITR; that’s reconstruction. It can be better—if upstream is trustworthy.

Joke #1: SQLite backups are like houseplants—ignore them for a month, and suddenly you’re Googling “why is this brown” at 2 a.m.

MariaDB backups: full, incremental, and binlog-based PITR

MariaDB backup design is a three-part system: a consistent base backup, binlog retention,
and a rehearsed restore procedure. Skip any one of these and your “PITR” is a slide deck.

Base backups: logical vs physical

The internet loves arguments here. In production, you choose based on dataset size, restore time, and operational constraints.

Logical backup (mariadb-dump / mysqldump): Portable and simple. Slower, larger, and restores are often painfully slow.
Useful for small datasets, schema migrations, or when you need human-readable output.
Physical backup (mariabackup / xtrabackup-style): Faster for large datasets, and restores can be much faster.
Requires engine/file layout compatibility and more careful handling, but it’s the serious option for serious volume.

Binlogs: the difference between “restore” and “rewind”

Binlogs record changes. If you keep them, you can replay forward from a base backup to a chosen moment.
If you don’t keep them, you can’t. There’s no hack, only bargaining.

PITR with MariaDB typically looks like:

Restore base backup to a new server or isolated instance.
Apply binlogs up to a stop time (or stop position) right before the disaster.
Validate data consistency and application behavior.
Cut over traffic or export corrected data back to production.

Replication is not backup (but it can be part of the plan)

Replication copies mistakes at the speed of light. That’s its job. A replica can help if:

It’s delayed replication (intentional lag) and you notice the incident before the delay window closes.
You can stop SQL thread quickly and treat it as a “time capsule.”
You still have binlogs for replay or to build a clean clone.

Joke #2: Replication isn’t a backup—it’s a group chat where everyone immediately repeats the wrong answer.

The reality of “real PITR” (and what it costs)

PITR isn’t “turn on binlogs.” It’s an operational contract:

Retention: enough binlogs to cover your recovery window (plus slack).
Base backup cadence: frequent enough that replay time is acceptable.
Time synchronization: consistent system clocks, known time zones, and careful interpretation of timestamps.
Restore practice: monthly or quarterly at minimum, with documented timings and failure notes.

The biggest hidden cost is human: the first time you do a binlog replay under pressure, you will discover which assumptions
were wrong. If you’re lucky, you discover them in a drill.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I actually run (or ask someone to run) when evaluating backup posture or executing a restore.
Each has: command, sample output, what it means, and what decision you make.

Task 1 — SQLite: determine journal mode (WAL changes everything)

cr0x@server:~$ sqlite3 /var/lib/app/app.db 'PRAGMA journal_mode;'
wal

Meaning: WAL mode is enabled. Writes go to app.db-wal and checkpoints merge them into the main DB file.

Decision: You must use a hot-backup method or snapshot both the DB and WAL state safely. “cp app.db” is not acceptable.

Task 2 — SQLite: take a consistent online backup using .backup

cr0x@server:~$ sqlite3 /var/lib/app/app.db ".backup '/backups/app.db.2025-12-30T0200Z'"

Meaning: SQLite uses the backup API to produce a consistent copy even while the DB is in use.

Decision: Prefer this over filesystem copy when you can afford the I/O. If this is too slow, move to snapshot-based backups.

Task 3 — SQLite: use VACUUM INTO for a compact, consistent backup

cr0x@server:~$ sqlite3 /var/lib/app/app.db "VACUUM INTO '/backups/app.compact.db';"

Meaning: Produces a new database file, defragmented and consistent. Often smaller, sometimes much smaller.

Decision: Use for nightly backups when size matters. Avoid during peak load: it can be I/O heavy.

Task 4 — SQLite: integrity check a restored copy (not the live DB)

cr0x@server:~$ sqlite3 /backups/app.db.2025-12-30T0200Z "PRAGMA integrity_check;"
ok

Meaning: Internal consistency checks passed.

Decision: Treat this as a gate. If it’s not “ok,” your backup pipeline is broken until proven otherwise.

Task 5 — SQLite: check for WAL files and size (backup risk indicator)

cr0x@server:~$ ls -lh /var/lib/app/app.db*
-rw-r----- 1 app app 1.2G Dec 30 01:59 /var/lib/app/app.db
-rw-r----- 1 app app 512M Dec 30 02:00 /var/lib/app/app.db-wal
-rw-r----- 1 app app  32K Dec 30 01:58 /var/lib/app/app.db-shm

Meaning: WAL is large. A checkpoint may be lagging; copies that ignore WAL will be missing recent transactions.

Decision: Investigate checkpointing and backup method. Consider forcing a checkpoint in a controlled window or tune WAL settings.

Task 6 — MariaDB: confirm binlogging is enabled

cr0x@server:~$ mariadb -e "SHOW VARIABLES LIKE 'log_bin';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin       | ON    |
+---------------+-------+

Meaning: Server is generating binlogs.

Decision: If OFF, stop pretending you have PITR. Turn it on, plan for restart if required, and design retention immediately.

Task 7 — MariaDB: check binlog format (row-based is your friend)

cr0x@server:~$ mariadb -e "SHOW VARIABLES LIKE 'binlog_format';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW   |
+---------------+-------+

Meaning: Row-based logging records row changes, generally safest for PITR and replication correctness.

Decision: Prefer ROW for PITR. If you’re on STATEMENT, be prepared for non-determinism with functions and triggers.

Task 8 — MariaDB: confirm binlog retention settings

cr0x@server:~$ mariadb -e "SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';"
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| binlog_expire_logs_seconds | 604800 |
+--------------------------+--------+

Meaning: Binlogs expire after 7 days (604800 seconds).

Decision: Match retention to your required recovery window plus time-to-detect. If incidents take two weeks to notice, 7 days is fantasy.

Task 9 — MariaDB: list available binlogs (do you even have history?)

cr0x@server:~$ mariadb -e "SHOW BINARY LOGS;"
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mariadb-bin.000231 | 104857600 |
| mariadb-bin.000232 | 104857600 |
| mariadb-bin.000233 |  52428800 |
+------------------+-----------+

Meaning: Binlog files exist and sizes look plausible.

Decision: If the list is short or empty, PITR range is short or nonexistent. Fix retention and archive strategy.

Task 10 — MariaDB: take a logical base backup (small datasets / portability)

cr0x@server:~$ mariadb-dump --single-transaction --routines --triggers --events --all-databases > /backups/mariadb.full.sql

Meaning: Consistent snapshot for InnoDB via --single-transaction. Includes routines/events/triggers.

Decision: Use if dataset is modest and restore time is acceptable. If restore takes hours and your RTO is minutes, switch to physical backups.

Task 11 — MariaDB: take a physical base backup with mariabackup

cr0x@server:~$ mariabackup --backup --target-dir=/backups/mariadb/base --user=backup --password='REDACTED'
[00] 2025-12-30 02:05:12 completed OK!

Meaning: Files copied with necessary metadata to apply logs.

Decision: If this fails intermittently, suspect permissions, I/O saturation, or insufficient temp space. Fix that before promising PITR.

Task 12 — MariaDB: prepare (apply logs) before restore

cr0x@server:~$ mariabackup --prepare --target-dir=/backups/mariadb/base
[00] 2025-12-30 02:08:44 InnoDB: Shutdown completed; log sequence number 987654321

Meaning: Backup is made consistent; redo applied.

Decision: A backup that wasn’t prepared isn’t ready to restore. Make “prepare succeeded” part of your pipeline.

Task 13 — MariaDB: capture binlog coordinates at backup time

cr0x@server:~$ mariadb -e "SHOW MASTER STATUS\G"
*************************** 1. row ***************************
             File: mariadb-bin.000233
         Position: 184467
     Binlog_Do_DB:
 Binlog_Ignore_DB:

Meaning: This is the binlog file and position representing the point consistent with your snapshot (for many backup methods).

Decision: Store this alongside the base backup metadata. Without coordinates, binlog replay becomes archaeology.

Task 14 — MariaDB: dry-run binlog inspection around an incident time

cr0x@server:~$ mysqlbinlog --start-datetime="2025-12-30 10:35:00" --stop-datetime="2025-12-30 10:45:00" /var/lib/mysql/mariadb-bin.000233 | head
# at 184467
#251230 10:36:01 server id 1  end_log_pos 184742 CRC32 0x9a3c1f2e  GTID 0-1-12345  trans
BEGIN
# at 184742
#251230 10:36:01 server id 1  end_log_pos 185120 CRC32 0x0e1d2b44  Table_map: `prod`.`orders` mapped to number 241

Meaning: You can see the transaction boundaries and event timing.

Decision: Identify the “bad” transaction window. Decide whether you will stop by datetime or by position/GTID for precision.

Task 15 — MariaDB: replay binlogs to a stop time (PITR execution)

cr0x@server:~$ mysqlbinlog --stop-datetime="2025-12-30 10:41:00" /backups/binlogs/mariadb-bin.000233 /backups/binlogs/mariadb-bin.000234 | mariadb

Meaning: Replays changes into the restored instance up to the stop time.

Decision: If you see errors, stop. Decide whether you’re applying to the correct restored base, and whether GTIDs or server IDs are mismatched.

Task 16 — MariaDB: verify replication/GTID mode before relying on GTID-based recovery

cr0x@server:~$ mariadb -e "SHOW VARIABLES LIKE 'gtid_strict_mode';"
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| gtid_strict_mode | ON    |
+------------------+-------+

Meaning: GTID strictness is enabled; GTID-based positioning is likely reliable.

Decision: If GTIDs are inconsistent across environments, prefer file+position and keep the restore environment isolated.

Task 17 — Sanity: measure restore throughput and decide if base backup cadence must change

cr0x@server:~$ /usr/bin/time -f "elapsed=%E cpu=%P" mariadb < /backups/mariadb.full.sql
elapsed=01:42:18 cpu=86%

Meaning: Logical restore took ~1h42m on this hardware.

Decision: If your RTO is 30 minutes, you need physical restores, smaller datasets per instance, or a different architecture. Don’t negotiate with physics.

Fast diagnosis playbook (find the bottleneck fast)

When backups or restores are slow or failing, don’t guess. Check the boring constraints first. They’re usually guilty.

First: storage and I/O saturation

Check disk space (backup targets and source). Out-of-space errors often show up as “corruption” later.
Check I/O wait. A “slow backup tool” is often a saturated volume.
Check filesystem latency spikes. Snapshotting or compaction jobs can ruin backup windows.

Second: consistency mechanics

SQLite: Are you backing up via API or blindly copying files? Are you capturing WAL/shm state correctly?
MariaDB: Are you using --single-transaction for logical backups? Is the physical backup being prepared?
Binlogs: Do you have the right binlog coordinates? Are the needed binlog files still retained?

Third: operational plumbing

Permissions and SELinux/AppArmor profiles.
Clock/time zone confusion (stop-datetime applied in the wrong timezone is a classic).
Network bottlenecks to object storage or NFS targets.
Compression CPU saturation (especially when you “optimized” it).

Three corporate mini-stories from the backup trenches

Mini-story 1: the incident caused by a wrong assumption

A mid-sized company ran a customer support product that stored queue state in SQLite on each application node.
It wasn’t the system of record—until it quietly became one. A new feature started writing “final” customer outcomes locally
to speed up page loads. Nobody updated the data classification. It happens.

Backups were a cron job: cp app.db /mnt/backups/app.db. Seemed fine. Restores worked in a sandbox,
because the sandbox workload was light and checkpoints were frequent.

Then an incident: a node rebooted during peak hours. After restart, the app crashed when reading certain rows.
The team tried to restore from the copied file. Same crash. They tried older copies. Some worked, some didn’t.
The backups were a roulette wheel.

The root cause was simple: WAL mode had been enabled months earlier to reduce writer blocking, but the “backup” copied only the main DB file.
Sometimes the WAL had critical transactions not checkpointed yet. The copied file alone was inconsistent with application expectations.

The fix was equally simple, and slightly humiliating: replace cp with SQLite’s backup API, and add an integrity check step on the backup artifact.
They also documented the fact that WAL exists and is not optional. The incident was less about SQLite and more about assuming a file copy equals a database backup.

Mini-story 2: the optimization that backfired

Another org ran MariaDB for a line-of-business platform. Backups were physical and solid, but the backup window started encroaching on business hours.
Someone had the bright idea to compress backups harder to reduce storage costs and speed up uploads to remote storage.

Compression did reduce bytes on disk. It also pegged CPU for hours and increased I/O wait because the pipeline stopped being “copy speed” and became “CPU speed.”
The database server began to show latency spikes during backups, and the app team blamed “MariaDB being slow again.”

They responded by lowering InnoDB buffer pool pressure and tuning query caches (yes, really). Latency got worse.
The real issue was that the backup process was competing with production traffic for CPU cycles and cache.

The backfiring part arrived during a restore drill. Decompression became the long pole. The theoretical bandwidth savings didn’t matter
because the RTO was dominated by CPU-bound inflate time. The restore missed the internal objective by a wide margin.

The postmortem outcome was unglamorous: use lighter compression or none for the “hot” restore path, keep heavy compression only for cold archives,
and run backups on a replica or dedicated backup host when possible. Storage is cheaper than downtime; also, your CFO doesn’t take pages at 3 a.m.

Mini-story 3: the boring but correct practice that saved the day

A financial services team ran MariaDB with strict change controls. They did one thing that looked almost quaint:
every month, they performed a full restore drill into an isolated environment, then ran a known query suite to validate business totals.
It wasn’t fun. It was scheduled. It was documented. People complained.

One Friday, an engineer ran a schema migration that included a data cleanup script. The script had a bug: it matched more rows than intended.
Production started shedding “valid” records. Monitoring didn’t immediately flag it because QPS and error rates were fine. The data was just… wrong.

They detected the issue through application-level anomaly detection and stopped the bleeding quickly. Now the question:
“Can we restore to right before the migration?” This is where teams discover whether PITR is real.

They restored the last physical base backup to a fresh instance, applied binlogs up to minutes before the migration using a stop-datetime,
verified totals via the same monthly drill queries, and cut over. The whole process was practiced; nobody was learning mysqlbinlog under stress.

The boring monthly drill didn’t just save data. It saved decision time. Everyone already trusted the runbook, the timings, and the validation checks.
That’s what “operational maturity” looks like when it’s not a slide.

Common mistakes: symptoms → root cause → fix

1) “SQLite restore works sometimes, sometimes it corrupts”

Symptoms: Restored file opens intermittently; missing recent changes; “database disk image is malformed.”

Root cause: Backing up by copying the main DB file while in WAL mode or during active writes; ignoring -wal and -shm.

Fix: Use .backup or VACUUM INTO. If using filesystem snapshots, snapshot atomically and include WAL state; run integrity checks on artifacts.

2) “MariaDB PITR failed: missing binlog file”

Symptoms: mysqlbinlog errors: cannot open binlog; gaps in binlog sequence; restore stops early.

Root cause: Binlog retention too short; rotation/cleanup misconfigured; binlogs stored only locally and lost with the server.

Fix: Increase binlog_expire_logs_seconds to match detection + recovery window; archive binlogs to durable storage; alert on missing sequences.

3) “Logical backup succeeded, but restore is inconsistent”

Symptoms: Foreign key errors; partial data; tables missing rows that “should be there.”

Root cause: Dump taken without --single-transaction on InnoDB; concurrent writes produced a non-atomic snapshot.

Fix: Use --single-transaction and avoid dumping non-transactional tables without planning; consider physical backups for mixed engines.

4) “PITR restores to the wrong moment”

Symptoms: Data includes changes that should have been excluded; missing changes that should be present.

Root cause: Time zone confusion; server clock drift; using stop-datetime interpreted differently than expected; ignoring DST transitions.

Fix: Use UTC for servers and runbooks; record incident times in UTC; consider stopping by binlog position/GTID when precision matters.

5) “Backups succeed, but restores are too slow to meet RTO”

Symptoms: Restore takes hours; decompression dominates time; app team starts talking about “acceptable data loss” during the incident.

Root cause: Strategy optimized for storage cost, not restore time; logical restores used at large scale; underpowered restore environment.

Fix: Measure restores, then redesign: physical backups, faster restore hosts, parallelism, or sharding. Keep hot-path backups in a restore-friendly format.

6) “Backups cause production latency spikes”

Symptoms: Increased query latency during backup windows; CPU pegged; I/O wait climbs.

Root cause: Backup process competing for I/O/CPU; heavy compression on primary; snapshots triggering copy-on-write churn.

Fix: Run backups on replicas; cap CPU/I/O for backup jobs; move compaction jobs out of peak; test snapshot performance characteristics.

Checklists / step-by-step plan

Decision checklist: choose SQLite “simple restore” vs MariaDB “real PITR”

Is the data authoritative? If yes, favor MariaDB (or another server DB) with PITR.
Do you need to undo user error? If yes, you want PITR. Snapshots alone are usually too coarse.
How fast do you need to restore? If minutes, avoid slow logical restore paths.
Can you run and monitor a server DB? If not, SQLite with disciplined backups and frequent snapshots may be more honest.

SQLite step-by-step: a backup that actually restores

Confirm journal mode and whether WAL is in play.
Use .backup or VACUUM INTO to create a consistent artifact.
Store backups with versioning (timestamped names or snapshot IDs).
Run PRAGMA integrity_check; on the artifact.
Practice restore: replace DB file, start app, run a minimal health query.
Decide your “PITR-like” resolution: snapshot hourly? every 5 minutes? Be explicit.

MariaDB step-by-step: base backup + binlogs for PITR

Enable binlogs and select ROW format.
Set retention to match business requirements; archive binlogs off-host if the host can die.
Take periodic base backups (physical for large datasets).
Record binlog coordinates (file/position or GTID) with the base backup metadata.
Restore drill: restore base to isolated instance, apply binlogs to a chosen stop time, validate with queries, document timing.
Operationalize: alerts for binlog gaps, backup failures, restore test failures.

FAQ

1) Can SQLite do “real” PITR?

Not in the MariaDB sense. SQLite doesn’t provide a first-class transaction log you can replay to an arbitrary point.
You can approximate with frequent snapshots and careful WAL handling, but call it what it is: snapshot-based recovery.

2) If I use filesystem snapshots, can I just snapshot the SQLite file?

Only if you understand the journal mode. In WAL mode, you need a crash-consistent snapshot of the DB plus WAL state.
The safer default is using SQLite’s backup API, unless snapshots are atomically coordinated and tested.

3) Is replication in MariaDB enough to recover from mistakes?

No. Replication faithfully reproduces mistakes. It can help if you run delayed replication or stop a replica quickly,
but PITR still depends on binlogs and a base backup.

4) Do I need physical backups to do PITR in MariaDB?

No. You need a consistent base backup plus binlogs. The base can be logical or physical.
Physical is often the only way to hit an aggressive RTO at scale.

5) Why not just run mysqldump nightly and call it a day?

Nightly dumps give you a daily restore point. That’s fine if your RPO is a day and your data loss tolerance is real.
If you need “undo what happened at 10:41,” you need binlogs and the operational discipline around them.

6) How long should I keep binlogs?

Keep them for at least your required recovery window plus your detection window plus slack.
If you typically detect issues in 72 hours but occasionally in 10 days, size for 10 days, not your best-case.

7) What’s the single most reliable way to know my backups work?

Restore them. Regularly. Automatically if possible. A backup without a restore test is a hope, not a control.

8) Can I apply binlogs directly onto production to “undo” a mistake?

Usually: don’t. Do PITR into an isolated instance, validate, then cut over or selectively export corrected data.
Directly manipulating production during an incident is how you turn one mistake into a trilogy.

9) Why does binlog replay sometimes fail with duplicate keys or missing tables?

Because your base backup doesn’t match the binlog stream you’re replaying (wrong coordinates, wrong server, wrong dataset),
or you restored a partial backup. Treat backup metadata as part of the backup, not optional garnish.

10) What should I monitor for backup health?

Backup job success is table stakes. Monitor: backup artifact size anomalies, integrity check results, restore drill timings,
binlog retention coverage, and storage free space on source and target.

Conclusion: what to do next week

If you run SQLite, stop copying the file like it’s a JPEG. Use the backup API, verify integrity on the artifact, and pick a snapshot cadence you can defend.
If you need true PITR semantics, admit SQLite isn’t built for that job and move the system of record to a server database—or build an event log that can reconstruct truth.

If you run MariaDB, make PITR real: ensure binlogs are on, retained, and archived; take consistent base backups; record coordinates; and rehearse restore.
Do one restore drill this month. Time it. Write down what surprised you. Then fix those surprises while nobody is paging you.