NixOS 25.11 Install: Reproducible Setup, Zero Config Drift, Maximum Control

Was this helpful?

You know the feeling: a server “mostly works,” except no one can explain why. The firewall rule was “temporarily” changed last quarter, the kernel parameter was edited at 2 a.m., and the disk layout is a surprise waiting to happen during the next outage.

NixOS is the antidote if you treat it like an operating system, not a hobby. This is a production-minded install of NixOS 25.11 aimed at reproducibility, minimal drift, and the kind of control you only appreciate once you’ve debugged a broken bootloader in a cold data center.

What you’re building (and what you’re not)

You’re building a machine that can be rebuilt. Not “reinstalled with notes.” Rebuilt. From a repo. On demand. With an audit trail.

That means three non-negotiables:

  • Declarative system state: services, users, packages, networking, filesystems, kernel params.
  • Atomic changes: you don’t “apply patches,” you build a new system generation and switch to it.
  • Rollback as a first-class feature: because production loves surprises, and NixOS dislikes them.

What you are not building: a one-off snowflake box “tuned” with hand edits across /etc and a pile of shell history. If you want that life, any other Linux will happily enable it.

Also, let’s get one thing straight: NixOS won’t save you from bad decisions. It will, however, preserve them perfectly and reproducibly. That’s a feature, not a threat.

Interesting facts and context (why this exists)

NixOS didn’t appear because someone wanted a new init system to argue about. It’s the practical outcome of treating software deployment as a functional build problem.

8 short facts worth knowing

  1. Nix predates containers as a mainstream ops tool. The Nix package manager’s ideas go back to early 2000s research on purely functional deployment.
  2. /nix/store is content-address-ish by design. Store paths include hashes so you can install multiple versions and variants side-by-side without file collisions.
  3. NixOS makes system upgrades transactional. New configs build a new “system closure” and boot entries are generated for each generation.
  4. Rollback is built into the boot menu. You can boot last-known-good even when networking is dead and SSH is a memory.
  5. Configurations are code, not a pile of diff-unfriendly files. This is why NixOS is brutally good at fleet consistency.
  6. Nixpkgs is one of the largest package collections. Not because it’s fashionable, but because Nix’s build isolation scales maintainably.
  7. “Declarative OS” wasn’t invented by NixOS, but NixOS operationalized it. You can find echoes in tools like cfengine/puppet/chef, but NixOS makes the OS itself declarative.
  8. Generations are cheap compared to traditional image builds. You keep multiple system states without cloning full disks, because the store deduplicates content.

One quote (paraphrased idea): Gene Kim’s reliability work repeatedly pushes the same principle: make changes small, reversible, and routine. NixOS bakes that into the OS workflow.

Decisions that matter: disks, boot, secrets, updates

Pick your filesystem like you’ll be on-call for it

For servers: you want predictable recovery, solid tooling, and a failure model you understand. Choose based on your team’s comfort, not internet points.

  • ext4: boring, stable, fast enough. If you don’t need snapshots at the filesystem layer, ext4 is the “get home safe” choice.
  • Btrfs: snapshots and subvolumes are useful; send/receive helps; it can be fantastic. Also: you must respect its operational sharp edges.
  • ZFS: best-in-class for checksumming, snapshots, and replication workflows. Also: you’re signing up for ZFS operational discipline (memory, ARC behavior, pool health).

This guide uses ZFS root as the “maximum control” path, with notes for ext4/Btrfs where it changes decisions.

UEFI only, GPT only, and no “clever” partitioning

Use UEFI + GPT. Create an EFI System Partition. Keep it boring. You can be clever later, when you have monitoring, backups, and a lab to test in.

Flakes: yes, unless you have a strict reason not to

Use flakes for new installs. They’re not perfect, but they standardize pinning and composition. The alternative is “channels,” which is basically “trust me bro, it’s latest.”

Secrets: decide now, not later

If you’re deploying anything beyond a personal laptop, you will have secrets: SSH host keys, TLS certs, tokens, Wi‑Fi PSKs, API keys. Decide your secret strategy before you ship:

  • For small setups: encrypted files in git + age-based tooling (common pattern), decrypted at deploy time.
  • For corp setups: integrate with a real secret backend and wire your NixOS config to fetch at runtime.

Do not store secrets in world-readable Nix store paths. The store is designed for reproducibility, not confidentiality.

Updates: pick a cadence and write it down

NixOS makes it easy to update. That’s not the same as safe. You want a rhythm:

  • Pin inputs (flake.lock) and update intentionally.
  • Test builds (even on the host) before switching.
  • Always keep a rollback path (boot entries + previous generations).

Joke #1: Configuration drift is like entropy: you can’t argue with it, only budget for the heat death of your /etc.

Checklists / step-by-step plan

Pre-flight checklist (before you boot the ISO)

  • Decide filesystem: ext4 vs Btrfs vs ZFS. If you pick ZFS, decide pool name and dataset layout.
  • Decide boot: UEFI + systemd-boot is fine for most. If you need Secure Boot, plan it now.
  • Decide identity: hostname, static IP or DHCP, SSH access model, admin users, and how you’ll store config (git).
  • Decide update channel: stable release branch vs tracking. For servers, prefer stable + planned bumps.
  • Decide secrets: encrypted repo, external secret manager, or manual provisioning (last resort).
  • Decide remote management: will you deploy locally or via SSH from a build host?

Install checklist (high-level)

  1. Boot NixOS ISO, confirm network works.
  2. Partition disks (GPT + EFI + ZFS partition).
  3. Create ZFS pool + datasets (or ext4/Btrfs equivalents).
  4. Mount filesystems at /mnt.
  5. Generate initial NixOS config.
  6. Convert to flake-based structure.
  7. Set bootloader, networking, users, sshd, time, locale.
  8. Install NixOS, set root password (or disable password and rely on SSH keys).
  9. Reboot, validate boot entries, validate ZFS import, validate SSH.
  10. Commit config to git. Treat it like infrastructure.

Post-install checklist (first hour)

  • Confirm rollback entries exist in the boot menu.
  • Confirm you can rebuild and switch without errors.
  • Set up monitoring basics (disk, memory, load, ZFS health).
  • Set up backups (snapshots + replication, or file-level backups).
  • Decide how upgrades are triggered (manual, timer, CI-driven).

Practical tasks: commands, expected output, and decisions

These are the tasks I actually run. Each one includes what the output means and what you decide next. Assume you’re on the NixOS installer environment unless stated otherwise.

Task 1: Confirm you booted in UEFI mode

cr0x@server:~$ ls /sys/firmware/efi
efivars  fw_platform_size

What it means: If that directory exists, you’re in UEFI mode. If it doesn’t, you booted legacy BIOS.

Decision: If UEFI is missing, reboot and fix firmware settings. Don’t install in legacy mode unless you have a very specific constraint.

Task 2: Identify disks and their topology

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,MODEL,SERIAL,FSTYPE,MOUNTPOINTS
NAME        SIZE TYPE MODEL           SERIAL    FSTYPE MOUNTPOINTS
nvme0n1   953.9G disk Samsung SSD     S6ABC123
├─nvme0n1p1   1G part                 vfat   /mnt/boot
└─nvme0n1p2 952.9G part                 zfs_member
sda        3.6T disk ST4000NM         Z4XYZ789

What it means: You see device names, sizes, and whether a filesystem already exists. Also: check you’re not about to nuke the wrong disk.

Decision: Choose the install target(s). If this is a mirrored boot setup, you want two disks. If it’s a VM, one is fine.

Task 3: Sanity-check existing partition tables (and wipe if appropriate)

cr0x@server:~$ sudo sgdisk -p /dev/nvme0n1
Disk /dev/nvme0n1: 2000409264 sectors, 953.9 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 4A5F0B45-7E62-4C6D-A3D2-8D5F7E0C1A1B
Partition table holds up to 128 entries
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2099199   1024.0 MiB  EF00  EFI System
   2         2099200      2000408575   952.9 GiB   BF01  ZFS

What it means: GPT is present. Partition codes EF00 (EFI) and BF01 (ZFS) are what we want.

Decision: If you see an old messy layout and you’re reinstalling, wipe the disk label cleanly (don’t half-reuse partitions unless you’re also writing a runbook for future-you).

Task 4: Partition a disk (single-disk example)

cr0x@server:~$ sudo sgdisk --zap-all /dev/nvme0n1
Creating new GPT entries in memory.
GPT data structures destroyed! You may now partition the disk using fdisk or other utilities.

What it means: Old GPT/MBR data is removed.

Decision: Proceed only if you are absolutely sure you selected the right disk.

cr0x@server:~$ sudo sgdisk -n1:1M:+1G -t1:EF00 -c1:"EFI System" /dev/nvme0n1
The operation has completed successfully.
cr0x@server:~$ sudo sgdisk -n2:0:0 -t2:BF01 -c2:"ZFS" /dev/nvme0n1
The operation has completed successfully.

What it means: You created a 1 GiB ESP and a ZFS partition consuming the rest.

Decision: If you plan to use hibernation or special swap requirements, plan that partition now. Otherwise, keep it simple and do swap via file later (or none, depending on workload).

Task 5: Format the EFI System Partition

cr0x@server:~$ sudo mkfs.fat -F 32 -n EFI /dev/nvme0n1p1
mkfs.fat 4.2 (2021-01-31)

What it means: ESP is FAT32-labeled EFI.

Decision: If you’re doing mirrored boot, you’ll format an ESP on each disk and mount both (or use a sync approach). Single-disk is simpler.

Task 6: Create the ZFS pool (root on ZFS)

cr0x@server:~$ sudo zpool create -f -o ashift=12 \
  -O atime=off -O compression=zstd -O xattr=sa -O normalization=formD \
  -O mountpoint=none -O acltype=posixacl \
  rpool /dev/nvme0n1p2
cr0x@server:~$ sudo zpool status
  pool: rpool
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          nvme0n1p2    ONLINE       0     0     0

errors: No known data errors

What it means: Pool is created and healthy.

Decision: If you see anything but ONLINE with zero errors, stop. Fix cabling, disk health, or device paths. Do not “continue and hope.”

Task 7: Create datasets (separate root, nix store, and persistent state)

cr0x@server:~$ sudo zfs create -o mountpoint=none rpool/local
cr0x@server:~$ sudo zfs create -o mountpoint=none rpool/safe
cr0x@server:~$ sudo zfs create -o mountpoint=legacy rpool/local/root
cr0x@server:~$ sudo zfs create -o mountpoint=legacy -o compression=zstd rpool/local/nix
cr0x@server:~$ sudo zfs create -o mountpoint=legacy rpool/safe/persist
cr0x@server:~$ sudo zfs list -o name,mountpoint,compression
NAME               MOUNTPOINT  COMPRESSION
rpool              none        zstd
rpool/local        none        zstd
rpool/local/nix    legacy      zstd
rpool/local/root   legacy      zstd
rpool/safe         none        zstd
rpool/safe/persist legacy      zstd

What it means: You’ve separated datasets by “can be rebuilt” (local) versus “should survive reimage” (safe/persist).

Decision: Put /nix on its own dataset. It grows. It churns. You want to snapshot it differently than your persistent state.

Task 8: Mount filesystems for installation

cr0x@server:~$ sudo mount -t zfs rpool/local/root /mnt
cr0x@server:~$ sudo mkdir -p /mnt/nix /mnt/persist /mnt/boot
cr0x@server:~$ sudo mount -t zfs rpool/local/nix /mnt/nix
cr0x@server:~$ sudo mount -t zfs rpool/safe/persist /mnt/persist
cr0x@server:~$ sudo mount /dev/nvme0n1p1 /mnt/boot
cr0x@server:~$ mount | head -n 5
rpool/local/root on /mnt type zfs (rw,xattr,posixacl)
rpool/local/nix on /mnt/nix type zfs (rw,xattr,posixacl)
rpool/safe/persist on /mnt/persist type zfs (rw,xattr,posixacl)
/dev/nvme0n1p1 on /mnt/boot type vfat (rw,relatime,fmask=0022,dmask=0022)

What it means: The target root is at /mnt, with separate mounts for /nix, /persist, and /boot.

Decision: If /mnt/boot isn’t mounted, your systemd-boot install will silently produce a system that doesn’t boot. Fix it now.

Task 9: Generate the initial configuration

cr0x@server:~$ sudo nixos-generate-config --root /mnt
writing /mnt/etc/nixos/hardware-configuration.nix...
writing /mnt/etc/nixos/configuration.nix...

What it means: NixOS detected hardware and wrote a baseline config.

Decision: Treat hardware-configuration.nix as mostly machine-generated and stable; keep human intent in separate modules where possible.

Task 10: Validate that ZFS mounts are reflected in hardware config

cr0x@server:~$ sudo sed -n '1,200p' /mnt/etc/nixos/hardware-configuration.nix
{ config, lib, pkgs, modulesPath, ... }:

{
  imports = [ (modulesPath + "/installer/scan/not-detected.nix") ];

  boot.supportedFilesystems = [ "zfs" ];

  fileSystems."/" = {
    device = "rpool/local/root";
    fsType = "zfs";
  };

  fileSystems."/nix" = {
    device = "rpool/local/nix";
    fsType = "zfs";
  };

  fileSystems."/persist" = {
    device = "rpool/safe/persist";
    fsType = "zfs";
  };

  fileSystems."/boot" = {
    device = "/dev/disk/by-uuid/ABCD-EF01";
    fsType = "vfat";
  };
}

What it means: The installer encoded mounts correctly. ESP uses UUID, ZFS uses dataset names.

Decision: Keep ZFS dataset names stable. Do not rely on /dev/nvme0n1p2 in long-term configs unless you enjoy post-reboot surprises.

Task 11: Preflight the build before installation (catch obvious errors)

cr0x@server:~$ sudo nixos-install --root /mnt --no-root-passwd --dry-run
building the system configuration...
warning: Git tree '/mnt/etc/nixos' is dirty
dry-run succeeded

What it means: The config evaluates and builds. “dirty” is fine right now.

Decision: If the build fails, fix it now while you still have the ISO environment. Don’t “install and hope.”

Task 12: Install and set up the system

cr0x@server:~$ sudo nixos-install --root /mnt --no-root-passwd
building the system configuration...
installing the boot loader...
setting up /etc...
updating GRUB 2 menu...
installation finished!

What it means: NixOS installed and wrote bootloader entries. (Depending on config, you may see systemd-boot instead of GRUB lines.)

Decision: If bootloader install errors, do not reboot. Inspect /mnt/boot and your boot loader settings first.

Task 13: Before reboot, verify boot entries exist (UEFI/systemd-boot path)

cr0x@server:~$ sudo ls -la /mnt/boot/loader/entries | head
total 16
drwxr-xr-x 2 root root 4096 Jan  1 00:10 .
drwxr-xr-x 3 root root 4096 Jan  1 00:10 ..
-rwxr-xr-x 1 root root  420 Jan  1 00:10 nixos-generation-1.conf

What it means: systemd-boot entries exist on the ESP.

Decision: If the directory is empty, your ESP wasn’t mounted during install. Mount it and reinstall bootloader from a chroot (or rerun install carefully).

Task 14: After reboot, confirm you’re running the intended generation

cr0x@server:~$ sudo nix-env -p /nix/var/nix/profiles/system --list-generations
   1   2026-02-05 00:12:33   (current)

What it means: You have at least one generation and it is current.

Decision: If generations are missing or switching fails later, you probably have a broken bootloader install or a filesystem mount issue.

Task 15: Confirm ZFS pool imports cleanly on boot

cr0x@server:~$ sudo zpool status -x
all pools are healthy

What it means: No known pool issues.

Decision: If it reports degraded pools, failing devices, or checksum errors, treat it as urgent. ZFS tells the truth; people tend to negotiate with it anyway.

Task 16: Validate services and boot performance quickly

cr0x@server:~$ systemd-analyze time
Startup finished in 3.221s (kernel) + 6.483s (userspace) = 9.704s
graphical.target reached after 6.439s in userspace

What it means: A quick top-line on boot time split between kernel/userspace.

Decision: If userspace is huge, inspect which units are blocking. If kernel is huge, it’s often firmware, storage init, or initrd complexity.

cr0x@server:~$ systemd-analyze blame | head
3.812s zfs-import-cache.service
2.104s network-online.target
1.442s nix-daemon.service
1.015s systemd-journald.service

What it means: Which units took longest.

Decision: Don’t “optimize” until you know whether it matters. If network-online.target is slow, check DHCP or remove unnecessary dependencies.

Task 17: Confirm SSH is reachable and keys are in place

cr0x@server:~$ sudo systemctl status sshd --no-pager
● sshd.service - OpenSSH Daemon
     Loaded: loaded (/etc/systemd/system/sshd.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-02-05 00:20:14 UTC; 1min ago
       Docs: man:sshd(8)
             man:sshd_config(5)

What it means: sshd is enabled and running.

Decision: If it’s not active, do not proceed with remote-only administration plans. Fix networking/sshd now while you still have console access.

Task 18: Check closure size and store growth (capacity planning)

cr0x@server:~$ nix path-info -Sh /run/current-system
/nix/store/2h3...-nixos-system-server-25.11.20260201  1.8G

What it means: Approximate size of the current system closure.

Decision: If you’re disk-constrained, plan garbage collection and keep /nix on its own dataset or partition.

A sane NixOS 25.11 configuration structure

Keep your configuration readable. Future-you will be tired and mildly annoyed. Don’t make it worse.

Minimal flake layout that scales past one host

Recommended structure:

  • flake.nix and flake.lock at repo root
  • hosts/server/configuration.nix for host intent
  • modules/ for reusable components (ssh, monitoring, zfs, users)
  • secrets/ for encrypted material (not stored in Nix store as plaintext)

Example flake (trimmed to the essentials):

cr0x@server:~$ cat /etc/nixos/flake.nix
{
  description = "NixOS fleet";

  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11";
  };

  outputs = { self, nixpkgs, ... }:
  let
    system = "x86_64-linux";
  in {
    nixosConfigurations.server = nixpkgs.lib.nixosSystem {
      inherit system;
      modules = [
        ./hosts/server/configuration.nix
      ];
    };
  };
}

Then the host config imports modules and nails down intent:

cr0x@server:~$ sed -n '1,220p' /etc/nixos/hosts/server/configuration.nix
{ config, pkgs, ... }:

{
  imports = [
    ../../hardware-configuration.nix
    ../../modules/base.nix
    ../../modules/ssh.nix
    ../../modules/zfs.nix
  ];

  networking.hostName = "server";
  time.timeZone = "UTC";

  users.users.cr0x = {
    isNormalUser = true;
    extraGroups = [ "wheel" ];
    openssh.authorizedKeys.keys = [
      "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... cr0x@laptop"
    ];
  };

  services.openssh.enable = true;
  services.openssh.settings = {
    PasswordAuthentication = false;
    PermitRootLogin = "no";
  };

  boot.loader.systemd-boot.enable = true;
  boot.loader.efi.canTouchEfiVariables = true;

  system.stateVersion = "25.11";
}

What to avoid: giant monolithic configuration.nix with everything from nginx to fonts to experimental kernel flags. Split it. You will debug faster.

Persistence: decide what survives rebuilds

If you created /persist, use it. Store state you care about: SSH host keys, machine-id strategy, service data, logs (maybe), and application state (definitely). The exact wiring varies by preference, but the principle is stable: separate “rebuildable” from “must survive.”

One practical approach is to bind-mount selected directories to /persist (or use impermanence-style patterns). The benefit is clarity during incident response: you can blow away the root dataset and keep state intact.

Nix settings that prevent self-inflicted pain

  • Enable the daemon (default) and use sandboxed builds where possible.
  • Prefer pinned nixpkgs inputs and review updates.
  • Plan garbage collection and generation retention explicitly.
cr0x@server:~$ sudo nixos-option nix.gc.automatic
nix.gc.automatic = false

What it means: GC isn’t enabled by default in many setups.

Decision: For servers with small disks, enable GC with a retention policy that doesn’t delete your last-known-good generation right before you need it.

Upgrades, rollbacks, and drift control in the real world

NixOS gives you a lot of safety rails. You still need to drive like you’re hauling glass.

How “zero config drift” actually works

Drift happens when reality changes without the source of truth changing. NixOS fights this in two ways:

  • Rebuild overwrites configuration-managed files. That blocks casual edits in /etc from sticking.
  • The system is a build artifact. You don’t “tweak live state,” you switch to a new generation.

But drift can still happen in state directories (databases, caches, logs), in manual firewall changes, in “temporary” kernel module loads, and in out-of-band changes like firmware settings. Your job is to minimize out-of-band paths and document the few that remain.

The safe upgrade pattern

This is the pattern I want in production:

  1. Update inputs (flake.lock) intentionally.
  2. Build the system.
  3. Switch to it.
  4. Validate critical services.
  5. Keep rollback available and don’t garbage-collect immediately.
cr0x@server:~$ sudo nixos-rebuild build --flake /etc/nixos#server
building the system configuration...
/nix/store/8m7...-nixos-system-server-25.11.20260205

What it means: You built a new system closure without activating it.

Decision: If build fails, you haven’t touched the running system. Fix before switching.

cr0x@server:~$ sudo nixos-rebuild switch --flake /etc/nixos#server
building the system configuration...
activating the configuration...
setting up /etc...
reloading user units for cr0x...

What it means: The new generation is active. Services may have restarted.

Decision: Validate the specific things your business cares about (ports, API health, storage mounts). Don’t trust “switch succeeded.”

Rollback you can do under pressure

There are two rollback paths:

  • At boot: choose an older generation from the bootloader menu.
  • From a running system: switch to an older generation in-place.
cr0x@server:~$ sudo nix-env -p /nix/var/nix/profiles/system --list-generations
   1   2026-02-05 00:12:33
   2   2026-02-05 01:05:10   (current)

What it means: You have at least two generations to choose from.

Decision: If you only keep one generation, you’ve turned NixOS into a fancy package manager. Keep a few.

cr0x@server:~$ sudo /nix/var/nix/profiles/system-1-link/bin/switch-to-configuration switch
setting up /etc...
reloading user units for cr0x...

What it means: You switched back to generation 1.

Decision: If the system is in a bad state, prefer boot-time rollback. It resets more assumptions.

Garbage collection without shooting yourself

Space pressure is real, especially on small root disks. But aggressive GC is how you delete your only working kernel right before a reboot.

cr0x@server:~$ sudo nix-collect-garbage -d
finding garbage collector roots...
deleting old generations of profile /nix/var/nix/profiles/system
deleting unused links...
0 store paths deleted, 0.00 MiB freed

What it means: In this example, nothing was freed (likely because generations are pinned).

Decision: Before enabling automated GC, set a policy for how many generations you keep and how long. Disk is cheap; downtime is not.

Joke #2: “Let’s enable daily garbage collection” is how you discover your rollback plan was, in fact, interpretive dance.

Storage engineering notes: ZFS, Btrfs, ext4, and reality

Storage is where “it worked in staging” goes to die. The good news: NixOS makes storage config reproducible, which means you can actually reason about it.

ZFS root: what you gain, what you pay

Gains: end-to-end checksums, snapshots, dataset properties per subtree, replication workflows, and extremely good observability (scrubs, error counts, clear failure signals).

Costs: You must plan memory usage, scrubs, monitoring, and import behavior. ZFS is stable, but it’s also honest: it will tell you your disk is dying before your RAID controller does.

Dataset layout: don’t make it complicated, make it useful

The earlier dataset layout (rpool/local/root, rpool/local/nix, rpool/safe/persist) is designed for operational control:

  • Different snapshot cadences: frequent for persist, less frequent for nix, maybe none for root.
  • Different replication rules: persist replicates off-host; nix is rebuildable and optional.
  • Clear blast radius: you can roll back persist without rolling back the OS, and vice versa.

ARC and memory: what to watch

ZFS will use memory for ARC caching. This is usually good. It can also be awkward on small-memory VMs or mixed workloads where the page cache and ARC fight for space.

cr0x@server:~$ cat /proc/meminfo | head
MemTotal:       16342132 kB
MemFree:         2145320 kB
MemAvailable:    8021444 kB
Buffers:          212344 kB
Cached:          4219072 kB

What it means: “MemAvailable” is your real headroom. “Cached” includes filesystem caches.

Decision: If you’re tight on RAM and see OOM pressure under load, consider constraining ZFS ARC. Do it intentionally, with measurement, not superstition.

Scrubs: schedule them like backups—because they are a kind of backup

Scrubs read all data and verify checksums. They don’t replace backups, but they turn silent corruption into loud alerts.

cr0x@server:~$ sudo zpool scrub rpool
cr0x@server:~$ sudo zpool status
  pool: rpool
 state: ONLINE
  scan: scrub in progress since Thu Feb  5 02:11:33 2026
        32.1G scanned at 1.20G/s, 1.02G issued at 39.1M/s, 120G total
        0B repaired, 0.85% done, 0:01:39 to go

What it means: Scrub is running; progress and throughput are visible.

Decision: If scrubs reveal repaired bytes or checksum errors, treat it as a signal to investigate drives and cabling. Don’t just clear the error and move on.

Ext4/Btrfs differences (if you don’t do ZFS)

If you choose ext4, your “maximum control” shifts up the stack: you’ll rely more on backups, LVM snapshots (maybe), and careful upgrades.

If you choose Btrfs, you get snapshots and send/receive. The operational discipline is different: watch for free space behavior, balance operations, and choose a sane subvolume layout.

The NixOS part is consistent: the OS state is declarative, rollbacks exist, and drift is minimized. Your storage plan fills in the rest of the reliability story.

Fast diagnosis playbook

This is the “it’s slow / it won’t boot / it’s flapping” triage order that saves time. The point is not completeness; it’s speed and signal.

First: confirm what changed

  • Did you switch generations? Did inputs change?
  • Is the running system the one you think it is?
cr0x@server:~$ readlink /run/current-system
/nix/store/8m7...-nixos-system-server-25.11.20260205

Decision: If the store path isn’t the expected one, stop guessing. Align the team on “what is running.”

Second: is it boot, storage, network, or service?

Pick the category fast. Most outages are boring.

Boot and initrd issues

cr0x@server:~$ journalctl -b -p err --no-pager | tail -n 20
Feb 05 02:20:31 server kernel: zfs: module license 'CDDL' taints kernel.
Feb 05 02:20:33 server systemd[1]: Failed to mount /persist.
Feb 05 02:20:33 server systemd[1]: Dependency failed for Local File Systems.

Decision: If mounts fail, you have a filesystem/dataset/mount ordering issue. Don’t chase app logs yet.

Storage health

cr0x@server:~$ sudo zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices could not be opened.
action: Attach the missing device and online it using 'zpool online'.
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          DEGRADED     0     0     0
          nvme0n1p2    ONLINE       0     0     0
          nvme1n1p2    UNAVAIL      0     0     0  cannot open

Decision: If ZFS is degraded, your priority is device availability and data safety. Performance debugging can wait.

Network reachability

cr0x@server:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
enp1s0           UP             10.20.30.40/24 fe80::a00:27ff:fe4e:66a1/64

Decision: If the interface isn’t UP or has no address, fix network config before you blame DNS, TLS, or applications.

Service health

cr0x@server:~$ systemctl --failed --no-pager
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● nginx.service             loaded failed failed Nginx Web Server

1 loaded units listed.

Decision: Failed units tell you where to focus. Read the unit logs; don’t restart blindly.

Third: find the bottleneck (CPU, IO, memory, or lock contention)

cr0x@server:~$ top -b -n 1 | head -n 12
top - 02:31:10 up 12 min,  1 user,  load average: 7.12, 6.80, 4.21
Tasks: 168 total,   2 running, 166 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.4 us,  3.1 sy,  0.0 ni, 82.9 id,  1.4 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  15959.1 total,   2210.5 free,   5300.8 used,   8447.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   8658.3 avail Mem

Decision: If IO wait is high, look at disk. If memory is low and swap absent, you may be heading toward OOM. If CPU is pegged, profile the service or scale.

Common mistakes: symptom → root cause → fix

1) System doesn’t boot after “successful” install

Symptom: Boot drops to firmware menu or “no bootable device.”

Root cause: ESP wasn’t mounted at /mnt/boot during install, so bootloader entries were written to the wrong place (or not at all).

Fix: Boot ISO, mount root at /mnt, mount ESP at /mnt/boot, then reinstall bootloader via nixos-enter or rerun installation steps carefully.

2) ZFS pool won’t import at boot

Symptom: Emergency shell; errors about mounting root or missing pool.

Root cause: Missing boot.supportedFilesystems = [ "zfs" ];, wrong dataset names, or initrd not including ZFS bits.

Fix: Confirm hardware config includes ZFS support; rebuild; if already broken, boot older generation or use ISO to import pool and fix config.

3) “It worked, then after reboot SSH is dead”

Symptom: You can’t SSH in after a rebuild; console shows system is up.

Root cause: Network config drift (DHCP vs static), firewall rules changed declaratively, or sshd settings tightened without keys in place.

Fix: On console, check ip -br a, systemctl status sshd, and journalctl -u sshd. Revert generation if needed, then fix config with authorized keys and a staged rollout.

4) Disk fills up unexpectedly on NixOS

Symptom: /nix grows until the system starts failing in weird ways.

Root cause: Old generations and store paths accumulate; GC not configured; build artifacts kept.

Fix: Set a GC policy and generation retention; move /nix to its own dataset/partition; monitor free space.

5) “nixos-rebuild switch” takes forever

Symptom: Rebuilds are slow; CPU and disk churn.

Root cause: Frequent source builds, missing binary cache access, or an overgrown configuration pulling too many dependencies.

Fix: Confirm you’re using binary caches, prefer stable packages, and avoid compiling the world on production nodes.

6) Secrets leaked into the Nix store

Symptom: A token shows up in /nix/store or can be read by unintended users.

Root cause: You placed secrets directly in Nix expressions or created files at build time that land in the store.

Fix: Move secrets to runtime provisioning (tmpfiles, external secret manager, encrypted files decrypted at activation) and rotate compromised credentials.

Three corporate mini-stories (pain, humility, outcomes)

Incident caused by a wrong assumption: “The interface name will be eth0”

At a mid-sized company, a team migrated a small fleet from traditional config management to NixOS. They did the right thing: everything was declarative, builds were pinned, rollbacks worked. Clean.

Then they bought new hardware. The NIC changed. Predictable interface names did their job, and the interface became something like enp129s0f0, not eth0. Their NixOS config had a static network stanza referencing the old name. On the bench, someone tested it with a single host and fixed it manually “just to get it online.” That manual fix never made it to git.

During the rollout window, half the servers came up without networking. The monitoring system showed a partial outage. The on-call engineer couldn’t SSH in, because that was the entire failure. Console access existed, but not quickly—these were remote sites.

Root cause wasn’t NixOS; it was the assumption that interface naming was stable across hardware. The real operational sin was the manual bench fix that bypassed the source of truth. That’s how drift sneaks in: through urgency and good intentions.

The fix was simple and boring: match interfaces by MAC address in the configuration and stop writing configs that depend on device names. The postmortem action item wasn’t “test more.” It was “no local manual edits without a commit,” and “network config must be keyed on stable identifiers.”

Optimization that backfired: “Aggressive garbage collection will keep disks clean”

A different org ran NixOS on build nodes and a handful of internal services. Someone noticed /nix was eating disk space. Fair. They enabled automatic garbage collection daily with an aggressive deletion policy. The free space graph looked great.

Weeks later, a security update required a reboot of a few nodes. One node didn’t come back cleanly—an unrelated hardware issue slowed device enumeration and the newest generation had a boot-time race involving a network dependency. The fix was to boot the previous generation. Except the previous generation’s closure had been collected.

Now they had a broken boot and no rollback path. The irony was sharp enough to cut glass: the OS designed for rollbacks had none because they optimized storage like it was a scratch disk.

The recovery involved booting rescue media, restoring enough of the store from backups to boot, and then rolling forward with a corrected configuration. Afterward, they changed the policy: keep multiple generations for a time window, and only garbage collect after a successful reboot test. Disk usage rose. Incidents fell.

Boring but correct practice that saved the day: “Build, then switch, then verify”

A fintech team used NixOS for a set of internal APIs. They weren’t loud about it; they just liked that production could be rebuilt from git and that deployments were predictable. Their runbook had a dull rule: always run nixos-rebuild build first, then switch, and always run a small verification suite afterward. Nobody loved it, but it stuck.

One day, an engineer updated pinned inputs to pick up a library fix. The build succeeded, but the activation step failed because a systemd unit had a changed option name upstream. That failure happened during the switch step, not after a reboot, and it was immediately visible at the console and in the deploy logs.

Because they had built first, they already knew it wasn’t a compilation problem. Because they switched in a controlled window, the blast radius was limited. Because they had verification steps, they didn’t accidentally ship a half-working state where some services used old libraries and others used new ones.

The rollback was quick: they flipped back to the previous generation, fixed the unit configuration in code, rebuilt, switched, verified, and moved on. No drama. Nobody got celebrated. That’s the point.

FAQ

1) Is NixOS actually “zero drift”?

For declaratively managed parts, yes: rebuilding enforces the declared state. Drift can still exist in persistent application state, firmware settings, and anything you change out-of-band.

2) Should I use flakes on servers?

Yes. Pinning and reproducibility are the whole game. Flakes make it harder to accidentally upgrade the world because a channel moved.

3) Can I manage multiple machines from one repo?

That’s one of NixOS’s best uses. Keep per-host modules and shared modules. Avoid copy-paste by factoring common roles cleanly.

4) Do I need ZFS for NixOS to be reproducible?

No. NixOS reproducibility is about system configuration and the Nix store. ZFS adds strong storage semantics and snapshots, which are great, but it’s optional.

5) What’s the simplest safe bootloader choice?

UEFI + systemd-boot is straightforward on many systems. If you have complex multi-boot or legacy requirements, GRUB can be appropriate. Pick one, test rollback boot entries, and document it.

6) How do I prevent secrets from landing in the Nix store?

Do not bake secrets into Nix derivations. Provision them at runtime (activation scripts, tmpfiles) or fetch from a secret manager. Rotate any secret you suspect was placed in the store.

7) What’s the right way to do rollbacks in production?

Keep multiple generations, ensure bootloader entries exist, and test booting an older generation after a significant change. Rollback should be muscle memory, not a theory.

8) How do I debug a failed service after switching generations?

Start with systemctl --failed, then journalctl -u service. If the failure is due to config, roll back immediately and fix in code.

9) Can I do remote deploys?

Yes, and it’s worth doing once you trust your pipeline. The key is to keep console access for failures and to avoid switching generations without a verification step.

10) What’s the minimum monitoring I should have?

Disk space (especially /nix), memory pressure, load, service health, and for ZFS: pool status and scrub results. If you can’t see it, you will page for it later.

Next steps you should actually do

You’ve installed NixOS 25.11. That’s the easy part. Now make it operational.

  1. Commit your configuration repo (including flake.lock) and treat it as the source of truth.
  2. Write a rollback runbook with the exact commands and the boot-menu steps for your hardware.
  3. Set generation retention and GC policy that preserves safety, not just free space.
  4. Decide and implement secret handling before you add more services.
  5. Add basic monitoring: disk, memory, service health, and ZFS status if applicable.
  6. Test a full rebuild in a VM or spare node. If you can’t rebuild, you don’t own the system.

Reproducibility is not a vibe. It’s a habit backed by tooling. NixOS gives you the tooling. The habit is on you.

← Previous
BitLocker Recovery Key Panic: How to Get Back In Without Bricking Windows
Next →
Roll Back a Driver When Windows Won’t Boot: The Safe Method

Leave a comment