ZFS Pool-Failure Alerting (ZED)

How the fleet gets paged when a ZFS pool goes DEGRADED or FAULTED. Email-only, host-side, codified — the source of truth is the ZED block in infra/terraform/files/provision-host.sh (search ZED, ~lines 183–202). This page is the index.

Last verified: 2026-06-15 (both boxes; codified after the Chicago reinstall).

Why ZED, not Netdata

Netdata is the per-second infra-health layer, but it is blind to pool state. zpool isn’t in the container image, and container→host on an arbitrary port is firewalled here (Docker isolates the kamal bridge from docker0; a host-side socat exporter was tried and the container could not reach it — see config/netdata/README.md §2c). Netdata does get ZFS ARC stats and per-dataset capacity for free, but the DEGRADED/FAULTED signal has to come from the host.

ZED (the ZFS Event Daemon) runs on the host, watches ZFS events (zevents), and emails on vdev/pool state changes (DEGRADED / FAULTED) and scrub errors. It ships with ZFS — no extra collector, no container.

What it alerts on

vdev / pool state changes — DEGRADED, FAULTED (a dropped/failing disk in the RAID1 mirror, a pool import failure).
scrub errors — checksum/read errors surfaced by a scrub.
ZED_NOTIFY_VERBOSE=0 → problems only, no healthy-scrub-completion spam.

Pool data stays local; this is purely the alert path.

Codified config

provision-host.sh applies this idempotently (installs the daemon, comments the stock root default, (re)writes a single heatwave-managed block, enables the service):

apt-get install -y -qq zfs-zed
# sed comments out the stock active default: ZED_EMAIL_ADDR="root"
# then writes the managed block to /etc/zfs/zed.d/zed.rc:
ZED_EMAIL_ADDR="sysadmin@warmlyyours.com"
ZED_EMAIL_PROG="mail"
ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"
ZED_NOTIFY_VERBOSE=0
# symlinks statechange-notify.sh zedlet if missing, then:
systemctl enable --now zfs-zed.service

zed.rc is shell-sourced, so the managed block (appended last) wins over the stock default. Runs on BOTH boxes — Dallas (dal-01) and Chicago (chi-02).

Applied two ways, both via the same script: cloud-init on a fresh provision, and the heatwave-host-config TFC workspace re-running it over SSH (day-2, no reinstall). See infra/terraform/host-config/.

Email relay path

ZED shells out to the host mail command, which hands off to the postfix→SendGrid relay that provision-host.sh also configures — so alerts leave the box and reach sysadmin@warmlyyours.com externally. With the stock ZED_EMAIL_ADDR="root" they would land in a local root mailbox nobody reads. That override is the whole point.

Verify on a box

systemctl is-active zfs-zed                      # → active
grep ZED_EMAIL_ADDR /etc/zfs/zed.d/zed.rc        # → sysadmin@warmlyyours.com (managed block), root line commented
zpool status -x                                  # → "all pools are healthy"

zpool status -x is the quick fault-path read; anything other than “all pools are healthy” is what ZED would have emailed on.

Reinstall lesson (2026-06-15)

When the Chicago standby was reinstalled (an approved Terraform plan whose only effective diff was a provision-host.sh edit triggered a full latitudesh_server reinstall), the box came up running its local, pre-ZED copy of provision-host.sh. Result: zed.rc reverted to ZED_EMAIL_ADDR="root" — pool alerts would NOT have reached the team. This is the classic manual host config lost on reinstall → codify it pattern: the ZED block is now in provision-host.sh, so any future host-config apply or reinstall restores correct alerting automatically. Full record: doc/tasks/202606151240_CHICAGO_REINSTALL_DR_RECOVERY.md.