ZFS Pool-Failure Alerting (ZED)
How the fleet gets paged when a ZFS pool goes DEGRADED or FAULTED.
Email-only, host-side, codified — the source of truth is the ZED block in
infra/terraform/files/provision-host.sh
(search ZED, ~lines 183–202). This page is the index.
Last verified: 2026-06-15 (both boxes; codified after the Chicago reinstall).
Why ZED, not Netdata
Section titled “Why ZED, not Netdata”Netdata is the per-second infra-health layer, but it is blind to pool state.
zpool isn’t in the container image, and container→host on an arbitrary port is
firewalled here (Docker isolates the kamal bridge from docker0; a host-side
socat exporter was tried and the container could not reach it — see
config/netdata/README.md §2c). Netdata does
get ZFS ARC stats and per-dataset capacity for free, but the
DEGRADED/FAULTED signal has to come from the host.
ZED (the ZFS Event Daemon) runs on the host, watches ZFS events (zevents), and emails on vdev/pool state changes (DEGRADED / FAULTED) and scrub errors. It ships with ZFS — no extra collector, no container.
What it alerts on
Section titled “What it alerts on”- vdev / pool state changes — DEGRADED, FAULTED (a dropped/failing disk in the RAID1 mirror, a pool import failure).
- scrub errors — checksum/read errors surfaced by a scrub.
ZED_NOTIFY_VERBOSE=0→ problems only, no healthy-scrub-completion spam.
Pool data stays local; this is purely the alert path.
Codified config
Section titled “Codified config”provision-host.sh applies this idempotently (installs the daemon, comments the
stock root default, (re)writes a single heatwave-managed block, enables the
service):
apt-get install -y -qq zfs-zed# sed comments out the stock active default: ZED_EMAIL_ADDR="root"# then writes the managed block to /etc/zfs/zed.d/zed.rc:ZED_EMAIL_ADDR="sysadmin@warmlyyours.com"ZED_EMAIL_PROG="mail"ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"ZED_NOTIFY_VERBOSE=0# symlinks statechange-notify.sh zedlet if missing, then:systemctl enable --now zfs-zed.servicezed.rc is shell-sourced, so the managed block (appended last) wins over the
stock default. Runs on BOTH boxes — Dallas (dal-01) and Chicago (chi-02).
Applied two ways, both via the same script: cloud-init on a fresh
provision, and the heatwave-host-config TFC workspace re-running it over
SSH (day-2, no reinstall). See infra/terraform/host-config/.
Email relay path
Section titled “Email relay path”ZED shells out to the host mail command, which hands off to the
postfix→SendGrid relay that provision-host.sh also configures — so alerts
leave the box and reach sysadmin@warmlyyours.com externally. With the stock
ZED_EMAIL_ADDR="root" they would land in a local root mailbox nobody reads.
That override is the whole point.
Verify on a box
Section titled “Verify on a box”systemctl is-active zfs-zed # → activegrep ZED_EMAIL_ADDR /etc/zfs/zed.d/zed.rc # → sysadmin@warmlyyours.com (managed block), root line commentedzpool status -x # → "all pools are healthy"zpool status -x is the quick fault-path read; anything other than “all pools
are healthy” is what ZED would have emailed on.
Reinstall lesson (2026-06-15)
Section titled “Reinstall lesson (2026-06-15)”When the Chicago standby was reinstalled (an approved Terraform plan whose
only effective diff was a provision-host.sh edit triggered a full
latitudesh_server reinstall), the box came up running its local, pre-ZED
copy of provision-host.sh. Result: zed.rc reverted to
ZED_EMAIL_ADDR="root" — pool alerts would NOT have reached the team. This
is the classic manual host config lost on reinstall → codify it pattern: the
ZED block is now in provision-host.sh, so any future host-config apply or
reinstall restores correct alerting automatically. Full record:
doc/tasks/202606151240_CHICAGO_REINSTALL_DR_RECOVERY.md.