HAProxy — DB write-VIP failover router

The single point that decides "which Postgres node takes writes right now."
heatwave-haproxy is a Kamal accessory co-located on the app host; the app's
DATABASE_HOST points at it, and it TCP-passthrough routes every connection to
the live primary's PgBouncer. A pg_promote on the standby reroutes the
write path with no databases.ini edit and no app redeploy — HAProxy
follows each node's health probe.

This is the operator-facing summary. Deep design, the staged rollout, the
staging rehearsal, and every gotcha live in
doc/tasks/202606120820_HAPROXY_ROUTING_LAYER.md.

Config: config/haproxy/production.cfg + config/haproxy/staging.cfg
(authoritative). The haproxy accessory is defined in config/deploy.yml /
config/deploy.staging.yml — image haproxy:3.0-alpine (tag + index digest
pinned). It is not a new central SPOF: it's part of the app's local stack
(like the co-located PgBouncer), so one HAProxy per app host; if it dies the
blast radius is that one host, which is already the failure unit.

Topology

   app (Rails) ── DATABASE_HOST=heatwave-haproxy:6433 ──┐
                                                        ▼
        ┌──────────── heatwave-haproxy (Dallas, kamal accessory) ───────────┐
        │  listen pg-primary  :6433  (mode tcp, passthrough)                 │
        │  listen stats       :8404  (stats UI + /metrics for netdata)       │
        │                                                                    │
        │  backend = ONE pool of both nodes; httpchk keeps only the          │
        │  current LEADER "up" → all writes go to that node's pgbouncer      │
        └───────────────┬──────────────────────────────┬───────────────────┘
            track 200    │ pg-health                    │ pg-health   track 503
                         ▼ (LOCAL: kamal-net DNS)        ▼ (REMOTE: tailnet IP)
        ┌──── dal-01 (100.123.47.52) ────┐   ┌──── chi-02 (100.68.157.49) ────┐
        │ heatwave-pg-health  :8008      │   │ heatwave-pg-health-replica :8008│
        │   GET / → 200 "primary"        │   │   GET / → 503 "standby"         │
        │ heatwave-pgbouncer  :6432 ◀ UP │   │ heatwave-pgbouncer-replica :6432│
        │ heatwave-postgres   :5432      │   │ heatwave-postgres-replica  :5432│
        │   (PG18 PRIMARY, RW)           │   │   (PG18 STANDBY)  ◀ DOWN        │
        └────────────────────────────────┘   └─────────────────────────────────┘

Routing-decision logic

One pool, health-gated. Both Postgres nodes sit in a single backend. A
per-node pg-health sidecar runs SELECT pg_is_in_recovery() and answers
HTTP 200 "primary" (recovery=false) / 503 "standby" (recovery=true) on
:8008. HAProxy option httpchk (GET /, expect 200) marks only the
node answering 200 as "up", so exactly the current leader's PgBouncer takes
every write.
A flip is just promote + rebuild. pg_promote on the standby flips its
recovery state → its pg-health goes 200 → HAProxy marks it up and the old
primary down. Each node's PgBouncer points only at its local Postgres
(static, never edited at a flip), so nothing downstream reconfigures.
Session-safe passthrough. mode tcp + timeout client/server 12h so
session-mode pooling and session-scoped advisory locks (28 call sites) pass
through untouched and long-lived idle connections are never reaped.
Flap guard. default-server inter 2s rise 2 fall 3 on-marked-down shutdown-sessions — a transient blip can't reroute, and connections to a node
drop the instant it stops being primary so the app reconnects onto the new
leader fast.
Single-primary is the one safety invariant. Health-based routing is only
safe while exactly one node reports primary; two writable nodes (split-brain)
would both return 200 and HAProxy would balance writes across both. Mitigation
is the controlled flip discipline (demote/fence the old primary before
promoting) — which is why cross-DC automatic failover is not enabled yet.

Addressing gotcha (load-bearing)

A container cannot hairpin to a port its own host published on the
tailnet interface. So in production.cfg:

Dallas (LOCAL) is reached by kamal-net DNS (heatwave-pgbouncer /
heatwave-pg-health) — which forces a track'd probe backend (HAProxy
forbids a hostname in check addr) plus a resolvers docker section.
Chicago (REMOTE), a genuine peer, is reached by its tailnet IP
100.68.157.49.

The first prod boot used the host IP for both nodes and came up with the primary
backend DOWN (L4 timeout to its own 100.123.47.52:8008); fixed in
ad45d9f12b ("HAProxy prod — reach the LOCAL node by kamal-net DNS, not its host
IP", direct-to-master 2026-06-12). Revisit the DNS-vs-IP split only at a DC
migration, never at a flip.

Ops commands

bin/recovery <env> topology — read-only: each node's recovery state,
pg-health, and HAProxy server status. Start here.
bin/recovery <env> flip-db — planned primary↔standby switchover. Drains the
old node via HAProxy's in-container Runtime API admin socket
(set server … state maint + shutdown sessions) before promoting, so
HAProxy never routes onto the dying node; waits for the standby to replay the
old primary's LSN (no data loss), then pg_promote, then rebuilds the demoted
node as a fresh streaming standby. (-y / RECOVERY_YES=1 for
non-interactive.)
bin/recovery <env> rebuild-standby — re-basebackup a wiped/stale node with
no flip. sudo zfs snapshots the demoted dataset right before the wipe (kept
on failure as a zfs rollback net), restarts the rebuilt node's PgBouncer to
clear a stale upstream-DNS cache.
bin/maintenance {up,down} <env> — full maintenance window: up = proxy 503 →
stop web + sidekiq (+ on prod the Chicago databasus controller container), down
reverses it. Wrap a flip in it: bin/maintenance up <env> →
bin/recovery <env> flip-db → bin/maintenance down <env>.

The Phase-4 prod drill (round-trip Dallas→Chicago→Dallas) passed — the flip +
reroute worked both directions; cross-DC basebackup ran ~620 MB/s (~6 min full
rebuild).

Stats UI